One of the goals of this workshop is to work through an example from start to end, and to highlight some of the decision points and rationales behind the decisions. The first step of any single-cell project will be to get the data into R for analysis.
![]() |
Cell Ranger outputs can be read into R via the triple of files directly, or via a memory saving route with the BPCells package. We provide code for both routes in this section. |
We’ve created a simple welcome.R
script to demonstrate
the wheres and hows of the interface. The “File” pane in the lower right
should have a welcome.R
script. Click on it to open.
There should now be four panes in the RStudio window, with the
welcome.R
script in the “Script” (or “Source”) pane. It
displays the code that we will write to perform our analysis.
Currently the welcome.R
script is inert. We must execute
it in sequence to bring it to life. Highlight all the code in
welcome.R
and then click the “Run” button located along the
top of the Script pane. Three things happen in quick succession:
We note that code can be run in two ways:
Question: What is the advantage of the first way over the second?
We will create an RStudio Project to keep our files and code organized. See the Projects section of R for Data Science for a more in-depth description of what a project is and how it’s helpful.
To create a Project for this workshop, click File then
New Project…. In the New Project Wizard window that opens,
select Existing Directory, then Browse…. In the
Choose Directory window, select the ISC_R
folder by
clicking it once, and then click the Choose button. Finally,
click Create Project.
Once we do this, RStudio will restart and the Files pane (lower
right) should put us in the ~/ISC_R
folder where there is
an inputs/
folder and an ISC_R.Rproj
file.
We have included the data to be used in the workshop in the
inputs/
folder. However, the project will need to include
folders for our analysis and our analysis scripts. Let’s create that
directory structure with the dir.create()
function. We will
start by putting the following commands into the Console pane
directly.
##### Day 1 - Getting Started with Seurat
# Create project directories ----------------------------------------------
dir.create('scripts', showWarnings = FALSE, recursive = TRUE)
dir.create('results/figures', showWarnings = FALSE, recursive = TRUE)
dir.create('results/tables', showWarnings = FALSE, recursive = TRUE)
dir.create('results/rdata', showWarnings = FALSE, recursive = TRUE)
In the Files pane we should see the new results/
and
scripts/
folders.
The two most important artifacts of our analysis are the data from
Cell Ranger, and the script to analyze the data. There will be outputs
in results/
, and these will be important, but if the
contents of results/
are ever lost, the script will be able
to re-generate them if we’ve captured all our steps as code, which we
aim to do.
To create the analysis script, click File, hover over
New File, and click on R Script. A new pane in the
upper left slides into view and is the Untitled script file. Save this
file, and name it, by clicking File then Save.
Double click the scripts/
folder, and in the File name:
text box type “analysis.R”. Then click Save.
As we proceed through the workshop, we should save this file (by clicking the Floppy disk, clicking File then Save, or by typing Control + S).
Good scripting practices
In any analysis script, we recommend using comments (lines preceded by a “#”) to provide additional information about code that may not be self-evident. This is to the benefit of others that may look at the code, but also to your future-self.
For completeness, insert the dir.create
commands at the
top of our new script.
##### Day 1 - Getting Started with Seurat
# Create project directories ----------------------------------------------
dir.create('scripts', showWarnings = FALSE, recursive = TRUE)
dir.create('results/figures', showWarnings = FALSE, recursive = TRUE)
dir.create('results/tables', showWarnings = FALSE, recursive = TRUE)
dir.create('results/rdata', showWarnings = FALSE, recursive = TRUE)
We begin our analysis script by loading the libraries we expect to
use. It’s generally good practice to include all library()
calls at the top of a script for visibility.
# Load libraries ----------------------------------------------
library(Seurat)
library(BPCells)
library(tidyverse)
options(future.globals.maxSize = 1e9)
The libraries that we are loading are:
Seurat
library, developed by the Satija lab, which will provide the
essential functions used in our single-cell analysis. The Seurat documentation is
extensive and indispensible.BPCells
library, developed by Benjamin
Parks, is a recent package with the primary goal of efficiently
storing single-cell data to reduce its memory footprint. The BPCells
documentation includes many useful tutorials.tidyverse
library, developed by Posit, is an
essential package for data manipulation and plotting. The tidyverse documentation is
essential for getting a handle on the array of functions in the many
packages contained therein.The inputs/
folder has the data for the workshop stored
in two forms. The first,
inputs/10x_cellranger_filtered_triples/
is closer to what
AGC would generate with Cell Ranger, where each sample has a folder, and
within that folder there are three files:
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
Note, these files are filtered matrices, AGC will touch on filtered and unfiltered Cell Ranger outputs.
The second, inputs/bpcells/
is the result of the
BPCells
package, and is more machine-readable than
human-readable. It is an efficient way of storing the same data
contained in the three files above. We will use this form of the input
data because it is more memory efficient.
While the full dataset we selected has time-series information from Day 0, Day 3, Day 7, and Day 21, we have removed Day 3 to reduce the memory requirements further.
The most recent release of Seurat
, version 5, includes
changes which take advantage of the memory-efficient storage implemented
in BPCells
. To read the efficiently stored data with
BPCells
we will use the open_matrix_dir()
function.
# Puts the data "on disk" rather than "in memory" -------------
geo_mat = open_matrix_dir(dir = 'inputs/bpcells')
This reads in the expression matrix which has genes as rows and cells
as columns. The expression matrix is the precursor to creating the
Seurat
object upon which all our analysis will be done. To
create the Seurat
object:
# Create seurat object ----------------------------------------------
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
geo_so
An object of class Seurat
26489 features across 35216 samples within 1 assay
Active assay: RNA (26489 features, 0 variable features)
1 layer present: counts
We have specified some parameters to remove genes and cells which do not contain very much information. Specifically a gene is removed if it is expressed in 1 or fewer cells, and a cell is removed if it contains reads for 50 or fewer genes. In the context of this workshop, this helps us minimize memory usage.
Above, we’ve demonstrated how to read data that was already arranged
in the BPCells
manner. We have done this for the workshop
to reduce our memory footprint from the start, but when you get data
back from AGC, you will have to follow the steps below.
# ==========================================================================
### DO NOT RUN ###
# To load data from raw files without BPCells
### DO NOT RUN ###
# Collect the input directories ---------------------------------------
# Each sample dir contains barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz.
# Naming the sample_dirs vector makes Seurat name the
# samples in the corresponding manner, which is nice for us.
sample_dirs = list(
HODay0replicate1 = "inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate1",
HODay0replicate2 = "inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate2",
HODay0replicate3 = "inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate3",
HODay0replicate4 = "inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate4",
HODay7replicate1 = "inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate1",
HODay7replicate2 = "inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate2",
HODay7replicate3 = "inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate3",
HODay7replicate4 = "inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate4",
HODay21replicate1 = "inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate1",
HODay21replicate2 = "inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate2",
HODay21replicate3 = "inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate3",
HODay21replicate4 = "inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate4")
# Create the expression matrix from sample dirs -----------------------
# Read10X needs a *vector* instead of a *list*, so we use *unlist* to convert
geo_mat = Read10X(data.dir = unlist(sample_dirs))
# Create a Seurat object from expression matrix -----------------------
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
geo_so
# Save the Seurat object ----------------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_unfiltered.rds')
# Optional: Build BPCells input dir -----------------------------------
# If you *wanted* to use BPCells (to save some memory), you can transform
# the expression matrix data structure above into BPCells files.
# BPCells uses these files for processing, but you typically never look at
# their contents directly
write_matrix_dir(mat = geo_mat, dir = '~/ISC_R/bpcells')
# What files did that command above create?
as.matrix(list.files('~/ISC_R/bpcells'))
[,1]
[1,] "col_names"
[2,] "idxptr"
[3,] "index_data"
[4,] "index_idx"
[5,] "index_idx_offsets"
[6,] "index_starts"
[7,] "row_names"
[8,] "shape"
[9,] "storage_order"
[10,] "val"
[11,] "version"
# Create expression matrix and Seurat object from BPCells files ---------
geo_mat = open_matrix_dir(dir = '~/ISC_R/bpcells')
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
geo_so
# Cleanup ---------------------------------------------------------------
# Since we have a Seurat object, we no longer need the geo_mat matrix.
# We will remove from the environment and then prompt RStudio to run a
# "garbage collection" to free up unused memory
rm(geo_mat)
gc()
# ========================================================================
Some programs (like Seurat) need only the barcodes, features, and matrix (counts) and those data are conveniently provided as three separate files for each sample. Other programs (like Cell Ranger) need to store more information (e.g. probe info about certain library preps or the chip or channel that cell ran on). Instead of adding more files to complement the three base files above, 10x provides a single binary file, molecule_info.h5, which contains all information for all molecules. (Strictly only the molecules with valid barcode/UMI assigned to a gene.)
molecule_info.h5
├─ barcodes [HDF5 group]
├─ count
├─ features [HDF5 group]
├─ gem_group
├─ library_info
├─ metrics_json
├─ probes [HDF5 group]
├─ ...
└─ umi
You can see the first structures above are barcodes, features, and counts. So if you have the hdf5 libraries installed, you can use the .h5 as input to Seurat like so:
# ==========================================================================
### DO NOT RUN ###
# To load data from raw .h5 file
### DO NOT RUN ###
HODay0replicate1 = "~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1/outs/molecule_info.h5"
get_mat = Read10X_h5(filename = HODay0replicate1)
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
If you want to load multiple samples from .h5 files, see the function Read10X_h5_Multi_Directory in the scCustomize R library.
FYI
The Seurat
object is a complex data type, so let’s get a
birds eye view with an image from this
tutorial on single-cell analysis.
Seurat
object.The three main “slots” in the object are:
assays
slot stores the expression data as
Assay
objects.meta.data
slot which stores cell-level information,
including technical and phenotypic data.reductions
slot stores the results of dimension
reduction applied to the assays
.There are other slots which store information that becomes relevant as we progress through the analysis. We will highlight the other slots as they come up.
After reading the data in and creating the Seurat object above, we can imagine the following schematic representing our object:
Note the RNA assay contains a count
layer consisting of
a raw count matrix where the rows are genes (features, more
generically), and the columns are all cells across all samples. Note
also the presence of a meta.data
table giving information
about each cell. We’ll pay close attention to this as we proceed. The
other slots include information about the active.assay
and
active.ident
which tell Seurat which expression data to use
and how the cells are to be identified.
The only slot of the Seurat
object that we’ll typically
access or modify by hand–that is, without a function from the
Seurat
package–is the meta.data
object. In R,
slots are accessed with the @
symbol, as in:
# Examine Seurat object ----------------------------------------------
head(geo_so@meta.data)
orig.ident nCount_RNA nFeature_RNA
HODay0replicate1_AAACCTGAGAGAACAG-1 HODay0replicate1 10234 3226
HODay0replicate1_AAACCTGGTCATGCAT-1 HODay0replicate1 3158 1499
HODay0replicate1_AAACCTGTCAGAGCTT-1 HODay0replicate1 13464 4102
HODay0replicate1_AAACGGGAGAGACTTA-1 HODay0replicate1 577 346
HODay0replicate1_AAACGGGAGGCCCGTT-1 HODay0replicate1 1189 629
HODay0replicate1_AAACGGGCAACTGGCC-1 HODay0replicate1 7726 2602
Here, each row is a cell, and each column is information about that
cell. The rows of the table are named according to the uniquely
identifiable name for the cell. In this case, the day and replicate, as
well as the barcode for that cell. As we continue the workshop, we will
check in on the meta.data
slot and observe changes we want
to make, and that other functions will make. We’ll also observe the
other assays and layers and note their changes.
Let’s Save our progress as an RDS file with saveRDS()
;
this allows us to have a copy of the object that we can read back into
our session with the readRDS()
commmand. Periodically we
will be saving our Seurat object so that we can have a version of it at
different steps of the analysis. These will also help us get untangled
if we get into an odd state.
# Save the Seurat object ----------------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_unfiltered.rds')
Now that the state of our analysis is saved on disk, we’ll demonstrate how to power down the RStudio session, restart the session, and re-load the data we just saved.
geo_so
object with:# Load the Seurat object ----------------------------------------------
geo_so = readRDS('results/rdata/geo_so_unfiltered.rds')
In this section we:
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Previous lesson | Top of this lesson | Next lesson |
---|