![]() |
Cell Ranger outputs can be read into R via the triple of files directly, or via a memory saving route with the BPCells package. We provide code for both routes in this section. |
One of the goals of this workshop is to work through an example from
start to end, and to highlight some of the decision points and
rationales behind the decisions. The first step of any single-cell
project will be to get the data into R for analysis.
To get started, let’s log in to the workshop server by going here: http://bfx-workshop02.med.umich.edu
The login page for the server looks like:
Enter your user credentials and click Sign In. The RStudio interface should load and look like:
Checkpoint
RStudio is an integrated development environment where you can write, execute, and see the results of your code. The interface is arranged in different panes:
Commands can be run in the Console directly. Press Enter to execute them:
> 2+2
[1] 4
Checkpoint
Commands can also be run in a script. Some beenefits to using a script rather than relying on the Console pane:
When first opening RStudio, a script file is not automatically opened. We’ll create our script file by clicking on the icon in the upper-left of the interface (a blank piece of paper with a + sign), and selecting R Script.
The new pane that opens is the Source pane:
You can think of this as a simple text editor. In the Script pane, enter:
3+2
Notice that if we press Enter in the Source pane, we get a new line instead of running the code. In order to execute the code, we press Ctrl + Enter either on the single line we want to run, or on a highlighted block of code. We then see that code executed, along with its result in in the Console pane.
Here are some of the key differences between the Console and Script panes in RStudio:
Console | Script |
---|---|
Ephemeral code | Preserved code |
Run with Enter | Run with Ctrl + Enter |
Hard to share | Easy to share |
All of the panes in RStudio have configuration options. For example, you can minimize/maximize a pane or resize panes by dragging the borders. The most important customization options for pane layout are in the View menu. Other options such as font sizes, colors/themes, and more are in the Tools menu under Global Options.
We can enable soft-wrapping of code by selecting Code and then Soft Wrap Long Lines.
To accomodate learning styles and to keep us moving along, we’ll provide code in three different ways, and you can get that code into RStudio in corresponding ways:
Source of Code | Execution of Code |
---|---|
Zoom screen share | Type the code yourself. |
Slack | Copy and paste code into RStudio. |
Website | Use code block copy button and paste into RStudio. |
Questions?
Before we begin, the folder structure of a working directory / project organizes all the relevant files. Typically we make directories for the following types of files:
data
, input
, etc,results
or output
with
subfolders for tables
, figures
, and
rdata
, andscripts
.We’ve already provided the raw data in the data/
folder,
but you’d otherwise want to move the starting, unaltered, data into this
folder.
Before we start creating directories, let’s make sure we’re in the right location. To print the current working directory:
# =========================================================================
# Getting Started with Seurat
# =========================================================================
# -------------------------------------------------------------------------
# Get current working directory
getwd()
[1] "/home/workshop/damki/workshop-intro-single-cell/source"
This means that any references to files loaded or files saved is with respect to this location. This can simplify our code a bit by allowing us to use relative paths rather than full paths. Let’s set our working directory to the ISC folder in our respective home directories.
# -------------------------------------------------------------------------
# Set current working directory
setwd('~/ISC_R')
Now that we’re sure of our working directory, let’s create some folders for our analysis scripts and results thereof.
# -------------------------------------------------------------------------
# Create directory structure
dir.create('scripts', recursive = TRUE, showWarnings = FALSE)
dir.create('results/figures', recursive = TRUE, showWarnings = FALSE)
dir.create('results/tables', recursive = TRUE, showWarnings = FALSE)
dir.create('results/rdata', recursive = TRUE, showWarnings = FALSE)
Let’s save our currently open script in the scripts/
folder as ISC_day1.R
by clicking File and then
Save.
Checkpoint
We begin our analysis script by loading the libraries we expect to
use. It’s generally good practice to include all library()
calls at the top of a script for visibility.
# -------------------------------------------------------------------------
# Load libraries
library(Seurat)
library(BPCells)
library(tidyverse)
options(future.globals.maxSize = 1e9)
The libraries that we are loading are:
Seurat
library, developed by the Satija lab, which will provide the
essential functions used in our single-cell analysis. The Seurat documentation is
extensive and indispensible.BPCells
library, developed by Benjamin
Parks, is a recent package with the primary goal of efficiently
storing single-cell data to reduce its memory footprint. The BPCells
documentation includes many useful tutorials.tidyverse
library, developed by Posit, is an
essential package for data manipulation and plotting. The tidyverse documentation is
essential for getting a handle on the array of functions in the many
packages contained therein.The inputs/10x_cellranger_filtered_triples/
folder is
closer to what AGC would generate with Cell Ranger, where each sample
has a folder, and within that folder there are three files:
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
Note, these files are filtered matrices, AGC will touch on filtered and unfiltered Cell Ranger outputs.
The Read10X()
function will read in the triples
organized within sample folders:
# -------------------------------------------------------------------------
# To load data from 10X Cell Ranger
# Collect the input directories
# Each sample dir contains barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz.
# Naming the sample_dirs vector makes Seurat name the
# samples in the corresponding manner, which is nice for us.
sample_dirs = list(
HODay0replicate1 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate1",
HODay0replicate2 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate2",
HODay0replicate3 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate3",
HODay0replicate4 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay0replicate4",
HODay7replicate1 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate1",
HODay7replicate2 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate2",
HODay7replicate3 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate3",
HODay7replicate4 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay7replicate4",
HODay21replicate1 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate1",
HODay21replicate2 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate2",
HODay21replicate3 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate3",
HODay21replicate4 = "~/ISC_R/inputs/10x_cellranger_filtered_triples/count_run_HODay21replicate4")
# Create the expression matrix from sample dirs
# Read10X needs a *vector* instead of a *list*, so we use *unlist* to convert
geo_mat = Read10X(data.dir = unlist(sample_dirs))
We could create a Seurat object with
CreateSeuratObject(counts = geo_mat)
, but we will first
output geo_mat
using the BPCells package to save
memory:
# -------------------------------------------------------------------------
# Build BPCells input dir
# To use BPCells (to save some memory), you can transform
# the expression matrix data structure above into BPCells files.
# BPCells uses these files for processing, but you typically never look at
# their contents directly
write_matrix_dir(mat = geo_mat, dir = '~/ISC_R/bpcells', overwrite = TRUE)
Warning: Matrix compression performs poorly with non-integers.
• Consider calling convert_matrix_type if a compressed integer matrix is intended.
This message is displayed once every 8 hours.
32285 x 35269 IterableMatrix object with class MatrixDir
Row names: Xkr4, Gm1992 ... AC149090.1
Col names: HODay0replicate1_AAACCTGAGAGAACAG-1, HODay0replicate1_AAACCTGGTCATGCAT-1 ... HODay21replicate4_TTTGTCACAGGGAGAG-1
Data type: double
Storage order: column major
Queued Operations:
1. Load compressed matrix from directory /home/workshop/damki/ISC_R/bpcells
# -------------------------------------------------------------------------
# Cleanup
# Since we'll now be reading in from BPCells files, we will remove geo_mat
# from the environment and then prompt RStudio to run a "garbage collection"
# to free up unused memory
rm(geo_mat)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 6300819 336.6 12129297 647.8 8022064 428.5
Vcells 11758118 89.8 564619816 4307.8 673035795 5134.9
Let’s read the BPCells files in, and note that the
geo_mat
object takes up less memory than when we created it
with Read10X()
:
# -------------------------------------------------------------------------
# Create expression matrix and Seurat object from BPCells files
geo_mat = open_matrix_dir(dir = '~/ISC_R/bpcells')
The expression matrix is the precursor to creating the
Seurat
object upon which all our analysis will be done. To
create the Seurat
object:
# -------------------------------------------------------------------------
# Create seurat object
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
geo_so
An object of class Seurat
26489 features across 35216 samples within 1 assay
Active assay: RNA (26489 features, 0 variable features)
1 layer present: counts
We have specified some parameters to remove genes and cells which do not contain very much information. Specifically a gene is removed if it is expressed in 1 or fewer cells, and a cell is removed if it contains reads for 50 or fewer genes. In the context of this workshop, this helps us minimize memory usage.
Some programs (like Seurat) need only the barcodes, features, and matrix (counts) and those data are conveniently provided as three separate files for each sample. Other programs (like Cell Ranger) need to store more information (e.g. probe info about certain library preps or the chip or channel that cell ran on). Instead of adding more files to complement the three base files above, 10x provides a single binary file, molecule_info.h5, which contains all information for all molecules. (Strictly only the molecules with valid barcode/UMI assigned to a gene.)
molecule_info.h5
├─ barcodes [HDF5 group]
├─ count
├─ features [HDF5 group]
├─ gem_group
├─ library_info
├─ metrics_json
├─ probes [HDF5 group]
├─ ...
└─ umi
You can see the first structures above are barcodes, features, and counts. So if you have the hdf5 libraries installed, you can use the .h5 as input to Seurat like so:
# ==========================================================================
### DO NOT RUN ###
# To load data from raw .h5 file
### DO NOT RUN ###
HODay0replicate1 = "~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1/outs/molecule_info.h5"
get_mat = Read10X_h5(filename = HODay0replicate1)
geo_so = CreateSeuratObject(counts = geo_mat, min.cells = 1, min.features = 50)
If you want to load multiple samples from .h5 files, see the function Read10X_h5_Multi_Directory in the scCustomize R library.
FYI
The Seurat
object is a complex data type, so let’s get a
birds eye view with an image from this
tutorial on single-cell analysis.
Seurat
object.The three main “slots” in the object are:
assays
slot stores the expression data as
Assay
objects.meta.data
slot which stores cell-level information,
including technical and phenotypic data.reductions
slot stores the results of dimension
reduction applied to the assays
.There are other slots which store information that becomes relevant as we progress through the analysis. We will highlight the other slots as they come up.
After reading the data in and creating the Seurat object above, we can imagine the following schematic representing our object:
Note the RNA assay contains a count
layer consisting of
a raw count matrix where the rows are genes (features, more
generically), and the columns are all cells across all samples. Note
also the presence of a meta.data
table giving information
about each cell. We’ll pay close attention to this as we proceed. The
other slots include information about the active.assay
and
active.ident
which tell Seurat which expression data to use
and how the cells are to be identified.
The only slot of the Seurat
object that we’ll typically
access or modify by hand–that is, without a function from the
Seurat
package–is the meta.data
object. In R,
slots are accessed with the @
symbol, as in:
# -------------------------------------------------------------------------
# Examine Seurat object
head(geo_so@meta.data)
orig.ident nCount_RNA nFeature_RNA
HODay0replicate1_AAACCTGAGAGAACAG-1 HODay0replicate1 10234 3226
HODay0replicate1_AAACCTGGTCATGCAT-1 HODay0replicate1 3158 1499
HODay0replicate1_AAACCTGTCAGAGCTT-1 HODay0replicate1 13464 4102
HODay0replicate1_AAACGGGAGAGACTTA-1 HODay0replicate1 577 346
HODay0replicate1_AAACGGGAGGCCCGTT-1 HODay0replicate1 1189 629
HODay0replicate1_AAACGGGCAACTGGCC-1 HODay0replicate1 7726 2602
Here, each row is a cell, and each column is information about that
cell. The rows of the table are named according to the uniquely
identifiable name for the cell. In this case, the day and replicate, as
well as the barcode for that cell. As we continue the workshop, we will
check in on the meta.data
slot and observe changes we want
to make, and that other functions will make. We’ll also observe the
other assays and layers and note their changes.
Let’s Save our progress as an RDS file with saveRDS()
;
this allows us to have a copy of the object that we can read back into
our session with the readRDS()
commmand. Periodically we
will be saving our Seurat object so that we can have a version of it at
different steps of the analysis. These will also help us get untangled
if we get into an odd state.
# -------------------------------------------------------------------------
# Save the Seurat object
saveRDS(geo_so, file = 'results/rdata/geo_so_unfiltered.rds')
Now that the state of our analysis is saved on disk, we’ll demonstrate how to power down the RStudio session, restart the session, and re-load the data we just saved.
geo_so
object with:# -------------------------------------------------------------------------
# Load libraries
library(Seurat)
library(BPCells)
library(tidyverse)
options(future.globals.maxSize = 1e9)
setwd("~/ISC_R/")
# -------------------------------------------------------------------------
# Load the Seurat object
geo_so = readRDS('results/rdata/geo_so_unfiltered.rds')
R and RStudio are designed to grab computer memory (RAM) when necessary and release that memory when it’s no longer needed. R does this with a fancy computer science technique called garbage collection.
The R garbage collector is very good for small objects and simple analyses, but complex single-cell analysis can overwhelm it. So don’t rely on R/RStudio’s built in garbage collection and session management for single-cell analysis. Instead, do the following:
saveRDS()
. This gives you “savepoints” so if you need
to backtrack, you don’t have to go back to the beginning. This simple pattern will make your code more reproducible and your RStudio session more reliable.
![]() |
Cell Ranger outputs can be read into R via the triple of files directly, or via a memory saving route with the BPCells package. We provide code for both routes in this section. |
In this section we:
Acknowledgements These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Previous lesson | Top of this lesson | Next lesson |
---|