cellranger count
web_summary.html report to
determine sample quality and inform decisions about additional
sequencing.The initial data processing steps in scRNA-Seq transform sample FASTQ files to gene expression counts. Some of these steps are similar to bulk RNA-Seq and some are distinct. 10x Genomics provides a pipeline tool, Cell Ranger, to expedite these initial transformations [1].
Cell Ranger inputs and outputs |
---|
Cell Ranger accepts a set of sample FASTQ files and a set of reference files. It produces a QC web summary (detailed below), a set of matrix files (also detailed below) along with a large collection of supporting files. The matrix files can be used for downstream analysis in Seurat and other tools. |
A high level view of Cell Ranger steps |
---|
A. For each sequence, the barcode and UMI (on read 1) is used to label
and bin the sequences by their barcodes. B. The transcript sequences are aligned to a genome reference using a modified version of the STAR aligner. C. For each transcript that falls in a gene, count the pile-up of sequences aligned to each gene; deduplicate the reads based on UMI. Use these deduplicated counts to build a feature barcode matrix. D. Apply a cell-calling algorithm to distinguish putative cells from background [3,4] E. Subset the raw matrix to only putative cells to create the filtered matrix. F. Use information from all previous steps to build a web summary QC report as well as basic gene expression clustering visualizations and other supporting files. |
The basic data package from the UM Advanced Genomics Core includes:
It looks like the tree below.
0000-SR
├── 0000-SR.md5
├── DemuxStats_0000-SR.csv
├── fastqs_0000-SR
│ ├── 0000-SR-1-GEX_S25_R1_001.fastq.gz
│ ├── 0000-SR-1-GEX_S25_R2_001.fastq.gz
│ ├── 0000-SR-2-GEX_S26_R1_001.fastq.gz
│ ├── 0000-SR-2-GEX_S26_R2_001.fastq.gz
│ ├── 0000-SR-3-GEX_S27_R1_001.fastq.gz
│ ├── 0000-SR-3-GEX_S27_R2_001.fastq.gz
│ ├── 0000-SR-4-GEX_S28_R1_001.fastq.gz
│ └── 0000-SR-4-GEX_S28_R2_001.fastq.gz
└── README.txt
All you need to run 10x cellranger count are the above 10x sample FASTQ files and the correctly formatted reference genome files.
As a researcher, what parts of these outputs should I download and save? |
|
The cellranger count
pipeline outputs are in the Sample
directory in the outs folder (aka
~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1
).
The output is similar to the following:
count_run_HODay0replicate1
├──...
└──outs
├── analysis
├── cloupe.cloupe
├── filtered_feature_bc_matrix
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
├── raw_feature_bc_matrix.h5
└── web_summary.html
10x Genomics provides excellent documentation on all the Cell Ranger outputs. Some of the most important outputs include:
It is useful to look at the web_summary.html files to get an idea of how each of your samples performed in the 10x pipeline, epecially if there were errors/wanrings and/or if your research could benefit from additional sequencing. 10x provides a guide to interpreting this report.
Let’s traverse to the outs directory for sample HODay0replicate1 and
open the web_summary.html
. Example
count_run_HODay0replicate1 web_summary.html
The cellranger count
web summary has errors/warnings at
the top of the report (if any) with additional metrics in the
Summary
and Gene Expression
tabs.
The important parts of the web_summary.html we look at to determine the quality of the 10x analaysis include:
WARNING
alerts indicate that some
parameter is suboptimal.ERROR
alerts indicate a major
issue.Intron mode was used. This helps with cell calling and is on by default.
The run summary from cellranger count
can be viewed by
clicking Summary
in the top left tab of the HTML file. The
summary metrics describe sequencing quality and various characteristics
of the detected cells.
The GEX Barcode Rank Plot under the Cells dashboard is an interactive plot that shows all barcodes detected in an experiment, ranked from highest to lowest UMI count and which barcodes were inferred to be associated with cells. It is useful for understanding Cell Ranger’s cell calling algorithm and its performance on your data, and for providing insights into your sample quality.
Basically, all 10x Barcodes detected during sequencing (~100k) are plotted in decreasing order of the number of UMIs associated with that particular barcode. The number of UMIs detected in each GEM is then used by Cell Ranger to identify barcodes/GEMs that are likely to contain an intact cell based on the expected cell number and UMI counts. Then distinguish low RNA content cells from empty droplets, since GEMs containing cells are expected to have a greater number of transcripts (and thus UMIs) associated with them than non-cell containing GEMs.
Since barcodes can be associated with cells based on their UMI count or by their RNA profiles, some regions of the graph can contain both cell-associated and background-associated barcodes. The color of the graph represents the local density of barcodes that are cell-associated.
A steep drop-off is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty partitions.
Gene Expression
tab in the top left corner. Click the
?
icons next to each section title to display information
about the secondary analyses shown in the dashboard.
The t-SNE Projection section shows the data reduced to two dimensions,
colored by UMI count (left) or clustering (right). It is a good starting
point to explore structure in the data. The projection colored by UMI
counts is indicative of the RNA content of the cells and often
correlates with cell size - redder points are cells with more RNA in
them. For the projection colored by clustering results, select the type
of clustering analysis to display from the drop-down button on the upper
right (Graph-based by default) - change the category to vary the type of
clustering and/or number of clusters (K=2-10) that are assigned to the
data.
The Top Features By Cluster table shows which genes are differentially
expressed in each cluster relative to all other clusters (Graph-based by
default). To find the genes associated with a particular cluster, click
the L2FC or p-value column headers associated with a given cluster
number to sort the table by a specific cluster
The recommended sequencing saturation is >60%. If you are below this threshold, you would probably benefit from additional sequencing for increased depth.
To think of this another way, if you hit 60% saturation, for every additional 100 reads/cell you could expect to see 40 more genes. At 40% saturation you could expect to see 60 additional genes for every additional 100 reads/cell. At 20% saturation you could expect to see 80 additional genes for every additional 100 reads/cell and so on. Therefore if your sequencing saturation is high enough, you are unlikely to gain additional biologically relevant data with additional sequencing. If your sequecing saturation is low, additional sequencing will improve your downstream analysis and reveal more about the underlying biology.
Assuming the Advanced Genomics Core has your samples it is relatively easy to request additional sequencing through the core.
cd ~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1/outs
tree raw_feature_bc_matrix
raw_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
0 directories, 3 files
|
|
It’s most common for the sequence provider to run Cell Ranger on your behalf. Cell Ranger’s preliminary QC outputs give the provider key information about whether the prep and sequencing run worked as expected. With that in mind it’s entirely possible you may never have to run Cell Ranger yourself.
There are a few scenarios when you might consider re-running Cell Ranger:
Also, keep in mind that if you or your collaborators have to re-run Cell Ranger for any reason you will need the original FASTQ files. (Again, you are repsonsible for your data. Your sequencing provider may not retain a copy and even if they do there may be charge and/or delay for reprocessing).
Running Cell Ranger can be challenging because the system requirements exceed the specifications for a typical personal computer/laptop. In particiular, Cell Ranger pipelines run on Linux systems that meet these minimum requirements:
Cell Ranger is run from a Linux/Unix command shell [2]. You can print the usage statement to see
what is needed to build the command:
cellranger count --help
cellranger-count
Count gene expression (targeted or whole-transcriptome) and/or feature barcode reads from a single sample and GEM well
USAGE:
cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>
FLAGS:
--no-bam Do not generate a bam file
--nosecondary Disable secondary analysis, e.g. clustering. Optional
--include-introns Include intronic reads in count
--no-libraries Proceed with processing using a --feature-ref but no Feature Barcode libraries
specified with the 'libraries' flag
--no-target-umi-filter Turn off the target UMI filtering subpipeline. Only applies when --target-panel is
used
--dry Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop
--disable-ui Do not serve the web UI
--noexit Keep web UI running after pipestance completes or fails
--nopreflight Skip preflight checks
-h, --help Prints help information
...
To run cellranger count
, you need to specify:
--id
. This can be any string less than
64 characters. Cell Ranger creates an output directory named for the id
string.--fastqs
path to the directory
containing the FASTQ files.--sample
argument to specify which
subset of FASTQ files to use (specified sample string should match the
beginning of the FASTQ file name).--transcriptome
a path to the
reference transcriptome. Typically you would download a specific
transcript for your organism from 10x Genomics.Recall the directory of outputs from the sequencer (from above):
0000-SR
├── 0000-SR.md5
├── DemuxStats_0000-SR.csv
├── fastqs_0000-SR
│ ├── 0000-SR-1-GEX_S25_R1_001.fastq.gz
│ ├── 0000-SR-1-GEX_S25_R2_001.fastq.gz
│ ├── 0000-SR-2-GEX_S26_R1_001.fastq.gz
│ ├── 0000-SR-2-GEX_S26_R2_001.fastq.gz
│ ├── 0000-SR-3-GEX_S27_R1_001.fastq.gz
│ ├── 0000-SR-3-GEX_S27_R2_001.fastq.gz
│ ├── 0000-SR-4-GEX_S28_R1_001.fastq.gz
│ └── 0000-SR-4-GEX_S28_R2_001.fastq.gz
└── README.txt
We can create the cellranger count
command:
cellranger count --id=Sample_0000-SR-1 \
--fastqs=/nfs/turbo/path/to/0000-SR/0000-SR_fastqs \
--sample=0000-SR-1\
--transcriptome=path/to/refereces/cellranger_count/refdata-gex-GRCh38-2020-A`
Data from a typical sample can take several hours to complete, so we’ve completed this step for you.
Previous lesson | Top of this lesson | Next lesson |
---|