Workflow Overview


wayfinder

Objectives

  • Describe the key inputs to Cell Ranger.
  • Describe the purpose and overall structure of key Cell Ranger outputs.
  • Interpret a cellranger count web_summary.html report to determine sample quality and inform decisions about additional sequencing.
  • Understand when you would need to run Cell Ranger and the relevant system requirements.

Cell Ranger Overview

The initial data processing steps in scRNA-Seq transform sample FASTQ files to gene expression counts. Some of these steps are similar to bulk RNA-Seq and some are distinct. 10x Genomics provides a pipeline tool, Cell Ranger, to expedite these initial transformations [1].

Cell Ranger inputs and outputs
Cell Ranger accepts a set of sample FASTQ files and a set of reference files. It produces a QC web summary (detailed below), a set of matrix files (also detailed below) along with a large collection of supporting files. The matrix files can be used for downstream analysis in Seurat and other tools.


A high level view of Cell Ranger steps
A. For each sequence, the barcode and UMI (on read 1) is used to label and bin the sequences by their barcodes.
B. The transcript sequences are aligned to a genome reference using a modified version of the STAR aligner.
C. For each transcript that falls in a gene, count the pile-up of sequences aligned to each gene; deduplicate the reads based on UMI. Use these deduplicated counts to build a feature barcode matrix.
D. Apply a cell-calling algorithm to distinguish putative cells from background [3,4]
E. Subset the raw matrix to only putative cells to create the filtered matrix.
F. Use information from all previous steps to build a web summary QC report as well as basic gene expression clustering visualizations and other supporting files.



Key inputs

The basic data package from the UM Advanced Genomics Core includes:

  • an *.md5 file to validate your data transfer.
  • a DemuxStats_*.csv that has some basic metrics about how your samples performed on the sequencer.
  • the fastq_* folder containing the fastq.gz files
  • a README.txt including details about how your samples were processed.

It looks like the tree below.

0000-SR
├── 0000-SR.md5
├── DemuxStats_0000-SR.csv
├── fastqs_0000-SR
│   ├── 0000-SR-1-GEX_S25_R1_001.fastq.gz
│   ├── 0000-SR-1-GEX_S25_R2_001.fastq.gz
│   ├── 0000-SR-2-GEX_S26_R1_001.fastq.gz
│   ├── 0000-SR-2-GEX_S26_R2_001.fastq.gz
│   ├── 0000-SR-3-GEX_S27_R1_001.fastq.gz
│   ├── 0000-SR-3-GEX_S27_R2_001.fastq.gz
│   ├── 0000-SR-4-GEX_S28_R1_001.fastq.gz
│   └── 0000-SR-4-GEX_S28_R2_001.fastq.gz
└── README.txt

All you need to run 10x cellranger count are the above 10x sample FASTQ files and the correctly formatted reference genome files.

As a researcher, what parts of these outputs should I download and save?
  1. If we sent it to you, it is probably important for either downstream analysis or publication of your sequencing data. So we recomend you download all of it.
  2. You can verify all of your files are intact using the *.md5 file.
  3. Keep these data safe as you are responsible for the storage of your data for publication and archival purposes. (Typically sequencing providers will regularly delete project data following delivery.)



Key outputs

The cellranger count pipeline outputs are in the Sample directory in the outs folder (aka ~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1).

The output is similar to the following:

count_run_HODay0replicate1
├──...
└──outs
    ├── analysis 
    ├── cloupe.cloupe
    ├── filtered_feature_bc_matrix
    ├── filtered_feature_bc_matrix.h5
    ├── metrics_summary.csv
    ├── molecule_info.h5
    ├── possorted_genome_bam.bam
    ├── possorted_genome_bam.bam.bai
    ├── raw_feature_bc_matrix
    ├── raw_feature_bc_matrix.h5
    └── web_summary.html

10x Genomics provides excellent documentation on all the Cell Ranger outputs. Some of the most important outputs include:

  • web_summary.html : A comprehensive report that includes quality control information and basic analysis.
  • metrics_summary.csv : Same data as the web_summary.html but in csv format.
  • raw_feature_bc_matrix: A set of three files that represent the feature x barcode count table; includes all detected barcodes. (Format details elaborated below.)
  • filtered_feature_bc_matrix : As above, but filtered to include only barcodes that passed Cell Ranger’s cell-calling algorithm [3].
  • molecule.h5 : An HDF5 (binary) file all containing information for all molecules assigned to a feature/barcode. See references [6] for more info on this format.
  • cloupe file : Used with the 10x Loupe Browser visualization software.

Web Summary html report

It is useful to look at the web_summary.html files to get an idea of how each of your samples performed in the 10x pipeline, epecially if there were errors/wanrings and/or if your research could benefit from additional sequencing. 10x provides a guide to interpreting this report.

Let’s traverse to the outs directory for sample HODay0replicate1 and open the web_summary.html. Example count_run_HODay0replicate1 web_summary.html

The cellranger count web summary has errors/warnings at the top of the report (if any) with additional metrics in the Summary and Gene Expression tabs.

The important parts of the web_summary.html we look at to determine the quality of the 10x analaysis include:

  • Warnings/Errors
  • Barcode Rank Plot
  • Sequencing Saturation Plot
  • Cell count
  • reads/cell
  • UMI count
  • gene count
  • mapping to reference

Warnings/Errors & remediations.

Alerts are generally the result of factors inherent in library preparation and sequencing or sample quality issues. Alerts do not affect the operation of the pipeline, but highlight potential causes for abnormal or missing data:

  • WARNING alerts indicate that some parameter is suboptimal.
  • ERROR alerts indicate a major issue.

Intron mode was used. This helps with cell calling and is on by default.

Other alert examples

Report with alerts
Report with alerts

Summary tab

The run summary from cellranger count can be viewed by clicking Summary in the top left tab of the HTML file. The summary metrics describe sequencing quality and various characteristics of the detected cells.

More on the Barcode Rank Plot

The y-axis is the number of UMI counts mapped to each barcode and the x-axis is the number of barcodes below that value.
The y-axis is the number of UMI counts mapped to each barcode and the x-axis is the number of barcodes below that value.

The GEX Barcode Rank Plot under the Cells dashboard is an interactive plot that shows all barcodes detected in an experiment, ranked from highest to lowest UMI count and which barcodes were inferred to be associated with cells. It is useful for understanding Cell Ranger’s cell calling algorithm and its performance on your data, and for providing insights into your sample quality.

Basically, all 10x Barcodes detected during sequencing (~100k) are plotted in decreasing order of the number of UMIs associated with that particular barcode. The number of UMIs detected in each GEM is then used by Cell Ranger to identify barcodes/GEMs that are likely to contain an intact cell based on the expected cell number and UMI counts. Then distinguish low RNA content cells from empty droplets, since GEMs containing cells are expected to have a greater number of transcripts (and thus UMIs) associated with them than non-cell containing GEMs.

Since barcodes can be associated with cells based on their UMI count or by their RNA profiles, some regions of the graph can contain both cell-associated and background-associated barcodes. The color of the graph represents the local density of barcodes that are cell-associated.

A steep drop-off is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty partitions.

Gene Expression tab

The automated secondary analysis results can be viewed by clicking the Gene Expression tab in the top left corner. Click the ? icons next to each section title to display information about the secondary analyses shown in the dashboard.

The t-SNE Projection section shows the data reduced to two dimensions, colored by UMI count (left) or clustering (right). It is a good starting point to explore structure in the data. The projection colored by UMI counts is indicative of the RNA content of the cells and often correlates with cell size - redder points are cells with more RNA in them. For the projection colored by clustering results, select the type of clustering analysis to display from the drop-down button on the upper right (Graph-based by default) - change the category to vary the type of clustering and/or number of clusters (K=2-10) that are assigned to the data.

The Top Features By Cluster table shows which genes are differentially expressed in each cluster relative to all other clusters (Graph-based by default). To find the genes associated with a particular cluster, click the L2FC or p-value column headers associated with a given cluster number to sort the table by a specific cluster

The Sequencing Saturation plot
The Sequecing Saturation plot shows the effect of decreased sequencing depth on sequencing saturation, which is a measure of the fraction of library complexity that was observed. The right-most point on the line is the full sequencing depth obtained in this run.

Similarly, the Median Genes per Cell plot shows the effect of decreased sequencing depth on median genes per cell, which is a way of measuring data yield as a function of depth. The right-most point on the line is the full sequencing depth obtained in this run.

The recommended sequencing saturation is >60%. If you are below this threshold, you would probably benefit from additional sequencing for increased depth.

To think of this another way, if you hit 60% saturation, for every additional 100 reads/cell you could expect to see 40 more genes. At 40% saturation you could expect to see 60 additional genes for every additional 100 reads/cell. At 20% saturation you could expect to see 80 additional genes for every additional 100 reads/cell and so on. Therefore if your sequencing saturation is high enough, you are unlikely to gain additional biologically relevant data with additional sequencing. If your sequecing saturation is low, additional sequencing will improve your downstream analysis and reveal more about the underlying biology.

Assuming the Advanced Genomics Core has your samples it is relatively easy to request additional sequencing through the core.



Feature Barcode Matrices

cd ~/ISC_Shell/cellranger_outputs/count_run_HODay0replicate1/outs
tree raw_feature_bc_matrix
raw_feature_bc_matrix
    ├── barcodes.tsv.gz
    ├── features.tsv.gz
    └── matrix.mtx.gz
0 directories, 3 files


  1. Each element of the feature-barcode matrix is the count of UMIs associated with a feature (row) and a barcode (column). Because the matrix is very sparse, storing the matrix as three individual, compressed files saves siginificant space. [6]
  2. In relation to the Barcode Rank Plot, raw matrices include cells and background and the filtered matrix files only contain data for cells. Downstream analysis typically uses the filtered matrix files.



Re-running Cell Ranger

It’s most common for the sequence provider to run Cell Ranger on your behalf. Cell Ranger’s preliminary QC outputs give the provider key information about whether the prep and sequencing run worked as expected. With that in mind it’s entirely possible you may never have to run Cell Ranger yourself.

There are a few scenarios when you might consider re-running Cell Ranger:

  1. You need to combine new samples with previously processed samples and you want all samples to be run through the same version of cell ranger
  2. You need to adjust the run parameters (e.g. you want to exclude/include introns in the alignment).
  3. You need to align against a custom reference ( e.g. one that includes a transgene)

Also, keep in mind that if you or your collaborators have to re-run Cell Ranger for any reason you will need the original FASTQ files. (Again, you are repsonsible for your data. Your sequencing provider may not retain a copy and even if they do there may be charge and/or delay for reprocessing).

System Requirements for running Cell Ranger

Running Cell Ranger can be challenging because the system requirements exceed the specifications for a typical personal computer/laptop. In particiular, Cell Ranger pipelines run on Linux systems that meet these minimum requirements:

  • 64-bit CentOS/RedHat 7.0 or Ubuntu 14.04
  • 8-core Intel or AMD processor (16 cores recommended).
  • 64GB RAM (128GB recommended).
  • 1TB free disk space.
Note that UM Advanced Genomics Core and Bioinformatics Core use the Great Lakes Computing cluster to run Cell Ranger. If your lab has signed up for the University of Michigan Research Computing Package (UMRCP), you can also get access to the cluster and the available software.


How to run Cell Ranger

Cell Ranger is run from a Linux/Unix command shell [2]. You can print the usage statement to see what is needed to build the command: cellranger count --help

cellranger-count
Count gene expression (targeted or whole-transcriptome) and/or feature barcode reads from a single sample and GEM well
 
USAGE:
    cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>
     
FLAGS:
          --no-bam                  Do not generate a bam file
          --nosecondary             Disable secondary analysis, e.g. clustering. Optional
          --include-introns         Include intronic reads in count
          --no-libraries            Proceed with processing using a --feature-ref but no Feature Barcode libraries
                                    specified with the 'libraries' flag
          --no-target-umi-filter    Turn off the target UMI filtering subpipeline. Only applies when --target-panel is
                            used
          --dry                     Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop
          --disable-ui              Do not serve the web UI
          --noexit                  Keep web UI running after pipestance completes or fails
          --nopreflight             Skip preflight checks
      -h, --help                    Prints help information
  ...

To run cellranger count, you need to specify:

  • --id. This can be any string less than 64 characters. Cell Ranger creates an output directory named for the id string.
  • --fastqs path to the directory containing the FASTQ files.
  • --sample argument to specify which subset of FASTQ files to use (specified sample string should match the beginning of the FASTQ file name).
  • --transcriptome a path to the reference transcriptome. Typically you would download a specific transcript for your organism from 10x Genomics.

Recall the directory of outputs from the sequencer (from above):

0000-SR
├── 0000-SR.md5
├── DemuxStats_0000-SR.csv
├── fastqs_0000-SR
│   ├── 0000-SR-1-GEX_S25_R1_001.fastq.gz
│   ├── 0000-SR-1-GEX_S25_R2_001.fastq.gz
│   ├── 0000-SR-2-GEX_S26_R1_001.fastq.gz
│   ├── 0000-SR-2-GEX_S26_R2_001.fastq.gz
│   ├── 0000-SR-3-GEX_S27_R1_001.fastq.gz
│   ├── 0000-SR-3-GEX_S27_R2_001.fastq.gz
│   ├── 0000-SR-4-GEX_S28_R1_001.fastq.gz
│   └── 0000-SR-4-GEX_S28_R2_001.fastq.gz
└── README.txt

We can create the cellranger count command:

cellranger count --id=Sample_0000-SR-1 \
   --fastqs=/nfs/turbo/path/to/0000-SR/0000-SR_fastqs \
   --sample=0000-SR-1\
   --transcriptome=path/to/refereces/cellranger_count/refdata-gex-GRCh38-2020-A`

Data from a typical sample can take several hours to complete, so we’ve completed this step for you.


Summary

  • Cell Ranger is a command line tool from 10x Genomics. It accepts the FASTQ sequences from a 10x experiment and emits QC reports and feature-barcode-count files used for downstream analysis.
  • The Barcode Rank Plot shows how the software determined what were cells compared to background. The filtered matrix files only contain data for cells on the barcode rank plot, wheras the raw matrices include cells and background.
  • Additional sequencing is suggested if your sequencing saturation is low.
  • You may not ever need to run Cell Ranger yourself, but keep in mind that this part of the process is computationally intensive and typically requires more powerful computers, abundant storage, and significant time.