After removing low-quality cells from the data, the next task is the normalization and variance stabilization of the counts for downstream analysis.
![]() |
The counts in the feature barcode matrix are a blend of technical and biological effects. Technical effects can distort or mask the biological effects of interest, confounding downstream analyses. Seurat can model the technical effects based on overall patterns in expression across all cells. Once these technical effects are minimized, the remaining signal is primarily due to biological variance. |
SCTransform()
method.SCTransform()
.Variation in scRNA-seq data comes from biological sources:
And from technical sources:
A key driver of technical variation is cellular sequencing depth (that is, the number of UMIs sequenced per cell). In the figure below, Sample A (left, red reads) is more deeply sequenced than Sample B (right, green reads). In a test for differential expression, we want to account for the difference in sequencing depth to avoid erroneously calling a gene differentially expressed.
The assays
in a Seurat v5 object store data in “layers”.
When we first read the data in, we saw there was only one layer in the
RNA assay. In order to run SCTransform()
in Seurat v5, we
have to separate the sample-wise data into layers with the following
command:
##### Day 1 - Normalization
# Separate sample data into layers ---------------------------------------
geo_so[['RNA']] = split(geo_so[['RNA']], f = geo_so$orig.ident)
geo_so
An object of class Seurat
26489 features across 31559 samples within 1 assay
Active assay: RNA (26489 features, 0 variable features)
12 layers present: counts.HODay0replicate1, counts.HODay0replicate2, counts.HODay0replicate3, counts.HODay0replicate4, counts.HODay7replicate1, counts.HODay7replicate2, counts.HODay7replicate3, counts.HODay7replicate4, counts.HODay21replicate1, counts.HODay21replicate2, counts.HODay21replicate3, counts.HODay21replicate4
We see the 12 layers containing the count data for each of the samples. Our running schematic has now changed:
Next, run SCTransform()
:
# Normalize the data with SCTransform ------------------------------------
geo_so = SCTransform(geo_so)
To get the idea of what SCTransform()
is doing, we’ll
consider a simplified example.
Consider two cells that are identical in terms of their type, expression, etc. Imagine that we put them through the microfluidic device and then sequenced the RNA content.
If we were to plot the total cell UMI count against a particular gene UMI count, in an ideal world, the points should be directly on top of each other because the cells had identical expression and everything done to measure that expression went perfectly.
However, we don’t live in a perfect world. We will likely observe the cells have different total cell UMI counts as well as different gene UMI counts. This difference, given that the cells were identical, can be attributed to technical factors. For example, efficiency in lysis or in reverse transcription, or the stoachastic sampling that occurs during sequencing.
It is these technical factors that normalization seeks to correct for, getting us back to the “true” expression state.
Imagine doing this for thousands of cells. We would get a point cloud like the above. Importantly, that point cloud has structure. There is a relationship between the total cell UMI count and the gene UMI count for each gene.
We could fit a line through the point cloud, where we estimate the intercept, the slope, and the error.
The residuals, or the distance from the point to the line, represents the expression of the gene less the total cell UMI count influence.
In other words, the residuals represent the biological variance, and the regression removes the technical variance. Note now that the residuals are the about the same for the two cells.
The full description and justification of the
SCTransform()
function are provided in two excellent
papers:
Hafemeister & Satija, Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression, 2019, Genome Biology (link)
Choudhary & Satija, Comparison and evaluation of statistical error models for scRNA-seq, 2022, Genome Biology (link)
A benefit of the this framework for normalization is that the model
can include other terms which can account for unwanted technical
variation. One example of this is the percent mitochondrial reads
(percent.mt
). The vars.to.regress
parameter of
SCTransform()
is used for this purpose. See the documentation
for details.
From the Seurat documentation, the SCTransform()
function “replaces NormalizeData()
,
ScaleData()
, and FindVariableFeatures()
”. This
chain of functions is referred to as the “log-normalization procedure”.
You may see these three commands in other vignettes, and even in other
Seurat vignettes (source).
In the two papers referenced above, the authors show how the
log-normalization procedure does not always fully account for cell
sequencing depth and overdispersion. Therefore, we urge you to use this
alternative pipeline with caution.
By now, SCTransform()
should have finished running, so
let’s take a look at the result:
# Examine Seurat Object --------------------------------------------------
geo_so
An object of class Seurat
46957 features across 31559 samples within 2 assays
Active assay: SCT (20468 features, 3000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: RNA
We now observe changes to our running schematic:
Note that we have a new assay, SCT
which has three
layers: counts
, data
, and
scale.data
. Also note additional columns in
meta.data
and the change of the active.assay
to SCT
. SCTransform()
has also determined the
common variable features across the cells to be used in our downstream
analysis. Viewed in our running schematic:
Let’s save this normalized form of our Seurat object.
# Save Seurat object -----------------------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_normalized.rds')
Once that completes, we’ll power down the session as we earlier.
In this section we have run the SCTransform()
function
to account for variation in cell sequencing depth and to stabilize the
variance of the counts.
Next steps: PCA and integration
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Previous lesson | Top of this lesson | Next lesson |
---|