After filtering and normalization, our next goal is to cluster the
cells according to their expression profiles. However, even after
filtering, we expect to still have several thousand cells assayed with
approximately 21,000 genes measured per cell. This means that we are
working very “high-dimensional” data that presents some challenges.
![]() |
Starting with filtered, normalized data [gene x cell] for all samples - principal component analysis (PCA) can be used to reduce the dimensionality while still representing the overall gene expression patterns for each cell |
RunPCA()
function to generate principal
components for our dataIntegrateLayers
to integrate our sample data prior
to clusteringSimilar to the previous sections, while you may see only a single approach in a paper and we will be showing one example of executing these steps here, testing multiple methods or parameters might be necessary
The high dimensionality of single-cell data present two major challenges for analysis:
In practical terms, Principal component analysis (PCA) creates a simpler, more concise version of data while preserving the same “story”. For example, we can think of watching football - attending a game in person is exciting but can take hours, especially considering the time that will be spent traveling and waiting. Conversely, watching a highlight reel of those same game might not give you the full experience, but by seeing the key plays you can get a good understand what happened in the game in a fraction of the time.
Similarly to a highlight reel, running PCA allows us to select PCs that represent correlated gene expression to reduce the size and complexity of the data while still capturing the overall expression patterns and also reducing computation time.
To understand how PCA works for single-cell data, we can consider a smaller dataset measuring the expression of four genes measured in just two cells and plot the expression of those four genes, with data from one cell plotted on the x-axis and the second cell plotted on the y-axis.
If we wanted to represent the most variation across the data, we would draw a diagonal line between gene B and gene C - this would represent the first principal component (PC). However, this line doesn’t capture all the variance in this data as the genes also vary above and below the line, so we could draw another line (at a right angle to the first) representing the second most variation in the data - which would represent the second PC.
We won’t illustrate the process of generating all possible PCs, but running PCA on our data will result in a score for each cell for each PC and each gene will have a weight or loading for a given PC. By definition, the PCs capture the greatest factors of heterogeneity in decreasing order in the data set (source).
In contrast to bulk RNA-seq, where the majority of the variance is usually explained by the first and second PC, for single-cell data we expect that many more PCs are contributing to the overall variance.
In addition to being “high-dimensional”, we expect the data to be both “sparse” and “noisy”. Sparse in the sense that many genes will either not be expressed or not measured in many of the cells and have zero values. Noisy due to biological variability and technical variability due to the practical limitations of both capture and sequencing depth (source).
Note on zero “inflation” of single-cell data
Single-cell data is sometimes described as “zero inflated”, however work by Svensson has challenged that characterization and argued that the higher number of zeros observed in scRNA-seq compared to bulk RNA-seq is more likely due to biological variance and lower sequencing saturation than technical artifacts. Work by Choi et al. (2020) support that zeros in single-cell data are due to biology but Jiang et al. (2022) delves more into the “controversy” zero inflation, including common approaches for handling zeros and the downstream impacts.
For sparse, high-dimensional, biological data, we expect many genes to have correlated expression as they would be impacted by the same biological process while other genes to have either low (more noisy) or have similar expression and therefore are not as important for identifying cell-type/subtypes. So we can use the top several PCs to approximate the full data set for the purpose of (source).
More context using PCA for single-cell data
To read more on PCA, please refer to the HBC - Theory of PCA content, from which this section is adapted and the original source material for that content, specifically Josh Starmer’s StatQuest video.
For additional detail on , the OSCA chapter on Principal components analysis includes a more descriptive overview.
Since PCA is sensitive to scale, we will use the SCT normalized assay
(reference).
We will name the reduction in an informative way to keep them clear for
us in the future. In this case, our data has not yet been integrated,
but it has been SCT normalized. So we will name the reduction
unintegrated.sct.pca
. Note that SCTTransform()
returned a set of highly variable genes, and the RunPCA()
function will use this subset to determine the PCs and genes associated
with those PCs.
# Day 2 - PCA and Integration
# =========================================================================
# Build a PCA and add it to the Seurat object ----------------------------
geo_so = RunPCA(geo_so, reduction.name = 'unintegrated.sct.pca')
geo_so
An object of class Seurat
46957 features across 31559 samples within 2 assays
Active assay: SCT (20468 features, 3000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: RNA
1 dimensional reduction calculated: unintegrated.sct.pca
After running the command in the console, we should see that the
geo_so
Seurat object has a new reduction. Viewed in our
running schematic:
We can then visualize the first several PCs using the
DimHeatmp()
function, which orders both cells and features
(genes) according to their PCA scores and allows us to see some general
patterns in the data. Here we will specify that the first 24 PCs be
included in the figure and that 500 cells are randomly subsetted for the
plot. Note - you may need to increase your plot window size to
resolve errors regarding figure margins.
# Build heatmaps of PCAs -------------------------------------------------
# Plot cell by gene heatmaps for first several PCs
DimHeatmap(geo_so, dims=1, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca') # look at first PC alone first
DimHeatmap(geo_so, dims=1:18, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca') # first 18 PCs
# note - need to use png() to write to file because this isn't a ggplot
png(filename = 'results/figures/qc_pca_heatmap.png', width = 12, height = 40, units = 'in', res = 300)
DimHeatmap(geo_so, dims=1:18, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca')
dev.off()
# =========================================================================
DimHeatmap for PC1 alone
DimHeatmap for PC1-18
As we scroll down in this plot, we see less and less structure in our plots indicating that no only is less variation is being represented by larger PCs overall but also that those PCs are less likely to correspond to variation of interest, e.g. corresponding to cell types/subtypes.
We can also visualize the loadings for genes contributing to each principal components using Seurat provided functions source.
We can highlight genes loaded for dimenstions of interest using using
VizDimLoadings()
:
# Visualize gene loadings for PC1 and PC2 ---------------------------------
VizDimLoadings(geo_so, dims = 1:2, reduction = 'unintegrated.sct.pca')
ggsave(filename = 'results/figures/qc_pca_loadings.png', width = 12, height = 6, units = 'in')
# =========================================================================
How does Seurat use PCA scores?
Per the Ho Lab’s materials - “To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.”
So how do we determine how many PCs to use before classifying cells
into clusters based on their expression? One way to evaluate how many
PCs to use for clustering is by looking at an elbow plot, which shows
the percent variance explained by successive PCs. We’ll use the
ElbowPlot()
function to do this, specifying that the first
50 PCs be plotted.
# Visualize how many PCs to include using an elbow plot -------------------
ElbowPlot(geo_so, ndims = 50, reduction = 'unintegrated.sct.pca')
ggsave(filename = 'results/figures/qc_sct_elbow_plot.png', width = 8, height = 8, units = 'in')
# =========================================================================
In this plot, we could arbitrarily choose a number along the x-axis that looks like a sharp change in the variance from one PC to the next, that is, an elbow. Of course, while that’s often the recommendation in tutorials, the choice of where the point of the “elbow” is, is not always obvious, and this plot is no different.
An important consideration for determining how many PCs to select for your single-cell analysis is to understand the “resolution” of your biological question. Is answering your biological question dependent on identifying rarer cell types or specific subtypes? Or are broader cell-types more relevant and less PCs are needed?
For this dataset, we are expecting a diversity of cell types and cell populations that mediate wound healing, but also an aberrant transition to bone, which might include rarer cells in the population. So for this dataset, we can consider starting with more PCS rather than too few PCs to start. Again, in a full analysis workflow, our selection at this step might be more of a starting point for further iterations than a final decision.
Instead of choosing based on the elbow plot by sight alone, we can try to quantify our choice algorithmically. Here we create a function to return a recommended PC based on two possible metrics ( (A) cumulative variation or (B) above a minimum step size). We can apply that function to our data to get a good starting point for number of PCs to include.
# Estimate optimal PCs for clustering with a function -------------------------
optimal_pcs = function(so, reduction) {
# quantitative check for number of PCs to include
pct = so@reductions[[reduction]]@stdev / sum(so@reductions[[reduction]]@stdev) * 100
cum = cumsum(pct)
co1 = which(cum > 90 & pct < 5)[1]
co2 = sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > .1), decreasing = T)[1] + 1
pcs = min(co1, co2)
return(pcs)
}
# Apply function to our data
pcs = optimal_pcs(geo_so, 'unintegrated.sct.pca')
pcs
[1] 16
# Based on the heatmap, we'll modify to 13 PCs
pcs = 13
# =========================================================================
Again, this number is likely a starting point and may need to be revised depending on the outcome of the downstream steps.
This function is derived from the function above; it’s useful for understanding the overall variance of the PCs as well as visualizing the two kinds of cutoffs. It returns the same recommended PCs as the original function, but also:
# Define a function to estimate optimal PCs for clustering ----------------
optimal_pcs_advanced = function(so, reduction, print_plot=TRUE, verbose=TRUE) {
# quantitative check for number of PCs to include
threshold_var_cum_min = 90
threshold_var_pct_max = 5
threshold_step_min = 0.1
pct = so@reductions[[reduction]]@stdev / sum(so@reductions[[reduction]]@stdev) * 100
cum = cumsum(pct)
co1 = which(cum > threshold_var_cum_min & pct < threshold_var_pct_max)[1]
co2 = sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > threshold_step_min), decreasing = T)[1] + 1
pcs = min(co1, co2)
plot_df <- data.frame(pc = 1:length(pct),
pct_var = pct,
cum_var = cum)
plot_df$labels = ''
co1_label = paste0('PC ', co1)
co2_label = paste0('PC ', co2)
if (co1 == co2) {
plot_df$labels[plot_df$pc == co1] = paste0(co1_label, '\n', co2_label)
} else {
plot_df$labels[plot_df$pc == co1] = co1_label
plot_df$labels[plot_df$pc == co2] = co2_label
}
p = ggplot(plot_df, aes(x=pc, y=cum_var, label=labels)) +
geom_point(color="grey", alpha=1) +
geom_text(hjust = 0, vjust=1, nudge_x=2, nudge_y=-5) +
geom_step(data=filter(plot_df, pc<=co2), color="blue", alpha=0.6, direction='vh') +
geom_step(data=filter(plot_df, pc>=co2), color="grey", alpha=0.6, direction='hv') +
geom_hline(yintercept = 90, color = "red", alpha=0.6, linetype = 'dashed') +
geom_point(data=filter(plot_df, pc==co1), aes(x=pc,y=cum_var), color='red') +
geom_point(data=filter(plot_df, pc==co2), aes(x=pc,y=cum_var), color='blue') +
scale_y_continuous(breaks = c(0,25,50,75,100, threshold_var_cum_min)) +
theme_bw() +
labs(title = 'Optimal num of PCs',
x = 'Principal Component',
y = 'Cumulative percent variance')
if (print_plot) {
print(p)
}
if (verbose) {
results = paste(
sprintf('Reduction %s: %s PCs (total var = %s)',
reduction,
nrow(plot_df),
plot_df[plot_df$pc==nrow(plot_df), 'cum_var']),
sprintf("\t%s (total var = %s) : Smallest PC that explains at least %s%% total variance",
co1_label,
round(plot_df[plot_df$pc==co1, 'cum_var'], 2),
threshold_var_cum_min),
sprintf("\t%s (total var = %s) : Largest PC with incremental variance step at least %s%%",
co2_label,
round(plot_df[plot_df$pc==co2, 'cum_var'],2),
threshold_step_min),
sprintf('\tRecommended num of PCs: %s', pcs),
sep='\n')
message(results)
}
return(list(recommended_pcs=pcs, plot=p, co1=co1, co2=co2, df=plot_df))
}
# =========================================================================
# Apply function to our data ----------------------------------------------
optimal_pcs = optimal_pcs_advanced(geo_so, 'unintegrated.sct.pca')
Reduction unintegrated.sct.pca: 50 PCs (total var = 100)
PC 42 (total var = 90.98) : Smallest PC that explains at least 90% total variance
PC 16 (total var = 53.21) : Largest PC with incremental variance step at least 0.1%
Recommended num of PCs: 16
# Save plot of PCs --------------------------------------------------------
ggsave(filename = 'results/figures/optimal_pcs.png',
plot=optimal_pcs$plot,
width = 8, height = 8, units = 'in')
# Based on the heatmaps above, we'll cluster on 13 PCs
pcs = 13
# =========================================================================
Again, this number is likely a starting point and may need to be revised depending on the outcome of the downstream steps.
While outside the scope of this workshop, there are community efforts to develop more sophisticated methods to select an appropriate number of PCs; here are a few popular approaches:
In addition to selecting a reduced number of PCs to represent our
data, we can also visualize , similarly to how we often look at samples
for bulk RNA-seq using the DimPlot()
function, with each
cell plotted along the selected PCs and colored by a selected meta-data
column:
?DimPlot # default dims = c(1,2)
# Visualize PCA (PC1 and PC2, unintegrated) -------------------------------
# first label by sample
DimPlot(geo_so, reduction = 'unintegrated.sct.pca', group.by = 'orig.ident')
# then label by day
DimPlot(geo_so, reduction = 'unintegrated.sct.pca', group.by = 'day')
# save
ggsave(filename = 'results/figures/qc_pca_plot_unintegrated_sct_day.png', width = 7, height = 6, units = 'in')
# =========================================================================
In the first plot with each cell labeled by the original sample name,
it looks like there might be some batches but it’s hard to distinguish.
From the second by day
, we can see that even after
normalization there seems to be variation that corresponds more to a
batch effect related to the day status than interesting biological
variation. This suggests that clustering our data using our selected
number of PCs, integration should be performed to mitigate these
differences and better align our data.
Before proceeding with clustering, we will integrate our data across
all samples. This is a common step for scRNA-seq analyses that include
multiple samples and should mitigate the differences between samples we
saw in our PCA plot. Currently, each sample is stored as a layer in our
geo_so
object, Since earlier in the workshop, we used
SCTransform()
to normalize our data, select variable genes
and then have generated a PCA reduction, we can use the
IntegrateLayers()
function to do that.
The details of how integration works depends on the method, but for the RPCA approach that we’ll be using here - the process is similar to the canonical correlation analysis (CCA) approach illustration above (source). However, RPCA is more efficient (faster) to run and better preserves distinct cell identities between samples (source). As described in the corresponding Seurat tutorial, each dataset is projected into the others’ PCA space and constrain the anchors
New to Seurat v5 are improvements that make selecting different
integration methods much easier. The results of each alternative
integration method are stored within the same Seurat
object, which makes comparing downstream effects much easier. So, if we
wanted to run the RPCAIntegration
method, we would run
(but we won’t here):
### DO NOT RUN ###
geo_so = IntegrateLayers(
object = geo_so,
method = RPCAIntegration,
orig.reduction = 'unintegrated.sct.pca',
normalization.method = 'SCT',
new.reduction = 'integrated.sct.rpca')
### DO NOT RUN ###
# =========================================================================
Note that the integration is run on the layers, which from our prior
steps each correspond to a single sample, in our geo_so
Seurat object. For additional information, refer to the Seurat v5
vignette on integrative analysis (link)
provides examples of each integration method. Note, that the vignette
code uses the NormalizeData()
, ScaleData()
,
FindVariableFeatures()
pipeline, so their
IntegrateLayers()
call does not include
normalization.method = 'SCT'
, as ours must.
Note we have specified the unintegrated reduction
unintegrated.sct.pca
, which is what
IntegrateLayers()
operates on, along with the
SCT
assay. Let’s take a look to see what’s different about
the Seurat
object:
Load integrated data
Because the
IntegrateLayers()
function takes a while to run, we will simply load the RPCA integratedgeo_so
object from a file we have previously generated.geo_so = readRDS('/home/workshop/rcavalca/ISC_R/results/rdata/geo_so_sct_integrated.rds')
# Check our updated object that we've read in from file -------------------
# Observe that we now have a new reduction, `integrated.sct.rpca`
geo_so
An object of class Seurat
46957 features across 31559 samples within 2 assays
Active assay: SCT (20468 features, 3000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: RNA
2 dimensional reductions calculated: unintegrated.sct.pca, integrated.sct.rpca
Viewed in our running schematic:
We can also confirm that our integration method has helped to correct
the Day
effects we saw in the initial PCA plots
# Visualize PCA (PC1 and PC2, integrated) ---------------------------------
DimPlot(geo_so, reduction = 'integrated.sct.rpca', group.by = 'day')
ggsave(filename = 'results/figures/qc_pca_plot_integrated_sct_day.png', width = 7, height = 6, units = 'in')
# =========================================================================
Before we move on, let’s save our updated Seurat object to file:
# Save updated seurat object to file -------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated.rds')
# =========================================================================
Other integration methods
After normalizing the data with
SCTransform()
and performed the dimension reduction withRunPCA()
, alternatively we could also use theCCA
integration method with:### DO NOT RUN ### geo_so = IntegrateLayers( object = geo_so, method = CCAIntegration, orig.reduction = 'unintegrated.sct.pca', normalization.method = 'SCT', new.reduction = 'integrated.sct.cca') ### DO NOT RUN ###
In this section, we:
Next steps: Clustering and projection
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Previous lesson | Top of this lesson | Next lesson |
---|