Introduction

After filtering and normalization, our next goal is to cluster the cells according to their expression profiles. However, even after filtering, we expect to still have several thousand cells assayed with approximately 21,000 genes measured per cell. This means that we are working very “high-dimensional” data that presents some challenges.

Starting with filtered, normalized data [gene x cell] for all samples - principal component analysis (PCA) can be used to reduce the dimensionality while still representing the overall gene expression patterns for each cell


Objectives

  • Understand why PCA is used prior to integration and clustering for single-cell data
  • Use the RunPCA() function to generate principal components for our data
  • Choose an appropriate number of principal components to represent our data
  • Use IntegrateLayers to integrate our sample data prior to clustering


Similar to the previous sections, while you may see only a single approach in a paper and we will be showing one example of executing these steps here, testing multiple methods or parameters might be necessary



What is PCA?

The high dimensionality of single-cell data present two major challenges for analysis:

  • The scale makes analysis steps costly and slow
  • The sparseness means there is likely uninteresting variation that should be excluded

In practical terms, Principal component analysis (PCA) creates a simpler, more concise version of data while preserving the same “story”. For example, we can think of watching football - attending a game in person is exciting but can take hours, especially considering the time that will be spent traveling and waiting. Conversely, watching a highlight reel of those same game might not give you the full experience, but by seeing the key plays you can get a good understand what happened in the game in a fraction of the time.

Similarly to a highlight reel, running PCA allows us to select PCs that represent correlated gene expression to reduce the size and complexity of the data while still capturing the overall expression patterns and also reducing computation time.

PCA example

To understand how PCA works for single-cell data, we can consider a smaller dataset measuring the expression of four genes measured in just two cells and plot the expression of those four genes, with data from one cell plotted on the x-axis and the second cell plotted on the y-axis.

A simple PCA example (from HBC)
A simple PCA example (from HBC)

If we wanted to represent the most variation across the data, we would draw a diagonal line between gene B and gene C - this would represent the first principal component (PC). However, this line doesn’t capture all the variance in this data as the genes also vary above and below the line, so we could draw another line (at a right angle to the first) representing the second most variation in the data - which would represent the second PC.

PCA gene loadings (from HBC)
PCA gene loadings (from HBC)

We won’t illustrate the process of generating all possible PCs, but running PCA on our data will result in a score for each cell for each PC and each gene will have a weight or loading for a given PC. By definition, the PCs capture the greatest factors of heterogeneity in decreasing order in the data set (source).

Why PCA is useful for single-cell data

In contrast to bulk RNA-seq, where the majority of the variance is usually explained by the first and second PC, for single-cell data we expect that many more PCs are contributing to the overall variance.

Sparsity in single-cell data

In addition to being “high-dimensional”, we expect the data to be both “sparse” and “noisy”. Sparse in the sense that many genes will either not be expressed or not measured in many of the cells and have zero values. Noisy due to biological variability and technical variability due to the practical limitations of both capture and sequencing depth (source).

Note on zero “inflation” of single-cell data

Single-cell data is sometimes described as “zero inflated”, however work by Svensson has challenged that characterization and argued that the higher number of zeros observed in scRNA-seq compared to bulk RNA-seq is more likely due to biological variance and lower sequencing saturation than technical artifacts. Work by Choi et al. (2020) support that zeros in single-cell data are due to biology but Jiang et al. (2022) delves more into the “controversy” zero inflation, including common approaches for handling zeros and the downstream impacts.

For sparse, high-dimensional, biological data, we expect many genes to have correlated expression as they would be impacted by the same biological process while other genes to have either low (more noisy) or have similar expression and therefore are not as important for identifying cell-type/subtypes. So we can use the top several PCs to approximate the full data set for the purpose of (source).

More detail on dimensionality reduction The Ouyang Lab has a “gentle introduction” section of their materials that goes into greater details on dimensionality reduction including how similar strategies are used in deep learning models. Additionaly, the OSCA book has a chapter on Dimensionality reduction that has useful context.



More context using PCA for single-cell data

To read more on PCA, please refer to the HBC - Theory of PCA content, from which this section is adapted and the original source material for that content, specifically Josh Starmer’s StatQuest video.

For additional detail on , the OSCA chapter on Principal components analysis includes a more descriptive overview.


Run PCA on our dataset

Since PCA is sensitive to scale, we will use the SCT normalized assay (reference). We will name the reduction in an informative way to keep them clear for us in the future. In this case, our data has not yet been integrated, but it has been SCT normalized. So we will name the reduction unintegrated.sct.pca. Note that SCTTransform() returned a set of highly variable genes, and the RunPCA() function will use this subset to determine the PCs and genes associated with those PCs.

# Day 2 - PCA and Integration
# =========================================================================

# Build a PCA and add it to the Seurat object  ----------------------------
geo_so = RunPCA(geo_so, reduction.name = 'unintegrated.sct.pca')
geo_so
An object of class Seurat 
46957 features across 31559 samples within 2 assays 
Active assay: SCT (20468 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 1 dimensional reduction calculated: unintegrated.sct.pca

After running the command in the console, we should see that the geo_so Seurat object has a new reduction. Viewed in our running schematic:

Image: Schematic after RunPCA().
Image: Schematic after RunPCA().

We can then visualize the first several PCs using the DimHeatmp() function, which orders both cells and features (genes) according to their PCA scores and allows us to see some general patterns in the data. Here we will specify that the first 24 PCs be included in the figure and that 500 cells are randomly subsetted for the plot. Note - you may need to increase your plot window size to resolve errors regarding figure margins.

# Build heatmaps of PCAs  -------------------------------------------------
# Plot cell by gene heatmaps for first several PCs
DimHeatmap(geo_so, dims=1, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca') # look at first PC alone first
DimHeatmap(geo_so, dims=1:18, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca') # first 18 PCs

# note - need to use png() to write to file because this isn't a ggplot
png(filename = 'results/figures/qc_pca_heatmap.png', width = 12, height = 40, units = 'in', res = 300)
  DimHeatmap(geo_so, dims=1:18, cells=500, balanced=TRUE, reduction = 'unintegrated.sct.pca')
dev.off()

#  =========================================================================
DimHeatmap for PC1 alone

DimHeatmap for PC1 alone

DimHeatmap for PC1-18

DimHeatmap for PC1-18

As we scroll down in this plot, we see less and less structure in our plots indicating that no only is less variation is being represented by larger PCs overall but also that those PCs are less likely to correspond to variation of interest, e.g. corresponding to cell types/subtypes.

What genes are contributing the most to each PC?

We can also visualize the loadings for genes contributing to each principal components using Seurat provided functions source.

We can highlight genes loaded for dimenstions of interest using using VizDimLoadings():

# Visualize gene loadings for PC1 and PC2 ---------------------------------
VizDimLoadings(geo_so, dims = 1:2, reduction = 'unintegrated.sct.pca')
ggsave(filename = 'results/figures/qc_pca_loadings.png', width = 12, height = 6, units = 'in')

#  =========================================================================



How does Seurat use PCA scores?

Per the Ho Lab’s materials - “To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.”

Visualizing relative contributions of each PC

So how do we determine how many PCs to use before classifying cells into clusters based on their expression? One way to evaluate how many PCs to use for clustering is by looking at an elbow plot, which shows the percent variance explained by successive PCs. We’ll use the ElbowPlot() function to do this, specifying that the first 50 PCs be plotted.

# Visualize how many PCs to include using an elbow plot -------------------
ElbowPlot(geo_so, ndims = 50, reduction = 'unintegrated.sct.pca')
ggsave(filename = 'results/figures/qc_sct_elbow_plot.png', width = 8, height = 8, units = 'in')

#  =========================================================================

In this plot, we could arbitrarily choose a number along the x-axis that looks like a sharp change in the variance from one PC to the next, that is, an elbow. Of course, while that’s often the recommendation in tutorials, the choice of where the point of the “elbow” is, is not always obvious, and this plot is no different.

Choosing the number of significant PCs for dimensionality reduction

An important consideration for determining how many PCs to select for your single-cell analysis is to understand the “resolution” of your biological question. Is answering your biological question dependent on identifying rarer cell types or specific subtypes? Or are broader cell-types more relevant and less PCs are needed?

For this dataset, we are expecting a diversity of cell types and cell populations that mediate wound healing, but also an aberrant transition to bone, which might include rarer cells in the population. So for this dataset, we can consider starting with more PCS rather than too few PCs to start. Again, in a full analysis workflow, our selection at this step might be more of a starting point for further iterations than a final decision.

An algorithmic approach

Instead of choosing based on the elbow plot by sight alone, we can try to quantify our choice algorithmically. Here we create a function to return a recommended PC based on two possible metrics ( (A) cumulative variation or (B) above a minimum step size). We can apply that function to our data to get a good starting point for number of PCs to include.

# Estimate optimal PCs for clustering with a function -------------------------
optimal_pcs = function(so, reduction) {
    # quantitative check for number of PCs to include
    pct = so@reductions[[reduction]]@stdev / sum(so@reductions[[reduction]]@stdev) * 100
    cum = cumsum(pct)
    co1 = which(cum > 90 & pct < 5)[1]
    co2 = sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > .1), decreasing = T)[1] + 1
    pcs = min(co1, co2) 
    
    return(pcs)
}

# Apply function to our data
pcs = optimal_pcs(geo_so, 'unintegrated.sct.pca')
pcs
[1] 16
# Based on the heatmap, we'll modify to 13 PCs
pcs = 13

# =========================================================================

Again, this number is likely a starting point and may need to be revised depending on the outcome of the downstream steps.

Optional: A more advanced function for picking PCs

This function is derived from the function above; it’s useful for understanding the overall variance of the PCs as well as visualizing the two kinds of cutoffs. It returns the same recommended PCs as the original function, but also:

  • prints more details
  • prints a plot of the PC variance with key PCs highlighted
  • returns a named list of detailed results
# Define a function to estimate optimal PCs for clustering ----------------
optimal_pcs_advanced = function(so, reduction, print_plot=TRUE, verbose=TRUE) {
  # quantitative check for number of PCs to include
  threshold_var_cum_min = 90
  threshold_var_pct_max = 5
  threshold_step_min = 0.1
  pct = so@reductions[[reduction]]@stdev / sum(so@reductions[[reduction]]@stdev) * 100
  cum = cumsum(pct)
  co1 = which(cum > threshold_var_cum_min & pct < threshold_var_pct_max)[1]
  co2 = sort(which((pct[1:length(pct)-1] - pct[2:length(pct)]) > threshold_step_min), decreasing = T)[1] + 1
  pcs = min(co1, co2) 
  
  plot_df <- data.frame(pc = 1:length(pct),
                        pct_var = pct, 
                        cum_var = cum)
  plot_df$labels = ''
  co1_label = paste0('PC ', co1)
  co2_label = paste0('PC ', co2)
  if (co1 == co2) {
    plot_df$labels[plot_df$pc == co1] = paste0(co1_label, '\n', co2_label)
  } else {
    plot_df$labels[plot_df$pc == co1] = co1_label
    plot_df$labels[plot_df$pc == co2] = co2_label
  }
  
  
  p = ggplot(plot_df, aes(x=pc, y=cum_var, label=labels)) +
    geom_point(color="grey", alpha=1) +
    geom_text(hjust = 0, vjust=1, nudge_x=2, nudge_y=-5) +
    geom_step(data=filter(plot_df, pc<=co2), color="blue", alpha=0.6, direction='vh') +
    geom_step(data=filter(plot_df, pc>=co2), color="grey", alpha=0.6, direction='hv') +
    geom_hline(yintercept = 90, color = "red", alpha=0.6, linetype = 'dashed') +
    geom_point(data=filter(plot_df, pc==co1), aes(x=pc,y=cum_var), color='red') + 
    geom_point(data=filter(plot_df, pc==co2), aes(x=pc,y=cum_var), color='blue') +
    scale_y_continuous(breaks = c(0,25,50,75,100, threshold_var_cum_min)) +
    theme_bw() +
    labs(title = 'Optimal num of PCs',
         x = 'Principal Component', 
         y = 'Cumulative percent variance')

  if (print_plot) {
          print(p)
  }
  
  if (verbose) {
     results = paste(
                sprintf('Reduction %s: %s PCs (total var = %s)', 
                     reduction,
                     nrow(plot_df),
                     plot_df[plot_df$pc==nrow(plot_df), 'cum_var']),
                sprintf("\t%s (total var = %s) : Smallest PC that explains at least %s%% total variance", 
                     co1_label,
                     round(plot_df[plot_df$pc==co1, 'cum_var'], 2),
                     threshold_var_cum_min),
                sprintf("\t%s (total var = %s) : Largest PC with incremental variance step at least %s%%", 
                      co2_label,
                      round(plot_df[plot_df$pc==co2, 'cum_var'],2),
                      threshold_step_min),
                sprintf('\tRecommended num of PCs: %s', pcs),
                sep='\n')
      message(results)
  }
  
  return(list(recommended_pcs=pcs, plot=p, co1=co1, co2=co2, df=plot_df))
}

# =========================================================================
# Apply function to our data ----------------------------------------------
optimal_pcs = optimal_pcs_advanced(geo_so, 'unintegrated.sct.pca')
Reduction unintegrated.sct.pca: 50 PCs (total var = 100)
    PC 42 (total var = 90.98) : Smallest PC that explains at least 90% total variance
    PC 16 (total var = 53.21) : Largest PC with incremental variance step at least 0.1%
    Recommended num of PCs: 16

# Save plot of PCs --------------------------------------------------------
ggsave(filename = 'results/figures/optimal_pcs.png',
       plot=optimal_pcs$plot,
       width = 8, height = 8, units = 'in')

# Based on the heatmaps above, we'll cluster on 13 PCs
pcs = 13

# =========================================================================

Again, this number is likely a starting point and may need to be revised depending on the outcome of the downstream steps.

While outside the scope of this workshop, there are community efforts to develop more sophisticated methods to select an appropriate number of PCs; here are a few popular approaches:




Visualizing our PC results

In addition to selecting a reduced number of PCs to represent our data, we can also visualize , similarly to how we often look at samples for bulk RNA-seq using the DimPlot() function, with each cell plotted along the selected PCs and colored by a selected meta-data column:

?DimPlot # default dims = c(1,2)
# Visualize PCA (PC1 and PC2, unintegrated) -------------------------------
# first label by sample
DimPlot(geo_so, reduction = 'unintegrated.sct.pca', group.by = 'orig.ident') 
 # then label by day
DimPlot(geo_so, reduction = 'unintegrated.sct.pca', group.by = 'day')
 # save 
ggsave(filename = 'results/figures/qc_pca_plot_unintegrated_sct_day.png', width = 7, height = 6, units = 'in')

#  =========================================================================

In the first plot with each cell labeled by the original sample name, it looks like there might be some batches but it’s hard to distinguish. From the second by day, we can see that even after normalization there seems to be variation that corresponds more to a batch effect related to the day status than interesting biological variation. This suggests that clustering our data using our selected number of PCs, integration should be performed to mitigate these differences and better align our data.

Integrate Layers

Before proceeding with clustering, we will integrate our data across all samples. This is a common step for scRNA-seq analyses that include multiple samples and should mitigate the differences between samples we saw in our PCA plot. Currently, each sample is stored as a layer in our geo_so object, Since earlier in the workshop, we used SCTransform() to normalize our data, select variable genes and then have generated a PCA reduction, we can use the IntegrateLayers() function to do that.

CCA integration illustration, modified from Stuart and Butler, 2018 -https://doi.org/10.1101/460147
CCA integration illustration, modified from Stuart and Butler, 2018 -https://doi.org/10.1101/460147

The details of how integration works depends on the method, but for the RPCA approach that we’ll be using here - the process is similar to the canonical correlation analysis (CCA) approach illustration above (source). However, RPCA is more efficient (faster) to run and better preserves distinct cell identities between samples (source). As described in the corresponding Seurat tutorial, each dataset is projected into the others’ PCA space and constrain the anchors

New to Seurat v5 are improvements that make selecting different integration methods much easier. The results of each alternative integration method are stored within the same Seurat object, which makes comparing downstream effects much easier. So, if we wanted to run the RPCAIntegration method, we would run (but we won’t here):

### DO NOT RUN ###
geo_so = IntegrateLayers(
  object = geo_so, 
  method = RPCAIntegration, 
  orig.reduction = 'unintegrated.sct.pca',
  normalization.method = 'SCT',
  new.reduction = 'integrated.sct.rpca')
### DO NOT RUN ###
#  =========================================================================

Note that the integration is run on the layers, which from our prior steps each correspond to a single sample, in our geo_so Seurat object. For additional information, refer to the Seurat v5 vignette on integrative analysis (link) provides examples of each integration method. Note, that the vignette code uses the NormalizeData(), ScaleData(), FindVariableFeatures() pipeline, so their IntegrateLayers() call does not include normalization.method = 'SCT', as ours must.

Note we have specified the unintegrated reduction unintegrated.sct.pca, which is what IntegrateLayers() operates on, along with the SCT assay. Let’s take a look to see what’s different about the Seurat object:

Load integrated data

Because the IntegrateLayers() function takes a while to run, we will simply load the RPCA integrated geo_so object from a file we have previously generated.

geo_so = readRDS('/home/workshop/rcavalca/ISC_R/results/rdata/geo_so_sct_integrated.rds')
# Check our updated object that we've read in from file ------------------- 
# Observe that we now have a new reduction, `integrated.sct.rpca`
geo_so 
An object of class Seurat 
46957 features across 31559 samples within 2 assays 
Active assay: SCT (20468 features, 3000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: RNA
 2 dimensional reductions calculated: unintegrated.sct.pca, integrated.sct.rpca

Viewed in our running schematic:

Image: Schematic after IntegrateLayers().
Image: Schematic after IntegrateLayers().

We can also confirm that our integration method has helped to correct the Day effects we saw in the initial PCA plots

# Visualize PCA (PC1 and PC2, integrated) ---------------------------------
DimPlot(geo_so, reduction = 'integrated.sct.rpca', group.by = 'day')
ggsave(filename = 'results/figures/qc_pca_plot_integrated_sct_day.png', width = 7, height = 6, units = 'in')

#  =========================================================================

Save our progress

Before we move on, let’s save our updated Seurat object to file:

# Save updated seurat object to file  -------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated.rds')

#  =========================================================================

Other integration methods

After normalizing the data with SCTransform() and performed the dimension reduction with RunPCA(), alternatively we could also use the CCA integration method with:

### DO NOT RUN ###
geo_so = IntegrateLayers(
    object = geo_so, 
    method = CCAIntegration, 
    orig.reduction = 'unintegrated.sct.pca',
    normalization.method = 'SCT',
    new.reduction = 'integrated.sct.cca')
### DO NOT RUN ###

Summary

In this section, we:

  • Discussed how PCA is used in scRNA-seq analysis.
  • Demonstrated some visualizations after computing PCA.
  • Discussed how to decide how many PCs to use in downstream analysis.
  • Integrated the data.

Next steps: Clustering and projection


These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.




Previous lesson Top of this lesson Next lesson