1 Differential Expression Workflow

Prior to testing for differential expression between our comparisons of interest, we’ll first generate plots that will assess how well our samples match up with our expectations (based on their treatment groups) and what we might expect to see from our differential expression comparisons.

Step Task
1 Experimental Design
2 Biological Samples / Library Preparation
3 Sequence Reads
4 Assess Quality of Raw Reads
5 Splice-aware Mapping to Genome
6 Count Reads Associated with Genes
7 Organize project files locally
8 Initialize DESeq2 and fit DESeq2 model
9 Assess expression variance within treatment groups
10 Specify pairwise comparisons and test for differential expression
11 Generate summary figures for comparisons
12 Annotate differential expression result tables

1 In this module, we will:

  • Generate common QC visualizations
  • Understand how to interpret QC visualizations
  • Understand when to revise the model used in the DESeq2 initialization
  • Understand the pitfalls of post-hoc analysis
  • Describe the causes and implications of batch effect or other QC issues in an RNA-Seq experiment

2 Sample Visualizatons for Quality Control

Yesterday we discussed aspects of quality control assessment at the sequencing data level. Today we will outline sample-level and gene level quality control to determine what we should expect from our downstream differential expression comparisons.

To do this, we will assess the similarity of our samples by using principal component analysis (PCA). This will allow us to determine how well patterns in the data fits our expectations from the experiments design and possible sources of variation in the dataset.

2.0.1 Plot Setup

We’ve already loaded the libraries we need for this module. We’ll follow best practices and create new directories to organize our output figures as well as a variable to store a descriptor of the dataset we are looking at.

# system("mkdir ./Figures") # create output folder if not already generated
# system("mkdir ./Figures/BySamples") # create output folder if not already generated
# OR 

dir.create("./Figures")
dir.create("./Figures/BySamples/")
plotPath = "./Figures/BySamples/"
Comparison <- "ko.Tx" # descriptor for the dataset we are looking at

2.1 Principle Component Analysis (PCA) Plots

A common and very useful plot for evaluating how well our samples cluster by treatment groups are Principle Component Analysis (PCA) plots. PCA is used to emphasize variation and bring out patterns in large datasets by using dimensionality redution.

This image from a helpful step by step explaination of PCA helps to illustrate the principle component projections for two genes measured in approximately 60 mouse samples. Generally, this process is repeated and after each gene’s contribution to a principle component or weight is determined, the expression and weight are summed across genes for each sample to calculate a value for each principle component.

Note: A more detailed overview of the PCA procedure is outlined in a Harvard Chan Bioinformatic Core training module and is based a more thorough description presented in a StatQuest’s video. Additionally, this TowardsDataScience blog post goes through the math behind PCAs.

Below, we will plot the rlog normalized data for our samples projected onto a 2D plane and spanned by their first two principle components to visualize the overall effect of experimental covariates and determine if there is evidence of batch effects.

We’ll generate the PCA projections for the top 500 using the plotPCA function from DESeq2, specifying Gtype.Tx as the condition of interest.

p.all <- plotPCA(rld, intgroup = c('Gtype.Tx'), ntop = 500)
p.all

Next, we’ll save this plot to file.

pdf(file = paste0(plotPath, 'PCAplot_rlog_', Comparison, '.pdf'), onefile = TRUE)
p.all
dev.off()
## quartz_off_screen 
##                 2

2.1.0.1 Optional - code to customize PCA plots

We can generate a customized plot using the ggplot2 package, including shape assignments for our treatment groups and color assignments for the individual samples.

pdf(file = paste0(plotPath, 'PCAplot_Fancy_rlog_', Comparison, '.pdf'), onefile = TRUE)

#PCA plot for Rlog-Normalized counts for all samples
CombinatoricGroup <- factor(MetaInfo$Gtype.Tx)
SampleName <- factor(row.names(MetaInfo))
    
#Generate the PCA projections using the `plotPCA` function from DESeq2. 
p.all <- plotPCA(rld, intgroup = c('Gtype.Tx'), ntop = 500)

gp <- ggplot(p.all$data, aes(x = PC1, y = PC2, color = SampleName, shape = CombinatoricGroup)) +     xlab(p.all$labels[2]) + ylab(p.all$labels[1]) + scale_shape_manual(values=1:nlevels(CombinatoricGroup), name = 'Combinatoric Group') + geom_point(size=2) + ggtitle(label = as.character('All samples Rlog-Normalized')) + theme(plot.title = element_text(hjust = 0.5)) + guides(colour=guide_legend(nrow=12, title = 'Sample'), legend.key = element_rect(size = 1), legend.key.size = unit(0, 'cm')) + theme_classic(base_size = 10) + theme(legend.margin=margin(t = 0, unit='mm'))
plot(gp)
dev.off()
## quartz_off_screen 
##                 2

2.1.1 Interpreting PCA plots

In the plot above, we see that principle component 1 (PC1) explains ~80% of the variance in our data while principle component 2 (PC2) explains ~12% of the variance. We also see that samples in both control groups are fairly tightly grouped, while samples within each Tx group do not cluster as tightly.

This helpful overview of PCA basics walks through both the generation and interpretatation of similar plots.

We generally expect most of the variance to be explained by the first two or three principle components. A screeplot is a way to visualize the variance explained by each principle component.

In these scree plot examples from BioTuring, the plot on the left fits what we would expect for a dataset with high signal from the experimental treatment, where the majority of the variance is explained by the first few principle components. The plot on the right illustrates a scenario where the variance is distributed across many components, which could be due to low signal from the experimental treatment, complex experimental design, or confounding factors. image:

2.1.1.1 Optional - Generating a ScreePlot

Scree plots are a way to see how much variance is explained by all possible principle components, giving us a way to

To generate a scree plot, the PCA results need to be used independently of plotting, such as described by this statquest post and replicated below.

# generate PCA loadings
pca <- prcomp(t(assay(rld)), scale=TRUE)

## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
barplot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation") 

We can see that the majority (~65%) of the variance across our samples is explained by the first three principle components, giving us some additional confidence regarding the quality of our data.

2.1.1.2 Optional - Additional PCA plot for raw data

It can sometimes be useful to also generate a PCA plot for the raw data as well as the normalized data.

pdf(file = paste0(plotPath, 'PCAplot_raw_', Comparison, '.pdf'), onefile = TRUE)

#PCA for Raw counts for all samples
RC <- SummarizedExperiment(log2(counts(dds, normalized = FALSE)), colData=colData(dds))

p.RC <- plotPCA(DESeqTransform(RC), intgroup = 'Gtype.Tx')
  
gpRC <- ggplot(p.RC$data, aes(x = PC1, y = PC2, color = SampleName, shape = CombinatoricGroup)) +     xlab(p.RC$labels[2]) + ylab(p.RC$labels[1]) + scale_shape_manual(values=1:nlevels(CombinatoricGroup), name = 'Combinatoric Group') + geom_point(size=2) + ggtitle(label = as.character('All samples Raw Data')) + theme(plot.title = element_text(hjust = 0.5)) + guides(colour=guide_legend(nrow=12, title = 'Sample'), legend.key = element_rect(size = 1), legend.key.size = unit(0, 'cm')) + theme_classic(base_size = 10) + theme(legend.margin=margin(t = 0, unit='mm'))
plot(gpRC)

dev.off()
## quartz_off_screen 
##                 2
# embedd example of plot (rlog only)
plot(gpRC)

We see that there is less variance explained by PC1 and that the samples from the same group are not as well clustered for the raw data. Since this is prior to normalization, these differences are likely due to technical considerations like sequencing depth differences that are accounted for by the rlog normalization.

2.1.2 Evaulating batch effects or confounders

PCA plots are useful for evaulating the impact of factors other than the experimental treatment or group.

At times, batch effects can be quite obvious, such as this example from the DESeq2 vignette, where there is clear separation within the treatment groups.

If we color only by sequencing run type (paired-end vs. single-end), we see that PC2 (29% of variance) is primarily explained by this technical covariate.

However, the samples are clearly seperated by experimental condition on PC1, and since PC1 explains more variance than PC2 and since we have non-confounded batches, we could incorporate the technical covariate into our model design, such as outlined in the DESeq2 vignette.


Click for complex design discussion In experiments with more complex designs, such as when there are interesecting/multiple treatment conditions, it can be less clear what covariants are influencing expression, such as illustrated from this documenation for a microarray analysis tool. From the PCA labeled by experimental treatment, we see that samples from the treatment group do not cluster together and that there is high variance across all treatment groups. However, when the plot is color coded by the technical batches of probe labeling, we see that the patterns in the data are better explained by batch than the experimental conditions.

2.1.2.1 When to remove samples or update the design formula

Generally, the decision to remove samples should not be taken lightly, especially if the consideration is based on the expression level patterns we are visualizing in this section, as doing so without sufficient justification could potentially bias your results. Other training materials have a step-by-step example of evaluating batch effects using PCAs, that we would recommend reviewing prior to removing any samples from your comparisons.

If your PCA reveals patterns that were not accounted for in the initial model fit, the DESeq2 vignette provides an overview of how to add additional covariates to a model design before refitting the DESeq2 model for a dataset.


3 Summary

In this section, we:

  • Learned to generate PCA plots
  • Discussed variance within treatment groups
  • Discussed technical artifacts, including batches

Before moving on with our group comparisons, we’ll take a short (10 minute) break.


4 Sources Used


These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.