Prior to testing for differential expression between our comparisons of interest, we’ll first generate plots that will assess how well our samples match up with our expectations (based on their treatment groups) and what we might expect to see from our differential expression comparisons.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Raw Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
1 In this module, we will:
- Generate common QC visualizations
- Understand how to interpret QC visualizations
- Understand when to revise the model used in the DESeq2 initialization
- Understand the pitfalls of post-hoc analysis
- Describe the causes and implications of batch effect or other QC issues in an RNA-Seq experiment
Yesterday we discussed aspects of quality control assessment at the sequencing data level. Today we will outline sample-level and gene level quality control to determine what we should expect from our downstream differential expression comparisons.
To do this, we will assess the similarity of our samples by using principal component analysis (PCA). This will allow us to determine how well patterns in the data fits our expectations from the experiments design and possible sources of variation in the dataset.
We’ve already loaded the libraries we need for this module. We’ll follow best practices and create new directories to organize our output figures as well as a variable to store a descriptor of the dataset we are looking at.
# system("mkdir ./Figures") # create output folder if not already generated
# system("mkdir ./Figures/BySamples") # create output folder if not already generated
# OR
dir.create("./Figures")
dir.create("./Figures/BySamples/")
plotPath = "./Figures/BySamples/"
Comparison <- "ko.Tx" # descriptor for the dataset we are looking at
A common and very useful plot for evaluating how well our samples cluster by treatment groups are Principle Component Analysis (PCA) plots. PCA is used to emphasize variation and bring out patterns in large datasets by using dimensionality redution.
This image from a helpful step by step explaination of PCA helps to illustrate the principle component projections for two genes measured in approximately 60 mouse samples. Generally, this process is repeated and after each gene’s contribution to a principle component or weight is determined, the expression and weight are summed across genes for each sample to calculate a value for each principle component.
Note: A more detailed overview of the PCA procedure is outlined in a Harvard Chan Bioinformatic Core training module and is based a more thorough description presented in a StatQuest’s video. Additionally, this TowardsDataScience blog post goes through the math behind PCAs.
Below, we will plot the rlog normalized data for our samples projected onto a 2D plane and spanned by their first two principle components to visualize the overall effect of experimental covariates and determine if there is evidence of batch effects.
We’ll generate the PCA projections for the top 500 using the plotPCA
function from DESeq2, specifying Gtype.Tx
as the condition of interest.
p.all <- plotPCA(rld, intgroup = c('Gtype.Tx'), ntop = 500)
p.all
Next, we’ll save this plot to file.
pdf(file = paste0(plotPath, 'PCAplot_rlog_', Comparison, '.pdf'), onefile = TRUE)
p.all
dev.off()
## quartz_off_screen
## 2
We can generate a customized plot using the ggplot2
package, including shape assignments for our treatment groups and color assignments for the individual samples.
pdf(file = paste0(plotPath, 'PCAplot_Fancy_rlog_', Comparison, '.pdf'), onefile = TRUE)
#PCA plot for Rlog-Normalized counts for all samples
CombinatoricGroup <- factor(MetaInfo$Gtype.Tx)
SampleName <- factor(row.names(MetaInfo))
#Generate the PCA projections using the `plotPCA` function from DESeq2.
p.all <- plotPCA(rld, intgroup = c('Gtype.Tx'), ntop = 500)
gp <- ggplot(p.all$data, aes(x = PC1, y = PC2, color = SampleName, shape = CombinatoricGroup)) + xlab(p.all$labels[2]) + ylab(p.all$labels[1]) + scale_shape_manual(values=1:nlevels(CombinatoricGroup), name = 'Combinatoric Group') + geom_point(size=2) + ggtitle(label = as.character('All samples Rlog-Normalized')) + theme(plot.title = element_text(hjust = 0.5)) + guides(colour=guide_legend(nrow=12, title = 'Sample'), legend.key = element_rect(size = 1), legend.key.size = unit(0, 'cm')) + theme_classic(base_size = 10) + theme(legend.margin=margin(t = 0, unit='mm'))
plot(gp)
dev.off()
## quartz_off_screen
## 2
In the plot above, we see that principle component 1 (PC1) explains ~80% of the variance in our data while principle component 2 (PC2) explains ~12% of the variance. We also see that samples in both control
groups are fairly tightly grouped, while samples within each Tx
group do not cluster as tightly.
This helpful overview of PCA basics walks through both the generation and interpretatation of similar plots.
We generally expect most of the variance to be explained by the first two or three principle components. A screeplot is a way to visualize the variance explained by each principle component.
In these scree plot examples from BioTuring, the plot on the left fits what we would expect for a dataset with high signal from the experimental treatment, where the majority of the variance is explained by the first few principle components. The plot on the right illustrates a scenario where the variance is distributed across many components, which could be due to low signal from the experimental treatment, complex experimental design, or confounding factors. image:
Scree plots are a way to see how much variance is explained by all possible principle components, giving us a way to
To generate a scree plot, the PCA results need to be used independently of plotting, such as described by this statquest post and replicated below.
# generate PCA loadings
pca <- prcomp(t(assay(rld)), scale=TRUE)
## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
barplot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")
We can see that the majority (~65%) of the variance across our samples is explained by the first three principle components, giving us some additional confidence regarding the quality of our data.
It can sometimes be useful to also generate a PCA plot for the raw data as well as the normalized data.
pdf(file = paste0(plotPath, 'PCAplot_raw_', Comparison, '.pdf'), onefile = TRUE)
#PCA for Raw counts for all samples
RC <- SummarizedExperiment(log2(counts(dds, normalized = FALSE)), colData=colData(dds))
p.RC <- plotPCA(DESeqTransform(RC), intgroup = 'Gtype.Tx')
gpRC <- ggplot(p.RC$data, aes(x = PC1, y = PC2, color = SampleName, shape = CombinatoricGroup)) + xlab(p.RC$labels[2]) + ylab(p.RC$labels[1]) + scale_shape_manual(values=1:nlevels(CombinatoricGroup), name = 'Combinatoric Group') + geom_point(size=2) + ggtitle(label = as.character('All samples Raw Data')) + theme(plot.title = element_text(hjust = 0.5)) + guides(colour=guide_legend(nrow=12, title = 'Sample'), legend.key = element_rect(size = 1), legend.key.size = unit(0, 'cm')) + theme_classic(base_size = 10) + theme(legend.margin=margin(t = 0, unit='mm'))
plot(gpRC)
dev.off()
## quartz_off_screen
## 2
# embedd example of plot (rlog only)
plot(gpRC)
We see that there is less variance explained by PC1 and that the samples from the same group are not as well clustered for the raw data. Since this is prior to normalization, these differences are likely due to technical considerations like sequencing depth differences that are accounted for by the rlog normalization.
PCA plots are useful for evaulating the impact of factors other than the experimental treatment or group.
At times, batch effects can be quite obvious, such as this example from the DESeq2 vignette, where there is clear separation within the treatment groups.
If we color only by sequencing run type (paired-end vs. single-end), we see that PC2 (29% of variance) is primarily explained by this technical covariate.
However, the samples are clearly seperated by experimental condition on PC1, and since PC1 explains more variance than PC2 and since we have non-confounded batches, we could incorporate the technical covariate into our model design, such as outlined in the DESeq2 vignette.
Generally, the decision to remove samples should not be taken lightly, especially if the consideration is based on the expression level patterns we are visualizing in this section, as doing so without sufficient justification could potentially bias your results. Other training materials have a step-by-step example of evaluating batch effects using PCAs, that we would recommend reviewing prior to removing any samples from your comparisons.
If your PCA reveals patterns that were not accounted for in the initial model fit, the DESeq2 vignette provides an overview of how to add additional covariates to a model design before refitting the DESeq2 model for a dataset.
In this section, we:
Before moving on with our group comparisons, we’ll take a short (10 minute) break.
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.