1 Differential Expression Workflow

Here we will finally test for differential expression between our comparisons of interest.

Step Task
1 Experimental Design
2 Biological Samples / Library Preparation
3 Sequence Reads
4 Assess Quality of Raw Reads
5 Splice-aware Mapping to Genome
6 Count Reads Associated with Genes
7 Organize project files locally
8 Initialize DESeq2 and fit DESeq2 model
9 Assess expression variance within treatment groups
10 Specify pairwise comparisons and test for differential expression
11 Generate summary figures for comparisons
12 Annotate differential expression result tables

In this module we will
* Generate tables of DE results
* Understand what a p-value represents.
* Understand multiple hypothesis correction application and importance


2 Generating DE results

Now that we have reviewed the plots by sample and determined that our data passed our quality control checks, specifically that the patterns we observe are likely due to our experimental treatments over technical or other cofounding factors.

This illustration from the HCB training materials illustrates the basis of the differential expression procedure, where our goal is to compare the distribution of an expressed gene across samples in each treatment groups. Image credit: Paul Pavlidis, UBC

Only where the distributions of each group are sufficiently seperated will a gene be considered differentially expressed. This is where having sufficient replicates to overcome within group variance is important, as the more replicates we have in each group the better we can determine the distributions of expression for each group.

2.1 Dispersion estimates

Since we’ve already run our DESeq analysis, the internal normalization process and specified model fit have already been added to our dds object. Let’s take another look at the dds object.

head(dds)

We can visualize the DESeq2 normalization results for our data, which center on shrinking the variance across all genes to better fit the expected spread at a given expression level by plotting the dispersion estimates with the plotDispEsts function.

plotDispEsts(dds)

We can see the raw data plotted in black, the fitted (or expected) dispersion in red, and the normalized data with scaled variance in blue. Since we have fairly small sample sizes for each condition, we see shrinkage for many genes but a reasonable correlation between the expression level and dispersions.

This HBC tutorial has a more detailed overview of estimating size factors, estimating gene dispersion, and the shrinkage procedure, as well as examples of concerning dispersion plots that may suggest reassessing quality of the experimental data.

2.2 DESeq2 statistical testing

We have already fit our DESeq2 model, specifing our model as ~ Gtype.Tx and our next step is to identify genes with significantly different expression between our contrasts of interest. To determine significance, a statistical test is required.

The first step for any statistical test is to define the null hypothesis. In this case, the null hypothesis would be that there is no difference in the expression of a gene between two groups of samples, such as illustrated at the bottom of the first figure in this module. The next step is to use a statistical test to determine if, based on the observed data, the null hypothesis can be rejected.

To do this, DESeq2 applies the Wald test to compare two groups. A Wald test statistic is computed as well as the probability that the observed value or more extreme test statistic would be observed. This probability is called the p-value of the test. If the p-value is smaller than a pre-defined threshold, we would reject the null hypothesis and state that there is evidence against the null, i.e. the gene is differentially expressed. However, if the p-value is larger than our threshold, we would fail to reject the null hypothesis, meaning that we lack evidence that the expression of this gene is different NOT that we have evidence that the expression is indeed the same in both groups.

For a more detailed overview of the statistical comparisons , please refer to this HBC tutorial or the DESeq2 vignette.

2.3 Results function

We can check what comparisons were automatically generated during fitting using the resultsNames().

resultsNames(dds)
## [1] "Intercept"                    "Gtype.Tx_ko.Tx_vs_wt.Tx"     
## [3] "Gtype.Tx_ko.control_vs_wt.Tx" "Gtype.Tx_wt.control_vs_wt.Tx"

Since we are interested in comparing each knockout versus its corresponding wild-type control, only one of the automatically generated comparisons is relevant. We can pull the Tx_ko.Tx_vs_wt.Tx comparison results using the results function and assign those result to a new object.

Comparison <- "Gtype.Tx_ko.Tx_vs_wt.Tx"
res_Tx <- results(dds, name=Comparison)

Checkpoint: Put a green check if you are seeing the same results names or a red x if you want to be put in a breakout room for help.

2.4 How to generate additional contrasts

If there are comparisons that are not included in the resultsName, since our dds object already has the fitted data we can generate Wald test results by specifying those comparisons as an additional arguement to the results function.

?results

As the function description specifies, we need to provide a list of three elements: the name of the factor in the model design, the name of the numerator for the fold-change, and the name of denominator.

res_WT <- results(dds, contrast=c("Gtype.Tx", "ko.control", "wt.control")) 
head(res_WT)
## log2 fold change (MLE): Gtype.Tx ko.control vs wt.control 
## Wald test p-value: Gtype.Tx ko.control vs wt.control 
## DataFrame with 6 rows and 6 columns
##                            baseMean       log2FoldChange              lfcSE
##                           <numeric>            <numeric>          <numeric>
## ENSMUSG00000000001 6255.78732983324  -0.0137017310529696 0.0931866166016875
## ENSMUSG00000000028 1337.98981847661    0.522678839173994  0.135975315430076
## ENSMUSG00000000031 3.77362566430888    -1.15621509092787   1.60061323542543
## ENSMUSG00000000037  27.563472571641   -0.279210113537662   0.47647640309087
## ENSMUSG00000000049 2.25696294315492     4.01133494840164   1.88908476242953
## ENSMUSG00000000056 2194.17847722401 -0.00821319405597625  0.177322023609814
##                                   stat               pvalue                padj
##                              <numeric>            <numeric>           <numeric>
## ENSMUSG00000000001  -0.147035395775079    0.883104082146075   0.961003101666123
## ENSMUSG00000000028    3.84392444702786 0.000121082296971725 0.00391536728876775
## ENSMUSG00000000031  -0.722357572296695    0.470074664977778                  NA
## ENSMUSG00000000037  -0.585989383160311    0.557882649434956   0.804474952347132
## ENSMUSG00000000049    2.12342771916846   0.0337180264640576                  NA
## ENSMUSG00000000056 -0.0463179580786245    0.963056826192806   0.988889834913244

2.5 Results table - review of output columns

Next, we’ll take a look at some of the results we generated.

head(res_Tx)
## log2 fold change (MLE): Gtype.Tx ko.Tx vs wt.Tx 
## Wald test p-value: Gtype.Tx ko.Tx vs wt.Tx 
## DataFrame with 6 rows and 6 columns
##                            baseMean     log2FoldChange              lfcSE
##                           <numeric>          <numeric>          <numeric>
## ENSMUSG00000000001 6255.78732983324 0.0796209849035635 0.0934295072241811
## ENSMUSG00000000028 1337.98981847661  0.226230261515312  0.116180019950784
## ENSMUSG00000000031 3.77362566430888  0.579460172757394   1.16211309713913
## ENSMUSG00000000037  27.563472571641 -0.555187661098399  0.430603520336244
## ENSMUSG00000000049 2.25696294315492  0.187563793195415   1.69510926353637
## ENSMUSG00000000056 2194.17847722401  -1.30631532269217  0.177857481386408
##                                 stat               pvalue               padj
##                            <numeric>            <numeric>          <numeric>
## ENSMUSG00000000001 0.852203840832806    0.394100965260126   0.66415197808866
## ENSMUSG00000000028  1.94723896252683   0.0515060925533081   0.21586297865795
## ENSMUSG00000000031 0.498626316305961    0.618042662195753                 NA
## ENSMUSG00000000037 -1.28932448268159    0.197285302952349  0.466008669950229
## ENSMUSG00000000049  0.11064997238238    0.911893918525563                 NA
## ENSMUSG00000000056 -7.34473080642647 2.06173625472245e-13 3.201811974326e-11

We can see in the results table that the row name are gene symbols and there are six columns of values.

The first column, ‘baseMean’ is the average of the normalized count values, dividing by size factors, taken over all samples, and can be interpreted as the relative expression level of that gene across all samples.

The second column, ‘log2FoldChange’, is the ratio of the expression of the numerator group (ko.Tx) over the denominator group (wt.Tx). If the value is positive, that means the expression of that gene is greater across the ko.Tx samples than across the wt.Tx samples. If the value is negative, that means the expression of that gene is greater across the wt.Tx samples. The third column, ‘lfcSE’ is the standard error for the log2 fold change estimate.

The fourth column, ‘stat’, is the calculated Wald statistic for that gene, while the fifth column ‘pvalue’ is the nominal significance for that gene.

2.5.1 Multiple hypothesis testing and FDR correction

The sixth column, ‘padj’, is the adjusted p-value and is what we use for determining significantly differently expressed genes. Why do we use values from this column instead of the ‘pvalue’ column?

Each p-value is the result of a single test for a single gene. The more genes we test, the greater chance we have of seeing a significant results. This is the multiple testing problem. If we used the p-value directly from the Wald test with a significance cut-off of p < 0.05, that means there is a 5% chance it is a false positives. So if we are testing 20,000 genes for differential expression, we would expect to see ~1,000 significant genes just by chance. This is problematic because we would need to sift through our “significant” genes to identify which ones are true positives.

DESeq2 reduces the number of genes that will be tested by removing genes with low number of counts and outlier samples (gene-level QC, see note below). However, we need to correct for multiple hypothesis testing to reduce the number of false positives, and while there are a few common approaches, the default method is False Discovery Rate(FDR)/Benjamini-Hochberg correction which is symbolized as ‘BH’ in DESeq2. Benjamini and Hochberg (1995) defined the concept of FDR and created an algorithm to control it below a specified level. An interpretation of the BH method is implemented in DESeq2 in which genes are ranked by p-value, then each ranked p-value is multiplied by the number of total tests divided by rank.

Note: From the results function help page and HBC tutorial that includes overview of multiple hypothesis correction, we can change the multiple hypothesis correction method to an alternative option using the pAdjustMethod = argument.

The default FDR rate cutoff for DESeq2 is alpha = 0.05. By setting the cutoff to < 0.05, we expect that the proportion of false positives amongst our differentially expressed genes is now controlled to 5%. For example, if we call 500 genes as differentially expressed with this FDR cutoff, we expect only 25 of them to be false positives. DESeq2 vignette’s includes a further discussion of filtering and multiple testing

Note on ‘padj’ values set to NA
As discussed in the HBC tutorial as well as the DESeq2 vignette
* If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA.
* If a row contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA. These outlier counts are detected by Cook’s distance.
* If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p-value will be set to NA.


3 Summary

In this section, we:

  • Performed statistical tests for comparisons of interest
  • Generated tables of differential expression results - i.e. fold changes and adjusted pvalues for each gene in dataset
  • Discussed importance and application of multiple hypothesis correction

Now that we’ve generated our differential comparisons and have an understanding of our results, including multiple hypothesis correction, we can proceed with generating summary figures and tables for our differential expression analysis.


4 Sources Used



These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.