1 Differential Expression Workflow

Here we will

Step Task
1 Experimental Design
2 Biological Samples / Library Preparation
3 Sequence Reads
4 Assess Quality of Raw Reads
5 Splice-aware Mapping to Genome
6 Count Reads Associated with Genes
7 Organize project files locally
8 Initialize DESeq2 and fit DESeq2 model
9 Assess expression variance within treatment groups
10 Specify pairwise comparisons and test for differential expression
11 Generate summary figures for comparisons
12 Annotate differential expression result tables

In this module, we will
* Understand advantages of using gene ids when analyzing data.
* Given a list of ENSEMBL gene ids, add gene symbols and Entrez accessions.
* Generate common visualizations for differential expression comparisons
* Understand reasonable thresholds for statistically significant differentially expressed genes
* Discuss options for functional enrichments


2 Visualizing DE results

Part of differential expression analysis is generating visualizations to share our results. While the DESeq2 tutorial provides examples of other visualizations, a common visualization to summarize DE comparisons are volcano plots.

2.1 Generating Volcano plots

As described by this Galaxy project tutorial, a volcano plot is a type of scatterplot that shows statistical significance (adjusted pvalue) versus magnitude of change (fold change). It allows us to quickly identify genes with large fold changes that are also statistically significant. In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.

First, we need to set thresholds for determining significant genes. A reasonable threshold would be a fold-change of less than -1.5 or greater than 1.5 and an adjusted pvalue less than 0.05.

fc <- 1.5
pval <- 0.05

Then, we need to sep up some objects to run our plotting code, creating a new object that will have the right shape for plotting, labeling our comparison of interest, and setting up our output directory

df<- res_WT[order(res_WT$padj),] #select our data of interest
df <- as.data.frame(df) #convert our object type
df <- cbind("id" = row.names(df), df) #set rownames to valid column
str(df) 
## 'data.frame':    24693 obs. of  7 variables:
##  $ id            : Factor w/ 24693 levels "ENSMUSG00000000001",..: 21065 14411 14377 17370 14378 12986 13375 13384 24376 11180 ...
##  $ baseMean      : num  4504 7458 876 29135 650 ...
##  $ log2FoldChange: num  -1.23 6.5 9.86 -7.65 8.78 ...
##  $ lfcSE         : num  0.0889 0.4789 0.7603 0.6086 0.7007 ...
##  $ stat          : num  -13.9 13.6 13 -12.6 12.5 ...
##  $ pvalue        : num  8.40e-44 5.71e-42 1.91e-38 2.84e-36 4.81e-36 ...
##  $ padj          : num  1.47e-39 5.00e-38 1.11e-34 1.24e-32 1.68e-32 ...
Comparison <- "ko.control_v_wt.control"

plotPath = "./Figures/"

Once we’ve setup the annotations we need, we can proceed with generating the volcano plot, including our thresholds.

pdf(file = paste0(plotPath,'VolcanoPlot_', Comparison, '.pdf'), onefile = FALSE)

# Initialize the plot, saving as object 'p' and specifying the plot type as 'geom_point'
p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(shape = 21, fill= 'darkgrey', color= 'darkgrey', size = 1) + theme_classic() + xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')

# Add threshold lines
p <- p + geom_vline(xintercept = c(0, -log2(fc), log2(fc)), linetype = c(1, 2, 2), color = c('black', 'black', 'black')) + geom_hline(yintercept = -log10(pval), linetype = 2, color = 'black')

# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))

print(p)
dev.off()
## quartz_off_screen 
##                 2
p

2.1.0.1 Optional- color coded volcano plot

Now we need to subset our data to label the datapoints (genes) that pass our thresholds.

df$dot <- rep(3, nrow(df))
df$dot[which(df$padj <= pval & df$log2FoldChange < 0 & abs(df$log2FoldChange) >= log2(fc))] = 2
df$dot[which(df$padj <= pval & df$log2FoldChange > 0 & abs(df$log2FoldChange) >= log2(fc))] = 1
df$sig <- df$dot

#take top 5 up, down, then combine, assign label
top <- rbind(head(subset(df, df$dot == 1), 5),head(subset(df, df$dot == 2), 5))
top$label <- top$id
df <- merge(x = df, y = top[,c('id','label')], by = "id", all.x = TRUE)
    
#count the number of significan up and down genes, assign value for legend
df$dot <- factor(df$dot,levels = c(1,2,3), labels = c(paste0('Up: ', sum(df$dot == 1)),paste0('Down: ', sum(df$dot == 2)),'NS'))

Once we’ve setup the annotations we need, we can proceed with generating the volcano plot.

pdf(file = paste0(plotPath,'VolcanoPlot_Fancier', Comparison, '.pdf'), onefile = FALSE)

p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(aes(color = df$dot), size = 1) + theme_classic() + xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')

# specify colors for up-/down-/nonsignificant genes
p <- p + scale_color_manual(name = '', values=c('#B31B21', '#1465AC', 'darkgray'))
# add threshold lines
p <- p + geom_vline(xintercept = c(0, -log2(fc), log2(fc)), linetype = c(1, 2, 2), color = c('black', 'black', 'black')) + geom_hline(yintercept = -log10(pval), linetype = 2, color = 'black')
# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))

print(p)
dev.off()
## quartz_off_screen 
##                 2
p

For additional visualizations for our DE results, this HBC tutorial includes some nice examples.


3 Summarizing our DE results

To generate a general summary of the DE results, we can use the summary function to generate a basic summary by DESeq2.

However, we can also use conditional statements to determine the number of genes that pass our thresholds for each comparison, which might be more informative.

#summary(res_WT)
sum(res_WT$padj < 0.05 & abs(res_WT$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 735
sum(res_Tx$padj < 0.05 & abs(res_Tx$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 1152

How do the number of DE genes compare to what we observed from the PCA plots?

3.0.1 Chosing thresholds

Thresholding adjusted pvalues < 0.05 is very standard

How many DE genes would we have if we only apply a more stringent threshold to our adjusted pvalues and ignore fold-change?

sum(res_WT$padj < 0.01, na.rm = TRUE)
## [1] 713

If we threshold only on significance, we still identify less DE genes than the 6057 identified in the paper. How should we interpret these differences? Are we in danger of missing relevant genes or, due to DESeq2’s stringency, are we better protected against irrelevant genes? One of the authors of DESeq2, wrote this blog post that detailed differences between DESeq2 and EdgeR but this review of multiple DE tools by Costa-Silva, Domingues, and Martins Lopes may also be useful for making choices for your own data.

3.0.1.1 Optional - Subsetting significant genes

You may be interested in identifying only the genes that pass your significance thesholds. A useful way to do this is to conditionally subset your results.

res_sig <- na.omit(res_WT)
res_sig <- res_sig[which(res_sig$padj < 0.05 & abs(res_sig$log2FoldChange) >= log2(1.5)), ]
head(res_sig)

For more details about subsetting tables in R, we recommend reviewing the Data Carpentry manipulating and analyzing data module.

We can also annotate our results to include a column that identifies our significant genes.

res_WT_DE <- res_WT # copy table
# add a column and assign all genes are non-significance
res_WT_DE$Call <- rep(FALSE, length(res_WT$baseMean)) 

# change 'Call' column to TRUE if meets conditions for significant differences
res_WT_DE[which(!is.na(res_WT_DE$padj) & res_WT$padj < 0.05 & abs(res_WT_DE$log2FoldChange) >= log2(1.5)), ]$Call <- TRUE

# reorder table to rank significant genes at the top
res_WT_DE <- res_WT_DE[order(-res_WT_DE$Call),]

head(res_WT_DE)

4 How to annotate the results table

It can be useful to annotate our results table with additional information to make them easier to interpret, such as adding additional gene information or better summarizing our DE results.

4.1 Adding genome annotations

Bioconductor provides many tools and resources to facilitate access to genomic annotation resources.

We can access additional genomic annotations using the bioMart package. To identify we’ll structure our ‘query’ or search of the bioMart resources to use the ENSEMBL id from our alignment to add the gene symbols and gene description for each gene.

4.1.1 Gene symbols versus Gene ids

Since UCSC and ENSEMBL are two difference reference builds not all genes are shared, as outlined in this comprehensive review of reference genomes by Zhao & Zhang. Additionally, since gene symbols can change over time or be ambigous we use and recommend using the EMSEMBL reference genome and ENSEMBL ids for alignments.

To proceed with our BioMart query, we will first load the biomaRt library & choose what reference we want to access. For a more detailed walk through of using biomaRt, this training module might be useful, including what to do when annotations are not 1:1 mappings.

## code to install if needed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("biomaRt")

library("biomaRt")

# using an archived host since my R/Bioconductor versions are out of date
listMarts(host="jul2016.archive.ensembl.org")
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 85
## 2     ENSEMBL_MART_SNP  Ensembl Variation 85
## 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 85
## 4    ENSEMBL_MART_VEGA               Vega 65
# choose ensembl database since have ENSEMBl gene ids
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host = "jul2016.archive.ensembl.org")

# alternative host
#ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="uswest.ensembl.org")

# search for mouse 
searchDatasets(mart = ensembl, pattern = "mus")
##                   dataset                    description   version
## 39 mmusculus_gene_ensembl Mus musculus genes (GRCm38.p4) GRCm38.p4

Note that this process takes some time and will take up a larger amount of working memory so skip if your machine has less than 4G of memory

# redefine ensembl with species specific database
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)

To identify possible filters to restrict our data, we can use the listFilters function. To identify the attributes we want to retrive, we can use the listAttributes function. There are also search functions to help narrow down the available options.

head(listFilters(mart = ensembl), n = 20)
##                                      name
## 1                         chromosome_name
## 2                                   start
## 3                                     end
## 4                              band_start
## 5                                band_end
## 6                            marker_start
## 7                              marker_end
## 8                                  strand
## 9                      chromosomal_region
## 10                              with_hgnc
## 11              with_hgnc_transcript_name
## 12                   with_ox_arrayexpress
## 13                              with_ccds
## 14                            with_chembl
## 15       with_ox_clone_based_ensembl_gene
## 16 with_ox_clone_based_ensembl_transcript
## 17          with_ox_clone_based_vega_gene
## 18    with_ox_clone_based_vega_transcript
## 19                               with_epd
## 20                              with_embl
##                                                  description
## 1                                            Chromosome name
## 2                                            Gene Start (bp)
## 3                                              Gene End (bp)
## 4                                                 Band Start
## 5                                                   Band End
## 6                                               Marker Start
## 7                                                 Marker End
## 8                                                     Strand
## 9  Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
## 10                                           with HGNC ID(s)
## 11                              with HGNC transcript name(s)
## 12                                   with ArrayExpress ID(s)
## 13                                           with CCDS ID(s)
## 14                                         with ChEMBL ID(s)
## 15                       with clone based Ensembl gene ID(s)
## 16                 with clone based Ensembl transcript ID(s)
## 17                          with clone based VEGA gene ID(s)
## 18                    with clone based VEGA transcript ID(s)
## 19                                            with EPD ID(s)
## 20                                           with EMBL ID(s)
head(listAttributes(ensembl), n = 30)
##                               name                                description
## 1                  ensembl_gene_id                            Ensembl Gene ID
## 2            ensembl_transcript_id                      Ensembl Transcript ID
## 3               ensembl_peptide_id                         Ensembl Protein ID
## 4                  ensembl_exon_id                            Ensembl Exon ID
## 5                      description                                Description
## 6                  chromosome_name                            Chromosome Name
## 7                   start_position                            Gene Start (bp)
## 8                     end_position                              Gene End (bp)
## 9                           strand                                     Strand
## 10                            band                                       Band
## 11                transcript_start                      Transcript Start (bp)
## 12                  transcript_end                        Transcript End (bp)
## 13        transcription_start_site             Transcription Start Site (TSS)
## 14               transcript_length Transcript length (including UTRs and CDS)
## 15                  transcript_tsl             Transcript Support Level (TSL)
## 16        transcript_gencode_basic                   GENCODE basic annotation
## 17               transcript_appris                          APPRIS annotation
## 18              external_gene_name                       Associated Gene Name
## 19            external_gene_source                     Associated Gene Source
## 20        external_transcript_name                 Associated Transcript Name
## 21 external_transcript_source_name               Associated Transcript Source
## 22                transcript_count                           Transcript count
## 23           percentage_gc_content                               % GC content
## 24                    gene_biotype                                  Gene type
## 25              transcript_biotype                            Transcript type
## 26                          source                              Source (gene)
## 27               transcript_source                        Source (transcript)
## 28                          status                              Status (gene)
## 29               transcript_status                        Status (transcript)
## 30                         version                             Version (gene)
##            page
## 1  feature_page
## 2  feature_page
## 3  feature_page
## 4  feature_page
## 5  feature_page
## 6  feature_page
## 7  feature_page
## 8  feature_page
## 9  feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page
## 21 feature_page
## 22 feature_page
## 23 feature_page
## 24 feature_page
## 25 feature_page
## 26 feature_page
## 27 feature_page
## 28 feature_page
## 29 feature_page
## 30 feature_page

Next, we’ll build our query to retrive the information we want to add to our DE results table.

Note: I was geeting errors when initially testing this code, so some of us might have issues here

## wasn't working for my version of biomaRt but might work for you
GeneKey <- getBM(attributes=c('ensembl_gene_id', 'external_gene_name'),
      filters = 'ensembl_gene_id',
      values = row.names(assay(dds)),
      mart = ensembl) # will take some time to run

head(GeneKey)

For those of us whom have an up to date version of R/Bioconductor

Now that we have the ENSEMBL information and a gene symbol to match to our results, we can add this to our DE results table.

res_WT_anno <- res_WT # copy table
res_WT_anno <- cbind(genes=row.names(res_WT_anno), res_WT_anno[ ,c(1:6)])
res_WT_anno <- as.data.frame(res_WT_anno)

res_WT_anno <- merge(GeneKey, res_WT_anno, by.x = "external_gene_name", by.y="genes", all.x = FALSE, all.y = TRUE) # combine the two tables using the merge function
head(res_WT_anno)

Notice that not all genes were annotated with an ENSEMBl gene id or gene description. While we are able to annotate our results, we should be very cautious as the gene symbol is not a good unique identifier plus we did not use a UCSC annotation resource so the HUGO gene symbol may not always match. However, this code is similar to the steps needed to annotate ENSEMBL id based results, like what would have been generated from yesterday’s alignments, with more interpretable gene symbols.

Note: For additional information regarding bioMart, please consult the ENSEMBL bioMart vignette or the broader Bioconductor Annotation Resources vignette.

5 Outputting results to file

A key aspect of our analysis is preserving the relevant datasets for both our records and for downstream applications.

5.1 Count tables

The most relevant count tables are the raw, filtered count table that we used as the input for our analysis and the rlog normalized count table that we used for our quality control visualizations.

First, we’ll setup a new directory for our output tables.

dir.create("./OutputTables")
## Warning in dir.create("./OutputTables"): './OutputTables' already exists

To output the raw counts, we will need to use the counts function to access the count table from within its larger DESeqDataSet object.

write.csv(counts(dds, normalized = FALSE), file="./OutputTables/DESeq2_raw_counts.csv")

Then we’ll output the rlog count table, using the assay function to access the normalized count table from within its larger DESeqDataSet object.

write.csv(assay(rld), file="./OutputTables/DESeq2_rlogNormalized_counts.csv")

5.2 DE results table

Next we’ll write out our DE results for the KD comparison to file, since we added additional information to that table.

*Note that if the biomaRt query didn’t work for you

write.csv(res_WT_anno, 
          row.names = FALSE,
          na = ".",
          file="./OutputTables/DEResults_ko.control_v_wt.control_annotated.csv")

For the other comparsion, we could repeat our annotations or write the DE results directly to file using the assay function for the DESeqDataSet object.

write.csv(res_Tx, 
          row.names = FALSE,
          na = ".",
          file="./OutputTables/DEResults_ko.Tx_v_wt.Tx.csv")

6 Summary

In this section, we:

  • Generated a volcano plot for our differential expression results
  • Summarized our differential expression results
  • Discussed choosing thresholds
  • (Some of us) annotated our tables of results
  • Saved our results to file

7 Key takeaways

Overall, we’ve run through most of the building blocks needed to run a differential expression analysis and hopefully built up a better understanding of how differential expression comparisons work, particularly how experimental design can impact our results.

What to consider moving forward:

  • How can I control for technical variation in my experimental design?
  • How much variation is expected with a treatment group?
  • What is my RNA quality, and how can that be optimized?
  • Are there quality concerns for my sequencing data?
  • What comparisons are relevant to my biological question?
  • Are there covariates that should be considered?
  • What will a differential expression analysis tell me?

Let’s pause here for general questions about the last two days of content


8 Extension - what can we do with our DE results?

Now that we have our DE results, have we address the biological question relevant to the authors of the original paper? On the one hand, yes - we now have two tables of genes that are impacted by change in Mov10 expression. But with two lists of genes alone, it can be difficult to find patterns or understand broader biological impacts.

What if we wanted to find out what genes were signifant in both comparisons? The intersect function, such as implimented as part of the dplyr package would be useful to identify shared significant genes. A venn diagram could be a way to visualize these overlaps.

A way to determine possible broader biological interpretations from the observed DE results, is functional enrichments. There are many options, such as some included in this discussion thread. Other common functional enrichments approaches are gene set enrichment analysis, aka GSEA, Database for Annotation, Visualization and Integrated Discovery, aka DAVID, Ingenity, and [iPathway Guide]

The University of Michigan has license and support for additional tools, such as Cytoscape, so we recommend reaching out to staff with Taubman Library to learn more about resources that might be application toyour research.


10 Additional Resources


11 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] biomaRt_2.42.1              RColorBrewer_1.1-2         
##  [3] pheatmap_1.0.12             ggrepel_0.9.1              
##  [5] dplyr_1.0.5                 tidyr_1.1.3                
##  [7] ggplot2_3.3.3               DESeq2_1.26.0              
##  [9] SummarizedExperiment_1.16.1 DelayedArray_0.12.3        
## [11] BiocParallel_1.20.1         matrixStats_0.58.0         
## [13] Biobase_2.46.0              GenomicRanges_1.38.0       
## [15] GenomeInfoDb_1.22.1         IRanges_2.20.2             
## [17] S4Vectors_0.24.4            BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           bit64_4.0.5            httr_1.4.2            
##  [4] progress_1.2.2         tools_3.6.1            backports_1.2.1       
##  [7] bslib_0.2.4            utf8_1.2.1             R6_2.5.0              
## [10] rpart_4.1-15           Hmisc_4.5-0            DBI_1.1.1             
## [13] colorspace_2.0-0       nnet_7.3-15            withr_2.4.2           
## [16] prettyunits_1.1.1      tidyselect_1.1.0       gridExtra_2.3         
## [19] curl_4.3               bit_4.0.4              compiler_3.6.1        
## [22] htmlTable_2.1.0        labeling_0.4.2         sass_0.3.1            
## [25] scales_1.1.1           checkmate_2.0.0        genefilter_1.68.0     
## [28] askpass_1.1            rappdirs_0.3.3         stringr_1.4.0         
## [31] digest_0.6.27          foreign_0.8-72         rmarkdown_2.7         
## [34] XVector_0.26.0         base64enc_0.1-3        jpeg_0.1-8.1          
## [37] pkgconfig_2.0.3        htmltools_0.5.1.1      dbplyr_2.1.1          
## [40] highr_0.9              fastmap_1.1.0          htmlwidgets_1.5.3     
## [43] rlang_0.4.10           rstudioapi_0.13        RSQLite_2.2.7         
## [46] farver_2.1.0           jquerylib_0.1.3        generics_0.1.0        
## [49] jsonlite_1.7.2         RCurl_1.98-1.3         magrittr_2.0.1        
## [52] GenomeInfoDbData_1.2.2 Formula_1.2-4          Matrix_1.3-2          
## [55] Rcpp_1.0.6             munsell_0.5.0          fansi_0.4.2           
## [58] lifecycle_1.0.0        stringi_1.5.3          yaml_2.2.1            
## [61] zlibbioc_1.32.0        BiocFileCache_1.10.2   grid_3.6.1            
## [64] blob_1.2.1             crayon_1.4.1           lattice_0.20-41       
## [67] splines_3.6.1          annotate_1.64.0        hms_1.0.0             
## [70] locfit_1.5-9.4         knitr_1.33             pillar_1.6.0          
## [73] geneplotter_1.64.0     XML_3.99-0.3           glue_1.4.2            
## [76] evaluate_0.14          latticeExtra_0.6-29    data.table_1.14.0     
## [79] png_0.1-7              vctrs_0.3.7            openssl_1.4.3         
## [82] gtable_0.3.0           purrr_0.3.4            assertthat_0.2.1      
## [85] cachem_1.0.4           xfun_0.22              xtable_1.8-4          
## [88] survival_3.2-10        tibble_3.1.1           AnnotationDbi_1.48.0  
## [91] memoise_2.0.0          cluster_2.1.2          ellipsis_0.3.1

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.