Here we will
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Raw Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
In this module, we will
* Understand advantages of using gene ids when analyzing data.
* Given a list of ENSEMBL gene ids, add gene symbols and Entrez accessions.
* Generate common visualizations for differential expression comparisons
* Understand reasonable thresholds for statistically significant differentially expressed genes
* Discuss options for functional enrichments
Part of differential expression analysis is generating visualizations to share our results. While the DESeq2 tutorial provides examples of other visualizations, a common visualization to summarize DE comparisons are volcano plots.
As described by this Galaxy project tutorial, a volcano plot is a type of scatterplot that shows statistical significance (adjusted pvalue) versus magnitude of change (fold change). It allows us to quickly identify genes with large fold changes that are also statistically significant. In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.
First, we need to set thresholds for determining significant genes. A reasonable threshold would be a fold-change of less than -1.5 or greater than 1.5 and an adjusted pvalue less than 0.05.
fc <- 1.5
pval <- 0.05
Then, we need to sep up some objects to run our plotting code, creating a new object that will have the right shape for plotting, labeling our comparison of interest, and setting up our output directory
df<- res_WT[order(res_WT$padj),] #select our data of interest
df <- as.data.frame(df) #convert our object type
df <- cbind("id" = row.names(df), df) #set rownames to valid column
str(df)
## 'data.frame': 24693 obs. of 7 variables:
## $ id : Factor w/ 24693 levels "ENSMUSG00000000001",..: 21065 14411 14377 17370 14378 12986 13375 13384 24376 11180 ...
## $ baseMean : num 4504 7458 876 29135 650 ...
## $ log2FoldChange: num -1.23 6.5 9.86 -7.65 8.78 ...
## $ lfcSE : num 0.0889 0.4789 0.7603 0.6086 0.7007 ...
## $ stat : num -13.9 13.6 13 -12.6 12.5 ...
## $ pvalue : num 8.40e-44 5.71e-42 1.91e-38 2.84e-36 4.81e-36 ...
## $ padj : num 1.47e-39 5.00e-38 1.11e-34 1.24e-32 1.68e-32 ...
Comparison <- "ko.control_v_wt.control"
plotPath = "./Figures/"
Once we’ve setup the annotations we need, we can proceed with generating the volcano plot, including our thresholds.
pdf(file = paste0(plotPath,'VolcanoPlot_', Comparison, '.pdf'), onefile = FALSE)
# Initialize the plot, saving as object 'p' and specifying the plot type as 'geom_point'
p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(shape = 21, fill= 'darkgrey', color= 'darkgrey', size = 1) + theme_classic() + xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')
# Add threshold lines
p <- p + geom_vline(xintercept = c(0, -log2(fc), log2(fc)), linetype = c(1, 2, 2), color = c('black', 'black', 'black')) + geom_hline(yintercept = -log10(pval), linetype = 2, color = 'black')
# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))
print(p)
dev.off()
## quartz_off_screen
## 2
p
Now we need to subset our data to label the datapoints (genes) that pass our thresholds.
df$dot <- rep(3, nrow(df))
df$dot[which(df$padj <= pval & df$log2FoldChange < 0 & abs(df$log2FoldChange) >= log2(fc))] = 2
df$dot[which(df$padj <= pval & df$log2FoldChange > 0 & abs(df$log2FoldChange) >= log2(fc))] = 1
df$sig <- df$dot
#take top 5 up, down, then combine, assign label
top <- rbind(head(subset(df, df$dot == 1), 5),head(subset(df, df$dot == 2), 5))
top$label <- top$id
df <- merge(x = df, y = top[,c('id','label')], by = "id", all.x = TRUE)
#count the number of significan up and down genes, assign value for legend
df$dot <- factor(df$dot,levels = c(1,2,3), labels = c(paste0('Up: ', sum(df$dot == 1)),paste0('Down: ', sum(df$dot == 2)),'NS'))
Once we’ve setup the annotations we need, we can proceed with generating the volcano plot.
pdf(file = paste0(plotPath,'VolcanoPlot_Fancier', Comparison, '.pdf'), onefile = FALSE)
p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) + geom_point(aes(color = df$dot), size = 1) + theme_classic() + xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')
# specify colors for up-/down-/nonsignificant genes
p <- p + scale_color_manual(name = '', values=c('#B31B21', '#1465AC', 'darkgray'))
# add threshold lines
p <- p + geom_vline(xintercept = c(0, -log2(fc), log2(fc)), linetype = c(1, 2, 2), color = c('black', 'black', 'black')) + geom_hline(yintercept = -log10(pval), linetype = 2, color = 'black')
# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))
print(p)
dev.off()
## quartz_off_screen
## 2
p
For additional visualizations for our DE results, this HBC tutorial includes some nice examples.
To generate a general summary of the DE results, we can use the summary
function to generate a basic summary by DESeq2.
However, we can also use conditional statements to determine the number of genes that pass our thresholds for each comparison, which might be more informative.
#summary(res_WT)
sum(res_WT$padj < 0.05 & abs(res_WT$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 735
sum(res_Tx$padj < 0.05 & abs(res_Tx$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 1152
How do the number of DE genes compare to what we observed from the PCA plots?
Thresholding adjusted pvalues < 0.05 is very standard
How many DE genes would we have if we only apply a more stringent threshold to our adjusted pvalues and ignore fold-change?
sum(res_WT$padj < 0.01, na.rm = TRUE)
## [1] 713
If we threshold only on significance, we still identify less DE genes than the 6057 identified in the paper. How should we interpret these differences? Are we in danger of missing relevant genes or, due to DESeq2’s stringency, are we better protected against irrelevant genes? One of the authors of DESeq2, wrote this blog post that detailed differences between DESeq2 and EdgeR but this review of multiple DE tools by Costa-Silva, Domingues, and Martins Lopes may also be useful for making choices for your own data.
You may be interested in identifying only the genes that pass your significance thesholds. A useful way to do this is to conditionally subset your results.
res_sig <- na.omit(res_WT)
res_sig <- res_sig[which(res_sig$padj < 0.05 & abs(res_sig$log2FoldChange) >= log2(1.5)), ]
head(res_sig)
For more details about subsetting tables in R, we recommend reviewing the Data Carpentry manipulating and analyzing data module.
We can also annotate our results to include a column that identifies our significant genes.
res_WT_DE <- res_WT # copy table
# add a column and assign all genes are non-significance
res_WT_DE$Call <- rep(FALSE, length(res_WT$baseMean))
# change 'Call' column to TRUE if meets conditions for significant differences
res_WT_DE[which(!is.na(res_WT_DE$padj) & res_WT$padj < 0.05 & abs(res_WT_DE$log2FoldChange) >= log2(1.5)), ]$Call <- TRUE
# reorder table to rank significant genes at the top
res_WT_DE <- res_WT_DE[order(-res_WT_DE$Call),]
head(res_WT_DE)
It can be useful to annotate our results table with additional information to make them easier to interpret, such as adding additional gene information or better summarizing our DE results.
Bioconductor provides many tools and resources to facilitate access to genomic annotation resources.
We can access additional genomic annotations using the bioMart
package. To identify we’ll structure our ‘query’ or search of the bioMart resources to use the ENSEMBL id from our alignment to add the gene symbols and gene description for each gene.
Since UCSC and ENSEMBL are two difference reference builds not all genes are shared, as outlined in this comprehensive review of reference genomes by Zhao & Zhang. Additionally, since gene symbols can change over time or be ambigous we use and recommend using the EMSEMBL reference genome and ENSEMBL ids for alignments.
To proceed with our BioMart query, we will first load the biomaRt library & choose what reference we want to access. For a more detailed walk through of using biomaRt, this training module might be useful, including what to do when annotations are not 1:1 mappings.
## code to install if needed
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("biomaRt")
library("biomaRt")
# using an archived host since my R/Bioconductor versions are out of date
listMarts(host="jul2016.archive.ensembl.org")
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 85
## 2 ENSEMBL_MART_SNP Ensembl Variation 85
## 3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 85
## 4 ENSEMBL_MART_VEGA Vega 65
# choose ensembl database since have ENSEMBl gene ids
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host = "jul2016.archive.ensembl.org")
# alternative host
#ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host="uswest.ensembl.org")
# search for mouse
searchDatasets(mart = ensembl, pattern = "mus")
## dataset description version
## 39 mmusculus_gene_ensembl Mus musculus genes (GRCm38.p4) GRCm38.p4
Note that this process takes some time and will take up a larger amount of working memory so skip if your machine has less than 4G of memory
# redefine ensembl with species specific database
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
To identify possible filters to restrict our data, we can use the listFilters
function. To identify the attributes we want to retrive, we can use the listAttributes
function. There are also search functions to help narrow down the available options.
head(listFilters(mart = ensembl), n = 20)
## name
## 1 chromosome_name
## 2 start
## 3 end
## 4 band_start
## 5 band_end
## 6 marker_start
## 7 marker_end
## 8 strand
## 9 chromosomal_region
## 10 with_hgnc
## 11 with_hgnc_transcript_name
## 12 with_ox_arrayexpress
## 13 with_ccds
## 14 with_chembl
## 15 with_ox_clone_based_ensembl_gene
## 16 with_ox_clone_based_ensembl_transcript
## 17 with_ox_clone_based_vega_gene
## 18 with_ox_clone_based_vega_transcript
## 19 with_epd
## 20 with_embl
## description
## 1 Chromosome name
## 2 Gene Start (bp)
## 3 Gene End (bp)
## 4 Band Start
## 5 Band End
## 6 Marker Start
## 7 Marker End
## 8 Strand
## 9 Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
## 10 with HGNC ID(s)
## 11 with HGNC transcript name(s)
## 12 with ArrayExpress ID(s)
## 13 with CCDS ID(s)
## 14 with ChEMBL ID(s)
## 15 with clone based Ensembl gene ID(s)
## 16 with clone based Ensembl transcript ID(s)
## 17 with clone based VEGA gene ID(s)
## 18 with clone based VEGA transcript ID(s)
## 19 with EPD ID(s)
## 20 with EMBL ID(s)
head(listAttributes(ensembl), n = 30)
## name description
## 1 ensembl_gene_id Ensembl Gene ID
## 2 ensembl_transcript_id Ensembl Transcript ID
## 3 ensembl_peptide_id Ensembl Protein ID
## 4 ensembl_exon_id Ensembl Exon ID
## 5 description Description
## 6 chromosome_name Chromosome Name
## 7 start_position Gene Start (bp)
## 8 end_position Gene End (bp)
## 9 strand Strand
## 10 band Band
## 11 transcript_start Transcript Start (bp)
## 12 transcript_end Transcript End (bp)
## 13 transcription_start_site Transcription Start Site (TSS)
## 14 transcript_length Transcript length (including UTRs and CDS)
## 15 transcript_tsl Transcript Support Level (TSL)
## 16 transcript_gencode_basic GENCODE basic annotation
## 17 transcript_appris APPRIS annotation
## 18 external_gene_name Associated Gene Name
## 19 external_gene_source Associated Gene Source
## 20 external_transcript_name Associated Transcript Name
## 21 external_transcript_source_name Associated Transcript Source
## 22 transcript_count Transcript count
## 23 percentage_gc_content % GC content
## 24 gene_biotype Gene type
## 25 transcript_biotype Transcript type
## 26 source Source (gene)
## 27 transcript_source Source (transcript)
## 28 status Status (gene)
## 29 transcript_status Status (transcript)
## 30 version Version (gene)
## page
## 1 feature_page
## 2 feature_page
## 3 feature_page
## 4 feature_page
## 5 feature_page
## 6 feature_page
## 7 feature_page
## 8 feature_page
## 9 feature_page
## 10 feature_page
## 11 feature_page
## 12 feature_page
## 13 feature_page
## 14 feature_page
## 15 feature_page
## 16 feature_page
## 17 feature_page
## 18 feature_page
## 19 feature_page
## 20 feature_page
## 21 feature_page
## 22 feature_page
## 23 feature_page
## 24 feature_page
## 25 feature_page
## 26 feature_page
## 27 feature_page
## 28 feature_page
## 29 feature_page
## 30 feature_page
Next, we’ll build our query to retrive the information we want to add to our DE results table.
Note: I was geeting errors when initially testing this code, so some of us might have issues here
## wasn't working for my version of biomaRt but might work for you
GeneKey <- getBM(attributes=c('ensembl_gene_id', 'external_gene_name'),
filters = 'ensembl_gene_id',
values = row.names(assay(dds)),
mart = ensembl) # will take some time to run
head(GeneKey)
For those of us whom have an up to date version of R/Bioconductor
Now that we have the ENSEMBL information and a gene symbol to match to our results, we can add this to our DE results table.
res_WT_anno <- res_WT # copy table
res_WT_anno <- cbind(genes=row.names(res_WT_anno), res_WT_anno[ ,c(1:6)])
res_WT_anno <- as.data.frame(res_WT_anno)
res_WT_anno <- merge(GeneKey, res_WT_anno, by.x = "external_gene_name", by.y="genes", all.x = FALSE, all.y = TRUE) # combine the two tables using the merge function
head(res_WT_anno)
Notice that not all genes were annotated with an ENSEMBl gene id or gene description. While we are able to annotate our results, we should be very cautious as the gene symbol is not a good unique identifier plus we did not use a UCSC annotation resource so the HUGO gene symbol may not always match. However, this code is similar to the steps needed to annotate ENSEMBL id based results, like what would have been generated from yesterday’s alignments, with more interpretable gene symbols.
Note: For additional information regarding bioMart, please consult the ENSEMBL bioMart vignette or the broader Bioconductor Annotation Resources vignette.
A key aspect of our analysis is preserving the relevant datasets for both our records and for downstream applications.
The most relevant count tables are the raw, filtered count table that we used as the input for our analysis and the rlog normalized count table that we used for our quality control visualizations.
First, we’ll setup a new directory for our output tables.
dir.create("./OutputTables")
## Warning in dir.create("./OutputTables"): './OutputTables' already exists
To output the raw counts, we will need to use the counts
function to access the count table from within its larger DESeqDataSet
object.
write.csv(counts(dds, normalized = FALSE), file="./OutputTables/DESeq2_raw_counts.csv")
Then we’ll output the rlog count table, using the assay
function to access the normalized count table from within its larger DESeqDataSet
object.
write.csv(assay(rld), file="./OutputTables/DESeq2_rlogNormalized_counts.csv")
Next we’ll write out our DE results for the KD comparison to file, since we added additional information to that table.
*Note that if the biomaRt query didn’t work for you
write.csv(res_WT_anno,
row.names = FALSE,
na = ".",
file="./OutputTables/DEResults_ko.control_v_wt.control_annotated.csv")
For the other comparsion, we could repeat our annotations or write the DE results directly to file using the assay
function for the DESeqDataSet object.
write.csv(res_Tx,
row.names = FALSE,
na = ".",
file="./OutputTables/DEResults_ko.Tx_v_wt.Tx.csv")
In this section, we:
Overall, we’ve run through most of the building blocks needed to run a differential expression analysis and hopefully built up a better understanding of how differential expression comparisons work, particularly how experimental design can impact our results.
What to consider moving forward:
Let’s pause here for general questions about the last two days of content
Now that we have our DE results, have we address the biological question relevant to the authors of the original paper? On the one hand, yes - we now have two tables of genes that are impacted by change in Mov10 expression. But with two lists of genes alone, it can be difficult to find patterns or understand broader biological impacts.
What if we wanted to find out what genes were signifant in both comparisons? The intersect
function, such as implimented as part of the dplyr package would be useful to identify shared significant genes. A venn diagram could be a way to visualize these overlaps.
A way to determine possible broader biological interpretations from the observed DE results, is functional enrichments. There are many options, such as some included in this discussion thread. Other common functional enrichments approaches are gene set enrichment analysis, aka GSEA, Database for Annotation, Visualization and Integrated Discovery, aka DAVID, Ingenity, and [iPathway Guide]
The University of Michigan has license and support for additional tools, such as Cytoscape, so we recommend reaching out to staff with Taubman Library to learn more about resources that might be application toyour research.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] biomaRt_2.42.1 RColorBrewer_1.1-2
## [3] pheatmap_1.0.12 ggrepel_0.9.1
## [5] dplyr_1.0.5 tidyr_1.1.3
## [7] ggplot2_3.3.3 DESeq2_1.26.0
## [9] SummarizedExperiment_1.16.1 DelayedArray_0.12.3
## [11] BiocParallel_1.20.1 matrixStats_0.58.0
## [13] Biobase_2.46.0 GenomicRanges_1.38.0
## [15] GenomeInfoDb_1.22.1 IRanges_2.20.2
## [17] S4Vectors_0.24.4 BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 httr_1.4.2
## [4] progress_1.2.2 tools_3.6.1 backports_1.2.1
## [7] bslib_0.2.4 utf8_1.2.1 R6_2.5.0
## [10] rpart_4.1-15 Hmisc_4.5-0 DBI_1.1.1
## [13] colorspace_2.0-0 nnet_7.3-15 withr_2.4.2
## [16] prettyunits_1.1.1 tidyselect_1.1.0 gridExtra_2.3
## [19] curl_4.3 bit_4.0.4 compiler_3.6.1
## [22] htmlTable_2.1.0 labeling_0.4.2 sass_0.3.1
## [25] scales_1.1.1 checkmate_2.0.0 genefilter_1.68.0
## [28] askpass_1.1 rappdirs_0.3.3 stringr_1.4.0
## [31] digest_0.6.27 foreign_0.8-72 rmarkdown_2.7
## [34] XVector_0.26.0 base64enc_0.1-3 jpeg_0.1-8.1
## [37] pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.1.1
## [40] highr_0.9 fastmap_1.1.0 htmlwidgets_1.5.3
## [43] rlang_0.4.10 rstudioapi_0.13 RSQLite_2.2.7
## [46] farver_2.1.0 jquerylib_0.1.3 generics_0.1.0
## [49] jsonlite_1.7.2 RCurl_1.98-1.3 magrittr_2.0.1
## [52] GenomeInfoDbData_1.2.2 Formula_1.2-4 Matrix_1.3-2
## [55] Rcpp_1.0.6 munsell_0.5.0 fansi_0.4.2
## [58] lifecycle_1.0.0 stringi_1.5.3 yaml_2.2.1
## [61] zlibbioc_1.32.0 BiocFileCache_1.10.2 grid_3.6.1
## [64] blob_1.2.1 crayon_1.4.1 lattice_0.20-41
## [67] splines_3.6.1 annotate_1.64.0 hms_1.0.0
## [70] locfit_1.5-9.4 knitr_1.33 pillar_1.6.0
## [73] geneplotter_1.64.0 XML_3.99-0.3 glue_1.4.2
## [76] evaluate_0.14 latticeExtra_0.6-29 data.table_1.14.0
## [79] png_0.1-7 vctrs_0.3.7 openssl_1.4.3
## [82] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [85] cachem_1.0.4 xfun_0.22 xtable_1.8-4
## [88] survival_3.2-10 tibble_3.1.1 AnnotationDbi_1.48.0
## [91] memoise_2.0.0 cluster_2.1.2 ellipsis_0.3.1
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.