Objectives:
- Understand advantages of using gene ids when analyzing data.
- Given a list of ENSEMBL gene ids, add gene symbols and Entrez accessions.
- Generate common visualizations for differential expression comparisons
- Understand choosing thresholds for calling differentially expressed genes
- Discuss options for functional enrichments
Here we will generate summary figures for our results and annotate our DE tables.
TODO: double check references to annotated table prior to running annotations
Part of differential expression analysis is generating visualizations to share our results. While the DESeq2 tutorial provides examples of other visualizations, a common visualization to summarize DE comparisons are volcano plots.
As described by this Galaxy project tutorial, a volcano plot is a type of scatterplot that shows statistical significance (adjusted p-value) versus magnitude of change (fold change). In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.
First, we need to set thresholds for determining significant genes. A reasonable threshold would be a fold-change of less than -1.5 or greater than 1.5 and an adjusted pvalue less than 0.05.
fc <- 1.5
pval <- 0.05
Then, we need to sep up some objects to run our plotting code, creating a new object that will have the right shape for plotting, labeling our comparison of interest, and setting up our output directory
df<- res_WT[order(res_WT$padj),] #select our data of interest
df <- as.data.frame(df) #convert our object type
df <- cbind("id" = row.names(df), df) #set rownames to valid column
str(df)
## 'data.frame': 24745 obs. of 7 variables:
## $ id : chr "ENSMUSG00000105263" "ENSMUSG00000069306" "ENSMUSG00000069045" "ENSMUSG00000086503" ...
## $ baseMean : num 4505 7456 876 29135 650 ...
## $ log2FoldChange: num -1.23 6.49 9.86 -7.64 8.78 ...
## $ lfcSE : num 0.089 0.475 0.76 0.608 0.701 ...
## $ stat : num -13.9 13.7 13 -12.6 12.5 ...
## $ pvalue : num 1.25e-43 1.58e-42 1.93e-38 3.15e-36 5.06e-36 ...
## $ padj : num 2.18e-39 1.39e-38 1.13e-34 1.38e-32 1.77e-32 ...
Comparison <- "ko.control_v_wt.control"
Once we’ve setup the annotations we need, we can proceed with generating the volcano plot, including our thresholds.
pdf(file = paste0(plotPath,'VolcanoPlot_', Comparison, '.pdf'), onefile = FALSE)
# Initialize the plot, saving as object 'p' and specifying the plot type as 'geom_point'
p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(shape = 21, colour= 'darkgrey', size = 1) +
theme_classic() +
xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')
# Add threshold lines
p <- p +
geom_vline(
xintercept = c(0, -log2(fc), log2(fc)),
linetype = c(1, 2, 2)) +
geom_hline(
yintercept = -log10(pval),
linetype = 2)
# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))
print(p)
dev.off()
## quartz_off_screen
## 2
p
Now we need to subset our data to label the datapoints (genes) that pass our thresholds.
df$dot <- rep(3, nrow(df))
df$dot[which(df$padj <= pval & df$log2FoldChange < 0 & abs(df$log2FoldChange) >= log2(fc))] = 2
df$dot[which(df$padj <= pval & df$log2FoldChange > 0 & abs(df$log2FoldChange) >= log2(fc))] = 1
df$sig <- df$dot
#take top 5 up, down, then combine, assign label
top <- rbind(head(subset(df, df$dot == 1), 5),head(subset(df, df$dot == 2), 5))
top$label <- top$id
df <- merge(x = df, y = top[,c('id','label')], by = "id", all.x = TRUE)
#count the number of significan up and down genes, assign value for legend
df$dot <- factor(df$dot,levels = c(1,2,3), labels = c(paste0('Up: ', sum(df$dot == 1)),paste0('Down: ', sum(df$dot == 2)),'NS'))
Once we’ve setup the annotations we need, we can proceed with generating the volcano plot.
pdf(file = paste0(plotPath,'VolcanoPlot_Fancier', Comparison, '.pdf'), onefile = FALSE)
p <- ggplot(df, aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(color = df$dot), size = 1) +
theme_classic() +
xlab('Log2 fold-change') + ylab('-Log10 adjusted p-value')
# specify colors for up-/down-/nonsignificant genes
p <- p + scale_color_manual(name = '', values=c('#B31B21', '#1465AC', 'darkgray'))
# add threshol lines
p <- p +
geom_vline(
xintercept = c(0, -log2(fc), log2(fc)),
linetype = c(1, 2, 2),
color = c('black', 'black', 'black')) +
geom_hline(
yintercept = -log10(pval),
linetype = 2,
color = 'black')
# Add Title that includes comparison name
p <- p + ggtitle(as.character(Comparison))
print(p)
dev.off()
## quartz_off_screen
## 2
p
For additional visualizations for our DE results, we included some example code in the Bonus Content module and this HBC tutorial includes some nice examples.
To generate a general summary of the DE results, we can use the summary
function to generate a basic summary by DESeq2.
However, we can also use conditional statements to determine the number of genes that pass our thresholds for each comparison, which might be more informative.
[Question]: How would we identify the number of genes with adjusted p-values < 0.05 and a fold-change above 1.5 (or below -1.5)?
#summary(res_WT)
sum(res_WT$padj < 0.05 & abs(res_WT$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 735
sum(res_Tx$padj < 0.05 & abs(res_Tx$log2FoldChange) >= log2(1.5), na.rm = TRUE)
## [1] 1151
How do the number of DE genes compare to what we observed from the PCA plots?
Thresholding on adjusted p-values < 0.05 is a standard threshold, but depending on the research question and/or how the results will be used, other thresholds might be reasonable.
There is a nice Biostart post that discusses choosing adjusted p-value thresholds, including cases where a more relaxed threshold might be appropriate and (some heated) discussion of the dangers of adjusting the choosen threshold after running an analysis. Additionally, there is a Dalmon et al 2012 paper about p-value and fold-change thresholds for microarray data that may help provide some context.
It can be useful to annotate our results table with additional information to make them easier to interpret, such as adding additional gene information or better summarizing our DE results.
You may be interested in identifying only the genes that pass your significance thresholds. A useful way to do this is to conditionally subset your results.
Note: The tidyverse functions you learned in Software Carpentry could also be alternatively used here.
res_sig <- na.omit(res_WT)
res_sig <- res_sig[which(res_sig$padj < 0.05 & abs(res_sig$log2FoldChange) >= log2(1.5)), ]
head(res_sig)
For more details about sub-setting tables in R, we recommend reviewing the Data Carpentry manipulating and analyzing data module. We can also annotate our results to include a column that identifies our significant genes.
We can also add an annotation column to indicate if a gene passes the padj
and log2FoldChange
thresholds we set to call DE genes.
res_WT_DE <- res_WT # copy table
# add a column and assign all genes are non-significance
res_WT_DE$Call <- rep(FALSE, length(res_WT$baseMean))
# change 'Call' column to TRUE if meets conditions for significant differences
res_WT_DE[which(!is.na(res_WT_DE$padj) & res_WT_DE$padj < 0.05 & abs(res_WT_DE$log2FoldChange) >= log2(1.5)), ]$Call <- TRUE
# reorder table to rank significant genes at the top
res_WT_DE <- res_WT_DE[order(-res_WT_DE$Call),]
head(res_WT_DE)
Since, gene symbols can change over time or be ambiguous we use and recommend using the EMSEMBL reference genome and ENSEMBL ids for alignmentsand we’ve been working with tables and data where all genes are labeled only by their long ENSEML ID. However, this can make it difficult to quickly look at results for genes of interest.
Luckily, Bioconductor provides many tools and resources to facilitate access to genomic annotation resources.
To start, we will first load the biomaRt library & choose what reference we want to access. For a more detailed walk through of using biomaRt, this training module might be useful, including what to do when annotations are not 1:1 mappings.
First, we’ll load the biomaRt
library and then specify the dataset and biomart we want to use, where the values for dataset
and biomart
are based on the biomaRt vignette on how to build a query and the biomaRt vignette on accessing ensembl
library("biomaRt")
ensembl <- useEnsembl(dataset = "mmusculus_gene_ensembl", biomart='ensembl')
Note that this process takes some time and will take up a larger amount of working memory so proceed with caution if you try to run these commands on a laptop with less than 4G of memory
To identify possible filters to restrict our data, we can use the listFilters
function. To identify the attributes we want to retrive, we can use the listAttributes
function. The best approach is to use list or search functions to help narrow down the available options.
head(listFilters(mart = ensembl), n = 20)
head(listAttributes(ensembl), n = 30)
We can access additional genomic annotations using the bioMart
package. To identify we’ll structure our ‘query’ or search of the bioMart resources to use the ENSEMBL id from our alignment to add the gene symbols and gene description for each gene.
## wasn't working for my version of biomaRt but might work for you
GeneKey <- getBM(attributes=c('ensembl_gene_id', 'external_gene_name'),
filters = 'ensembl_gene_id',
values = row.names(assay(dds)),
mart = ensembl) # will take some time to run
## check the key
head(GeneKey)
## ensembl_gene_id external_gene_name
## 1 ENSMUSG00000000001 Gnai3
## 2 ENSMUSG00000000028 Cdc45
## 3 ENSMUSG00000000031 H19
## 4 ENSMUSG00000000037 Scml2
## 5 ENSMUSG00000000049 Apoh
## 6 ENSMUSG00000000056 Narf
Now that we have the ENSEMBL information and a gene symbol to match to our results, we can procced to the breakout exercise to annotate our DE table(s)
head(res_WT_anno)
## genes external_gene_name baseMean log2FoldChange lfcSE stat
## 1 ENSMUSG00000000001 Gnai3 6255.632164 -0.014024321 0.09301588 -0.15077341
## 2 ENSMUSG00000000028 Cdc45 1337.874474 0.522421732 0.13599465 3.84148730
## 3 ENSMUSG00000000031 H19 3.773571 -1.156597290 1.60080258 -0.72251088
## 4 ENSMUSG00000000037 Scml2 27.563275 -0.279611867 0.47655119 -0.58674046
## 5 ENSMUSG00000000049 Apoh 2.256350 4.010718329 1.88912289 2.12305846
## 6 ENSMUSG00000000056 Narf 2194.251314 -0.008544091 0.17722204 -0.04821122
## pvalue padj
## 1 0.8801544640 0.960702559
## 2 0.0001222911 0.003948693
## 3 0.4699804359 NA
## 4 0.5573780290 0.804647138
## 5 0.0337489531 NA
## 6 0.9615479086 0.987888734
Notice that not all genes were annotated with an ENSEMBl gene id or gene description. While we are able to annotate our results, we should be very cautious as the gene symbol is not a good unique identifier plus we did not use a UCSC annotation resource so the HUGO gene symbol may not always match. However, this code is similar to the steps needed to annotate ENSEMBL id based results, like what would have been generated from yesterday's alignments, with more interpretable gene symbols.
Note: For additional information regarding bioMart, please consult the ENSEMBL bioMart vignette or the broader Bioconductor Annotation Resources vignette.
A key aspect of our analysis is preserving the relevant datasets for both our records and for downstream applications, such as functional enrichments.
First, we’ll setup a new directory for our output tables.
dir.create("tables", showWarnings = FALSE)
Next we’ll write out our DE results for the KD comparison to file, since we added additional information to that table.
write.csv(res_WT_anno,
row.names = FALSE,
na = ".",
file="tables/DEResults_ko.control_v_wt.control_annotated.csv")
If we generated another comparision, we could repeat our annotations or write the DE results directly to file.
write.csv(res_Tx,
row.names = FALSE,
na = ".",
file="tables/DEResults_ko.Tx_v_wt.Tx.csv")
The most relevant count tables are the raw, filtered count table that we used as the input for our analysis and the rlog normalized count table that we used for our quality control visualizations.
To output the raw counts, we will need to use the counts
function to access the count table from within its larger DESeqDataSet
object.
write.csv(counts(dds, normalized = FALSE), file="tables/DESeq2_raw_counts.csv")
Then we’ll output the rlog count table, using the assay
function to access the normalized count table from within its larger DESeqDataSet
object.
write.csv(assay(rld), file="tables/DESeq2_rlogNormalized_counts.csv")
In this section, we:
Overall, we’ve run through most of the building blocks needed to run a differential expression analysis and hopefully built up a better understanding of how differential expression comparisons work, particularly how experimental design can impact our results.
What to consider moving forward:
Let’s pause here for general questions
Now that we have our DE results, have we address the biological question relevant to the authors of the original paper? On the one hand, yes - we now have two tables of genes that are impacted by change in Mov10 expression. But with two lists of genes alone, it can be difficult to find patterns or understand broader biological impacts.
What if we wanted to find out what genes were signifant in both comparisons? The intersect
function, such as implimented as part of the dplyr package would be useful to identify shared significant genes. A venn diagram could be a way to visualize these overlaps.
A way to determine possible broader biological interpretations from the observed DE results, is functional enrichments. There are many options, such as some included in this discussion thread. Other common functional enrichments approaches are gene set enrichment analysis, aka GSEA, Database for Annotation, Visualization and Integrated Discovery, aka DAVID, Ingenity, and [iPathway Guide]
The University of Michigan has license and support for additional tools, such as Cytoscape, so we recommend reaching out to staff with Taubman Library to learn more about resources that might be application toyour research.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biomaRt_2.50.0 rmarkdown_2.11 data.table_1.14.2
## [4] RColorBrewer_1.1-2 pheatmap_1.0.12 ggrepel_0.9.1
## [7] dplyr_1.0.7 tidyr_1.1.4 ggplot2_3.3.5
## [10] DESeq2_1.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0
## [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 GenomicRanges_1.46.0
## [16] GenomeInfoDb_1.30.0 IRanges_2.28.0 S4Vectors_0.32.2
## [19] BiocGenerics_0.40.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 progress_1.2.2
## [5] httr_1.4.2 tools_4.1.1 utf8_1.2.2 R6_2.5.1
## [9] DBI_1.1.1 colorspace_2.0-2 withr_2.4.2 tidyselect_1.1.1
## [13] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 compiler_4.1.1
## [17] cli_3.0.1 xml2_1.3.2 DelayedArray_0.20.0 labeling_0.4.2
## [21] scales_1.1.1 genefilter_1.76.0 rappdirs_0.3.3 stringr_1.4.0
## [25] digest_0.6.28 XVector_0.34.0 pkgconfig_2.0.3 htmltools_0.5.2
## [29] dbplyr_2.1.1 fastmap_1.1.0 highr_0.9 rlang_0.4.11
## [33] rstudioapi_0.13 RSQLite_2.2.8 jquerylib_0.1.4 generics_0.1.0
## [37] farver_2.1.0 BiocParallel_1.28.0 RCurl_1.98-1.5 magrittr_2.0.1
## [41] GenomeInfoDbData_1.2.7 Matrix_1.3-4 Rcpp_1.0.7 munsell_0.5.0
## [45] fansi_0.5.0 lifecycle_1.0.1 stringi_1.7.5 yaml_2.2.1
## [49] zlibbioc_1.40.0 BiocFileCache_2.2.0 grid_4.1.1 blob_1.2.2
## [53] parallel_4.1.1 crayon_1.4.1 lattice_0.20-44 Biostrings_2.62.0
## [57] splines_4.1.1 annotate_1.72.0 hms_1.1.1 KEGGREST_1.34.0
## [61] locfit_1.5-9.4 knitr_1.36 pillar_1.6.3 geneplotter_1.72.0
## [65] XML_3.99-0.8 glue_1.4.2 evaluate_0.14 png_0.1-7
## [69] vctrs_0.3.8 gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [73] cachem_1.0.6 xfun_0.26 xtable_1.8-4 survival_3.2-11
## [77] tibble_3.1.5 AnnotationDbi_1.56.2 memoise_2.0.0 ellipsis_0.3.2
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.