Objectives
- Understand advantages of using gene ids when analyzing RNA-seq data.
- Given a list of ENSEMBL gene ids, add gene symbols and Entrez accessions.
- Generate common visualizations for differential expression comparisons
- Understand choosing thresholds for calling differentially expressed genes
- Discuss options for functional enrichments
Here we will generate summary figures for our results and annotate our DE tables.
Since, gene symbols can change over time or be ambiguous we use, and recommend, using the EMSEMBL reference genome and ENSEMBL IDs for alignments and we’ve been working with tables and data where all genes are labeled only by their long ENSEMBL ID. However, this can make it difficult to quickly look for genes of interest.
Luckily, Bioconductor provides many tools and resources to facilitate access to genomic annotation resources.
To start, we will first load the biomaRt library and choose what reference we want to access. For a more detailed walk through of using biomaRt, this training module might be useful, including what to do when annotations are not 1:1 mappings.
We’ll start by loading the biomaRt
library and calling
the useEnsembl()
function to select the database we’ll use
to extract the information we need. This will download the mapping of
ENSEMBL IDs to gene symbols, enabling us to eventually add the gene
symbol column we want.
library('biomaRt')
ensembl = useEnsembl(dataset = 'mmusculus_gene_ensembl', biomart='ensembl')
Note - this process takes some time and will take up a larger amount of working memory so proceed with caution if you try to run these commands on a laptop with less than 4G of memory
To identify possible filters to restrict our data,
we can use the listFilters
function. To identify the
attributes we want to retrive, we can use the
listAttributes
function. The best approach is to use list
or search functions to help narrow down the available options.
head(listFilters(mart = ensembl), n = 20)
head(listAttributes(ensembl), n = 30)
We can access additional genomic annotations using the bioMart
package. To identify we’ll structure our ‘query’ or search of the
bioMart resources to use the ENSEMBL
id from our alignment to add the gene symbols and gene description
for each gene.
id_mapping = getBM(attributes=c('ensembl_gene_id', 'external_gene_name'),
filters = 'ensembl_gene_id',
values = row.names(assay(dds_fitted)),
mart = ensembl)
# will take some time to run
# Preview the result
head(id_mapping)
ensembl_gene_id external_gene_name
1 ENSMUSG00000000001 Gnai3
2 ENSMUSG00000000028 Cdc45
3 ENSMUSG00000000031 H19
4 ENSMUSG00000000037 Scml2
5 ENSMUSG00000000049 Apoh
6 ENSMUSG00000000056 Narf
Now that we have the ENSEMBL information and a gene symbol to match to our results, we can proceed in the smaller groups. As with the previous exercise, we have broken it into small steps with hints as needed.
Note: For additional information regarding bioMart, please consult the ENSEMBL bioMart vignette or the broader Bioconductor Annotation Resources vignette.
Look at the two data frames that are going to be needed:
id_mapping
and results_minus_vs_plus
.
head(id_mapping)
ensembl_gene_id external_gene_name
1 ENSMUSG00000000001 Gnai3
2 ENSMUSG00000000028 Cdc45
3 ENSMUSG00000000031 H19
4 ENSMUSG00000000037 Scml2
5 ENSMUSG00000000049 Apoh
6 ENSMUSG00000000056 Narf
head(results_minus_vs_plus)
log2 fold change (MLE): condition minus vs plus
Wald test p-value: condition minus vs plus
DataFrame with 6 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue padj call
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <factor>
ENSMUSG00000000001 1489.83039 0.297760 0.210310 1.415815 0.156830 0.868573 NS
ENSMUSG00000000028 1748.93544 0.226421 0.176795 1.280695 0.200301 0.902900 NS
ENSMUSG00000000031 2151.87715 0.457635 0.764579 0.598545 0.549476 0.995391 NS
ENSMUSG00000000037 24.91672 0.579130 0.561259 1.031840 0.302147 0.950613 NS
ENSMUSG00000000049 7.78377 -0.899483 1.553063 -0.579167 0.562476 0.998043 NS
ENSMUSG00000000056 19653.54030 -0.174048 0.203529 -0.855151 0.392468 0.982479 NS
We want to match the id
column of
results_minus_vs_plus
to the ensembl_gene_id
column of id_mapping
, and once that match is found, we want
to extract the external_gene_name
column of
id_mapping
to get the gene symbol. Next, look at the
documentation for dplyr::left_join()
and merge the
id_mapping
table into the
results_minus_vs_plus
table on the columns
ensembl_gene_id
and external_gene_name
.
results_minus_vs_plus_annotated = as_tibble(results_minus_vs_plus, rownames = "id") %>%
left_join(id_mapping, by = c('id' = 'ensembl_gene_id'))
head(results_minus_vs_plus_annotated)
# A tibble: 6 × 9
id baseMean log2FoldChange lfcSE stat pvalue padj call external_gene_name
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 ENSMUSG00000000001 1490. 0.298 0.210 1.42 0.157 0.869 NS Gnai3
2 ENSMUSG00000000028 1749. 0.226 0.177 1.28 0.200 0.903 NS Cdc45
3 ENSMUSG00000000031 2152. 0.458 0.765 0.599 0.549 0.995 NS H19
4 ENSMUSG00000000037 24.9 0.579 0.561 1.03 0.302 0.951 NS Scml2
5 ENSMUSG00000000049 7.78 -0.899 1.55 -0.579 0.562 0.998 NS Apoh
6 ENSMUSG00000000056 19654. -0.174 0.204 -0.855 0.392 0.982 NS Narf
We can use some of the tidyverse
functions we’ve
encountered previously to rename the external_gene_name
column to symbol
and to move it into the second column
position? Hint: Because of the order of the packages we may have loaded,
we’ll use dplyr::rename()
and dplyr::select()
instead of just the select()
function. We can discuss this
in a moment.
results_minus_vs_plus_annotated = results_minus_vs_plus_annotated %>%
dplyr::rename('symbol' = 'external_gene_name') %>%
dplyr::select(id, symbol, everything())
results_minus_vs_plus_annotated
# A tibble: 16,249 × 9
id symbol baseMean log2FoldChange lfcSE stat pvalue padj call
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 ENSMUSG00000000001 Gnai3 1490. 0.298 0.210 1.42 0.157 0.869 NS
2 ENSMUSG00000000028 Cdc45 1749. 0.226 0.177 1.28 0.200 0.903 NS
3 ENSMUSG00000000031 H19 2152. 0.458 0.765 0.599 0.549 0.995 NS
4 ENSMUSG00000000037 Scml2 24.9 0.579 0.561 1.03 0.302 0.951 NS
5 ENSMUSG00000000049 Apoh 7.78 -0.899 1.55 -0.579 0.562 0.998 NS
6 ENSMUSG00000000056 Narf 19654. -0.174 0.204 -0.855 0.392 0.982 NS
7 ENSMUSG00000000078 Klf6 951. -0.145 0.222 -0.652 0.515 0.992 NS
8 ENSMUSG00000000085 Scmh1 727. 0.265 0.275 0.964 0.335 0.966 NS
9 ENSMUSG00000000088 Cox5a 1427. -0.272 0.212 -1.28 0.199 0.903 NS
10 ENSMUSG00000000093 Tbx2 15.2 0.542 0.687 0.788 0.430 0.985 NS
# … with 16,239 more rows
And now we have our differential expression results annotated with gene symbols, which can help in the interpretation of the results, and can be used in downstream analysis such as functional analysis.
A key aspect of our analysis is preserving the relevant datasets for both our records and for downstream applications, such as functional enrichments.
We’ll write out our DE results, now that we’ve added information to the table to help us or our collaborators interpret the results.
write.csv(results_minus_vs_plus,
row.names = FALSE,
na = ".",
file="outputs/tables/DE_results_minus_vs_plus.csv")
In addition to the individual RObj(s) we saved earlier, we can
capture a snapshot our entire session using the save.image
function. This can be loaded in the same manner as an individual
Robj.
First, we’ll save our session info so we can reference the packages and versions used to generate these data.
session_summary <- sessionInfo()
save.image(file = "outputs/Robjs/DE_iron.RData")
In this section, we:
We’ve run through most of the building blocks needed to run a differential expression analysis and hopefully built up a better understanding of how differential expression comparisons work, particularly how experimental design can impact our results.
What to consider moving forward:
Let’s pause here for general questions
A way to determine possible broader biological interpretations from the observed DE results, is functional enrichments. There are many options, such as some included in this discussion thread. Other common functional enrichments approaches are gene set enrichment analysis, aka GSEA, Database for Annotation, Visualization and Integrated Discovery, aka DAVID, Ingenity, and iPathway Guide
The University of Michigan has license and support for additional tools, such as Cytoscape, so we recommend reaching out to staff with Taubman Library to learn more about resources that might be application toyour research.
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 13.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.52.0 knitr_1.41 rmarkdown_2.18
[4] data.table_1.14.6 RColorBrewer_1.1-3 pheatmap_1.0.12
[7] ggrepel_0.9.2 DESeq2_1.36.0 SummarizedExperiment_1.26.1
[10] Biobase_2.56.0 MatrixGenerics_1.8.1 matrixStats_0.63.0
[13] GenomicRanges_1.48.0 GenomeInfoDb_1.32.4 IRanges_2.30.1
[16] S4Vectors_0.34.0 BiocGenerics_0.42.0 forcats_0.5.2
[19] stringr_1.5.0 dplyr_1.0.10 purrr_0.3.5
[22] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
[25] ggplot2_3.4.0 tidyverse_1.3.2
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 colorspace_2.0-3 ellipsis_0.3.2 XVector_0.36.0
[5] fs_1.5.2 rstudioapi_0.14 farver_2.1.1 bit64_4.0.5
[9] AnnotationDbi_1.58.0 fansi_1.0.3 lubridate_1.9.0 xml2_1.3.3
[13] codetools_0.2-18 splines_4.2.0 cachem_1.0.6 geneplotter_1.74.0
[17] jsonlite_1.8.3 broom_1.0.1 annotate_1.74.0 dbplyr_2.2.1
[21] png_0.1-8 compiler_4.2.0 httr_1.4.4 backports_1.4.1
[25] assertthat_0.2.1 Matrix_1.5-3 fastmap_1.1.0 gargle_1.2.1
[29] cli_3.4.1 prettyunits_1.1.1 htmltools_0.5.3 tools_4.2.0
[33] gtable_0.3.1 glue_1.6.2 GenomeInfoDbData_1.2.8 rappdirs_0.3.3
[37] Rcpp_1.0.9 cellranger_1.1.0 jquerylib_0.1.4 vctrs_0.5.1
[41] Biostrings_2.64.1 xfun_0.35 rvest_1.0.3 timechange_0.1.1
[45] lifecycle_1.0.3 XML_3.99-0.13 googlesheets4_1.0.1 zlibbioc_1.42.0
[49] scales_1.2.1 vroom_1.6.0 hms_1.1.2 parallel_4.2.0
[53] curl_4.3.3 yaml_2.3.6 memoise_2.0.1 sass_0.4.4
[57] stringi_1.7.8 RSQLite_2.2.19 highr_0.9 genefilter_1.78.0
[61] filelock_1.0.2 BiocParallel_1.30.4 rlang_1.0.6 pkgconfig_2.0.3
[65] bitops_1.0-7 evaluate_0.18 lattice_0.20-45 labeling_0.4.2
[69] bit_4.0.5 tidyselect_1.2.0 magrittr_2.0.3 R6_2.5.1
[73] generics_0.1.3 DelayedArray_0.22.0 DBI_1.1.3 pillar_1.8.1
[77] haven_2.5.1 withr_2.5.0 survival_3.4-0 KEGGREST_1.36.3
[81] RCurl_1.98-1.9 modelr_0.1.10 crayon_1.5.2 utf8_1.2.2
[85] BiocFileCache_2.4.0 tzdb_0.3.0 progress_1.2.2 locfit_1.5-9.6
[89] grid_4.2.0 readxl_1.4.1 blob_1.2.3 reprex_2.0.2
[93] digest_0.6.30 xtable_1.8-4 munsell_0.5.0 bslib_0.4.1
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.