In this module, we will learn:
- The advantages of using gene ids when analyzing RNA-seq data.
- How to find gene symbols and other annotations, using ENSEMBL gene
ids
- How to output our results to file
- General options for functional enrichments and other follow-ups
Differential Expression Workflow
Here we will generate summary figures for our results and annotate
our DE tables.
Generating gene annotations
Since, gene symbols can change over time or be ambiguous we use, and
recommend, using the EMSEMBL reference genome and ENSEMBL IDs for
alignments and we’ve been working with tables and data where all genes
are labeled only by their long ENSEMBL ID. However, this can make it
difficult to quickly look for genes of interest.
Luckily, Bioconductor provides many tools and resources to facilitate
access to genomic
annotation resources.
To start, we will first load the biomaRt
library and choose what reference we want to access. For a more
detailed walk through of using biomaRt, this
training module might be useful, including what to do when
annotations are not 1:1 mappings.
We’ll start by loading the biomaRt
library and calling
the useEnsembl()
function to select the database we’ll use
to extract the information we need. This will download the mapping of
ENSEMBL IDs to gene symbols, enabling us to eventually add the gene
symbol column we want.
library('biomaRt')
ensembl = useEnsembl(dataset = 'mmusculus_gene_ensembl', biomart='ensembl')
Note - this process takes some time and will
take up a larger amount of working memory so proceed with caution if you
try to run these commands on a laptop with less than 4G of
memory
To identify possible filters to restrict our data,
we can use the listFilters
function. To identify the
attributes we want to retrive, we can use the
listAttributes
function. The best approach is to use list
or search functions to help narrow down the available options.
head(listFilters(mart = ensembl), n = 20)
head(listAttributes(ensembl), n = 30)
We can access additional genomic annotations using the bioMart
package. To identify we’ll structure our ‘query’ or search of the
bioMart resources to use the ENSEMBL
id from our alignment to add the gene symbols and gene description
for each gene.
id_mapping = getBM(attributes=c('ensembl_gene_id', 'external_gene_name'),
filters = 'ensembl_gene_id',
values = row.names(assay(dds_batch_fitted)),
mart = ensembl)
Batch submitting query [=======>------------------------] 25% eta: 34sBatch
submitting query [===============>----------------] 50% eta: 16sBatch submitting
query [=======================>--------] 75% eta: 7s
# will take some time for the query to run
# Preview the result
head(id_mapping)
ensembl_gene_id external_gene_name
1 ENSMUSG00000000001 Gnai3
2 ENSMUSG00000000028 Cdc45
3 ENSMUSG00000000031 H19
4 ENSMUSG00000000037 Scml2
5 ENSMUSG00000000049 Apoh
6 ENSMUSG00000000056 Narf
The id_mapping
table now includes the ENSEMBL
information and a gene symbol only for the genes included in our
results. This table should look familiar as it’s the same table we used
to annotate our results table in the last module.
Note: For additional information regarding bioMart,
please consult the ENSEMBL
bioMart vignette or the broader Bioconductor
Annotation Resources vignette.
Outputting results to file
A key aspect of our analysis is preserving the relevant datasets for
both our records and for downstream applications, such as functional
enrichments.
DE results table
We’ll write out our DE results, now that we’ve added information to
the table to help us or our collaborators interpret the results.
write.csv(results_deficient_vs_control,
row.names = FALSE,
na = ".",
file="outputs/tables/DE_results_deficient_vs_control.csv")
write.csv(results_deficient_vs_control_annotated,
row.names = FALSE,
file="outputs/tables/DE_results_deficient_vs_control_annotated.csv")
Subsetting significant genes
You may be interested in creating a table of only the genes that pass
your significance thresholds. A useful way to do this is to
conditionally subset your results. Again, we already created the
call
column, which makes this relatively simple to do:
# tidyr (requires table reformatting)
res_sig <- as_tibble(results_deficient_vs_control, rownames = "gene_ids") %>% filter(call != 'NS')
head(res_sig)
# A tibble: 6 × 8
gene_ids baseMean log2FoldChange lfcSE stat pvalue padj call
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 ENSMUSG00000000275 1662. -0.674 0.200 -3.38 0.000734 3.12e-2 Down
2 ENSMUSG00000000861 1679. 0.685 0.210 3.27 0.00109 3.98e-2 Up
3 ENSMUSG00000001281 532. 1.15 0.242 4.75 0.00000203 5.54e-4 Up
4 ENSMUSG00000002109 125. 1.06 0.300 3.54 0.000398 2.07e-2 Up
5 ENSMUSG00000002985 157. -1.16 0.345 -3.35 0.000814 3.29e-2 Down
6 ENSMUSG00000003865 89.3 2.26 0.627 3.61 0.000310 1.80e-2 Up
dim(res_sig)
[1] 189 8
Once we’ve created this table, we can also write it out to file:
write.csv(res_sig,
row.names = FALSE,
na = ".",
file="outputs/tables/DEGs-only_deficient_vs_control.csv")
R session data
In addition to the individual RObj(s) we saved earlier, we can
capture a snapshot our entire session using the save.image
function. This can be loaded in the same manner as an individual
Robj.
First, we’ll save our session info so we can reference the packages
and versions used to generate these data.
session_summary <- sessionInfo()
save.image(file = "outputs/Robjs/DE_iron.RData")
Overall takeaways
We’ve run through most of the building blocks needed to run a
differential expression analysis and hopefully built up a better
understanding of how differential expression comparisons work,
particularly how experimental design can impact our results.
What to consider moving forward:
- How can I control for technical variation in my experimental
design?
- How much variation is expected with a treatment group?
- What is my RNA quality, and how can that be optimized?
- Are there quality concerns for my sequencing data?
- What comparisons are relevant to my biological question?
- Are there covariates that should be considered?
- What will a differential expression analysis tell me?
Let’s pause here for general questions
Next steps - How do we make sense of large numbers of DE genes?
A way to determine possible broader
biological interpretations from the observed DE results, is
functional enrichments.
There are many options, such as some included in this discussion
thread. Other common functional enrichments approaches are gene set
enrichment analysis, aka GSEA,
Database for Annotation, Visualization and Integrated Discovery, aka DAVID, Ingenity, and iPathway Guide
The University of Michigan has license and support for additional
tools, such as Cytoscape, so we recommend reaching out to staff with Taubman
Library to learn more about resources that might be application
toyour research.
Session Info
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Detroit
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] biomaRt_2.60.1 data.table_1.15.4
[3] RColorBrewer_1.1-3 pheatmap_1.0.12
[5] ggrepel_0.9.5 lubridate_1.9.3
[7] forcats_1.0.0 stringr_1.5.1
[9] dplyr_1.1.4 purrr_1.0.2
[11] readr_2.1.5 tidyr_1.3.1
[13] tibble_3.2.1 ggplot2_3.5.1
[15] tidyverse_2.0.0 DESeq2_1.44.0
[17] SummarizedExperiment_1.34.0 Biobase_2.64.0
[19] MatrixGenerics_1.16.0 matrixStats_1.3.0
[21] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
[23] IRanges_2.38.1 S4Vectors_0.42.1
[25] BiocGenerics_0.50.0 knitr_1.47
[27] rmarkdown_2.27
loaded via a namespace (and not attached):
[1] DBI_1.2.3 httr2_1.0.2 rlang_1.1.4
[4] magrittr_2.0.3 compiler_4.4.0 RSQLite_2.3.7
[7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
[10] crayon_1.5.3 fastmap_1.2.0 dbplyr_2.5.0
[13] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
[16] tzdb_0.4.0 UCSC.utils_1.0.0 bit_4.0.5
[19] xfun_0.44 zlibbioc_1.50.0 cachem_1.1.0
[22] jsonlite_1.8.8 progress_1.2.3 blob_1.2.4
[25] highr_0.11 DelayedArray_0.30.1 BiocParallel_1.38.0
[28] parallel_4.4.0 prettyunits_1.2.0 R6_2.5.1
[31] bslib_0.7.0 stringi_1.8.4 jquerylib_0.1.4
[34] Rcpp_1.0.13 Matrix_1.7-0 timechange_0.3.0
[37] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-5
[40] yaml_2.3.8 codetools_0.2-20 curl_5.2.1
[43] lattice_0.22-6 withr_3.0.1 KEGGREST_1.44.1
[46] evaluate_0.23 BiocFileCache_2.12.0 xml2_1.3.6
[49] Biostrings_2.72.1 filelock_1.0.3 pillar_1.9.0
[52] BiocManager_1.30.23 generics_0.1.3 hms_1.1.3
[55] munsell_0.5.1 scales_1.3.0 glue_1.7.0
[58] tools_4.4.0 locfit_1.5-9.10 grid_4.4.0
[61] AnnotationDbi_1.66.0 colorspace_2.1-1 GenomeInfoDbData_1.2.12
[64] cli_3.6.2 rappdirs_0.3.3 fansi_1.0.6
[67] S4Arrays_1.4.1 gtable_0.3.5 sass_0.4.9
[70] digest_0.6.35 SparseArray_1.4.8 farver_2.1.2
[73] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
[76] httr_1.4.7 bit64_4.0.5
These materials have been adapted and extended from materials listed
above. These are open access materials distributed under the terms of
the Creative
Commons Attribution license (CC BY 4.0), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original
author and source are credited.
