Today we will:
* Get an overview of a differential expression workflow
* Run a differential expression analysis with DESeq2 starting from a count table
* Generate visualizations of gene expression for each sample
* Test for differential expression between experimental conditions
* Generate summary figures and tables of differential expression results
* Discuss key considerations around experimental design and your results might be impacted
Today we won’t:
* Execute every step that might be relevant to a differential expression analysis
* Generate all the figures we would normally generate in a differential expression analysis
* Exhaust possible errors or edge cases
* Execute the commands that would work best for your dataset
* Show what steps to take after analysis to interpret differential expression results
* Become RNA-seq analysis experts
Today we will proceed from a count table similar to what was generated yesterday through key steps in a differential expression analysis.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Raw Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
In this module, we will: * Overview of reproducible research & analysis setup * Broad introduction to DESeq2 & why it is widely used for differential expression comparisons * How to import and review gene count table
We’ll be exploring some RNA-seq data that is fairly representative of what we see in the core and start with input files similar to what would have been generated by the full alignment process that was introduced yesterday and similar to the count tables currently being delivered by the Advanced Sequencing Core.
To get started, we’ll open up our RStudio console and set up for our analysis.
To follow best practices as discussed by Nobel, 2009 for file organization for bioinformatics/computational projects, we will need to make sure there are separate storage locations for:
* Raw data
* Code
* Output files
To organize our files for our analysis today we’ll create a new folder, ideally in our home directory, using the ‘New Folder’ button on the right side of our screen & naming it RNASeqDemystified
.
Next, we’ll set our working directory and create a new subdirectory within RNASeqDemystified
called data
.
## alternative command
# system("mkdir ./RNASeqDemystified")
dir.create("./RNASeqDemystified")
## Warning in dir.create("./RNASeqDemystified"): './RNASeqDemystified' already
## exists
#setwd("~/RNASeqDemystified")
getwd()
## [1] "/Users/damki/repos/rnaseq_demystified_workshop_2021"
Checkpoint: Please use the red x if you are not in your own ‘RNASeqDemystified’ directory after executing these commands
Once we confirm our working directory, we’ll create a new folder to store our raw data.
## alternative command
# system("mkdir ./data")
dir.create("./RNASeqDemystified/data")
## Warning in dir.create("./RNASeqDemystified/data"): './RNASeqDemystified/data'
## already exists
Next, we’ll download the files we’ll need for today.
download.file("https://raw.githubusercontent.com/umich-brcf-bioinf/2021-04-26-umich-rnaseqDemystified/master/data/Day2Data/SampleInfo_trimmed.csv", "./RNASeqDemystified/data/SampleInfo_trimmed.csv")
download.file("https://raw.githubusercontent.com/umich-brcf-bioinf/2021-04-26-umich-rnaseqDemystified/master/data/Day2Data/gene_expected_count_trimmed.txt", "./RNASeqDemystified/data/gene_expected_count_trimmed.txt")
Checkpoint: If you have successfully downloaded the data then used the green ‘yes’ button. We’ll have a stopping point
Next, we’ll open a new ‘.R’ script file by using the toolbar at the top of our window. We’ll save our ‘Untitled1’ file as “RNASeqAnalysis”. This new “RNASeqAnalysis.R” will serve as a record of our analysis and, following best practices, is saved in a separate location from our raw data.
This code file can also serve as a ‘cheatsheet’ of commands to use for working through differential expression comparisons with other example datasets or your own data in the future.
To start off our code file, we’ll check for if the packages we’re using today were correctly installed.
# copy this line and paste into your R script file (on a new line)
missing <- setdiff(c("tidyr", "ggplot2", "pheatmap", "ggrepel", "formattable", "RColorBrewer", "matrixStats", "dplyr", "biomaRt", "DESeq2"), rownames(installed.packages()))
# place cursor on same line as command and use the 'Run' button OR Ctrl + Enter to execute command
# copy this line and pastw into your R script files (on a new line)
if (!length(missing)) { cat("Ready for Computational Foundations workshop\n")} else {cat("PROBLEM: could not install:", missing, "\n")}
## Ready for Computational Foundations workshop
# place cursor on same line as command and use the 'Run' button OR Ctrl + Enter to execute command
Checkpoint: Please use the ‘red x’ button if you are having issues with downloading the data, or if you don’t that the expected packages are installed, like shown below to be placed in a breakout room for help. If you have successfully downloaded the data then used the green ‘yes’ button
If you are seeing a PROBLEM
message after checking your package installation, here is the code to install the packages, taken from the setup instructions:
# copy these lines and paste into your R script file
install.packages("tidyr")
install.packages("ggplot2")
install.packages("pheatmap")
install.packages("ggrepel")
install.packages("formattable")
install.packages("RColorBrewer")
install.packages("matrixStats")
install.packages("dplyr")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("biomaRt","DESeq2"), update=FALSE, ask=FALSE)
# highlight all lines and use the 'run' button OR Ctrl + Enter to send commands to the Console to be executed by R
Note that there are several bonus content sections on the instruction pages, like the two below that we will not be covering in this workshop, but that may have useful context or be helpful when you review this material
>
, then R has executed the command. If you see a +
, this means that the command is not complete and R is waiting (usually for a )
).
You should already have several packages installed, so we can load them into our R session now. To do that we’ll use the library
function to load the required packages.
library(DESeq2)
library(ggplot2)
library(tidyr)
library(matrixStats)
library('ggrepel', character.only=TRUE)
library('pheatmap', character.only=TRUE)
library('RColorBrewer', character.only=TRUE)
Note: Expect to see some messages in your console while these packages are loading
As discussed in the computation foundations/prerequsite sessions, R/RStudio has great resources for getting help, including code ‘cheatsheets’ and package vignettes, like for tidyr.
Checkpoint: Please use the ‘red x’ button if you need help loading your libraries
As discussed during the seminar, a common application for bulk RNA-seq is to test for differential expression between conditions or treatments, using statistical approaches that are appropriate for biological data.
While there are several tools that can be used for differential expression comparisons, we use DESeq2 in our analysis today. DESeq2 is one of two tools, along with EdgeR, considered ‘best practice’ for differential expression, as both tools apply similar methods that account for the distributions we expect to see for RNA-seq and are fairly stringent in calling differentially expressed genes.
Additionally, DESeq2
also has an this excellent vignette click to open from Love, Anders, and Huber from which our workflow is partially adapted and is a good resource when analyzing your own data (see also: Love, Anders, and Huber. Genome Biology. 2014.).
We previously loaded several libraries into our R session, we can check the tools documentation out using the ?
operator.
?`DESeq2-package`
Checkpoint: If you see the R documentation for
DESeq2
pop up in your ‘help’ panel on the right, please indicate with the green ‘yes’ button. Please use the ‘raise hand’ button, if you see this messageNo documentation for 'DESeq2-package' in specified packages and libraries
to be placed in a breakout room for help
A key assumption is that since a vast number of RNAs are represented in each sample, the probability of counting a particular transcript is small. In addition, we expect that the mean expression of any given gene is less than the variance (or spread) of expression across different samples.
For most experiments biological variance is much greater than technical variance, especially if best practices for quality RNA isolation are followed (including DNase treatment!).
Since variance is key to the statistical approach used for DESeq2, if you try to compare treatment groups with less than two replicates, DESeq2 will give you an error, as shown in this blog post. Without replicates, statistical significance (i.e. p-values) cannot be calculated, but qualitative approaches like looking at the top expressed genes after normalization are an option.
A question we are frequently asked is “How many replicates do I need?” As mentioned in the seminar, there is often more contributing to the observed gene expression in each sample than the experimental treatment or condition.
The general goal of differential expression analysis to seperate the “interesting” biological contributions from the “uninteresting” technical or extraneous contributions that either cannot be or were not controlled in the experimental design. The more sources of variation, such as samples coming from heterogenous tissues or experiments with incomplete knockdowns, the more replicates (>3) are recommended.
For a more in depth discussion of experimental design considerations, particularly for the number of replicates, please review Koch et al. and these papers, like this one by Hart et al. that focus on estimating statistical power for RNA-seq experiments.
A related aspect to consider for experimental design is how much sequencing depth should be generated per sample. This figure shared by Illumina in their technical talks is helpful to understand the relative importance of sequencing depth versus number of replicates.
Generally, for the human and mouse genomes, the general recommendation is 30-40 million reads per sample if measuring the ~20,000 protein-coding genes (i.e.: polyA library prep) to capture both highly expressed (common) and more lowly expresssed (rarer) transcripts. However, as the image above shows, sequencing depth has less of an impact in detecting differentially expressed genes (DEGs) than number of replicates.
Another key assumption for DESeq2 is that the analysis will start with un-normalized counts.
To begin our analysis, we’ll read in the raw count data file, gene_expected_count_trimmed.txt
which is similar to what would be generated in the alignment steps yesterday (and what you could receive from AGC). We’ll discuss later a few normalizations that can be helpful for us to understand how much a gene is expressed within or between samples, but normalized data should not be used as an input for DESeq2.
CountTable <- read.table("./RNASeqDemystified/data/gene_expected_count_trimmed.txt", header = TRUE, row.names = 1)
head(CountTable) # look at the top of the table
## Sample_116498 Sample_116499 Sample_116500 Sample_116501
## ENSMUSG00000000001 8256 6680 7532 5122
## ENSMUSG00000000003 0 0 0 0
## ENSMUSG00000000028 226 244 199 193
## ENSMUSG00000000031 1 2 2 0
## ENSMUSG00000000037 29 11 22 16
## ENSMUSG00000000049 0 0 1 8
## Sample_116502 Sample_116503 Sample_116504 Sample_116505
## ENSMUSG00000000001 6684 8047 6446 5559
## ENSMUSG00000000003 0 0 0 0
## ENSMUSG00000000028 293 382 2297 2138
## ENSMUSG00000000031 2 0 9 5
## ENSMUSG00000000037 13 15 32 51
## ENSMUSG00000000049 4 2 2 0
## Sample_116506 Sample_116507 Sample_116508 Sample_116509
## ENSMUSG00000000001 5443 5906 5771 4792
## ENSMUSG00000000003 0 0 0 0
## ENSMUSG00000000028 2344 2357 2531 2225
## ENSMUSG00000000031 2 1 8 12
## ENSMUSG00000000037 54 28 25 31
## ENSMUSG00000000049 3 4 1 0
Now that the file is read into R, note that we’ve created a data frame that includes ‘gene ids’ in ENSEMBL format as rownames and count data from twelve different samples.
If we think back to the ‘expected_counts’ RSEM output, the values in the count table are likely not integers (due to how the alignment tool resolves reads that map to multiple locuses). Since DESeq2 requires whole numbers, if we try to use the RSEM ouputs without rounding the estimated counts to a whole number first, DESeq2 will give us an error. To resolve this, we’ll round down all the columns of our CountTable
that include count data (all columns since we set the gene names to be our row names).
tail(CountTable) # not all whole numbers
## Sample_116498 Sample_116499 Sample_116500 Sample_116501
## ENSMUSG00000118573 1.76 0.00 0.00 0.00
## ENSMUSG00000118574 0.00 0.00 0.00 0.00
## ENSMUSG00000118575 0.00 0.00 0.00 0.00
## ENSMUSG00000118576 3.00 0.00 1.00 1.00
## ENSMUSG00000118577 752.33 613.24 417.04 412.63
## ENSMUSG00000118578 34.49 20.58 12.10 14.88
## Sample_116502 Sample_116503 Sample_116504 Sample_116505
## ENSMUSG00000118573 0.00 0.00 0.00 2.29
## ENSMUSG00000118574 0.00 0.00 0.00 0.00
## ENSMUSG00000118575 0.00 0.00 0.00 0.00
## ENSMUSG00000118576 1.00 5.00 3.00 1.00
## ENSMUSG00000118577 429.74 553.50 479.16 825.36
## ENSMUSG00000118578 14.05 34.62 18.57 14.01
## Sample_116506 Sample_116507 Sample_116508 Sample_116509
## ENSMUSG00000118573 0.00 0.00 0.43 0.71
## ENSMUSG00000118574 0.00 0.00 0.00 0.00
## ENSMUSG00000118575 0.00 0.00 0.00 0.00
## ENSMUSG00000118576 2.00 0.00 0.00 2.00
## ENSMUSG00000118577 520.06 383.61 404.31 353.35
## ENSMUSG00000118578 11.14 4.69 11.81 7.74
CountTable <- floor(CountTable)
tail(CountTable) # now whole numbers
## Sample_116498 Sample_116499 Sample_116500 Sample_116501
## ENSMUSG00000118573 1 0 0 0
## ENSMUSG00000118574 0 0 0 0
## ENSMUSG00000118575 0 0 0 0
## ENSMUSG00000118576 3 0 1 1
## ENSMUSG00000118577 752 613 417 412
## ENSMUSG00000118578 34 20 12 14
## Sample_116502 Sample_116503 Sample_116504 Sample_116505
## ENSMUSG00000118573 0 0 0 2
## ENSMUSG00000118574 0 0 0 0
## ENSMUSG00000118575 0 0 0 0
## ENSMUSG00000118576 1 5 3 1
## ENSMUSG00000118577 429 553 479 825
## ENSMUSG00000118578 14 34 18 14
## Sample_116506 Sample_116507 Sample_116508 Sample_116509
## ENSMUSG00000118573 0 0 0 0
## ENSMUSG00000118574 0 0 0 0
## ENSMUSG00000118575 0 0 0 0
## ENSMUSG00000118576 2 0 0 2
## ENSMUSG00000118577 520 383 404 353
## ENSMUSG00000118578 11 4 11 7
tximport
is another optionrecommended the DESeq2 authors to read in the RSEM expected_counts, as this package allows for the average transcript length per gene to be used in the DE analysis and, as described by the author, the tximport-to-DESeqDataSet
constructor function round the non-integer data generated by RSEM to whole numbers.
R/Rstudio has a strong community component so if you are getting an error or wondering how to make a command work or how to perform a specific task, there is likely already a solution out there. Google is your friend but part of the challenge is figuring out what to search for. Key parts of a successful search:
R
or Bioconductor
How to get session information to aid in a search:
sessionInfo()
Highly recommend using resources like Bioconductor Support, Biostars, and Stack Overflow, including threads on specific packages or common bioinformatic tasks.
I personally use one or more of these resources every day.
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.