1 Differential Expression Workflow

Here we will proceed with continuing to set up the inputs we need to initialize DESeq2 before testing for differential expression.

Step	Task
1	Experimental Design
2	Biological Samples / Library Preparation
3	Sequence Reads
4	Assess Quality of Raw Reads
5	Splice-aware Mapping to Genome
6	Count Reads Associated with Genes

7	Organize project files locally
8	Initialize DESeq2 and fit DESeq2 model
9	Assess expression variance within treatment groups
10	Specify pairwise comparisons and test for differential expression
11	Generate summary figures for comparisons
12	Annotate differential expression result tables

In this module, we will: * Setup our sample information & create a DESeq2 object * Understand possible confounding factors * Understand the impact of batches or additional covariates * Filter count table

2 Sample Information

For this representative dataset, we have somewhat limited information from public records, but we know these samples were isolated from either wild-type or knock-out T-cells harvest from control mice or isolated from previously transplanted mice.

2.1 Generate Sample Table

Our next step will be to describe the samples within our R session, so that we make the proper comparisons with DESeq2. The first step is to check the sample names from the count table.

colnames(CountTable)

##  [1] "Sample_116498" "Sample_116499" "Sample_116500" "Sample_116501"
##  [5] "Sample_116502" "Sample_116503" "Sample_116504" "Sample_116505"
##  [9] "Sample_116506" "Sample_116507" "Sample_116508" "Sample_116509"

When we looked at our CountTable, our samples are blinded, i.e. the sample names don’t correspond to any of the expected treatment groups so we will need to specify which sample IDs connect to which experimental conditions.

Since there are a large number of samples (and to increase the reproduciblity of our code), we would generate a sample information table in excel and exported it as a ‘.csv’ file so that it is in a ‘plain text’ format that can be easily loaded into our R session.

Click for code execution shortcut

A critical aspect of creating a sample sheet in excel is to avoid using spaces or characters that have special uses in R, such as dashes or parentheses. Simple sample group names are best. If you are unfamilar with ‘.csv’ files or how to generate them, there are tutorials available to guide you through the process.

We’ll load our ‘pre-made’ sample information sheet, SampleInfo_trimmed.csv, to unblind our samples.

MetaInfo <- read.table("~/RNASeqDemystified/data/SampleInfo_trimmed.csv", sep = ",", header = TRUE, row.names = 1)
head(MetaInfo)

##               Gtype      Tx   Gtype.Tx
## Sample_116498    wt control wt.control
## Sample_116499    wt control wt.control
## Sample_116500    wt control wt.control
## Sample_116501    ko control ko.control
## Sample_116502    ko control ko.control
## Sample_116503    ko control ko.control

Checkpoint: If you see MetaInfo in your environment panel, please indicate with the green ‘yes’ button. Otherwise, please use the ‘raise hand’ button to be placed in a breakout room for help

How are the treatment groups encoded in the table?

unique(MetaInfo$Gtype.Tx)

## [1] wt.control ko.control wt.Tx      ko.Tx     
## Levels: ko.control ko.Tx wt.control wt.Tx

Next, we’ll format our table so that we have the appropriate data type (a ordered factor) for DESeq2 to recognize our treatment groups and appropriately compare samples.

MetaInfo$Gtype.Tx <- factor(MetaInfo$Gtype.Tx, levels = c( "wt.Tx", "ko.Tx", "ko.control", "wt.control" )) ## levels = c(Case, Control); Control should be last if only two groups & first if three groups to generate default pair-wise comparisons. 

unique(MetaInfo$Gtype.Tx)

## [1] wt.control ko.control wt.Tx      ko.Tx     
## Levels: wt.Tx ko.Tx ko.control wt.control

Notice that we set the levels in a particular order. This is important for setting the ‘Control’ group as the denominator in our comparisons when we setup our DESeq2 model.

Before we proceed, we need to make sure that the sample labels (column names) in the count table match the sample information table (row names), including the order.

all(colnames(data) %in% rownames(MetaInfo)) #OR

## [1] TRUE

all(colnames(data) == rownames(MetaInfo))

## [1] TRUE

The first line of code shows us if the two lists have matching members and the second line checks if both the identity and order match between our data table and our MetaInfo.

If the sample labels don’t match, then we will see an error and need to correct the labels prior to proceeding. Checking the sample information table is extremely important to ensure that the correct samples are grouped together for comparisons.

2.2 Creating DESeq2 object

Bioconductor software packages often define and use custom classes within R for storing data in a way that better fits expectations around biological data, such as illustrated below from Huber et al. 2015.

These custom data structures have pre-specified data slots, which hold specific types/classes of data and therefore can be more easily accessed by functions from the same package.

We’ll start by creating the DESeqDataSet and then we can talk more about key parts of this object.

To create the DESeqDataSet we need two things we already have: count matrix and the “MetaInfo” sample data table. To complete the DESeqDataSet, we will also need to specify a design formula that tells DESeq2 what column indicates how the samples should be grouped.

## Create DESeq object, line by line
dds <- DESeqDataSetFromMatrix(countData = CountTable,
                              colData = MetaInfo,
                              design = ~ Gtype.Tx)

## converting counts to integer mode

Checkpoint: If you see dds in your environment panel, please indicate with the green ‘yes’ button. Otherwise, please use the ‘red x’ button to get help

The design formula specifies column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is condition. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.

If we look at the dds object now, we can see how our data is organized.

str(dds)

Right now, there are many “empty” slots that will be filled in when we proceed with the model fitting. Before we do that, let’s discuss the design formula in more detail.

2.3 Making model choices

The design formula specified informs many of the DESeq2 functions how to treat the samples in the analysis, specifically which column in the samaple metadata table specifies the experimental design.

In this case, there aren’t known covariates to include in the sample table. However, if there are additional attributes of the samples that may impact the DE comparisons, like sex, date of collection, or patient of origin, these should be added as additional columns in the sample information table and added to a design formula.

Let’s test out manually adding a covariate to our MetaInfo data table and then create a new DESeq2 object.

head(MetaInfo)

##               Gtype      Tx   Gtype.Tx
## Sample_116498    wt control wt.control
## Sample_116499    wt control wt.control
## Sample_116500    wt control wt.control
## Sample_116501    ko control ko.control
## Sample_116502    ko control ko.control
## Sample_116503    ko control ko.control

MetaInfo$patient <- factor(rep(c("P1", "P2", "P3"), 4), levels = c("P1", "P2", "P3"))

head(MetaInfo)

##               Gtype      Tx   Gtype.Tx patient
## Sample_116498    wt control wt.control      P1
## Sample_116499    wt control wt.control      P2
## Sample_116500    wt control wt.control      P3
## Sample_116501    ko control ko.control      P1
## Sample_116502    ko control ko.control      P2
## Sample_116503    ko control ko.control      P3

Notice how we avoid starting with a number our patient covariate labels since R doesn’t like that.

dds_patient <- DESeqDataSetFromMatrix(countData = CountTable,
                              colData = MetaInfo,
                              design = ~ patient + Gtype.Tx)

## converting counts to integer mode

Now we have specified for DESeq2 that we want to test for the effect of the condition (the last factor) while controlling for the effect of the patient origin (the first factor).

Click for note on interaction terms

More complex questions, including determining if a fold-change due to treatment is different across groups, such as patient samples, “interaction terms” can be included in the design formula, such as outlined in this support thread.

2.3.1 Optional - Modeling batch effects with DESeq2

Differences between samples can also be due to technical reasons, such as collection on different days or different sequencing runs. Differences between samples that are not due to biological factors as called batch effects. We can include batch effects in our design model in the same way as covariates, as long as the technical groups do not overlap, or confound, the biological treatment groups.

Let’s try add some additional meta-data information where we have counfounding batch effects and create another DESeq2 object.

#MetaInfo
MetaInfo$batch <- factor(c(rep(c("Day1"), 3), 
                           rep(c("Day2"), 3), 
                           rep(c("Day3"), 3), 
                           rep(c("Day4"), 3)), 
                         levels = c("Day1", "Day2", "Day3", "Day4"))

dds_batch <- DESeqDataSetFromMatrix(countData = CountTable,
                          colData = MetaInfo,
                          design = ~ batch + Gtype.Tx)

Notice that if you run the above command, the error indicates that variables in the design formula “are linear combinations” which means that batch and condition are correlated and the function is unable to fill in a required ‘slot’ in the DESeq2 object. So if batches are not balanced by including both case and controls (like in the patient covariate example) then we cannot control for those technical effects throughs statistical modeling.

2.4 Pre-filtering

While not necessary, pre-filtering the count table to exclude genes that were poorly measured helps to not only reduce the size of the DESeq2 object, but also gives you a sense of how many genes were measured at the sequencing depth generated for your samples.

Here we will filter out any genes that have less than 10 counts across any of the samples. This is a fairly standard level of filtering, but can filter data less/more depending on quality control metrics from alignments and/or total number of samples.

keep <- rowSums(counts(dds)) >= 10

dds <- dds[keep,]

Notice how the dds object now has less elements than the dds_patient object after filtering, so there were quite a number of genes that were not measured in our experiment.

We’ll also filter the dds_patient genes.

keep <- rowSums(counts(dds_patient)) >= 10

dds_patient <- dds_patient[keep,]

3 Summary

In this section, we:
* Loaded the necessary input files into our R session
* Discussed model design for DESeq2
* Initialized a DESeq2 data set
* Filtered our count data

Now that we’ve created our DESeq2 objects, including specifying what model is appropriate for our data, and filtered our data, we can proceed with assessing the impact of the experimental conditions on gene expression across our samples.

Before moving on with our group comparisons, we’ll take a short (10 minute) break.

4 Sources

4.1 Training resources used to develop materials

HBC DGE setup: https://hbctraining.github.io/DGE_workshop/lessons/01_DGE_setup_and_overview.html
HBC Count Normalization: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
DESeq2 standard vignette: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
DESeq2 beginners vignette: https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf
Bioconductor RNA-seq Workflows: https://www.bioconductor.org/help/course-materials/2015/LearnBioconductorFeb2015/B02.1_RNASeq.html
CCDL Gastric cancer training materials: https://alexslemonade.github.io/training-modules/RNA-seq/03-gastric_cancer_exploratory.nb.html
CCDL Neuroblastoma training materials: https://alexslemonade.github.io/training-modules/RNA-seq/05-nb_cell_line_DESeq2.nb.html

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Day 2 - Module 07: DESeq2 Initialization

UM Bioinformatics Core

2021-04-26