Here we will proceed with continuing to set up the inputs we need to initialize DESeq2 before testing for differential expression.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Raw Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
In this module, we will: * Setup our sample information & create a DESeq2 object * Understand possible confounding factors * Understand the impact of batches or additional covariates * Filter count table
For this representative dataset, we have somewhat limited information from public records, but we know these samples were isolated from either wild-type or knock-out T-cells harvest from control mice or isolated from previously transplanted mice.
Our next step will be to describe the samples within our R session, so that we make the proper comparisons with DESeq2. The first step is to check the sample names from the count table.
colnames(CountTable)
## [1] "Sample_116498" "Sample_116499" "Sample_116500" "Sample_116501"
## [5] "Sample_116502" "Sample_116503" "Sample_116504" "Sample_116505"
## [9] "Sample_116506" "Sample_116507" "Sample_116508" "Sample_116509"
When we looked at our CountTable
, our samples are blinded, i.e. the sample names don’t correspond to any of the expected treatment groups so we will need to specify which sample IDs connect to which experimental conditions.
Since there are a large number of samples (and to increase the reproduciblity of our code), we would generate a sample information table in excel and exported it as a ‘.csv’ file so that it is in a ‘plain text’ format that can be easily loaded into our R session.
We’ll load our ‘pre-made’ sample information sheet, SampleInfo_trimmed.csv
, to unblind our samples.
MetaInfo <- read.table("~/RNASeqDemystified/data/SampleInfo_trimmed.csv", sep = ",", header = TRUE, row.names = 1)
head(MetaInfo)
## Gtype Tx Gtype.Tx
## Sample_116498 wt control wt.control
## Sample_116499 wt control wt.control
## Sample_116500 wt control wt.control
## Sample_116501 ko control ko.control
## Sample_116502 ko control ko.control
## Sample_116503 ko control ko.control
Checkpoint: If you see
MetaInfo
in your environment panel, please indicate with the green ‘yes’ button. Otherwise, please use the ‘raise hand’ button to be placed in a breakout room for help
How are the treatment groups encoded in the table?
unique(MetaInfo$Gtype.Tx)
## [1] wt.control ko.control wt.Tx ko.Tx
## Levels: ko.control ko.Tx wt.control wt.Tx
Next, we’ll format our table so that we have the appropriate data type (a ordered factor) for DESeq2 to recognize our treatment groups and appropriately compare samples.
MetaInfo$Gtype.Tx <- factor(MetaInfo$Gtype.Tx, levels = c( "wt.Tx", "ko.Tx", "ko.control", "wt.control" )) ## levels = c(Case, Control); Control should be last if only two groups & first if three groups to generate default pair-wise comparisons.
unique(MetaInfo$Gtype.Tx)
## [1] wt.control ko.control wt.Tx ko.Tx
## Levels: wt.Tx ko.Tx ko.control wt.control
Notice that we set the levels in a particular order. This is important for setting the ‘Control’ group as the denominator in our comparisons when we setup our DESeq2 model.
Before we proceed, we need to make sure that the sample labels (column names) in the count table match the sample information table (row names), including the order.
all(colnames(data) %in% rownames(MetaInfo)) #OR
## [1] TRUE
all(colnames(data) == rownames(MetaInfo))
## [1] TRUE
The first line of code shows us if the two lists have matching members and the second line checks if both the identity and order match between our data
table and our MetaInfo
.
If the sample labels don’t match, then we will see an error and need to correct the labels prior to proceeding. Checking the sample information table is extremely important to ensure that the correct samples are grouped together for comparisons.
Bioconductor software packages often define and use custom classes within R for storing data in a way that better fits expectations around biological data, such as illustrated below from Huber et al. 2015.
These custom data structures have pre-specified data slots, which hold specific types/classes of data and therefore can be more easily accessed by functions from the same package.
We’ll start by creating the DESeqDataSet and then we can talk more about key parts of this object.
To create the DESeqDataSet we need two things we already have: count matrix and the “MetaInfo” sample data table. To complete the DESeqDataSet, we will also need to specify a design formula that tells DESeq2 what column indicates how the samples should be grouped.
## Create DESeq object, line by line
dds <- DESeqDataSetFromMatrix(countData = CountTable,
colData = MetaInfo,
design = ~ Gtype.Tx)
## converting counts to integer mode
Checkpoint: If you see
dds
in your environment panel, please indicate with the green ‘yes’ button. Otherwise, please use the ‘red x’ button to get help
The design formula specifies column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is condition
. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.
If we look at the dds
object now, we can see how our data is organized.
str(dds)
Right now, there are many “empty” slots that will be filled in when we proceed with the model fitting. Before we do that, let’s discuss the design formula in more detail.
The design formula specified informs many of the DESeq2 functions how to treat the samples in the analysis, specifically which column in the samaple metadata table specifies the experimental design.
In this case, there aren’t known covariates to include in the sample table. However, if there are additional attributes of the samples that may impact the DE comparisons, like sex, date of collection, or patient of origin, these should be added as additional columns in the sample information table and added to a design formula.
Let’s test out manually adding a covariate to our MetaInfo
data table and then create a new DESeq2 object.
head(MetaInfo)
## Gtype Tx Gtype.Tx
## Sample_116498 wt control wt.control
## Sample_116499 wt control wt.control
## Sample_116500 wt control wt.control
## Sample_116501 ko control ko.control
## Sample_116502 ko control ko.control
## Sample_116503 ko control ko.control
MetaInfo$patient <- factor(rep(c("P1", "P2", "P3"), 4), levels = c("P1", "P2", "P3"))
head(MetaInfo)
## Gtype Tx Gtype.Tx patient
## Sample_116498 wt control wt.control P1
## Sample_116499 wt control wt.control P2
## Sample_116500 wt control wt.control P3
## Sample_116501 ko control ko.control P1
## Sample_116502 ko control ko.control P2
## Sample_116503 ko control ko.control P3
Notice how we avoid starting with a number our patient covariate labels since R doesn’t like that.
dds_patient <- DESeqDataSetFromMatrix(countData = CountTable,
colData = MetaInfo,
design = ~ patient + Gtype.Tx)
## converting counts to integer mode
Now we have specified for DESeq2 that we want to test for the effect of the condition (the last factor) while controlling for the effect of the patient origin (the first factor).
Differences between samples can also be due to technical reasons, such as collection on different days or different sequencing runs. Differences between samples that are not due to biological factors as called batch effects. We can include batch effects in our design model in the same way as covariates, as long as the technical groups do not overlap, or confound, the biological treatment groups.
Let’s try add some additional meta-data information where we have counfounding batch effects and create another DESeq2 object.
#MetaInfo
MetaInfo$batch <- factor(c(rep(c("Day1"), 3),
rep(c("Day2"), 3),
rep(c("Day3"), 3),
rep(c("Day4"), 3)),
levels = c("Day1", "Day2", "Day3", "Day4"))
dds_batch <- DESeqDataSetFromMatrix(countData = CountTable,
colData = MetaInfo,
design = ~ batch + Gtype.Tx)
Notice that if you run the above command, the error indicates that variables in the design formula “are linear combinations” which means that batch and condition are correlated and the function is unable to fill in a required ‘slot’ in the DESeq2 object. So if batches are not balanced by including both case and controls (like in the patient covariate example) then we cannot control for those technical effects throughs statistical modeling.
While not necessary, pre-filtering the count table to exclude genes that were poorly measured helps to not only reduce the size of the DESeq2 object, but also gives you a sense of how many genes were measured at the sequencing depth generated for your samples.
Here we will filter out any genes that have less than 10 counts across any of the samples. This is a fairly standard level of filtering, but can filter data less/more depending on quality control metrics from alignments and/or total number of samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
Notice how the dds
object now has less elements than the dds_patient
object after filtering, so there were quite a number of genes that were not measured in our experiment.
We’ll also filter the dds_patient
genes.
keep <- rowSums(counts(dds_patient)) >= 10
dds_patient <- dds_patient[keep,]
In this section, we:
* Loaded the necessary input files into our R session
* Discussed model design for DESeq2
* Initialized a DESeq2 data set
* Filtered our count data
Now that we’ve created our DESeq2 objects, including specifying what model is appropriate for our data, and filtered our data, we can proceed with assessing the impact of the experimental conditions on gene expression across our samples.
Before moving on with our group comparisons, we’ll take a short (10 minute) break.
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.