Objectives:

  • ‘Unblind’ our sample IDs
  • Understand possible confounding factors
  • Understand the impact of batches or additional covariates
  • Filter count table

1 Differential Expression Workflow

Here we will proceed setting up the inputs needed to initialize DESeq2 before testing for differential expression.


2 DESeq2 objects

Bioconductor software packages often define and use custom classes within R to store data in a way that better fits expectations around biological data, such as illustrated below from Huber et al. 2015.

These custom data structures have pre-specified data slots, which hold specific types/classes of data and therefore can be more easily accessed by functions from the same package.

To create the DESeqDataSet we need two tables:

  1. a count matrix (which we already loaded)

  2. a table that assigns the condition labels for each sample (which we will generate below).

3 Sample Information

For this representative dataset, we have somewhat limited information from public records, but we know these samples were isolated from either wild-type or knock-out T-cells harvested from control mice or isolated from previously transplanted mice.

3.1 Generate Sample Table

Our next step will be to describe the samples within our R session, so that we make the proper comparisons with DESeq2. The first step is to check the sample names from the count table.

colnames(CountTable)
##  [1] "Sample_116498" "Sample_116499" "Sample_116500" "Sample_116501" "Sample_116502" "Sample_116503"
##  [7] "Sample_116504" "Sample_116505" "Sample_116506" "Sample_116507" "Sample_116508" "Sample_116509"

When we looked at our CountTable, our samples are blinded, e.g. the sample names don’t correspond to any of the expected treatment groups. So we will need to specify which sample IDs connect to which experimental conditions.

Since there are a large number of samples (and to increase the reproduciblity of our code), we would generate a sample information table in excel and exported it as a ‘.csv’ file so that it is in a ‘plain text’ format that can be easily loaded into our R session.

Click for sample naming conventions

A critical aspect of creating a sample sheet in excel is to avoid using spaces or characters that have special uses in R, such as dashes or parentheses. Simple sample group names are best. If you are unfamilar with ‘.csv’ files or how to generate them, there are tutorials available to guide you through the process.

We’ll load our ‘pre-made’ sample information sheet, SampleInfo_trimmed.csv, to unblind our samples.

MetaInfo <- read.table("data/SampleInfo_trimmed.csv", 
                       sep = ",", 
                       header = TRUE, 
                       row.names = 1)
head(MetaInfo)
##               Gtype      Tx   Gtype.Tx
## Sample_116498    wt control wt.control
## Sample_116499    wt control wt.control
## Sample_116500    wt control wt.control
## Sample_116501    ko control ko.control
## Sample_116502    ko control ko.control
## Sample_116503    ko control ko.control

Checkpoint: If you have loaded MetaInfo, please indicate with the green ‘check’ button. Otherwise, please use the red ‘x’ button to have the command repeated

How are the treatment groups encoded in the table we just loaded?

unique(MetaInfo$Gtype.Tx)
## [1] "wt.control" "ko.control" "wt.Tx"      "ko.Tx"

Next, we’ll format our table so that we have the appropriate data type (a ordered factor) for DESeq2 to recognize our treatment groups and appropriately compare samples.

MetaInfo$Gtype.Tx <- factor(MetaInfo$Gtype.Tx, 
                            levels = c( "wt.Tx", "ko.Tx", "ko.control", "wt.control" ))

unique(MetaInfo$Gtype.Tx)
## [1] wt.control ko.control wt.Tx      ko.Tx     
## Levels: wt.Tx ko.Tx ko.control wt.control

Notice that we set the levels in a particular order. This is important for setting the ‘Control’ group as the denominator in our comparisons when we setup our DESeq2 model. The group name should be last if there are only two groups & first if there are only three groups to generate default pair-wise comparisons.

Before we proceed, we need to make sure that the sample labels (column names) in the count table match the sample information table (row names), including the order.

If the sample labels don’t match, then we will see an error and need to correct the labels prior to proceeding. Checking the sample information table is extremely important to ensure that the correct samples are grouped together for comparisons.

all(colnames(data) == rownames(MetaInfo))
## [1] TRUE

This line of code checks if both the identity and order match between our data table and our MetaInfo.

Checkpoint: If you your sample info check returns TRUE, please indicate with the green ‘yes’ button. Otherwise, please use the red ‘x’ button to have the command repeated

4 Creating DESeq2 object

To create the DESeqDataSet we will need the count matrix, the “MetaInfo” sample data table, and we will also need a design formula to specify our model.

4.1 Making model choices

The design formula specified informs many of the DESeq2 functions how to treat the samples in the analysis, specifically which column in the samaple metadata table contains the experimental design.

In this case, we aren’t aware of any covariates that should be considered in our comparisons. However, if there are additional attributes of the samples that may impact the DE comparisons, like sex, date of collection, or patient of origin, these should be added as additional columns in the sample information table and added to a design formula.

Click for Note More complex questions, including determining if a fold-change due to treatment is different across groups, such as patient samples, “interaction terms” can be included in the design formula, such as outlined in this support thread.

The design formula specifies the column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is Gtype.Tx. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.

## Create DESeq object, line by line
dds <- DESeqDataSetFromMatrix(countData = CountTable,
                              colData = MetaInfo,
                              design = ~ Gtype.Tx)
## converting counts to integer mode

Checkpoint: If you see dds in your environment panel, please indicate with the green ‘check’ button. Otherwise, please use use the red ‘x’ button in your zoom reaction panel to have this step repeated. You can use the red ‘x’ to be put in a breakout room for help

Click for how to model batch effects with DESeq2

Differences between samples can also be due to technical reasons, such as collection on different days or different sequencing runs. Differences between samples that are not due to biological factors as called batch effects. We can include batch effects in our design model in the same way as covariates, as long as the technical groups do not overlap, or confound, the biological treatment groups.
Let’s try add some additional meta-data information where we have counfounding batch effects and create another DESeq2 object.

#MetaInfo
MetaInfo$batch <- factor(c(rep(c("Day1"), 3),
                       rep(c("Day2"), 3),
                       rep(c("Day3"), 3),
                       rep(c("Day4"), 3)),
                     levels = c("Day1", "Day2", "Day3", "Day4"))

dds_batch <- DESeqDataSetFromMatrix(countData = CountTable,
                      colData = MetaInfo,
                      design = ~ batch + Gtype.Tx)
Notice that if you run the above command, the error indicates that variables in the design formula “are linear combinations” which means that batch and condition are correlated and the function is unable to fill in a required ‘slot’ in the DESeq2 object. So if batches are not balanced by including both case and controls (like in the patient covariate example) then we cannot control for those technical effects through statistical modeling.

5 Pre-filtering

While not necessary, pre-filtering the count table helps to not only reduce the size of the DESeq2 object, but also gives you a sense of how many genes were reasonably measured at the sequencing depth generated for your samples.

Here we will filter out any genes that have less than 10 counts across any of the samples. This is a fairly standard level of filtering, but can filter data less/more depending on quality control metrics from alignments and sequencing depth or total number of samples.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

Notice how the dds object now has less elements after filtering, so there were quite a number of genes that were not measured in our experiment.

Checkpoint: Questions?

5.0.0.1 [Optional exercise]: Adding covariates to sample information

You can test out manually adding a covariate to our MetaInfo data table and then create a new DESeq2 object.

Click for how to add a covariate to sample information
head(MetaInfo)
MetaInfo$patient <- factor(rep(c("P1", "P2", "P3"), 4), levels = c("P1", "P2", "P3"))
head(MetaInfo)

Notice how we avoid starting with a number our patient covariate labels since R doesn’t like that.

dds_patient <- DESeqDataSetFromMatrix(countData = CountTable,
                          colData = MetaInfo,
                          design = ~ patient + Gtype.Tx)
Now we have specified for DESeq2 that we want to test for the effect of the condition (the last factor) while controlling for the effect of the patient origin (the first factor).

6 Summary

In this section, we:

  • Loaded the necessary input files into our R session
  • Discussed model design for DESeq2
  • Initialized a DESeq2 data set
  • Filtered our count data

Now that we’ve created our DESeq2 objects, including specifying what model is appropriate for our data, and filtered our data, we can proceed with assessing the impact of the experimental conditions on gene expression across our samples.


7 Sources

7.1 Training resources used to develop materials


These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

---
title: "Day 2 - Module 07: DESeq2 Initialization"
author: "UM Bioinformatics Core"
date: "`r Sys.Date()`"
output:
        html_document:
            includes:
                in_header: header.html
            theme: paper
            toc: true
            toc_depth: 4
            toc_float: true
            number_sections: true
            fig_caption: true
            markdown: GFM
            code_download: true
---
<style type="text/css">
body{ /* Normal  */
      font-size: 14pt;
  }
pre {
  font-size: 12pt
}
code.r{
  font-size: 12pt;
}
</style>

<!--- Allow the page to be wider --->
<style>
    body .main-container {
        max-width: 1200px;
    }
</style>

> # Objectives:    
> * 'Unblind' our sample IDs
> * Understand possible confounding factors
> * Understand the impact of batches or additional covariates
> * Filter count table

```{r LoadRunningData, eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE}
library(DESeq2)
load("rdata/RunningData.RData")
```


# Differential Expression Workflow

Here we will proceed setting up the inputs needed to initialize DESeq2 before testing for differential expression.

![](./images/wayfinder/wayfinder-04.png){width=75%}

---

# DESeq2 objects

Bioconductor software packages often define and use custom classes within R to store data in a way that better fits expectations around biological data, such as illustrated below from [Huber et al. 2015](https://www.nature.com/articles/nmeth.3252).

![](./images/SummarizedExperiment.jpg)

These custom data structures have pre-specified data slots, which hold specific types/classes of data and therefore can be more easily accessed by functions from the same package.

To create the DESeqDataSet we need two tables:

1. a count matrix (which we already loaded)

1. a table that assigns the condition labels for each sample (which we will generate below).


# Sample Information

For this representative dataset, we have somewhat limited information from [public records](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144588), but we know these samples were isolated from either wild-type or knock-out T-cells harvested from control mice or isolated from previously transplanted mice.

## Generate Sample Table

Our next step will be to describe the samples within our R session, so that we make the proper comparisons with DESeq2. The first step is to check the sample names from the count table.
```{r ColumnNames}
colnames(CountTable)
```

When we looked at our `CountTable`, our samples are blinded, e.g. the sample names don't correspond to any of the expected treatment groups. So we will need to specify which sample IDs connect to which experimental conditions.

Since there are a large number of samples (and to increase the reproduciblity of our code), we would generate a sample information table in excel and exported it as a '.csv' file so that it is in a 'plain text' format that can be easily loaded into our R session.


<details>
    <summary>*Click for sample naming conventions*</summary>
    A critical aspect of creating a sample sheet in excel is to avoid using spaces or characters that have special uses in R, such as dashes or parentheses. Simple sample group names are best.
    If you are unfamilar with '.csv' files or how to generate them, there are [tutorials](https://www.wikihow.com/Create-a-CSV-File) available to guide you through the process.

</details>

We'll load our 'pre-made' sample information sheet, `SampleInfo_trimmed.csv`, to unblind our samples.
```{r SampleInfo1, eval=TRUE, echo=FALSE}
MetaInfo <- read.table("../data/SampleInfo_trimmed.csv", sep = ",", header = TRUE, row.names = 1)
```
```{r SampleInfo2, eval=FALSE}
MetaInfo <- read.table("data/SampleInfo_trimmed.csv", 
                       sep = ",", 
                       header = TRUE, 
                       row.names = 1)
```
```{r CheckSampleInfo}
head(MetaInfo)
```

**Checkpoint**: *If you have loaded `MetaInfo`, please indicate with the green 'check' button. Otherwise, please use the red 'x' button to have the command repeated*


How are the treatment groups encoded in the table we just loaded?
```{r TreatmentGroupTable}
unique(MetaInfo$Gtype.Tx)
```


Next, we'll format our table so that we have the appropriate data type (a ordered [factor](https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors/)) for DESeq2 to recognize our treatment groups and appropriately compare samples.
```{r SampleInfoFactor}

MetaInfo$Gtype.Tx <- factor(MetaInfo$Gtype.Tx, 
                            levels = c( "wt.Tx", "ko.Tx", "ko.control", "wt.control" ))

unique(MetaInfo$Gtype.Tx)
```
Notice that we set the levels in a particular order. This is important for setting the 'Control' group as the denominator in our comparisons when we setup our DESeq2 model. The group name should be last if there are only two groups & first if there are only three groups to generate default pair-wise comparisons.

Before we proceed, we need to make sure that the sample labels (column names) in the count table match the sample information table (row names), including the order.

If the sample labels don't match, then we will see an error and need to correct the labels prior to proceeding. Checking the sample information table is extremely important to ensure that the correct samples are grouped together for comparisons.

```{r CheckSampleInfos}
all(colnames(data) == rownames(MetaInfo))
```
This line of code checks if both the identity and order match between our `data` table and our `MetaInfo`.

**Checkpoint**: *If you your sample info check returns `TRUE`, please indicate with the green 'yes' button. Otherwise, please use the red 'x' button to have the command repeated*

# Creating DESeq2 object

To create the DESeqDataSet we will need the count matrix, the "MetaInfo" sample data table, and we will also need a **design formula** to specify our model.

## Making model choices

The design formula specified informs many of the DESeq2 functions how to treat the samples in the analysis, specifically which column in the samaple metadata table contains the experimental design.

In this case, we aren't aware of any [**covariates**](https://methods-sagepub-com.proxy.lib.umich.edu/reference/encyc-of-research-design/n85.xml) that should be considered in our comparisons. However, if there are additional attributes of the samples that may impact the DE comparisons, like sex, date of collection, or patient of origin, these should be added as [additional columns](https://support.bioconductor.org/p/75309/) in the sample information table and [added to a design formula](https://support.bioconductor.org/p/98700/).


<details>
    <summary>*Click for Note*</summary>
    More complex questions, including determining if a fold-change due to treatment is different across groups, such as patient samples, "interaction terms" can be included in the design formula, such as outlined in [this support thread](https://support.bioconductor.org/p/98628/).
</details>

The design formula specifies the column(s) in the metadata table and how they should be used in the analysis. For our dataset we only have one column we are interested in, that is `Gtype.Tx`. This column has three factor levels, which tells DESeq2 that for each gene we want to evaluate gene expression change with respect to these different levels.

```{r DESeq2Object}
## Create DESeq object, line by line
dds <- DESeqDataSetFromMatrix(countData = CountTable,
                              colData = MetaInfo,
                              design = ~ Gtype.Tx)
```

**Checkpoint**: *If you see `dds` in your environment panel, please indicate with the green 'check' button. Otherwise, please use  use the red 'x' button in your zoom reaction panel to have this step repeated. You can use the red 'x' to be put in a breakout room for help*


<details>
    <summary>*Click for how to model batch effects with DESeq2*</summary>
    Differences between samples can also be due to technical reasons, such as collection on different days or different sequencing runs. Differences between samples that are not due to biological factors as called **batch effects**. We can include batch effects in our design model in the same way as covariates, as long as the technical groups do not overlap, or **confound**, the biological treatment groups.  
    Let's try add some additional meta-data information where we have counfounding batch effects and create another DESeq2 object.
```{r Confounders, eval=FALSE}
#MetaInfo
MetaInfo$batch <- factor(c(rep(c("Day1"), 3),
                       rep(c("Day2"), 3),
                       rep(c("Day3"), 3),
                       rep(c("Day4"), 3)),
                     levels = c("Day1", "Day2", "Day3", "Day4"))

dds_batch <- DESeqDataSetFromMatrix(countData = CountTable,
                      colData = MetaInfo,
                      design = ~ batch + Gtype.Tx)
```
    Notice that if you run the above command, the error indicates that variables in the design formula "are linear combinations" which means that batch and condition are correlated and the function is unable to fill in a required 'slot' in the DESeq2 object.
    So if batches are **not balanced** by including both case and controls (like in the patient covariate example) then we cannot control for those technical effects through statistical modeling.
</details>

---


# Pre-filtering

While not necessary, [pre-filtering](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering) the count table helps to not only reduce the size of the DESeq2 object, but also gives you a sense of how many genes were reasonably measured at the sequencing depth generated for your samples.

Here we will filter out any genes that have less than 10 counts across any of the samples. This is a fairly standard level of filtering, but can filter data less/more depending on quality control metrics from alignments and sequencing depth or total number of samples.
```{r PreFilter}
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
```
Notice how the `dds` object now has less elements after filtering, so there were quite a number of genes that were not measured in our experiment.

**Checkpoint**: *Questions?*


#### [Optional exercise]: Adding covariates to sample information

You can test out manually adding a covariate to our `MetaInfo` data table and then create a new DESeq2 object.

<details>
    <summary>*Click for how to add a covariate to sample information*</summary>
```{r Covariates, eval=FALSE}
head(MetaInfo)
MetaInfo$patient <- factor(rep(c("P1", "P2", "P3"), 4), levels = c("P1", "P2", "P3"))
head(MetaInfo)
```
    Notice how we avoid starting with a number our patient covariate labels since R doesn't like that.

```{r CovariateDESeq2Object, eval=FALSE}
dds_patient <- DESeqDataSetFromMatrix(countData = CountTable,
                          colData = MetaInfo,
                          design = ~ patient + Gtype.Tx)
```
    Now we have specified for DESeq2 that we want to test for the effect of the condition (the last factor) while controlling for the effect of the patient origin (the first factor).
</details>



# Summary

In this section, we:

* Loaded the necessary input files into our R session    
* Discussed model design for DESeq2    
* Initialized a DESeq2 data set     
* Filtered our count data     


Now that we've created our DESeq2 objects, including specifying what model is appropriate for our data, and filtered our data, we can proceed with assessing the impact of the experimental conditions on gene expression across our samples.




---

# Sources
## Training resources used to develop materials
* HBC DGE setup: https://hbctraining.github.io/DGE_workshop/lessons/01_DGE_setup_and_overview.html
* HBC Count Normalization: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html
* DESeq2 standard vignette: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
* DESeq2 beginners vignette: https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf
* Bioconductor RNA-seq Workflows: https://www.bioconductor.org/help/course-materials/2015/LearnBioconductorFeb2015/B02.1_RNASeq.html
* CCDL Gastric cancer training materials: https://alexslemonade.github.io/training-modules/RNA-seq/03-gastric_cancer_exploratory.nb.html
* CCDL Neuroblastoma training materials: https://alexslemonade.github.io/training-modules/RNA-seq/05-nb_cell_line_DESeq2.nb.html



```{r WriteOut.RData, eval=TRUE, echo=FALSE, message=FALSE, warning=FALSE}
#Hidden code block to write out data for knitting

save.image(file = "rdata/RunningData.RData")

```

---

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the [Creative Commons Attribution license (CC BY 4.0)](http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
