Objectives

  • Additional visualizations for gene level QC assessment

0.1 Count boxplots

To understand how skewed our raw data was and how well our normalization worked, we can look at distributions of raw and normalized counts. First, we need to set up some tables and labels.

## setup for raw counts
pdata <- data.frame(colData(dds))
mat <- as.matrix(assay(dds))
title <- 'Raw counts'
y_label <- 'log2(counts)'
Comparison <- "ko.Tx"

Then, we’ll add the relevant annotations to the count table.

# create annotationn table for raw plots
annot_df = data.frame(
    sample = row.names(pdata),
    row.names = row.names(pdata),
    stringsAsFactors = F
)

# join counts and annotation table
tidy_mat = tidyr::gather(as_tibble(mat), key = 'sample', value = 'counts') %>%
    left_join(annot_df, by = 'sample')

Once we set up the input data, we can plot the raw counts for our samples.

box_plot = ggplot(tidy_mat, aes(x = sample, y = log2(counts))) +
    geom_boxplot(notch = TRUE) +
    labs(
        title = title,
        x = '',
        y = y_label) +
    theme_bw() + theme(axis.text.x = element_text(angle = 90))
box_plot

After generating the plot with ggplot, we’ll save it as a file in the directory we set up.

ggsave(filename = paste0(plotPath, "BoxPlot_Gtype.Tx_raw.pdf"), plot = box_plot, height = 8, width = 8, dpi = 300)
## Warning: Removed 22526 rows containing non-finite values (stat_boxplot).

To understand how the rlog normalization impacted the distributions of counts for each sample, we can plot boxplots for the normalized data and compare that to our plot of the raw data.

## rlog counts
pdata = data.frame(colData(rld))
mat = as.matrix(assay(rld))
title = 'Rlog normalized counts'
y_label = 'rlog(counts)'

annot_df = data.frame(
    sample = row.names(pdata),
    row.names = row.names(pdata),
    stringsAsFactors = F
)

tidy_mat = tidyr::gather(as_tibble(mat), key = 'sample', value = 'counts') %>%
    left_join(annot_df, by = 'sample')

box_plot = ggplot(tidy_mat, aes(x = sample, y = counts)) +
    geom_boxplot(notch = TRUE) +
    labs(
        title = title,
        x = '',
        y = y_label) +
    theme_bw() + theme(axis.text.x = element_text(angle = 90))
box_plot

ggsave(filename = paste0(plotPath, "BoxPlot_Gtype.Tx_rlog.pdf"), plot = box_plot, height = 8, width = 8, dpi = 300)

0.2 Heatmaps

To understand the patterns of expression across all our samples, including how well our samples cluster by group labels, we can generate a heatmaps.

The first heatmap to generate is of the top 500 expressed genes across all samples. First, we’ll set our color palette using a tool called Color Brewer.

#heatmap with top 500 variant or expressed genes, rlog normalized data
colors <- colorRampPalette(brewer.pal(9, 'Blues'))(255)

First, we’ll select the top 500 expressed genes across all our samples to prioritize this set of genes and allow for patterns to be more easily ovbserved.

select <- order(rowMeans(assay(rld)), decreasing=TRUE)[1:500]
df <- data.frame(Group = colData(rld)[,c('Gtype.Tx')], row.names = rownames(colData(dds)))

Next, we’ll set up a PDF file and plot our heatmap. Saving the plot as an object allows us to view the figure within our session as well as writing the plot to file.

The pheatmap function does quite a lot in a single step, including scaling the data by row and clustering both the samples (columns) and genes (rows).

Note: This blog post has a nice step by step overview of the pheatmap options, using basketball data as an example.

pdf(file = paste0(plotPath,'Heatmap_TopExp_', Comparison, '.pdf'), onefile = FALSE, width=10, height=20)
p <- pheatmap(assay(rld)[select,], scale="row",  cluster_rows=TRUE, show_rownames=FALSE, cluster_cols=TRUE, annotation_col=df, fontsize = 7, las = 2, fontsize_row = 7, color = colors, main = '500 Top Expressed Genes Heatmap')
p

dev.off()
## pdf 
##   3

Looking at the heatmap, we see that samples within the same treatment group cluster together, fitting our understanding of the experimental design. We also see clusters of genes that appear to have contrasting patterns between the treatment groups, which is promising for our differential expression comparisons.

Note: Heatmaps are helpful visualizations, especially for sharing an overview of your RNA-seq data. The why and how of to use them properly can be confusing, such as outlined in the questions and answers in this biostars post that adds additional context to the overview in this workshop.

0.2.0.1 Sample and Top Variable Expressed Heatmaps

This blog post reviews the data transformation procedure for generating heatmaps and is a useful resource. They review the steps for generating a sample correlation heatmap similar to the plot generated below.

#heatmap of normalized data, sample distibution matrix
sampleDists <- dist(t(assay(rld))) #rld
sampleDistMatrix <- as.matrix(sampleDists) # convert to matrix
colnames(sampleDistMatrix) <- NULL

colors <- colorRampPalette(rev(brewer.pal(9, 'Blues')))(255)
pdf(file = paste0(plotPath,'Heatmap_Dispersions_', Comparison, '.pdf'), onefile = FALSE)
p <- pheatmap(sampleDistMatrix, 
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)
p

dev.off()
## pdf 
##   3

If we look at the sampleDists object, we now see from the diagonal values that there appears to be two major groups of samples, with better defined subgroups in the bottom right quadrant.

Overall, like the heatmap of the top 500 most expressed genes, we see that samples in the same treatment groups cluster well together when the full dataset is considered.

Another informative heatmap is for the top most variably expressed genes in the dataset. An example of this code is shown below.

colors <- colorRampPalette(brewer.pal(9, 'Blues'))(255)

select <- order(rowVars(assay(rld)), decreasing=TRUE)[1:500]
df <- data.frame(Group = colData(rld)[,c('Gtype.Tx')], row.names = rownames(colData(dds)))

pdf(file = paste0(plotPath,'Heatmap_TopVar_', Comparison, '.pdf'), onefile = FALSE, width=10, height=20)
pheatmap(assay(rld)[select,], scale="row",  cluster_rows=TRUE, show_rownames=FALSE, cluster_cols=TRUE, annotation_col=df, fontsize = 7, las = 2, fontsize_row = 7, color = colors, main = '500 Top Variably Expressed Genes Heatmap')
dev.off()
## pdf 
##   3

2 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2          pheatmap_1.0.12            
##  [3] ggrepel_0.9.1               dplyr_1.0.5                
##  [5] tidyr_1.1.3                 ggplot2_3.3.3              
##  [7] DESeq2_1.26.0               SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [11] matrixStats_0.58.0          Biobase_2.46.0             
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [15] IRanges_2.20.2              S4Vectors_0.24.4           
## [17] BiocGenerics_0.32.0        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7           bit64_4.0.5            tools_3.6.1           
##  [4] backports_1.2.1        bslib_0.2.4            utf8_1.2.1            
##  [7] R6_2.5.0               rpart_4.1-15           Hmisc_4.5-0           
## [10] DBI_1.1.1              colorspace_2.0-0       nnet_7.3-15           
## [13] withr_2.4.2            tidyselect_1.1.0       gridExtra_2.3         
## [16] bit_4.0.4              compiler_3.6.1         htmlTable_2.1.0       
## [19] labeling_0.4.2         sass_0.3.1             scales_1.1.1          
## [22] checkmate_2.0.0        genefilter_1.68.0      stringr_1.4.0         
## [25] digest_0.6.27          foreign_0.8-72         rmarkdown_2.7         
## [28] XVector_0.26.0         base64enc_0.1-3        jpeg_0.1-8.1          
## [31] pkgconfig_2.0.3        htmltools_0.5.1.1      highr_0.9             
## [34] fastmap_1.1.0          htmlwidgets_1.5.3      rlang_0.4.10          
## [37] rstudioapi_0.13        RSQLite_2.2.7          farver_2.1.0          
## [40] jquerylib_0.1.3        generics_0.1.0         jsonlite_1.7.2        
## [43] RCurl_1.98-1.3         magrittr_2.0.1         GenomeInfoDbData_1.2.2
## [46] Formula_1.2-4          Matrix_1.3-2           Rcpp_1.0.6            
## [49] munsell_0.5.0          fansi_0.4.2            lifecycle_1.0.0       
## [52] stringi_1.5.3          yaml_2.2.1             zlibbioc_1.32.0       
## [55] grid_3.6.1             blob_1.2.1             crayon_1.4.1          
## [58] lattice_0.20-41        splines_3.6.1          annotate_1.64.0       
## [61] locfit_1.5-9.4         knitr_1.33             pillar_1.6.0          
## [64] geneplotter_1.64.0     XML_3.99-0.3           glue_1.4.2            
## [67] evaluate_0.14          latticeExtra_0.6-29    data.table_1.14.0     
## [70] png_0.1-7              vctrs_0.3.7            gtable_0.3.0          
## [73] purrr_0.3.4            assertthat_0.2.1       cachem_1.0.4          
## [76] xfun_0.22              xtable_1.8-4           survival_3.2-10       
## [79] tibble_3.1.1           AnnotationDbi_1.48.0   memoise_2.0.0         
## [82] cluster_2.1.2          ellipsis_0.3.1

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.