workshop wayfinder

Objectives

  • Compare between sets of genes from ORA results
  • Summarize expression differences for genes associated with a GO term of interest

Getting started

In this section, we’ll generate additional visualizations - beyond what was generated by WebGestaltR - for our bulk RNA-seq ORA results.

First, we’ll load a library called ComplexHeatmap that we will be using for some of our visualizations in addition to the plotting that is available to use from ggplot2 after previously loading the tidyverse.

# =========================================================================
# Advanced Visualizations I
# =========================================================================

# -------------------------------------------------------------------------
# Load additional libraries
library(tidyverse)
library(ComplexHeatmap)
# -------------------------------------------------------------------------
# Check current working directory
getwd()

Generate visualizations for ORA results

To create visualizations for the ORA enrichments, we’ll start by reading in the table that WebGestaltR output to file.

# -------------------------------------------------------------------------
# Read in results
enriched_ora = read_delim('results/Project_deficient_vs_control_ORA_GO_BP/enrichment_results_deficient_vs_control_ORA_GO_BP.txt')
Rows: 6 Columns: 11
── Column specification ──────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): geneSet, description, link, overlapId, userId
dbl (6): size, overlap, expect, enrichmentRatio, pValue, FDR

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(enriched_ora)
# A tibble: 6 × 11
  geneSet   description link   size overlap expect enrichmentRatio  pValue    FDR overlapId userId
  <chr>     <chr>       <chr> <dbl>   <dbl>  <dbl>           <dbl>   <dbl>  <dbl> <chr>     <chr> 
1 GO:00709… neuron dea… http…   263      14  3.73             3.76 1.98e-5 0.0157 11816;12… Apoe;…
2 GO:00987… detoxifica… http…    77       7  1.09             6.41 1.03e-4 0.0314 11816;14… Apoe;…
3 GO:00620… regulation… http…   235      12  3.33             3.60 1.19e-4 0.0314 11816;12… Apoe;…
4 GO:00065… cellular m… http…   120       8  1.70             4.70 2.91e-4 0.0575 14381;14… Mthfd…
5 GO:00971… intrinsic … http…   227      11  3.22             3.42 3.64e-4 0.0575 12048;12… Bcl2l…
6 GO:00436… regulation… http…    50       5  0.709            7.05 6.72e-4 0.0886 12608;13… Vegfa…

In addition to the enrichment statistics, the table also includes a column called userId that includes the names of the genes that were both annotated for that GO term and part of the list of DE genes of interest that we provided for the enrichment. Let’s pull out those values and transform them into a structure that is more R friendly.

# -------------------------------------------------------------------------
# create empty list
go_geneSets <- list()

# -------------------------------------------------------------------------
# create lists of overlapping DE genes for each GO term
for(i in 1:length(enriched_ora$description)){
  go_name <- enriched_ora$description[i] # descriptive GO term names
  go_genes <- str_split_1(enriched_ora$userId[i], ";") # list of gene symbols from table entry
  go_geneSets[[go_name]] <- c(go_genes)
}

# -------------------------------------------------------------------------
# check the named lists
head(go_geneSets)
$`neuron death`
 [1] "Apoe"    "Bcl2l1"  "Vegfa"   "Ddit3"   "Slc7a11" "Trim2"   "Coro1a"  "G6pdx"   "Mt1"    
[10] "Egln1"   "Jun"     "Cebpb"   "Hipk2"   "Bnip3"  

$detoxification
[1] "Apoe"   "Sesn2"  "Gsr"    "Mt2"    "Mt1"    "Srxn1"  "Hbb-bs"

$`regulation of small molecule metabolic process`
 [1] "Apoe"     "Bcl2l1"   "Grb10"    "Clcn2"    "Acmsd"    "Slc7a11"  "Sesn2"    "Trib3"   
 [9] "Midn"     "Epm2aip1" "Cltc"     "Erfe"    

$`cellular modified amino acid metabolic process`
[1] "Mthfd2"  "P4ha1"   "Chac1"   "Slc7a11" "Cth"     "G6pdx"   "Gsr"     "Egln1"  

$`intrinsic apoptotic signaling pathway`
 [1] "Bcl2l1" "E2f2"   "Cdkn1a" "Ddit3"  "Stk25"  "Chac1"  "Nupr1"  "Trib3"  "Cebpb"  "Hipk2" 
[11] "Bnip3" 

$`regulation of DNA-templated transcription in response to stress`
[1] "Vegfa" "Ddit3" "Sesn2" "Jun"   "Cebpb"

Now we have a list of the names of each GO term and then nested lists of the DE genes for each term.

Investigate gene overlaps

Earlier, we saw that based on the DAG of GO terms in our initial enrichment using the web-browser version of WebGestalt that the “neuron death” and “intrinsic apoptotic signaling pathway” terms. Let’s check what DE genes are shared between those two terms:

# -------------------------------------------------------------------------
# check what genes are shared between the two terms
intersect(go_geneSets$`neuron death`, go_geneSets$`intrinsic apoptotic signaling pathway`)
[1] "Bcl2l1" "Ddit3"  "Cebpb"  "Hipk2"  "Bnip3" 

We see that five genes are shared between the two terms out of the total set that are both annotated for each GO term and provided as DE genes of interest for the enrichment.

While comparing genes between two sets of genes is fairly straightforward, if we wanted to understand how many genes are shared versus unique across all the significant terms in our ORA enrichmemt results using WebGestaltR this approach would be quite tedious.

Instead we can opt for visualizations that use a broader type of set analysis. One option is something called an UpSet plot. Unlike that doesn’t limit the number of sets that can be compared, unlike more traditional approaches like Venn diagrams. With the Complex Heatmap package, we can use the named list as our starting point.

# -------------------------------------------------------------------------
# First make the combination matrix from our list of genes for each GO term
m = make_comb_mat(go_geneSets)
m
A combination matrix with 6 sets and 17 combinations.
  ranges of combination set size: c(1, 7).
  mode for the combination size: distinct.
  sets are on rows.

Top 8 combination sets are:
  neuron death detoxification regulation of small molecule metabolic process cellular modified amino acid metabolic process intrinsic apoptotic signaling pathway regulation of DNA-templated transcription in response to stress   code size
                                                                           x                                                                                                                                                      001000    7
                                                                                                                                                                x                                                                 000010    4
                            x                                                                                                                                                                                                     010000    3
                                                                                                                          x                                                                                                       000100    3
             x                                                                                                                                                  x                                                               x 100011    2
             x                                                                                                            x                                                                                                       100100    2
             x                                                                                                                                                  x                                                                 100010    2
             x                                                                                                                                                                                                                  x 100001    2

Sets are:
                                                              set size
                                                     neuron death   14
                                                   detoxification    7
                   regulation of small molecule metabolic process   12
                   cellular modified amino acid metabolic process    8
                            intrinsic apoptotic signaling pathway   11
  regulation of DNA-templated transcription in response to stress    5

In the combination matrix results, we can see the total number of sets being compared and a representation of the overlaps. This is useful not as nice as something more visual.

Using the combination matrix, we can create a simple UpSet plot comparing the DE genes annotated for each GO term:

# -------------------------------------------------------------------------
# Make simple UpSet plot
UpSet(m)

Let’s breakdown the parts of this plot:

  • On the left are the descriptive names of the GO terms, that we provided.
  • On the right are bar plots summarizing the total set size, which in this case is the total number of genes that were DE in our data for each term.
  • The center area shows the set comparison. The single dots on the right represents the values that are included in that term but not the other terms being compared. The dots connected by lines show the shared values between the indicated terms. In this case, we see sets of 2-3 terms that have at least one shared value.
  • The top bar plots show the counts for each set comparison. The default mode is ‘distinct’ so the height of the bar corresponds to the number of genes that are only part of that set intersection.

Now that we understand the parts of the plot, let’s make some adjustments to make the plot easier to read. You can use the same colors or pick your own from ggplots named options:

# -------------------------------------------------------------------------
# Make simple UpSet plot
upsetGO <- UpSet(m, comb_order = rev(order(comb_size(m))), lwd = 3,
                 comb_col = c("black", "blue", "gold3")[comb_degree(m)],
                 left_annotation = upset_left_annotation(m, add_numbers = TRUE))
upsetGO

Now we have the GO term dots connected with a thicker line to indicate the combination and a color designated for each degree of intersection (black = unique, blue = shared by two terms, gold = shared by three terms). We also have the intersections sorted by size and the set size bar moved to the left with the values added.

Question: What term has the fewest unique associated genes? Which term has the most?

Note - This is just a fraction of all the possible customizations that are available via the ComplexHeatmap package but for the sake of time, we’ll limit our plot adjustments to this, at least for now.

Let’s output our customized UpSet plot to file:

# -------------------------------------------------------------------------
# Output UpSet plot to file
png(file = "./results/figures/deficient_vs_control_GO-BP_UpSetPlot.png", width = 1000, height = 400)
  draw(upsetGO) # draw the plot
dev.off() # Close the graphics device
quartz_off_screen 
                2 

We only have a single comparison between deficient and control conditions for our bulk results, but for experiments that have more than one comparison we can also use these approaches to compare what GO terms are shared/unique between different DE comparisons instead of comparing what genes are unique/shared.


Summarize gene expression for ORA results

Since ORA does not account for the fold-change size or direction, we can combine our enrichment results with our DE statistics to summarize the gene expression for terms of interest.

# -------------------------------------------------------------------------
# If not already loaded, read in diffex results
rsd_diffex = read_csv('inputs/bulk_de_results/de_deficient_vs_control_annotated.csv')
Rows: 16249 Columns: 9
── Column specification ──────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): id, symbol, call
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# -------------------------------------------------------------------------
# Take another look at our DE table
head(rsd_diffex)
# A tibble: 6 × 9
  id                 symbol baseMean log2FoldChange lfcSE   stat pvalue   padj call 
  <chr>              <chr>     <dbl>          <dbl> <dbl>  <dbl>  <dbl>  <dbl> <chr>
1 ENSMUSG00000000001 Gnai3   1490.            0.278 0.148  1.88  0.0605  0.325 NS   
2 ENSMUSG00000000028 Cdc45   1749.            0.222 0.129  1.72  0.0853  0.386 NS   
3 ENSMUSG00000000031 H19     2152.            0.136 0.284  0.478 0.633   0.867 NS   
4 ENSMUSG00000000037 Scml2     24.9           0.600 0.562  1.07  0.286  NA     NS   
5 ENSMUSG00000000049 Apoh       7.78         -1.23  1.15  -1.07  0.285  NA     NS   
6 ENSMUSG00000000056 Narf   19654.           -0.201 0.167 -1.20  0.229   0.596 NS   

To visualize the expression for genes annotated as part of small molecule metabolism, which might be interesting since the treatment was a difference in diet, we’ll need to subset our results to the DE genes of interest for that term.

# -------------------------------------------------------------------------
# Subset RSD table to genes that match those in our list for the GO term of interest

rsd_smallmolecule <- rsd_diffex %>% filter(symbol %in% go_geneSets$`regulation of small molecule metabolic process`)
head(rsd_smallmolecule)
# A tibble: 6 × 9
  id                 symbol  baseMean log2FoldChange lfcSE  stat   pvalue        padj call 
  <chr>              <chr>      <dbl>          <dbl> <dbl> <dbl>    <dbl>       <dbl> <chr>
1 ENSMUSG00000002985 Apoe        157.         -1.16  0.345 -3.35 8.14e- 4 0.0329      Down 
2 ENSMUSG00000007659 Bcl2l1    21934.         -0.668 0.209 -3.19 1.40e- 3 0.0459      Down 
3 ENSMUSG00000020176 Grb10      3086.          1.26  0.240  5.27 1.38e- 7 0.0000662   Up   
4 ENSMUSG00000022843 Clcn2      1016.          1.18  0.201  5.90 3.74e- 9 0.00000374  Up   
5 ENSMUSG00000026348 Acmsd      1203.         -0.658 0.200 -3.30 9.66e- 4 0.0362      Down 
6 ENSMUSG00000027737 Slc7a11     119.          2.28  0.367  6.23 4.78e-10 0.000000561 Up   

Let’s start by using the subsetted table to create a simple barplot of the fold-change values for each gene:

# -------------------------------------------------------------------------
# Simple barplot of fold-changes
ggplot(rsd_smallmolecule, aes(x = symbol, y = log2FoldChange)) +
  geom_col()

This is helpful but could be improved with some formatting adjustments and labels. Let’s do that next:

# -------------------------------------------------------------------------
# Create a more sophisticated barplot of fold-changes

rsd_smallmolecule_barplot <- ggplot(rsd_smallmolecule, aes(x = reorder(symbol, -log2FoldChange), y = log2FoldChange, fill = log2FoldChange > 0)) +
  geom_col() +
  labs(
    title = "Fold-Change of Categories",
    x = "Gene symbol",
    y = "log2 fold-change"
  ) +
  scale_fill_manual(values = c("TRUE" = "darkred", "FALSE" = "darkblue"),
                    labels = c("Down-regulated", "Up-regulated"),
                    name = "Regulation") +
  theme_minimal() 

rsd_smallmolecule_barplot

Now we can see we see that the genes order by their fold-change values from our DE results and colored by their status (Up- vs Down- regulated).

Does anything jump out about these results? Are you surprised that some genes are upregulated but other genes in the same term are down regulated?

Let’s also write this out to file:

# -------------------------------------------------------------------------
# Output the barplot to file
png(file = "./results/figures/deficient_vs_control_fold-changes_regulation-of-small-molecule-metabolic-process.png", width = 800, height = 500, res = 100)
  rsd_smallmolecule_barplot
dev.off() # Remember to close the graphics device
quartz_off_screen 
                2 

Clean up session

To ensure that we don’t leave objects in our environment that will slow down our log-in and session load time tomorrow morning, let’s clear our environment and re-start our R session now.

  1. To power down the session, click the orange power button in the upper right corner of the RStudio window. Note: There is no confirmation dialogue, the session will simply end.
  2. Click the “Start New Session” button to restart the RStudio session.
  3. Re-load the libraries:
# -------------------------------------------------------------------------
# Load primary libraries
library(WebGestaltR)
library(tidyverse)

setwd("~/IFUN_R/")



Summary

  • Set comparisons at the gene level can be useful for understanding our enrichment results
  • UpSet plots are an easy way to compare and visualize the common and unique values for a large number of sets.
  • We can use more familiar visualization approaches like barplots to see if there are any interest patterns in our expression data for genes associated with functional terms of interest



Previous lesson Top of this lesson Next lesson