Workflow Overview




Introduction


Starting with labeled clustered data, for each labeled cluster, we can compare between case and control within that cluster to identify genes that are impacted by the experimental perturbation(s) for that cell-type or subtype.


After identifying what cell-types are likely present in our data, we can finally consider experimental conditions and use differential expression comparisons to address the biological question at hand for an experiment.

Objectives

  • Run “standard” differential expression comparisons with cells as replicates
  • Run “pseudobulk” differential expression comparisons with samples as replicates

We already introduced DE comparisons in the marker identification section of this workshop, but here we will show how to run comparisons between experimental conditions for each annotated cluster.

As a reminder, our data includes cells isolated from issue from day 0 (prior to injury) as controls, and days 7 and 21 post-injury as experimental conditions.



Differential Expression

For single-cell data there are generally two types approaches for running differential expression - either a cell-level or sample-level approach.

For cell-level comparisons, simpler statistical methods like a t-test or the Wilcoxon rank-sum test or single-cell specific methods that models cells individually like MAST can be used.

As mentioned earlier, many of the tools developed for bulk RNA-seq have been shown to have good performance for single-cell data, such as EdgeR or DESeq2, particularly when the count data is aggregated into sample-level “pseudobulk” values for each cluster source.

As discussed in the single-cell best practices book and in the Ouyang Lab’s marker gene identification materials, there are active benchmarking efforts and threshold considerations for single-cell data.

Standard comparisons

First we’ll run cell-level comparisons for our data for the pericyte cluster, which seemed to have an interesting pattern between time points in the UMAP plots, starting with cells from the D21 vs D7 conditions. We’ll need to ensure our cells are labeled to reflect both the cluster and condition identities before running our comparison using FindMarker() and summarizing the results:

# =========================================================================
# Differential Expression Analysis
# =========================================================================

# Create combined label of day + celltype to use for DE contrasts
geo_so$day.celltype = paste(geo_so$time, geo_so$cell_type, sep = '_')
# check labels
unique(geo_so$day.celltype)
 [1] "Day0_Hematopoietic Stem Cell"  "Day0_Pericyte"                 "Day0_Monocyte"                 "Day0_Dendritic Cell"          
 [5] "Day0_Fibroblast"               "Day0_B cell"                   "Day0_Platelet"                 "Day0_Stem Cell"               
 [9] "Day0_NA"                       "Day0_Myofibroblast"            "Day0_Unknown"                  "Day0_T Cell"                  
[13] "Day0_Infl. Macrophage"         "Day0_Smooth muscle cell"       "Day7_Monocyte"                 "Day7_Fibroblast"              
[17] "Day7_Infl. Macrophage"         "Day7_Hematopoietic Stem Cell"  "Day7_Myofibroblast"            "Day7_Platelet"                
[21] "Day7_Dendritic Cell"           "Day7_Smooth muscle cell"       "Day7_Pericyte"                 "Day7_Stem Cell"               
[25] "Day7_B cell"                   "Day7_T Cell"                   "Day7_Unknown"                  "Day7_NA"                      
[29] "Day21_Hematopoietic Stem Cell" "Day21_Dendritic Cell"          "Day21_Infl. Macrophage"        "Day21_Pericyte"               
[33] "Day21_Myofibroblast"           "Day21_Monocyte"                "Day21_Unknown"                 "Day21_Platelet"               
[37] "Day21_T Cell"                  "Day21_Stem Cell"               "Day21_Smooth muscle cell"      "Day21_NA"                     
[41] "Day21_Fibroblast"              "Day21_B cell"                 
# Reset cell identities to the combined condition + celltype label
Idents(geo_so) = 'day.celltype'
x
Day0_Hematopoietic Stem Cell
Day0_Pericyte
Day0_Monocyte
Day0_Dendritic Cell
Day0_Fibroblast
Day0_B cell
Day0_Platelet
Day0_Stem Cell
Day0_NA
Day0_Myofibroblast
Day0_Unknown
Day0_T Cell
Day0_Infl. Macrophage
Day0_Smooth muscle cell
Day7_Monocyte
Day7_Fibroblast
Day7_Infl. Macrophage
Day7_Hematopoietic Stem Cell
Day7_Myofibroblast
Day7_Platelet
Day7_Dendritic Cell
Day7_Smooth muscle cell
Day7_Pericyte
Day7_Stem Cell
Day7_B cell
Day7_T Cell
Day7_Unknown
Day7_NA
Day21_Hematopoietic Stem Cell
Day21_Dendritic Cell
Day21_Infl. Macrophage
Day21_Pericyte
Day21_Myofibroblast
Day21_Monocyte
Day21_Unknown
Day21_Platelet
Day21_T Cell
Day21_Stem Cell
Day21_Smooth muscle cell
Day21_NA
Day21_Fibroblast
Day21_B cell
# -------------------------------------------------------------------------
# Consider pericyte cluster D21 v D7 & run DE comparison using wilcoxon test
de_cell_pericyte_D21_vs_D7 = FindMarkers(
    object = geo_so,
    slot = 'data', test = 'wilcox',
    ident.1 = 'Day21_Pericyte', ident.2 = 'Day7_Pericyte')

head(de_cell_pericyte_D21_vs_D7)
                p_val avg_log2FC pct.1 pct.2     p_val_adj
Gm10076 3.861416e-198 -1.2639245 0.761 0.976 7.939458e-194
Mgp     1.611030e-122  2.3590159 0.883 0.658 3.312439e-118
Dcn     1.023011e-116  0.9844431 0.998 0.995 2.103414e-112
Cst3    7.775960e-113  1.9614277 0.956 0.894 1.598815e-108
Tpt1    2.353162e-111  0.6086616 0.996 0.996 4.838337e-107
Rps19   8.676751e-106 -0.5712829 0.990 0.992 1.784027e-101
# -------------------------------------------------------------------------
# Add gene symbols names and save

# Add rownames as a column for output
de_cell_pericyte_D21_vs_D7$gene = rownames(de_cell_pericyte_D21_vs_D7)

# save to file
write_csv(de_cell_pericyte_D21_vs_D7, 
          file = 'results/tables/de_standard_pericyte_D21_vs_D7.csv')

# summarize diffex results
table(de_cell_pericyte_D21_vs_D7$p_val_adj < 0.05 & 
        abs(de_cell_pericyte_D21_vs_D7$avg_log2FC) > 1.5)

FALSE  TRUE 
 9869   268 

In the first 3 lines of the above code block we can see the changes to the schematic:

Image: Schematic after setting the Idents().
Image: Schematic after setting the Idents().

Note - the avg_log2FC threshold of 1.5 we use here are quite stringent as the default log2FC threshold for the function is 0.25. However the default threshold corresponds to only a 19% difference in RNA levels, which is quite permissive.

If there is enough time - we can also compare between cells from the D7 and D0 conditions within the pericyte population.

# -------------------------------------------------------------------------
# Compare pericyte cluster D7 v D0
de_cell_pericyte_D7_vs_D0 = FindMarkers(
    object = geo_so,
    slot = 'data', test = 'wilcox',
    ident.1 = 'Day7_Pericyte', ident.2 = 'Day0_Pericyte')

head(de_cell_pericyte_D7_vs_D0)
                p_val avg_log2FC pct.1 pct.2     p_val_adj
Gsn     1.247574e-234  -4.492126 0.732 1.000 2.565138e-230
Tsc22d3 9.259539e-203  -4.184011 0.101 0.744 1.903854e-198
Gm10076 4.733506e-184   2.469264 0.976 0.513 9.732561e-180
Postn   3.125845e-181   4.354646 0.925 0.194 6.427050e-177
Cthrc1  4.189361e-180   6.740953 0.860 0.030 8.613744e-176
Fn1     2.443921e-143   2.105758 0.986 0.653 5.024947e-139
# -------------------------------------------------------------------------
# Add rownames for D7 v D0 results
de_cell_pericyte_D7_vs_D0$gene = rownames(de_cell_pericyte_D7_vs_D0)

# summarize results
table(de_cell_pericyte_D7_vs_D0$p_val_adj < 0.05 & 
        abs(de_cell_pericyte_D7_vs_D0$avg_log2FC) > 1.5)

FALSE  TRUE 
10851   766 

This same approach can be extended to run pairwise comparisons between conditions for each annotated cluster of interest.

Pseudobulk comparisons

With advances in the technology as well as decreased sequencing costs allowing for larger scale single-cell experiments (that include replicates), along with a study by Squair et al (2021) that highlighted the possibility of inflated false discovery rates for the cell-level approaches since cells isolated from the same sample are unlikely to be statistically independent source the use of sample-level or “psuedobulk” can be advantageous.

We’ll run psuedobulk comparisons for our data for the monocyte cluster, starting with the D21 vs D0 conditions. We’ll need to generate the aggregated counts first (ensuring that we are grouping cells by replicate labels), before labeling the cells to reflect the cluster and condition. Then we will run our comparison using FindMarker() but specifying DESeq2 as our method before summarizing the results:

# -------------------------------------------------------------------------
# Create pseudobulk object
pseudo_catch_so = 
  AggregateExpression(geo_so, 
                      assays = 'RNA',
                      return.seurat = TRUE,
                      group.by = c('cell_type', 'time', 'replicate'))

# Set up labels to use for comparisons & assign as cell identities
pseudo_catch_so$day.celltype = paste(pseudo_catch_so$time, pseudo_catch_so$cell_type, sep = '_')
Idents(pseudo_catch_so) = 'day.celltype'
# -------------------------------------------------------------------------
# Run pseudobulk comparison between Day 21 and Day 0, using DESeq2
de_pseudo_pericyte_D21_vs_D7 = FindMarkers(
    object = pseudo_catch_so, 
    ident.1 = 'Day21_Pericyte', ident.2 = 'Day7_Pericyte', 
    test.use = 'DESeq2')

# Take a look at the table
head(de_pseudo_pericyte_D21_vs_D7)
              p_val avg_log2FC pct.1 pct.2    p_val_adj
H2-Ab1 8.308772e-79  -2.518827     1     1 2.200911e-74
Mgp    8.525039e-64   2.602263     1     1 2.258197e-59
Robo1  1.414700e-61  -1.286805     1     1 3.747400e-57
Cxcl1  1.277125e-60  -1.849225     1     1 3.382975e-56
Tyrobp 6.476869e-57  -1.694030     1     1 1.715658e-52
Ank3   4.042556e-53  -1.498265     1     1 1.070833e-48
# -------------------------------------------------------------------------
# add genes rownames as a column for output
de_pseudo_pericyte_D21_vs_D7$gene = rownames(de_pseudo_pericyte_D21_vs_D7)

# save results
write_csv(de_pseudo_pericyte_D21_vs_D7,
          file = 'results/tables/de_pseudo_pericyte_D21_vs_D7.csv')

# review pseudobulk results, using the same thresholds
table(de_pseudo_pericyte_D21_vs_D7$p_val_adj < 0.05 & 
        abs(de_pseudo_pericyte_D21_vs_D7$avg_log2FC) > 1.5)

FALSE  TRUE 
22403    34 

Since we’re working with pseudobulk data, unlike in the marker identification section, there is no percentage of cells expressing to need to represent so we can summarize our DE results with a volcano plot:

# -------------------------------------------------------------------------
# Make a volcano plot of pseudobulk diffex results
pseudo_pericyte_D21_vs_D7_volcano = 
  ggplot(de_pseudo_pericyte_D21_vs_D7, aes(x = avg_log2FC, y = -log10(p_val))) + 
  geom_point()

ggsave(filename = 'results/figures/volcano_de_pseudo_pericyte_D21_vs_D0.png', 
       plot = pseudo_pericyte_D21_vs_D7_volcano,
       width = 7, height = 7, units = 'in')

pseudo_pericyte_D21_vs_D7_volcano

Further examining DE results

We can also overlay the expression of interesting differentially expressed genes back onto our UMAP plots to highlight the localization and possible function, again using the FeaturePlot function.

# -------------------------------------------------------------------------
# UMAP feature plot of Cd55 gene
FeaturePlot(geo_so, features = "Cd55", split.by = "time")

So we found Cd55 based on differential expression comparison in the Pericyte population between Day 7 and Day 21 but in looking at the Feature plot of expression, we also see high expression in a subset of cells on Day 0. This interesting, since according to Shin et al (2019), CD55 regulates bone mass in mice.

It also looks like there is a high percentage of expression in some of the other precursor populations on the top right of our plots, which is interesting and might suggest an interesting subpopulation that we might try to identify, particularly given the role of this gene and our interest in determining why abberant bone can form after injury.

Next steps

While looking at individual genes can reveal interesting patterns like in the case of Cd55, it’s not a very efficient process. So after running ‘standard’ and/or psuedobulk differential expression comparisons, we can use the same types of tools used downstream of bulk RNA-seq to interpret these results, which we’ll touch on in the next section.

Save our progress

# -------------------------------------------------------------------------
# Discard all ggplot objects currently in environment
# (Ok since we saved the plots as we went along.)
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot))))); 
gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   9554216  510.3   18239464  974.1  12246073  654.1
Vcells 348249914 2657.0  922625602 7039.1 922622737 7039.1

We’ll save a copy of the final Seurat object to final.

# -------------------------------------------------------------------------
# Save Seurat object
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated_final.rds')

Summary


Starting with labeled clustered data, for each labeled cluster, we can compare between case and control within that cluster to identify genes that are impacted by the experimental perturbation(s) for that cell-type or subtype.


Reviewing these results should allow us to identify genes of interest that are impacted by injury and in the context of the cell-types in which they are differentially expressed, formalize some hypotheses for what cell-types or biological processes might be contributing to aberrant bone formation.


These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.




Previous lesson Top of this lesson Analysis Summary