A frequent bottleneck in single-cell analysis is annotating the algorithmicly generated clusters, as it requires bridging the gap between the data and prior knowledge (source). While generating markers for each cluster it important, it may or may not be sufficient to assign cell-type or sub-type labels without bringing in additional sources.
![]() |
Starting with clustered data, we can summarize and compare gene expression for known gene markers within each cluster, irrespective of sample or condition, to help label clusters with the appropriate cell-type or subtype. |
scCATCH
cell-type predictions to annotate our
clustersLike the previous sections, the process to assign cell-types to clusters can be very iterative. In addition, the steps to reach a “Figure 1” of labeled clusters may not be presented in detail, can be very dataset dependent, and often is more challenging for less characterized tissues.
Automated tools have the advantage of being able to compare between the expression patterns in our dataset and large numbers of reference datasets or databases at a scale that is not feasible to do manually.
As described in more detail by the Ouyang Lab and summarized in the figure below, there are many computational tools that aim to assign cell type labels for single-cell data. These methods generally fall into three categories:
However, across any of these approaches the quality of the reference data (and reliability of the authors labels) and relevancy to your specific tissue/experiment (and the resolution of your biological question) is crucial. Additionally, it’s important to consider that rare or novel cell populations may not be present or well-characterized in available references and that even after filtering, some clusters might correspond to stressed or dying cells and not a particular cell-type or subtype. Therefore, any prediction should be reviewed and considered in the context both marker gene expression for the dataset and knowledge of the biological system and broader literature.
Some tools and references are available solely or primarily for human tissues (and not mouse or rat), particular for tissues other than PBMCs and the brain. For human data, if a relevant reference is available for your experiment, we would recommend trying Azimuth (created by authors of Seurat). 10x has a tutorial that includes example of using Azimuth, including a feature of the tool that allows for first pass of cell-type assignment of more common cell-types followed by identifying rarer populations that may not be identified in the first pass.
A tool we often use for both mouse and human data cell-type predictions is called scCATCH which, per the author’s description in Shao et al (2020), annotates cell-types using a “tissue-specific cellular taxonomy reference database (CellMatch) and [an] evidence-based scoring (ES) protocol”. The CellMatch reference is compiled from CellMarker (Zhang et al., 2019b), MCA (Han et al., 2018), CancerSEA (Yuan et al., 2019), and the CD Marker Handbook and PMIDs for relevant literature are reported in the prediction results.
First, we need to load the scCATCH library. Then, we’ll double check
that we are using the expected resolution cluster results (this is
particularly important if we generated multiple resolutions in our
clustering steps), before creating a new object from our
counts
data with createscCATCH()
and adding
our marker genes to the scCATCCH object.
To increase the speed and accuracy of our predictions, we’ll create query of relevant tissues (which requires some prior knowledge of the experiment and using the scCATCH wiki to select tissues from the species) before we run the tool:
##### Day 3 - Cell Type Annotation
# Load scCATCH -----------------------------------------------------------
library(scCATCH)
# check that cell identities are set to expected resolution
all(Idents(geo_so) == geo_so$integrated.sct.rpca.clusters)
[1] TRUE
# Annotate clusters using scCATCH ----------------------------------------
# create scCATCH object, using count data
geo_catch = createscCATCH(data = geo_so@assays$SCT@counts, cluster = as.character(Idents(geo_so)))
# add marker genes to use for predictions
geo_catch@markergene = geo_markers
# specify tissues/cell-types from the scCATCH reference
geo_catch@marker = cellmatch[cellmatch$species == 'Mouse' & cellmatch$tissue %in% c('Blood', 'Peripheral Blood', 'Muscle', 'Skeletal muscle', 'Epidermis', 'Skin'), ]
# run scCATCH to generate predictions
geo_catch = findcelltype(geo_catch)
# look at the predictions
geo_catch@celltype %>% select(cluster, cell_type, celltype_score)
cluster cell_type celltype_score
1 0 Pericyte 0.75
2 1 Hematopoietic Stem Cell 0.86
3 2 Monocyte 0.82
4 3 Muscle Progenitor Cell 0.71
5 4 Macrophage 0.82
6 5 Dendritic Cell 0.86
7 6 Hematopoietic Stem Cell 0.9
8 7 Stem Cell 0.88
9 8 Hematopoietic Stem Cell 0.91
10 9 Hematopoietic Stem Cell 0.86
11 10 Dendritic Cell, Monocyte 0.82, 0.82
12 11 Regulatory T Cell 0.9
13 12 Dendritic Cell 0.85
14 13 Dendritic Cell, Progenitor Cell 0.71, 0.71
15 14 Muscle Satellite Cell 0.94
16 15 CD8+ T Cell 0.88
17 16 Stem Cell 0.84
18 17 Hematopoietic Stem Cell 0.87
19 18 Stromal Cell 0.66
20 19 Hematopoietic Stem Cell 0.88
21 20 Muscle Cell 0.69
22 21 Basophil 0.73
23 22 Stem Cell 0.8
When we look at our results we can see the cell type score, which
gives us an idea of the confidence of that prediction. Not shown here
but the full celltype
table also includes marker genes and
PMIDs for relevant literature for each prediction.
In our experience, these kinds of results often help guide cluster annotation but scores can vary and the predictions may need to be revised based on researcher’s knowledge of the biological system. As these cell-types correspond to the cell-types and subtypes we’d expect to be present in these data and most of the prediction scores are quite high, we can reasonably use these results to annotate our clusters with some minor adjustments.
To confirm and refine the scCATCH predictions, we’ll spot check some known markers for immune populations. Then we’ll look look at some other key marker genes from some other relevant resources like Chen et al (2021), Buechler et al (2021) Roman (2023), Li et al (2022) and Nestorowa et al (2016) to see if other modifications should be made to the scCATCH predictions:
# Create lists of immune cells and associated gene markers --------------
immune_markers = list()
immune_markers[['Inflam. Macrophage']] = c('Cd14', 'Cxcl2') # Cd14 a- monocyte/macrophage cells
immune_markers[['Platelet']] = c('Pf4')
immune_markers[['Mast cells']] = c('Gata2', 'Kit')
immune_markers[['NK cells']] = c('Nkg7', 'Klrd1')
immune_markers[['B-cell']] = c( 'Ly6d', 'Cd19', 'Cd79b', 'Ms4a1')
immune_markers[['T-cell']] = c( 'Cd3d','Cd3e','Cd3g') # also Thy1
# Plot other immune to assist with cluster identification ---------------
immune_markers_plot = DotPlot(geo_so, features = immune_markers, assay = 'SCT') +
theme(text=element_text(size=10), axis.text.x = element_text(angle = 45, vjust = 0.5))
Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
Please report the issue at <https://github.com/satijalab/seurat/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# save to file
ggsave(filename = 'results/figures/immune_markers_sct_dot_plot.png',
plot = immune_markers_plot, width = 10, height = 5, units = 'in')
immune_markers_plot
# Create lists of other cells and associated gene markers ---------------------------------------
other_markers = list()
other_markers[['Pericyte']] = c('Acan','Sox9')
other_markers[['SMC']] = c('Acta2', 'Myh11') # SMC = mesenchymal smooth-muscle cell/mesenchymal lineage
other_markers[['Keratinocytes']] = c('Thy1', 'Dlk1') # fibro progenitors aso=Thy1
other_markers[['Myofibroblasts']] = c('Tmem100', 'Cd34', 'Ly6c1') # hematopoetic stem/activated fibroblast=Cd34
other_markers[['Fibroblast']] = c('Dpt', 'Fn1', 'Col3a1') # activated fib = Fn1
other_markers[['Endothelial']] = c('Pecam1', 'Cd38') # from wound healing; Pecam1 also exp in endothelial
other_markers[['HSC']] = c('Ltb', 'Cd74') # less well defined/conflicting definitions
other_markers[['Erythroid']] = c('Hba-a1')
# Plot known cell-type markers ---------------------------------------
other_markers_dot_plot = DotPlot(geo_so, features = other_markers, assay = 'SCT') +
theme(text=element_text(size=10), axis.text.x = element_text(angle = 45, vjust = 0.5))
# save to file
ggsave(filename = 'results/figures/other_markers_sct_dot_plot.png',
plot = other_markers_dot_plot, width = 12, height = 5, units = 'in')
other_markers_dot_plot
In the first plot, B-cell and T-cell markers seem to line up with the
predictions and are limited to single clusters. However, macrophage and
dendrocyte markers match to multiple clusters including some annotated
with different cell types, so we can consider modifying those cluster
labels.
From the other marker genes, the patterns are less clear so we may want to test other clustering parameters and discuss the results with a researcher familiar with the expected cell types. However, we can notice some patterns that we can use to refine our cluster annotations.
Often we have prior information about what cell types are expected in
our samples and key marker genes for those populations. This can be an
important part of evaluating our clusters, since if genes that are known
markers for a specific cell type are found in too many or too few
clusters as that can suggest that re-clustering is needed or that some
of the clusters should be manually combined before annotating. We can
create lists of markers used in figures from the original
paper before using the same DotPlot()
function to
visualize the expression level and frequency of these genes in our
current clusters:
# Visualize manually selected marker genes --------------------------------
# Create lists of genes from paper
fig1g_markers = c('Cxcl1', 'Cxcl2', 'Ccl2', 'Ccl3', 'Ccl4', 'Il1b', 'Il6b', 'Tnf', 'Tgfb1', 'Tgfb2', 'Tgfb3', 'Cxcl5')
fig1h_markers = c('Cxcr2', 'Csf1r', 'Csf3r', 'Tgfbr1', 'Tgfbr3', 'Il1r1', 'Il6ra', 'Lifr', 'Tgfbr2')
# create DotPlots for genes from paper
fig1g_sct_dot_plot = DotPlot(geo_so, features = fig1g_markers, assay = 'SCT')
Warning: The following requested variables were not found: Il6b
fig1h_sct_dot_plot = DotPlot(geo_so, features = fig1h_markers, assay = 'SCT')
# save plots to file
ggsave(filename = 'results/figures/markers_fig1g_sct_dot_plot.png',
plot = fig1g_sct_dot_plot, width = 8, height = 6, units = 'in')
ggsave(filename = 'results/figures/markers_fig1h_sct_dot_plot.png',
plot = fig1h_sct_dot_plot, width = 8, height = 6, units = 'in')
fig1g_sct_dot_plot
fig1h_sct_dot_plot
For known marker genes, it’s important to note that since scRNA-seq is
only measuring transcriptional signals that markers at the protein level
(e.g used for approaches like FACS) may be less effective. An
alternative or complement to using marker genes could be methods like
using gene set enrichment (GSEA) as
demonstrated in the OSCA book to aid in annotations. However, the
book “Best
practices for single-cell analysis across modalities” by Heumos, Schaar,
Lance, et al. points out that “it is often useful to work together
with experts … [like a] biologist who has more extensive knowledge of
the tissue, the biology, the expected cell types and markers etc.”. In
our experience, we find that experience and knowledge of the researchers
we work with is invaluable.
We can also generate the same plots, but using the unintegrated data
by specifying the RNA
assay:
# Visualize manually selected marker genes (from unintegrated data) ------
rna_dot_plot = DotPlot(geo_so, features = fig1g_markers, assay = 'RNA')
fig1h_rna_dot_plot = DotPlot(geo_so, features = fig1h_markers, assay = 'RNA')
ggsave(filename = 'results/figures/markers_fig1g_rna_dot_plot.png', plot = fig1g_rna_dot_plot, width = 8, height = 6, units = 'in')
ggsave(filename = 'results/figures/markers_fig1h_rna_dot_plot.png', plot = fig1h_rna_dot_plot, width = 8, height = 6, units = 'in')
Next, we’ll modify the cell type predictions and add the labels to our Seurat object to replace our clusters’ numerical identities. Note: we will create a new metadata object where we join cell types. However, this will destroy the row names - which will cause a problem in Seurat - so we have to add them back.
# Annotate clusters using modified predictions ----------------------------
# First - Extract the cell types only from the predictions
celltype_annos = geo_catch@celltype %>% select(cluster, cell_type) %>%
mutate(cluster = factor(cluster, levels = c(0:22))) %>% arrange(cluster)
celltype_annos
cluster cell_type
1 0 Pericyte
2 1 Hematopoietic Stem Cell
3 2 Monocyte
4 3 Muscle Progenitor Cell
5 4 Macrophage
6 5 Dendritic Cell
7 6 Hematopoietic Stem Cell
8 7 Stem Cell
9 8 Hematopoietic Stem Cell
10 9 Hematopoietic Stem Cell
11 10 Dendritic Cell, Monocyte
12 11 Regulatory T Cell
13 12 Dendritic Cell
14 13 Dendritic Cell, Progenitor Cell
15 14 Muscle Satellite Cell
16 15 CD8+ T Cell
17 16 Stem Cell
18 17 Hematopoietic Stem Cell
19 18 Stromal Cell
20 19 Hematopoietic Stem Cell
21 20 Muscle Cell
22 21 Basophil
23 22 Stem Cell
# Update annotations, remembering that cluster 0 = row 1 in table ---------
celltype_annos$cell_type[c(6,15,16)] <- "Inflammatory macrophage" # resolve cluster 5, 14, 15
celltype_annos$cell_type[c(12)] <- "Macrophage"
celltype_annos$cell_type[c(5,8,10)] <- "Platelet" # clusters 4,7,9
celltype_annos$cell_type[c(1,22)] <- "Pericyte"
celltype_annos$cell_type[c(2,9)] <- "Fibroblast" # revise clusters 1,8 based on markers
celltype_annos$cell_type[c(7)] <- "Myofibroblast" # revise cluster 6
celltype_annos$cell_type[c(4,14,17)] <- "Hematopoietic stem cell" # based on markers but could further revise
celltype_annos$cell_type[c(11,20)] <- "Mesenchymal stem/stromal cell" # based on Acta2 signal; cluster 10, 19
celltype_annos$cell_type[c(19)] <- "Erythroid" # cluster 18
celltype_annos$cell_type[c(23)] <- "Mast"
celltype_annos$cell_type[c(18, 21)] <- "Unknown" # since such small populations, reset cluster 17 & 20 as unknown for now
# Merge cell types in but as a new table to slide into @meta.data ----------
copy_metadata = geo_so@meta.data
new_metadata = copy_metadata %>% left_join(celltype_annos, by = c('integrated.sct.rpca.clusters' = 'cluster'))
rownames(new_metadata) = rownames(geo_so@meta.data) # We are implicitly relying on the same row order!
# Replace the meta.data
geo_so@meta.data = new_metadata
head(geo_so@meta.data)
orig.ident | nCount_RNA | nFeature_RNA | day | replicate | percent.mt | nCount_SCT | nFeature_SCT | integrated.sct.rpca.clusters | seurat_clusters | unintegrated.sct.pca.clusters | cell_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
HODay0replicate1_AAACCTGAGAGAACAG-1 | HO.Day0.replicate1 | 10234 | 3226 | Day0 | replicate1 | 1.240962 | 6062 | 2867 | 1 | 2 | 2 | Fibroblast |
HODay0replicate1_AAACCTGGTCATGCAT-1 | HO.Day0.replicate1 | 3158 | 1499 | Day0 | replicate1 | 7.536416 | 4607 | 1509 | 1 | 2 | 2 | Fibroblast |
HODay0replicate1_AAACCTGTCAGAGCTT-1 | HO.Day0.replicate1 | 13464 | 4102 | Day0 | replicate1 | 3.112002 | 5314 | 2370 | 1 | 2 | 2 | Fibroblast |
HODay0replicate1_AAACGGGAGAGACTTA-1 | HO.Day0.replicate1 | 577 | 346 | Day0 | replicate1 | 1.559792 | 3877 | 1031 | 10 | 8 | 8 | Mesenchymal stem/stromal cell |
HODay0replicate1_AAACGGGAGGCCCGTT-1 | HO.Day0.replicate1 | 1189 | 629 | Day0 | replicate1 | 3.700589 | 4166 | 915 | 1 | 2 | 2 | Fibroblast |
HODay0replicate1_AAACGGGCAACTGGCC-1 | HO.Day0.replicate1 | 7726 | 2602 | Day0 | replicate1 | 2.938131 | 5865 | 2588 | 1 | 2 | 2 | Fibroblast |
Checkpoint : Has the metadata for your
geo_so
object been updated?
We have now added a “cell_type” column to the meta.data
table:
Lastly, we can generate a revised UMAP plot with our descriptive
cluster labels by using our updated Seurat object and providing the new
cell_type
label for the group.by
argument:
# Make a labeled UMAP plot of clusters ------------------------------------
catch_umap_plot = DimPlot(geo_so, group.by = 'cell_type', label = TRUE, reduction = 'umap.integrated.sct.rpca')
catch_umap_plot
ggsave(filename = 'results/figures/umap_integrated_catch.png', plot = catch_umap_plot, width = 10, height = 8, units = 'in')
catch_umap_condition_plot = DimPlot(geo_so, group.by = 'cell_type', split.by = 'day', label = TRUE, reduction = 'umap.integrated.sct.rpca')
ggsave(filename = 'results/figures/umap_integrated_catch_byCondition.png',
plot = catch_umap_condition_plot, width = 10, height = 8, units = 'in')
catch_umap_condition_plot
# Discard all ggplot objects currently in environment ---------------------
# Ok since we saved the plots as we went along
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot)))));
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 6904063 368.8 11863601 633.6 11863601 633.6
Vcells 336463314 2567.1 1109312410 8463.4 1108694243 8458.7
We’ll save the scCATCH object. The Seurat object has not been changed in this module.
# Save Seurat object and annotations --------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated_with_catch.rds')
saveRDS(geo_catch, file = 'results/rdata/geo_catch.rds')
Now that we have generated reasonable annotations for our clusters, we can proceed with the step of differential expression which is essential to addressing our biological question for this experiment.
Next steps: Differential Expression
These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Previous lesson | Top of this lesson | Next lesson |
---|