Workflow Overview
Introduction
One of our goals in a single-cell analysis is to generate clusters
that reasonably approximate cell-types or sub-types of interest in our
samples before determining if there are differences in the proportions
of these populations or differences in gene expression within these
populations between experimental conditions.
|
A. The prior step used the filtered, normalized counts as input to a
prinicpal component analysis. The top principle components (PCs) are the
input to the clustering step. B. Cells are plotted in a
high-dimensional space (where the num dimesnions = num of included PCs,
e.g. 15 dimensions); cells with similar gene expression profiles are
closer to each other and are clustered together. C. The cells
clusters are projected from the high-dimensional space (e.g. 15
dimensions) down to two-dimensional space for visualization.
|
In this section, we will demonstrate how to generate clusters using
Seurat’s graph based clustering approach and visualize those clustering
assignments via a lower-dimensional projection of the full dataset.
Like other steps in our analysis, multiple parameters may need to be
tested and evaluated while we would expect that only the final would be
reported. Clustering is considered part of data exploration so an
iterative approach is reasonable, and often expected (source).
Objectives
- Understand the clustering process and input parameters
- Generate initial clusters using ``
- Visualize our clustering results
Clustering and projection
Now that we selected a number of PCs that we think are likely to
represent biological variation and integrated our data across
samples/batches, our next task is clustering.
An important aspect of parameter selection for clustering is to
understand the “resolution” of the underlying biology and your
experimental design:
- Is answering your biological question dependent on identifying rarer
cell types or specific subtypes?
- Or are broader cell-types more relevant to address your biological
question?
The OSCA book has a helpful
analogy comparing clustering to microscopy and points out that
“asking for an unqualified ‘best’ clustering is akin to asking for the
best magnification on a microscope without any context”.
To generate clusters, we will generate “communities” of cells using
the PCs we selected, before choosing a resolution parameter to divide
those communities into discrete clusters.
Clustering
Seurat uses a graph-based clustering approach to assign cells to
clusters using a distance metric based on the previously generated PCs,
with improvements based on work by (Xu
and Su 2015) and CyTOF data (Levine et
al. 2015) implemented in Seurat v3 and v5 and building on the
initial strategies for droplet-based single-cell technology (Macosko et
al. 2015) (source).
A key aspect of this process is that while the clusters are based on
similarity of expression between the cells, the clustering is based on
the selected PCs and not the full data set.
Image: kNN example - section on graph based
clustering (from Cambridge Bioinformatics course)
To briefly summarize, cells are embedded in a k-nearest neighbors
(kNN) graph (illustrated above) based on “the euclidean distance in PCA
space” between the cells and the edge weights between any two cells
(e.g. their “closeness”) is refined based on Jaccard similarity (source).
Additional context and sources for graph-based clustering
Cambridge
Bioinformatics’ Analysis of single cell RNA-seq data course
materials, the source of the image above, delves into kNN and other
graph based clustering methods in much greater detail, including
outlining possible downsides for these methods. To described kNN, we
have also drawn from the Ho
Lab’s description of this process for Seurat v3 as well as the HBC
materials on clustering and the OSCA
book’s more general overview of graph based clustering, which also
describes the drawbacks for these methods.
This process is performed with the FindNeighbors()
command,
using the number of principal components we selected in the previous
section.
The second step is to iteratively partition the kNN graph into
“cliques” or clusters using the Louvain modularity optimization
algorithm (for the default parameters), with the “granularity” of the
clusters set by a resolution
parameter (source).
Image: K-means clustering example (from
Cambridge Bioinformatics course)
We’ll use the FindClusters()
function,
selecting a resolution of 0.4
to start, although we could
also add other resolutions at this stage to look at in later steps. See
Waltman
and Jan van Eck (2013) for the underlying algorithms.
Again, how a “cell type” or “subtype” should be defined for your data
is important to consider in selecting a resolution - we’d start with a
higher resolution for smaller/more rare clusters and a lower resolution
for larger/more general clusters.
Then, when we look at the meta data we should see that cluster labels
have been added for each cell:
# Cluster PCAs ------------------------------------------------------------
# Create KNN graph with `FindNeighbors()`
geo_so = FindNeighbors(geo_so, dims = 1:pcs, reduction = 'integrated.sct.rpca')
# generate clusters
geo_so = FindClusters(geo_so, resolution = 0.4, cluster.name = 'integrated.sct.rpca.clusters')
# look at meta.data to see cluster labels
head(geo_so@meta.data)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 31560
Number of edges: 1047628
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9428
Number of communities: 20
Elapsed time: 9 seconds
orig.ident nCount_RNA nFeature_RNA day replicate percent.mt nCount_SCT nFeature_SCT integrated.sct.rpca.clusters seurat_clusters
HODay0replicate1_AAACCTGAGAGAACAG-1 HO.Day0.replicate1 10234 3226 Day0 replicate1 1.240962 6062 2867 1 1
HODay0replicate1_AAACCTGGTCATGCAT-1 HO.Day0.replicate1 3158 1499 Day0 replicate1 7.536415 4607 1509 1 1
HODay0replicate1_AAACCTGTCAGAGCTT-1 HO.Day0.replicate1 13464 4102 Day0 replicate1 3.112002 5314 2370 6 6
HODay0replicate1_AAACGGGAGAGACTTA-1 HO.Day0.replicate1 577 346 Day0 replicate1 1.559792 3877 1031 10 10
HODay0replicate1_AAACGGGAGGCCCGTT-1 HO.Day0.replicate1 1189 629 Day0 replicate1 3.700589 4166 915 0 0
HODay0replicate1_AAACGGGCAACTGGCC-1 HO.Day0.replicate1 7726 2602 Day0 replicate1 2.938131 5865 2588 1 1
The result of FindNeighbors()
adds graph information to
the graphs
slot:
Image: Schematic after FindNeighbors().
The result of FindClusters()
adds two columns to the
meta.data
table and changes the active.ident
to the “seurat_clusters” column. In other words, the cells now belong to
clusters rather than to their orig.ident
.
Image: Schematic after FindClusters().
Generally it’s preferable to err on the side of too many clusters, as
they can be combined manually in later steps. In our experience, this is
another parameter that often needs to be iteratively revised and
reviewed.
Resolution parameters recommendations
More details on choosing a clustering resolution
The Seurat
clustering tutorial recommends selecting a resolution between 0.4 -
1.2 for datasets of approximately 3k cells, while the HBC
course recommends 0.4-1.4 for 3k-5k cells. However, in our
experience reasonable starting resolutions can be very dataset
dependent.
Cluster plots
To visualize the cell clusters, we can use dimensionality reduction
techniques to visualize and explore our large, high-dimensional dataset.
Two popular methods that are supported by Seurat are t-distributed
stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation
and Projection (UMAP) techniques. These techniques allow us to visualize
our high-dimensional single-cell data in 2D space and see if cells
grouped together within graph-based clusters co-localize in these
representations (source).
While we unfortunately don’t have time to compare and contrast tSNE,
and UMAP, we would highly recommend this blog post
contrasting tSNE and UMAP for illustrative examples. The Seurat
authors additionally caution that while these methods are useful for
data exploration, to avoid drawing biological conclusions solely based
on these visualizations (source).
To start this process, we’ll use the RunUMAP()
function to
calculate the UMAP reduction for our data. Notice how the previous
dimensionality choices carry through the downstream analysis and that
the number of PCs selected in the previous steps are included as an
argument.
# Create UMAP reduction ---------------------------------------------------
geo_so = RunUMAP(geo_so, dims = 1:pcs, reduction = 'integrated.sct.rpca', reduction.name = 'umap.integrated.sct.rpca')
geo_so # Notice a third reduction has been added: `umap.integrated.sct.rpca`
An object of class Seurat
46955 features across 31560 samples within 2 assays
Active assay: SCT (20466 features, 3000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: RNA
3 dimensional reductions calculated: unintegrated.sct.pca, integrated.sct.rpca, umap.integrated.sct.rpca
The resulting Seurat object now has an additional
umap.integrated.sct.rpca
in the reduction
slot:
Image: Schematic after RunUMAP().
Visualizing and evaluating clustering
After we generate the UMAP reduction, we can then visualize the
results using the DimPlot()
function,
labeling our plot by the auto generated seurat_clusters
that correspond to the most recent clustering results generated.
At this stage, we want to determine if the clusters look fairly well
separated, if they seem to correspond to how cells are grouped in the
UMAP, and if the number of clusters is generally aligned with the
resolution of our biological question. Again, if there are “too many”
clusters that’s not necessarily a problem.
We can also look at the same UMAP labeled by day
to
visually inspect if the UMAP structure corresponds to the
day
.
# Visualize UMAP cluster --------------------------------------------------
# cluster ID labels
post_integration_umap_plot_clusters = DimPlot(geo_so, group.by = 'seurat_clusters', label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
post_integration_umap_plot_clusters
ggsave(filename = 'results/figures/umap_integrated_sct_clusters.png', plot = post_integration_umap_plot_clusters, width = 6, height = 6, units = 'in')
# clusters with labels, split by condition
post_integration_umap_plot_split_clusters = DimPlot(geo_so, group.by = 'seurat_clusters', split.by = 'day', label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
post_integration_umap_plot_split_clusters
ggsave(filename = 'results/figures/umap_integrated_sct_split_clusters.png', plot = post_integration_umap_plot_clusters, width = 14, height = 6, units = 'in')
# UMAP with day labels (note - we added this column to the meta-data yesterday)
post_integration_umap_plot_day = DimPlot(geo_so, group.by = 'day', label = FALSE, reduction = 'umap.integrated.sct.rpca')
post_integration_umap_plot_day
ggsave(filename = 'results/figures/umap_integrated_sct_day.png', plot = post_integration_umap_plot_day, width = 8, height = 6, units = 'in')
Similar to the PCA plots, the day
labeled UMAP can tell
us if technical sources of variation might be driving or stratifying the
clusters, which would suggest that the normalization and integration
steps should be revisted before proceeding.
Another approach is to evaluate the number of cells per cluster using
the table()
function, split by day
or split by
orig.ident
to see if the individual samples are driving any
of the UMAP structure:
# Make tables of cell counts ----------------------------------------------
# cells per cluster, split by condition
table(geo_so@meta.data$day, geo_so@meta.data$integrated.sct.rpca.clusters)
# cells per cluster per sample
table(geo_so@meta.data$orig.ident, geo_so@meta.data$integrated.sct.rpca.clusters)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Day0 119 946 25 213 44 190 240 789 186 92 121 71 30 270 1 76 138 161 133 30
Day7 3561 1325 2919 2443 2693 1684 1237 133 597 552 542 684 650 351 570 334 97 285 93 105
Day21 1502 1320 39 325 148 484 375 449 453 452 266 166 68 123 2 135 271 40 194 18
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
HO.Day0.replicate1 31 279 10 72 20 58 64 196 50 18 23 22 11 71 0 14 22 48 34 10
HO.Day0.replicate2 15 132 8 33 7 35 24 165 29 24 20 8 7 52 1 15 25 13 13 3
HO.Day0.replicate3 48 329 2 66 4 56 90 221 64 26 36 20 5 75 0 25 41 59 52 3
HO.Day0.replicate4 25 206 5 42 13 41 62 207 43 24 42 21 7 72 0 22 50 41 34 14
HO.Day7.replicate1 521 343 1089 572 898 446 307 49 208 106 0 191 214 29 0 76 41 57 29 39
HO.Day7.replicate2 1269 285 600 658 552 522 276 12 153 163 0 197 159 242 0 101 18 87 15 15
HO.Day7.replicate3 990 530 767 759 565 454 422 63 145 187 0 194 158 57 0 129 28 96 37 45
HO.Day7.replicate4 781 167 463 454 678 262 232 9 91 96 542 102 119 23 570 28 10 45 12 6
HO.Day21.replicate1 382 415 13 85 50 152 140 131 134 131 68 56 16 47 0 40 71 13 58 2
HO.Day21.replicate2 198 233 7 64 25 85 51 140 71 68 56 25 14 24 1 33 53 5 28 5
HO.Day21.replicate3 218 199 4 65 23 97 57 117 85 78 44 24 18 20 0 22 56 7 38 6
HO.Day21.replicate4 704 473 15 111 50 150 127 61 163 175 98 61 20 32 1 40 91 15 70 5
Comparing to unintegrated data
If we had proceeded with our filtered data and only normalized our
data without doing any integration, including through the dimensionality
reduction and clustering steps and then labeled the cells with their
sample of origin, then we would see the following for our data:
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 31560
Number of edges: 998270
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9475
Number of communities: 22
Elapsed time: 9 seconds
In the plot at left, we see that while there are distinct clusters,
those clusters seem to stratified by day. This suggests that without
integration, these batch effects could skew the biological variability
in our data. While on the right, we see little stratification within our
clusters which means the integration seems to have removed those batch
effects.
Rewind: Pre-integration evaluation clustering and visualization
(code)
Prior to integration, we could follow the same steps we’ve just run
for the integrated to see if the resulting clusters tend to be
determined by sample or condition (in this case, the day):
geo_so = FindNeighbors(geo_so, dims = 1:pcs, assay = 'RNA', reduction = 'unintegrated.sct.pca', graph.name = c('RNA_nn', 'RNA_snn'))
geo_so = FindClusters(geo_so, resolution = 0.4, graph.name = 'RNA_snn', cluster.name = 'unintegrated.sct.clusters')
geo_so = RunUMAP(geo_so, dims = 1:pcs, reduction = 'unintegrated.sct.pca', reduction.name = 'umap.unintegrated.sct.pca')
The plots above were generated with:
# Code block - show unintegrated
pre_integration_umap_plot_orig.ident = DimPlot(geo_so, group.by = 'orig.ident', label = FALSE, reduction = 'umap.unintegrated.sct.pca')
ggsave(filename = 'results/figures/umap_unintegrated_sct_orig.ident.png', plot = pre_integration_umap_plot_orig.ident, width = 8, height = 6, units = 'in')
pre_integration_umap_plot_day = DimPlot(geo_so, group.by = 'day', label = FALSE, reduction = 'umap.unintegrated.sct.pca')
ggsave(filename = 'results/figures/umap_unintegrated_sct_day.png', plot = pre_integration_umap_plot_day, width = 8, height = 6, units = 'in')
Alternative clustering resolutions
While we show a single resolution, we can generate and plot multiple
resolutions iteratively and compare between them before selecting a
clustering result for the next steps:
resolutions = c(0.4, 0.8)
for(res in resolutions) {
message(res)
cluster_column = sprintf('SCT_snn_res.%s', res)
umap_file = sprintf('results/figures/umap_integrated_sct_%s.png', res)
geo_so = FindClusters(geo_so, resolution = res)
DimPlot(geo_so, group.by = cluster_column, label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
ggsave(filename = umap_file, width = 8, height = 7, units = 'in')
}
In the results, we’ll see multiple resolutions should now be added to
the metadata slot.
head(geo_so@meta.data)
Save our progress
Before moving on to our next section, we will output our updated
Seurat object to file:
# Save Seurat object ------------------------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_clustered.rds')
Summary
In this section we:
- Generated cluster assignments for our cells using
FindNeighbors()
and FindClusters()
- Evaluated our initial clusters using
RunUMAP
dimensional reduction and visualization
Next steps: Marker genes
These materials have been adapted and extended from materials listed
above. These are open access materials distributed under the terms of
the Creative
Commons Attribution license (CC BY 4.0), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original
author and source are credited.
---
title: "Clustering and Projection"
author: "UM Bioinformatics Core"
date: "`r Sys.Date()`"
output:
        html_document:
            includes:
                in_header: header.html
            theme: paper
            toc: true
            toc_depth: 4
            toc_float: true
            number_sections: false
            fig_caption: true
            markdown: GFM
            code_download: true
---

<style type="text/css">
body, td {
   font-size: 18px;
}
code.r{
  font-size: 12px;
}
pre {
  font-size: 12px
}

table.fig, th.fig, td.fig {
  border: 1px solid black;
  border-collapse: collapse;
  padding: 15px;
}

table.fig, th.fig, td.fig {
  border: 1px solid black;
  border-collapse: collapse;
  padding: 15px;
}
</style>

```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(lang = c("r", "markdown", "bash"), position = c("top", "right"))
```

```{r, include = FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("05-ProjectionAndClustering/05-")
```

# Workflow Overview {.unlisted .unnumbered}

<br/>
<img src="images/wayfinder/wayfinder.png" alt="wayfinder" style="height: 400px;"/>
<br/>
<br/>

# Introduction

<!--- General goal: to generate clusters that reasonably approximate cell-types or sub-types of interest --->
One of our goals in a single-cell analysis is to generate clusters that reasonably approximate cell-types or sub-types of interest in our samples before determining if there are differences in the proportions of these populations or differences in gene expression within these populations between experimental conditions.

<table class='fig'>
<tr class='fig'><td class='fig'>![](images/graphical_abstracts/graphical_abstract_cluster_projection.png)</td></tr>
<tr class='fig'><td class='fig'>A. The prior step used the filtered, normalized counts as input to a prinicpal component analysis. The top principle components (PCs) are the input to the clustering step. <br/>
B. Cells are plotted in a high-dimensional space (where the num dimesnions = num of included PCs, e.g. 15 dimensions); cells with similar gene expression profiles are closer to each other and are clustered together. <br/>
C. The cells clusters are projected from the high-dimensional space (e.g. 15 dimensions) down to two-dimensional space for visualization.
</td></tr>
</table>
<br/>


In this section, we will demonstrate how to generate clusters using Seurat's graph based clustering approach and visualize those clustering assignments via a lower-dimensional projection of the full dataset.

Like other steps in our analysis, multiple parameters may need to be tested and evaluated while we would expect that only the final would be reported. Clustering is considered part of data exploration so an iterative approach is reasonable, and often expected ([source](https://bioconductor.org/books/3.15/OSCA.basic/clustering.html)). 


## Objectives

- Understand the clustering process and input parameters
- Generate initial clusters using `` 
- Visualize our clustering results

---

```{r, read_rds_hidden, echo = FALSE, warning = FALSE, message = FALSE}
if(!exists('geo_so')) {
  library(Seurat)
  library(BPCells)
  library(tidyverse)

  options(future.globals.maxSize = 1e9)

  geo_so = readRDS('results/rdata/geo_so_sct_integrated.rds')
}
```

# Clustering and projection

Now that we selected a number of PCs that we think are likely to represent biological variation and integrated our data across samples/batches, our next task is clustering. 


An important aspect of parameter selection for clustering is to understand the "resolution" of the underlying biology and your experimental design:   

- Is answering your biological question dependent on identifying rarer cell types or specific subtypes?   
- Or are broader cell-types more relevant to address your biological question?   

The OSCA book has a [helpful analogy comparing clustering to microscopy](https://bioconductor.org/books/3.15/OSCA.basic/clustering.html#overview-1) and points out that "asking for an unqualified 'best' clustering is akin to asking for the best magnification on a microscope without any context". 

To generate clusters, we will generate "communities" of cells using the PCs we selected, before choosing a resolution parameter to divide those communities into discrete clusters.

<!--- Contrast the previous dimensionality reduction versus nearest neighbors clustering and plotting the cells in lower dimensionality with the cluster labels? --->

## Clustering

Seurat uses a graph-based clustering approach to assign cells to clusters using a distance metric based on the previously generated PCs, with improvements based on work by ([Xu and Su 2015](https://academic.oup.com/bioinformatics/article/31/12/1974/214505)) and CyTOF data ([Levine et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4508757/)) implemented in Seurat v3 and v5 and building on the initial strategies for droplet-based single-cell technology ([Macosko et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481139/))  ([source](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)). A key aspect of this process is that while the clusters are based on similarity of expression between the cells, the clustering is based on the selected PCs and not the full data set. 

<center>
![Image: kNN example - section on graph based clustering (from Cambridge Bioinformatics course)](./images/curriculum/05-ProjectionAndClustering/BioCellGene-NearestNeighborNetworks.png)
</center>

To briefly summarize, cells are embedded in a k-nearest neighbors (kNN) graph (illustrated above) based on "the euclidean distance in PCA space" between the cells and the edge weights between any two cells (e.g. their "closeness") is refined based on Jaccard similarity ([source](https://hbctraining.github.io/scRNA-seq_online/lessons/07_SC_clustering_cells_SCT.html)).

<details>
    <summary>*Additional context and sources for graph-based clustering*</summary>
    [Cambridge Bioinformatics' Analysis of single cell RNA-seq data course materials](https://biocellgen-public.svi.edu.au/mig_2019_scrnaseq-workshop/clustering-and-cell-annotation.html), the source of the image above, delves into kNN and other graph based clustering methods in much greater detail, including outlining possible downsides for these methods. 
     To described kNN, we have also drawn from the [Ho Lab's description of this process for Seurat v3](https://holab-hku.github.io/Fundamental-scRNA/downstream.html#perform-linear-dimensional-reduction) as well as the [HBC materials on clustering](https://hbctraining.github.io/scRNA-seq_online/lessons/07_SC_clustering_cells_SCT.html) and the [OSCA book's more general overview of graph based clustering](https://bioconductor.org/books/3.15/OSCA.basic/clustering.html#clustering-graph), which also describes the drawbacks for these methods.
</details>
<br>

This process is performed with the `FindNeighbors()` [command](https://satijalab.org/seurat/reference/findneighbors), using the number of principal components we selected in the previous section.

The second step is to iteratively partition the kNN graph into "cliques" or clusters using the Louvain modularity optimization algorithm (for the default parameters), with the "granularity" of the clusters set by a `resolution` parameter ([source](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)).

![Image: K-means clustering example (from Cambridge Bioinformatics course)](./images/curriculum/05-ProjectionAndClustering/HBC-07-k-meansFigure.png)

We'll use the `FindClusters()` [function](https://satijalab.org/seurat/reference/findclusters), selecting a resolution of `0.4` to start, although we could also add other resolutions at this stage to look at in later steps. See [Waltman and Jan van Eck (2013)](https://link.springer.com/article/10.1140/epjb/e2013-40829-0) for the underlying algorithms.

Again, how a “cell type” or “subtype” should be defined for your data is important to consider in selecting a resolution - we'd start with a higher resolution for smaller/more rare clusters and a lower resolution for larger/more general clusters.  

Then, when we look at the meta data we should see that cluster labels have been added for each cell:

```{r, find_neighbors, eval = FALSE}
# Cluster PCAs ------------------------------------------------------------
# Create KNN graph with `FindNeighbors()`
geo_so = FindNeighbors(geo_so, dims = 1:pcs, reduction = 'integrated.sct.rpca')

# generate clusters
geo_so = FindClusters(geo_so, resolution = 0.4, cluster.name = 'integrated.sct.rpca.clusters')

# look at meta.data to see cluster labels
head(geo_so@meta.data)
```

```{r, find_neighbors_hidden, cache = TRUE, cache.lazy = FALSE, warning = FALSE, message = FALSE, echo=FALSE}
# Cluster PCAs ------------------------------------------------------------
# Create KNN graph with `FindNeighbors()`
geo_so = FindNeighbors(geo_so, dims = 1:pcs, reduction = 'integrated.sct.rpca')

# generate clusters
geo_so = FindClusters(geo_so, resolution = 0.4, cluster.name = 'integrated.sct.rpca.clusters')

# look at meta.data to see cluster labels
head(geo_so@meta.data)
```

The result of `FindNeighbors()` adds graph information to the `graphs` slot:

![Image: Schematic after FindNeighbors().](images/seurat_schematic/Slide8.png)

<br>

The result of `FindClusters()` adds two columns to the `meta.data` table and changes the `active.ident` to the "seurat_clusters" column. In other words, the cells now belong to clusters rather than to their `orig.ident`.

![Image: Schematic after FindClusters().](images/seurat_schematic/Slide9.png)

Generally it's preferable to err on the side of too many clusters, as they can be combined manually in later steps. In our experience, this is another parameter that often needs to be iteratively revised and reviewed. 

### Resolution parameters recommendations
<details>
    <summary>*More details on choosing a clustering resolution*</summary>
    The [Seurat clustering tutorial](https://holab-hku.github.io/Fundamental-scRNA/downstream.html#perform-linear-dimensional-reduction) recommends selecting a resolution between 0.4 - 1.2 for datasets of approximately 3k cells, while the [HBC course](https://hbctraining.github.io/scRNA-seq_online/lessons/07_SC_clustering_cells_SCT.html) recommends 0.4-1.4 for 3k-5k cells. However, in our experience reasonable starting resolutions can be very dataset dependent.
</details>
<br>

# Cluster plots 

To visualize the cell clusters, we can use dimensionality reduction techniques to visualize and explore our large, high-dimensional dataset. Two popular methods that are supported by Seurat are t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques. These techniques allow us to visualize our high-dimensional single-cell data in 2D space and see if cells grouped together within graph-based clusters co-localize in these representations ([source](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#run-non-linear-dimensional-reduction-umaptsne)).

While we unfortunately don't have time to compare and contrast tSNE, and UMAP, we would highly recommend [this blog post contrasting tSNE and UMAP](https://pair-code.github.io/understanding-umap/) for illustrative examples. The Seurat authors additionally caution that while these methods are useful for data exploration, to avoid drawing biological conclusions solely based on these visualizations ([source](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#run-non-linear-dimensional-reduction-umaptsne)).

To start this process, we'll use the `RunUMAP()` [function](https://satijalab.org/seurat/reference/runumap) to calculate the UMAP reduction for our data. Notice how the previous dimensionality choices carry through the downstream analysis and that the number of PCs selected in the previous steps are included as an argument.

```{r, run_umap, cache = TRUE, cache.lazy = FALSE, warning = FALSE, message = FALSE}
# Create UMAP reduction ---------------------------------------------------
geo_so = RunUMAP(geo_so, dims = 1:pcs, reduction = 'integrated.sct.rpca', reduction.name = 'umap.integrated.sct.rpca')
geo_so # Notice a third reduction has been added: `umap.integrated.sct.rpca`
```

The resulting Seurat object now has an additional `umap.integrated.sct.rpca` in the `reduction` slot:

![Image: Schematic after RunUMAP().](images/seurat_schematic/Slide10.png)

# Visualizing and evaluating clustering

<!--- How many clusters should I get and how do I adjust the number? --->

<!--- Add example of changing resolution?--->

After we generate the UMAP reduction, we can then visualize the results using the `DimPlot()` [function](https://satijalab.org/seurat/reference/dimplot), labeling our plot by the auto generated `seurat_clusters` that correspond to the most recent clustering results generated.

At this stage, we want to determine if the clusters look fairly well separated, if they seem to correspond to how cells are grouped in the UMAP, and if the number of clusters is generally aligned with the resolution of our biological question. Again, if there are "too many" clusters that's not necessarily a problem.

We can also look at the same UMAP labeled by `day` to visually inspect if the UMAP structure corresponds to the `day`.

```{r, umap_by_clusters, fig.show='hold'}
# Visualize UMAP cluster --------------------------------------------------
# cluster ID labels
post_integration_umap_plot_clusters = DimPlot(geo_so, group.by = 'seurat_clusters', label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
post_integration_umap_plot_clusters

ggsave(filename = 'results/figures/umap_integrated_sct_clusters.png', plot = post_integration_umap_plot_clusters, width = 6, height = 6, units = 'in')

# clusters with labels, split by condition
post_integration_umap_plot_split_clusters = DimPlot(geo_so, group.by = 'seurat_clusters', split.by = 'day', label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
post_integration_umap_plot_split_clusters

ggsave(filename = 'results/figures/umap_integrated_sct_split_clusters.png', plot = post_integration_umap_plot_clusters, width = 14, height = 6, units = 'in')

# UMAP with day labels (note - we added this column to the meta-data yesterday)
post_integration_umap_plot_day = DimPlot(geo_so, group.by = 'day', label = FALSE, reduction = 'umap.integrated.sct.rpca')
post_integration_umap_plot_day

ggsave(filename = 'results/figures/umap_integrated_sct_day.png', plot = post_integration_umap_plot_day, width = 8, height = 6, units = 'in')
```

Similar to the PCA plots, the `day` labeled UMAP can tell us if technical sources of variation might be driving or stratifying the clusters, which would suggest that the normalization and integration steps should be revisted before proceeding.

Another approach is to evaluate the number of cells per cluster using the `table()` function, split by `day` or split by `orig.ident` to see if the individual samples are driving any of the UMAP structure:

```{r, table_by_day, eval = FALSE}
# Make tables of cell counts ----------------------------------------------
# cells per cluster, split by condition
table(geo_so@meta.data$day, geo_so@meta.data$integrated.sct.rpca.clusters)

# cells per cluster per sample
table(geo_so@meta.data$orig.ident, geo_so@meta.data$integrated.sct.rpca.clusters)
```

```{r, table_by_day_hidden, echo = FALSE}
# Make tables of cell counts ----------------------------------------------
# cells per cluster, split by condition
table(geo_so@meta.data$day, geo_so@meta.data$integrated.sct.rpca.clusters)

# cells per cluster per sample
table(geo_so@meta.data$orig.ident, geo_so@meta.data$integrated.sct.rpca.clusters)
```

<!-- consider adding later 
Add DimPlot for % mitochondria to see if high % are across dataset or limited to one/few clusters (related to what Olivia talked about during Day 1)

# Other approaches for visualizing scRNA-sq data
e.g. code for tSNE visualization 
-->

# Comparing to unintegrated data

If we had proceeded with our filtered data and only normalized our data without doing any integration, including through the dimensionality reduction and clustering steps and then labeled the cells with their sample of origin, then we would see the following for our data:

<!--Add updated example of UMAP / sample labels for unintegrated data - need to check with Raymond regarding how best to do this in place --> 

```{r, run_umap_unintegrated, echo = FALSE, cache = TRUE, cache.lazy = FALSE, warning = FALSE, message = FALSE, out.height='60%'}
# Create KNN graph with `FindNeighbors()` for unintegrated data
geo_so = FindNeighbors(geo_so, dims = 1:pcs, reduction = 'unintegrated.sct.pca', verbose=FALSE)

# generate clusters for unintegrated data
geo_so = FindClusters(geo_so, resolution = 0.4, cluster.name = 'unintegrated.sct.pca.clusters')

geo_so = RunUMAP(geo_so, dims = 1:pcs, reduction = 'unintegrated.sct.pca', reduction.name = 'umap.unintegrated.sct.pca')

pre_integration_umap_plot_day = DimPlot(geo_so, group.by = 'day', label = FALSE, reduction = 'umap.unintegrated.sct.pca')

print(pre_integration_umap_plot_day + post_integration_umap_plot_day)
```

In the plot at left, we see that while there are distinct clusters, those clusters seem to stratified by day. This suggests that without integration, these batch effects could skew the biological variability in our data. While on the right, we see little stratification within our clusters which means the integration seems to have removed those batch effects.

<!-- Edit section to incorporate --->

<details>
<summary>**Rewind: Pre-integration evaluation clustering and visualization (code)**</summary>
Prior to integration, we could follow the same steps we've just run for the integrated to see if the resulting clusters tend to be determined by sample or condition (in this case, the day):

```{r, eval=FALSE}
geo_so = FindNeighbors(geo_so, dims = 1:pcs, assay = 'RNA', reduction = 'unintegrated.sct.pca', graph.name = c('RNA_nn', 'RNA_snn'))
geo_so = FindClusters(geo_so, resolution = 0.4, graph.name = 'RNA_snn', cluster.name = 'unintegrated.sct.clusters')
geo_so = RunUMAP(geo_so, dims = 1:pcs, reduction = 'unintegrated.sct.pca', reduction.name = 'umap.unintegrated.sct.pca')
```
	
The plots above were generated with:

```{r, eval=FALSE}
# Code block - show unintegrated
pre_integration_umap_plot_orig.ident = DimPlot(geo_so, group.by = 'orig.ident', label = FALSE, reduction = 'umap.unintegrated.sct.pca')
ggsave(filename = 'results/figures/umap_unintegrated_sct_orig.ident.png', plot = pre_integration_umap_plot_orig.ident, width = 8, height = 6, units = 'in')

pre_integration_umap_plot_day = DimPlot(geo_so, group.by = 'day', label = FALSE, reduction = 'umap.unintegrated.sct.pca')
ggsave(filename = 'results/figures/umap_unintegrated_sct_day.png', plot = pre_integration_umap_plot_day, width = 8, height = 6, units = 'in')
```
</details>
<br>
<br>

<details>
<summary>**Alternative clustering resolutions**</summary>
While we show a single resolution, we can generate and plot multiple resolutions iteratively and compare between them before selecting a clustering result for the next steps:

```{r, eval=FALSE}
resolutions = c(0.4, 0.8)

for(res in resolutions) {
    message(res)

    cluster_column = sprintf('SCT_snn_res.%s', res)
    umap_file = sprintf('results/figures/umap_integrated_sct_%s.png', res)

    geo_so = FindClusters(geo_so, resolution = res)

    DimPlot(geo_so, group.by = cluster_column, label = TRUE, reduction = 'umap.integrated.sct.rpca') + NoLegend()
    ggsave(filename = umap_file, width = 8, height = 7, units = 'in')
}
```

In the results, we'll see multiple resolutions should now be added to the metadata slot.

```{r, eval=FALSE}
head(geo_so@meta.data)
```
</details>
<br>
<br>

# Save our progress

Before moving on to our next section, we will output our updated Seurat object to file:

```{r, save_rds_hidden, echo = FALSE}
if(!file.exists('results/rdata/geo_so_sct_clustered.rds')) {
  saveRDS(geo_so, file = 'results/rdata/geo_so_sct_clustered.rds')
}
```

```{r, eval=FALSE}
# Save Seurat object ------------------------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_clustered.rds')
```

# Summary

In this section we:

- Generated cluster assignments for our cells using `FindNeighbors()` and `FindClusters()`
- Evaluated our initial clusters using `RunUMAP` dimensional reduction and visualization

Next steps: Marker genes

----

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the [Creative Commons Attribution license (CC BY 4.0)](http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

<br/>
<br/>
<hr/>
| [Previous lesson](04-PCAandIntegration.html) | [Top of this lesson](#top) | [Next lesson](06-MarkerVisualization.html) |
| :--- | :----: | ---: |
