
Making clusters is a very important step in single-cell analysis. It often takes a few iterations be confident that the clusters accurately represent the different cell types or cell states. As we get started, it’s useful to step back for a moment and consider some key questions.
Why do we cluster on Principle Components (PC) instead of genes?
It’s much much faster to cluster 15 PCs than 20k genes.
PCs naturally diminish noise and aggregate signal in expression patterns across related genes; this makes patterns clearer.
Clustering on genes means clustering in ~20k dimensions. That doesn’t work very well mathematically because in high dimensional space, the notion of distance breaks down - put another way, in 20k dimensions everything is too far from everything else. This is sometimes called the curse of dimensionality.
Consider this 2d plot that shows two clusters. The clusters are well-seperated i.e. the within cluster distainces are small but between cluster distances are large.

If we take these two clusters and start adding dimensions, the within cluster distances and between cluster distances begin apart (i.e. the cluster is well separated in lower dimensions) but quickly merge together in higher dimensions. Most points become equidistant in higher dimensions; as distance becomes uniform, it becomes difficult to create distinct clusters.

So, we reduce ~20k genes to a handful of PCs and then cluster those PCs. This avoids the curse of dimensionality while preserving most of the original signal.
How can such a small number of PCs faithfully represent so many genes? Aren’t we throwing away a ton of information?
| Genes look like this | A Principle Component shows this |
|---|---|
![]() |
![]() |
So, what’s the correct resolution and number of PCs?
That depends on how many cell types/cell states do you expect. And the answer to that depends on:
For our sample data (bone biopsies from mice), we anticipate about 20 cell types.
Couldn’t the computer figure out the best PCs and resolution?
So how do we pick the right PCs and resolution to get the right clusters?
| Different cluster counts across resolutions and PCs |
|---|
|
|
What do the cluster UMAPs look like at different PCs and resolutions?
| resolution | 5 | 10 | 16 | 30 |
|---|---|---|---|---|
| 0.0 |
|
|
|
|
| 0.2 |
|
|
|
|
| 0.4 |
|
|
|
|
| 0.8 |
|
|
|
|
| 1.2 |
|
|
|
|
But these are all still just … clusters - how do I know if they are biologically accurate?
We need two more tools to validate these are good biological clusters:
We will cover both of topics in the next sections.
| Previous lesson | Top of this lesson | Next lesson |
|---|