
Making clusters is a very important step in single-cell analysis. It
often takes a few iterations be confident that the clusters accurately
represent the different cell types or cell states. As we get started,
it’s useful to step back for a moment and consider some key
questions.
PCs vs genes?
Why do we cluster on Principle Components (PC) instead of
genes?
- Most genes show little variance at all; those that do often behave
in concert (regulating up and down in a coherent experssion program).
Because of this behavior, clustering directly on genes doesn’t
work very well mathematically. (Because in 20k dimensions, the
notion of distance breaks down; everything is too far from everything
else.)
- PCs naturally diminish noise and aggregate signal in expression
patterns across related genes; this makes patterns clearer.
- It’s much much faster to cluster 15 PCs than 20k
genes.
Just a few PCs?
How can such a small number of PCs faithfully represent so
many genes? Aren’t we throwing away a ton of information?
- Selecting a few PCs helps by bringing the most important parts of
the picture into focus. Remember that each PCs represents a gene
expression pattern across many genes.
- Recall from the earlier elbow
plot that almost all the variance (signal) in the dataset is
captured in the first handful of PCs. In our example dataset, the first
50 PCs capture over 99% of the cumulative variance.
Resolution? PCs?
So, what’s the correct resolution and number of
PCs?
That depends on how many cell types/cell states do you
expect. And the answer to that depends on:
- the tissue/bio-fluid of origin
- the quality of the sample / suspension
- your experimental design
- your research question
- the number of cells profiled
- the number of genes detected
- the range and specificity of cellular expression programs
For our sample data (bone biopsies from mice), we anticipate
about 20 cell types.
A little help?
Couldn’t the computer figure out the best PCs and
resolution?
- The computer could try a bunch of PCs and resolutions to get the
right number of clusters - and those clusters might look good
numerically. But they wouldn’t be the best representation of the
different cells types because the algorithm doesn’t understand
the cell biology.
Resolution, PCs, and clusters
So how do we pick the right PCs and resolution to get the
right clusters?
- You will want to iterate through a few combinations of PCs and
resolutions.
- It helps to know how PCs and resolution work together to influence
the number of clusters.
- We applied the clustering and UMAP code blocks from earlier and tried a bunch of
clusterings each with a different number of PCs and a different
resolution.
Different cluster counts across resolutions and PCs
|
|
- Each dot represents the number of clusters produced with a specific
resolution and number of PCs.
- Each colored line is a different resolution.
- Each column is a different num of PCs.
- Note that PCs have a modest effect on the number of clusters while
resolution has a strong effect on the number of clusters.
|
Don’t stop yet!
But these are all still just … clusters - how do I know if
they are biologically accurate?