Why do we cluster on Principle Components (PC) instead of genes?


How can such a small number of PCs faithfully represent so many genes? Aren’t we throwing away a ton of information?

Genes look like this A Principle Component shows this

So, what’s the correct resolution and number of PCs?


Couldn’t the computer figure out the best PCs and resolution?


So how do we pick the right PCs and resolution?

Different cluster counts across resolutions and PCs
  • Each dot represents the number of clusters produced with a specific resolution and number of PCs.
  • Each colored line is a different resolution.
  • Each column is a different num of PCs.
  • Note that PCs have a modest effect on the number of clusters while resolution has a strong effect on the number of clusters.

What do the cluster UMAPs look like at different PCs and resolutions?

resolution 5 10 16 30
0.0 clusters=0 : pcs=5 res=0 clusters=0 : pcs=10 res=0 clusters=0 : pcs=16 res=0 clusters=0 : pcs=30 res=0
0.2 clusters=11 : pcs=5 res=0.2 clusters=14 : pcs=10 res=0.2 clusters=13 : pcs=16 res=0.2 clusters=16 : pcs=30 res=0.2
0.4 clusters=17 : pcs=5 res=0.4 clusters=18 : pcs=10 res=0.4 clusters=24 : pcs=16 res=0.4 clusters=24 : pcs=30 res=0.4
0.8 clusters=28 : pcs=5 res=0.8 clusters=27 : pcs=10 res=0.8 clusters=30 : pcs=16 res=0.8 clusters=33 : pcs=30 res=0.8
1.2 clusters=32 : pcs=5 res=1.2 clusters=36 : pcs=10 res=1.2 clusters=36 : pcs=16 res=1.2 clusters=38 : pcs=30 res=1.2

But these are all still just … clusters - how do I know if they are biologically accurate?


Previous lesson Top of this lesson Next lesson