Workflow Overview
Introduction
As discussed at the start of the workshop, single-cell experiments
using 10x Chromium instrument aim to have droplets with one cell plus
one bead. However this is an inherently imperfect process and there are
other important considerations like how healthy or intact the cell was
at the time of measurement.
In this section, our goal is to use filtering thresholds to remove
“cells” that were poorly measured, not cells at all, or included more
than one cell.
|
The Cell Ranger barcode feature matrix outputs (A) is a mix of good and
poor quality cells (B). We can use expression patterns to further
visualize and filter to the subset of putative healthy cells (C) to
improve downstream analysis.
|
Similar to many other areas of research, there are often gaps
between how single-cell data is presented versus the reality of running
an analysis. For example, only the final filtering thresholds might be
reported in a paper but our process for choosing those is likely to be
more iterative and include some trial and error.
Objectives
- Discuss QC measures and learn how to calculate and plot them.
- Discuss cell-filtering approaches and apply them to our
dataset.
Adding metadata
We’re going to alter and add some columns to meta.data
to ease some downstream analysis and plotting steps. This will often be
necessary, as sample names contain combined information about
phenotypes. Let’s take a look at the first few rows again, to reacquaint
ourselves.
##### Day 1 - Secondary QC and filtering
# Examine Seurat metdata ----------------------------------------------
head(geo_so@meta.data)
orig.ident nCount_RNA nFeature_RNA
HODay0replicate1_AAACCTGAGAGAACAG-1 HODay0replicate1 10234 3226
HODay0replicate1_AAACCTGGTCATGCAT-1 HODay0replicate1 3158 1499
HODay0replicate1_AAACCTGTCAGAGCTT-1 HODay0replicate1 13464 4102
HODay0replicate1_AAACGGGAGAGACTTA-1 HODay0replicate1 577 346
HODay0replicate1_AAACGGGAGGCCCGTT-1 HODay0replicate1 1189 629
HODay0replicate1_AAACGGGCAACTGGCC-1 HODay0replicate1 7726 2602
We can add arbitrary per-cell information to this table such as:
- Summary statistics, such as percent mitochondrial reads for each
cell.
- Batch, condition, etc. for each cell.
- Cluster membership for each cell.
- Cell cycle phase for each cell.
- Other custom annotations for each cell.
The following code chunk will create new columns from the
orig.ident
column, one for the day
and one for
the replicate
. We’ll also use factors to make the days
appear in the correct order (by default R will order them Day0, Day21,
Day7).
# Make metadata more granular ------------------------------------------
# (So it's easier for us to understand)
orig.ident_levels = apply(
expand.grid(c('HO'), c('replicate1', 'replicate2', 'replicate3', 'replicate4'), c('Day0','Day7','Day21')),
MARGIN = 1, FUN = function(row){paste(row[1], row[3], row[2], sep = '.')})
tmp_meta = geo_so@meta.data
tmp_meta$orig.ident = gsub('HO', 'HO.', tmp_meta$orig.ident)
tmp_meta$orig.ident = gsub('rep', '.rep', tmp_meta$orig.ident)
tmp_meta$orig.ident = factor(tmp_meta$orig.ident, levels = orig.ident_levels)
# Add day information
tmp_meta$day = factor(str_split(tmp_meta$orig.ident, pattern = '[.]', simplify = TRUE)[,2], levels = c('Day0', 'Day7', 'Day21'))
# Add replicate column
tmp_meta$replicate = str_split(tmp_meta$orig.ident, pattern = '[.]', simplify = TRUE)[,3]
# Assign the updated metadata to that in the Seurat object
geo_so@meta.data = tmp_meta
# Set the Idents(geo_so) to the new orig.ident
Idents(geo_so) = 'orig.ident'
Taking a look at the result, we see orig.ident
looks a
little different, and we have new day
and
replicate
columns. Note that there are many ways to have
accomplished this, the above code is just one route.
# Re-examine Seurat metdata -------------------------------------------
head(geo_so@meta.data)
orig.ident nCount_RNA nFeature_RNA day replicate
HODay0replicate1_AAACCTGAGAGAACAG-1 HO.Day0.replicate1 10234 3226 Day0 replicate1
HODay0replicate1_AAACCTGGTCATGCAT-1 HO.Day0.replicate1 3158 1499 Day0 replicate1
HODay0replicate1_AAACCTGTCAGAGCTT-1 HO.Day0.replicate1 13464 4102 Day0 replicate1
HODay0replicate1_AAACGGGAGAGACTTA-1 HO.Day0.replicate1 577 346 Day0 replicate1
HODay0replicate1_AAACGGGAGGCCCGTT-1 HO.Day0.replicate1 1189 629 Day0 replicate1
HODay0replicate1_AAACGGGCAACTGGCC-1 HO.Day0.replicate1 7726 2602 Day0 replicate1
Percent mitochondrial reads
The PercentageFeatureSet()
function enables us to
quickly determine the counts belonging to a subset of the possible
features for each cell. Since mitochondrial transcripts in mouse begin
with “mt”, we will use that pattern to count the percentage of reads
coming from mitochondrial transcripts.
# Consider mitochondrial transcripts ----------------------------------
# We use "mt" because this is mouse, depending on the organism, this might need to be changed
geo_so$percent.mt = PercentageFeatureSet(geo_so, pattern = '^mt-')
# Use summary() to quickly check the range of values
summary(geo_so$percent.mt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.817 2.582 5.281 3.865 98.244
Just looking at the summary, we can see that there are some cells
with a high percentage of mitochondrial reads.
Image: Schematic after altering some meta data
columns.
Quality Metrics
Cell Ranger is a first-pass filter to determine what is a “cell” and
what is not. It only considers one sample at a time, and does not
consider the cells relative to one another.
Let’s dig deeper to determine when a droplet might contain two cells,
a very stressed cell, or some technical issue in the library
preparation. We will use three metrics to determine low-quality cells
based on their expression profiles (reference).
- The total number of UMIs detected. Cells with a small number of UMIs
detected may indicate loss of RNA during library preparation via cell
lysis or inefficient cDNA capture / amplification. Cells with relatively
high number of UMIs detected may indicate a doublet.
- The number of expressed features, defined as number of genes with
non-zero counts. Cells with very few measured genes are likely to be of
low-quality, and may distort downstream variance estimation or dimension
reduction steps.
- The proportion of reads mapped to the mitochondrial genome. High
proportions of mitochondrial transcripts may indicate a damaged cell,
the measure of which may also distort downstream analysis steps.
The number of UMIs detected (nCount
) and number of
expressed features (nFeature
) are already given in the meta
data table.
Why total UMIs instead of total reads?
Since a single-cell inherently contains a limited amount of RNA
molecules, a higher amount of PCR amplification is required to generate
the final sequencing library.
Since PCR can skew proportions of initial input materials, specific
sequences are included in the initial capture probes called unique
molecule identifiers (UMIs). As each initial probe has a different UMI
sequence, each RNA captured will be tagged with a different UMI, which
allows those initial RNAs and subsequent PCR duplicates to be identified
and duplicates collapsed as part of the initial processing by
CellRanger.
Other meanings of nFeatures
For other single-cell data types, nFeatures
would
represent what’s being measured in that experiment. For single-cell
ATAC-seq, nFeatures
would represents the total number of
peaks (e.g. accessible areas of DNA) per cell.
Cell counts
Cell counts per sample (based on the numbrer of unique cellular
barcodes detected by Cell Ranger) can indicate if an entire sample was
of poor quality. The table below is the number of cells called by Cell
Ranger, with the added filter from CreateSeurateObject()
in
the previous lesson: min.cells = 1, min.features = 50
.
Recall, this means a gene is removed if it is expressed in 1 or fewer
cells, and a cell is removed if it contains reads for 50 or fewer
genes.
# Assign overall cell counts ------------------------------------------
cell_counts_pre_tbl = geo_so@meta.data %>% count(orig.ident, name = 'prefilter_cells')
cell_counts_pre_tbl
orig.ident prefilter_cells
1 HO.Day0.replicate1 1183
2 HO.Day0.replicate2 689
3 HO.Day0.replicate3 1310
4 HO.Day0.replicate4 1053
5 HO.Day7.replicate1 5765
6 HO.Day7.replicate2 6020
7 HO.Day7.replicate3 6285
8 HO.Day7.replicate4 5166
9 HO.Day21.replicate1 2321
10 HO.Day21.replicate2 1322
11 HO.Day21.replicate3 1275
12 HO.Day21.replicate4 2827
It appears that the Day 7 samples have systematically more cells than
the other days. If you had an idea of how many cells you expected to see
per sample, this table can help you check that expectation and determine
if a sample failed and should be dropped.
Visualizing quality metrics
A violin plot shows the distribution of a quantity among the cells in
a single sample, or across many samples. Seurat has a built-in function,
VlnPlot()
for this purpose. Let’s orient with a violin plot
of nFeature_RNA
for only one sample,
HO.Day21.replicate1
:
A violin plot is similar to a box plot, but it shows the density of
the data at different values. Here the individual points are the cells
from HO.Day21.replicate1
and the y-axis is the value of
nFeature_RNA
for that cell. The violin part of the function
is essentially showing the density of the cells at different values of
nFeature_RNA
.
Genes per cell
Let’s look at the nFeature_RNA
violin plot across all
the samples. Again, this is the number of genes detected per cell.
# Review feature violin plots ----------------------------------------
VlnPlot(geo_so, features = 'nFeature_RNA', assay = 'RNA', layer = 'counts') + NoLegend() + geom_hline(yintercept = 500) + geom_hline(yintercept = 400) + geom_hline(yintercept = 300) + geom_hline(yintercept = 200)
ggsave(filename = 'results/figures/qc_nFeature_violin.png', width = 12, height = 6, units = 'in')
We observe that samples behave similarly within a day, but have
different distributions across days. We also note that many cells appear
to have a low number of genes detected (< 500).
UMI counts per cell
Now let’s look at nCount_RNA
plot, the total number of
UMIs detected.
# Review count violin plots ------------------------------------------
VlnPlot(geo_so, features = 'nCount_RNA', assay = 'RNA', layer = 'counts') + NoLegend()
ggsave(filename = 'results/figures/qc_nCount_violin.png', width = 12, height = 6, units = 'in')
We observe a similar within-and-across day behavior among the
samples. Overall it appears that the Day 0 samples have fewer UMIs
detected per cell than Day 7 and Day 21. Also of note is the outlier
cell in HO.Day0.replicate1, with around 100K UMIs detected. This cell
might be a doublet.
Percent mitochondrial reads
Finally, lets plot the percent mitochondrial reads,
percent.mt
.
# Review mitochondrial violin plots ----------------------------------
VlnPlot(geo_so, features = 'percent.mt', assay = 'RNA', layer = 'counts') + NoLegend() + geom_hline(yintercept = 25) + geom_hline(yintercept = 20) + geom_hline(yintercept = 15) + geom_hline(yintercept = 10)
ggsave(filename = 'results/figures/qc_mito_violin.png', width = 12, height = 6, units = 'in')
We observe the majority of cells seem to have < 25% mitochondrial
reads, but there are many cells with > 25% that we may want to
remove.
Generally, many tutorials use a cutoff of 5 - 10% mitochondrial
reads. However, for some experiments, high mitochondrial reads would be
expected (such as in cases where the condition/treatment or genotype
increases cell death). In this case, a relaxed threshold would help
preserve biologically relevant cells. In our case, the cells were
collected as part of an injury model, perhaps justifying a more lenient
percent mitochondrial read filter.
Filtering approaches
Fixed thresholds
Fixed thresholds for any combination of nFeature_RNA
,
nCount_RNA
, and percent.mt
can be selected to
remove low quality cells. In general, selecting very stringent
thresholds for each of the metrics may lead to removing informative
cells. As per the HBC training materials (link),
we recommend setting individual thresholds to err on the permissive
side.
Sorkin et
al. chose these thresholds:
We filtered out cells with less than 500 genes per cell and with more
than 25% mitochondrial read content.
In other words, cells with > 500 nFeature_RNA
and
< 25% percent.mt
were retained by the authors. We could
preview what the resulting cell counts would be with these
thresholds:
# Consider excluding suspect cells -------------------------------------
subset(geo_so, subset = nFeature_RNA > 500 & percent.mt < 25)@meta.data %>%
count(orig.ident, name = 'postfilter_cells')
orig.ident postfilter_cells
1 HO.Day0.replicate1 1049
2 HO.Day0.replicate2 615
3 HO.Day0.replicate3 1191
4 HO.Day0.replicate4 943
5 HO.Day7.replicate1 4928
6 HO.Day7.replicate2 4932
7 HO.Day7.replicate3 4897
8 HO.Day7.replicate4 4242
9 HO.Day21.replicate1 2000
10 HO.Day21.replicate2 1182
11 HO.Day21.replicate3 1166
12 HO.Day21.replicate4 2470
Adaptive thresholds
For this workshop we will use fixed thresholds, but another option is
to remove low-quality cells adaptively. This approach assumes that most
of the cells are of acceptable quality. For more information see the
relevant section in Bioconductor’s online book: Orchestrating
Single-Cell Analysis (link).
Removing low-quality cells
We will deviate from the publication slightly and retain the cells
with nFeature_RNA > 300
and
percent.mt < 15
. Noting that the
nFeature_RNA
plot had a gap around 300, and that we want to
be more lenient than the usual 5-10% cutoff for mitochondrial reads.
# Filter to exclude suspect cells and assign to new Seurat object ------
geo_so = subset(geo_so, subset = nFeature_RNA > 300 & percent.mt < 15)
geo_so
An object of class Seurat
26489 features across 31560 samples within 1 assay
Active assay: RNA (26489 features, 0 variable features)
1 layer present: counts
And let’s record the number of cells remaining after filtering:
# Examine remaining cell counts ----------------------------------------
cell_counts_post_tbl = geo_so@meta.data %>% count(orig.ident, name = 'postfilter_cells')
cell_counts_post_tbl
orig.ident postfilter_cells
1 HO.Day0.replicate1 1053
2 HO.Day0.replicate2 629
3 HO.Day0.replicate3 1222
4 HO.Day0.replicate4 971
5 HO.Day7.replicate1 5215
6 HO.Day7.replicate2 5324
7 HO.Day7.replicate3 5626
8 HO.Day7.replicate4 4690
9 HO.Day21.replicate1 2004
10 HO.Day21.replicate2 1186
11 HO.Day21.replicate3 1178
12 HO.Day21.replicate4 2462
Let’s combine the pre and post tables:
# Show cell counts before and after filtering --------------------------
cell_counts_tbl = cell_counts_pre_tbl %>% left_join(cell_counts_post_tbl, by = 'orig.ident')
cell_counts_tbl
orig.ident prefilter_cells postfilter_cells
1 HO.Day0.replicate1 1183 1053
2 HO.Day0.replicate2 689 629
3 HO.Day0.replicate3 1310 1222
4 HO.Day0.replicate4 1053 971
5 HO.Day7.replicate1 5765 5215
6 HO.Day7.replicate2 6020 5324
7 HO.Day7.replicate3 6285 5626
8 HO.Day7.replicate4 5166 4690
9 HO.Day21.replicate1 2321 2004
10 HO.Day21.replicate2 1322 1186
11 HO.Day21.replicate3 1275 1178
12 HO.Day21.replicate4 2827 2462
We can now easily see how many cells we started with, and how many we
retained for downstream analysis. Let’s write this table to a file.
write_csv(cell_counts_tbl, file = 'results/tables/cell_filtering_counts.csv')
Revising thresholds
It is important to recognize that thresholds for retaining cells are
not set in stone. It may happen that the downstream analysis suggests
that thresholds should be more or less stringent. In which case, you may
experiment with the thresholds and observe the downstream effects.
Save our progress
Let’s save this filtered form of our Seurat object. It will also
include our changes to the meta.data
:
# Save the current Seurat object ---------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_filtered.rds')
Summary
In this section we:
- Discussed the big three quality metrics:
nFeature
s,
nCount
s, and percent.mt
.
- Visualized these metrics across cells / samples to help identify
low-quality cells.
- Filtered low-quality cells using fixed thresholds.
Next steps: Normalization
These materials have been adapted and extended from materials listed
above. These are open access materials distributed under the terms of
the Creative
Commons Attribution license (CC BY 4.0), which permits unrestricted
use, distribution, and reproduction in any medium, provided the original
author and source are credited.
---
title: "Secondary Quality Control and Filtering"
author: "UM Bioinformatics Core"
date: "`r Sys.Date()`"
output:
        html_document:
            includes:
                in_header: header.html
            theme: paper
            toc: true
            toc_depth: 4
            toc_float: true
            number_sections: false
            fig_caption: true
            markdown: GFM
            code_download: true
---

<style type="text/css">
body, td {
   font-size: 18px;
}
code.r{
  font-size: 12px;
}
pre {
  font-size: 12px
}

table.fig, th.fig, td.fig {
  border: 1px solid black;
  border-collapse: collapse;
  padding: 15px;
}

table{
   width:100%;
}
</style>

```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(lang = c("r", "markdown", "bash"), position = c("top", "right"))
```

```{r, include = FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("02-QCandFiltering/02-")
```

# Workflow Overview {.unlisted .unnumbered}

<br/>
<img src="images/wayfinder/wayfinder.png" alt="wayfinder" style="height: 400px;"/>
<br/>
<br/>

# Introduction

As discussed at the start of the workshop, single-cell experiments using 10x Chromium instrument aim to have droplets with one cell plus one bead. However this is an inherently imperfect process and there are other important considerations like how healthy or intact the cell was at the time of measurement.

In this section, our goal is to use filtering thresholds to remove "cells" that were poorly measured, not cells at all, or included more than one cell.

<table class='fig'><tr><td class='fig'>
![](images/graphical_abstracts/graphical_abstract_QC_and_filtering.png)
</td></tr><tr><td class='fig'>The Cell Ranger barcode feature matrix outputs (A) is a mix of good and poor quality cells (B). We can use expression patterns to further visualize and filter to the subset of putative healthy cells (C) to improve downstream analysis.
</td></tr></table>
<br/>
Similar to many other areas of research, there are often gaps between how single-cell data is presented versus the reality of running an analysis. For example, only the final filtering thresholds might be reported in a paper but our process for choosing those is likely to be more iterative and include some trial and error.


## Objectives

- Discuss QC measures and learn how to calculate and plot them.
- Discuss cell-filtering approaches and apply them to our dataset.

<!-- Challenge for instructors: Every vignette uses different filters, how to harmonize/give guidance? Related, how much to discuss arbitrary cutoffs and continued maturation of field?--> 
<!-- Add links to relevant resources throughout --> 

<!-- General guidance - likely to be moved to earlier section:
- Note with each function call what gets added to the Seurat object.
- Adding checks to ensure object is updated by learners since want to avoid generating copied objects
- Note what layers should be used for what, for example, counts used for FeaturePlots. RNA vs SCT assay. -->

---

```{r, read_rds_hidden, echo = FALSE, warning = FALSE, message = FALSE}
if(!exists('geo_so')) {
  library(Seurat)
  library(BPCells)
  library(tidyverse)

  options(future.globals.maxSize = 1e9)

  geo_so = readRDS('results/rdata/geo_so_unfiltered.rds')
}
```

# Adding metadata

We're going to alter and add some columns to `meta.data` to ease some downstream analysis and plotting steps. This will often be necessary, as sample names contain combined information about phenotypes. Let's take a look at the first few rows again, to reacquaint ourselves.

```{r, preview_metadata1}
##### Day 1 - Secondary QC and filtering

# Examine Seurat metdata  ----------------------------------------------
head(geo_so@meta.data)
```

We can add arbitrary per-cell information to this table such as:

- Summary statistics, such as percent mitochondrial reads for each cell.
- Batch, condition, etc. for each cell.
- Cluster membership for each cell.
- Cell cycle phase for each cell.
- Other custom annotations for each cell.

The following code chunk will create new columns from the `orig.ident` column, one for the `day` and one for the `replicate`. We'll also use factors to make the days appear in the correct order (by default R will order them Day0, Day21, Day7).

```{r, alter_metadata}
# Make metadata more granular ------------------------------------------
# (So it's easier for us to understand)  
orig.ident_levels = apply(
    expand.grid(c('HO'), c('replicate1', 'replicate2', 'replicate3', 'replicate4'), c('Day0','Day7','Day21')), 
    MARGIN = 1, FUN = function(row){paste(row[1], row[3], row[2], sep = '.')})

tmp_meta = geo_so@meta.data

tmp_meta$orig.ident = gsub('HO', 'HO.', tmp_meta$orig.ident)
tmp_meta$orig.ident = gsub('rep', '.rep', tmp_meta$orig.ident)

tmp_meta$orig.ident = factor(tmp_meta$orig.ident, levels = orig.ident_levels)

# Add day information
tmp_meta$day = factor(str_split(tmp_meta$orig.ident, pattern = '[.]', simplify = TRUE)[,2], levels = c('Day0', 'Day7', 'Day21'))

# Add replicate column
tmp_meta$replicate = str_split(tmp_meta$orig.ident, pattern = '[.]', simplify = TRUE)[,3]

# Assign the updated metadata to that in the Seurat object
geo_so@meta.data = tmp_meta

# Set the Idents(geo_so) to the new orig.ident
Idents(geo_so) = 'orig.ident'
```

Taking a look at the result, we see `orig.ident` looks a little different, and we have new `day` and `replicate` columns. Note that there are many ways to have accomplished this, the above code is just one route.

```{r, preview_metadata3}
# Re-examine Seurat metdata  -------------------------------------------
head(geo_so@meta.data)
```

## Percent mitochondrial reads

The `PercentageFeatureSet()` function enables us to quickly determine the counts belonging to a subset of the possible features for each cell. Since mitochondrial transcripts in mouse begin with "mt", we will use that pattern to count the percentage of reads coming from mitochondrial transcripts.

```{r, assign_percent_mt}
# Consider mitochondrial transcripts  ----------------------------------
# We use "mt" because this is mouse, depending on the organism, this might need to be changed
geo_so$percent.mt = PercentageFeatureSet(geo_so, pattern = '^mt-')

# Use summary() to quickly check the range of values
summary(geo_so$percent.mt)
```

Just looking at the summary, we can see that there are some cells with a high percentage of mitochondrial reads.

![Image: Schematic after altering some meta data columns.](images/seurat_schematic/Slide3.png)

# Quality Metrics

Cell Ranger is a first-pass filter to determine what is a "cell" and what is not. It only considers one sample at a time, and does not consider the cells relative to one another. 

Let's dig deeper to determine when a droplet might contain two cells, a very stressed cell, or some technical issue in the library preparation. We will use three metrics to determine low-quality cells based on their expression profiles ([reference](https://bioconductor.org/books/3.12/OSCA/quality-control.html#choice-of-qc-metrics)).

1. The total number of UMIs detected. Cells with a small number of UMIs detected may indicate loss of RNA during library preparation via cell lysis or inefficient cDNA capture / amplification. Cells with relatively high number of UMIs detected may indicate a doublet.
2. The number of expressed features, defined as number of genes with non-zero counts. Cells with very few measured genes are likely to be of low-quality, and may distort downstream variance estimation or dimension reduction steps.
3. The proportion of reads mapped to the mitochondrial genome. High proportions of mitochondrial transcripts may indicate a damaged cell, the measure of which may also distort downstream analysis steps.

The number of UMIs detected (`nCount`) and number of expressed features (`nFeature`) are already given in the meta data table.

> **Why total UMIs instead of total reads?**
> 
> Since a single-cell inherently contains a limited amount of RNA molecules, a higher amount of PCR amplification is required to generate the final sequencing library.
> 
> Since PCR can skew proportions of initial input materials, specific sequences are included in the initial capture probes called unique molecule identifiers (UMIs). As each initial probe has a different UMI sequence, each RNA captured will be tagged with a different UMI, which allows those initial RNAs and subsequent PCR duplicates to be identified and duplicates collapsed as part of the initial processing by CellRanger.  
>

> **Other meanings of `nFeatures`**
> 
> For other single-cell data types, `nFeatures` would represent what's being measured in that experiment. For single-cell ATAC-seq, `nFeatures` would represents the total number of peaks (e.g. accessible areas of DNA) per cell.
>

<!-- Matt may have discussed UMIs as part of CellRanger processing and Liv/Tricia may touch on as part of library generation at end of Day 2 -->

## Cell counts

Cell counts per sample (based on the numbrer of unique cellular barcodes detected by Cell Ranger) can indicate if an entire sample was of poor quality. The table below is the number of cells called by Cell Ranger, with the added filter from `CreateSeurateObject()` in the previous lesson: `min.cells = 1, min.features = 50`. Recall, this means a gene is removed if it is expressed in 1 or fewer cells, and a cell is removed if it contains reads for 50 or fewer genes.

```{r, prefilter_cell_counts}
# Assign overall cell counts  ------------------------------------------
cell_counts_pre_tbl = geo_so@meta.data %>% count(orig.ident, name = 'prefilter_cells')
cell_counts_pre_tbl
```

It appears that the Day 7 samples have systematically more cells than the other days. If you had an idea of how many cells you expected to see per sample, this table can help you check that expectation and determine if a sample failed and should be dropped.

## Visualizing quality metrics

A violin plot shows the distribution of a quantity among the cells in a single sample, or across many samples. Seurat has a built-in function, `VlnPlot()` for this purpose. Let's orient with a violin plot of `nFeature_RNA` for only one sample, `HO.Day21.replicate1`:

```{r, single_violin_plot, echo = FALSE}
VlnPlot(subset(geo_so, orig.ident == 'HO.Day21.replicate1'), features = 'nFeature_RNA', assay = 'RNA', layer = 'counts') + NoLegend()
```

A violin plot is similar to a box plot, but it shows the density of the data at different values. Here the individual points are the cells from `HO.Day21.replicate1` and the y-axis is the value of `nFeature_RNA` for that cell. The violin part of the function is essentially showing the density of the cells at different values of `nFeature_RNA`.

### Genes per cell

Let's look at the `nFeature_RNA` violin plot across all the samples. Again, this is the number of genes detected per cell.

```{r, feature_plot, fig.show = 'hold'}
# Review feature violin plots   ----------------------------------------
VlnPlot(geo_so, features = 'nFeature_RNA', assay = 'RNA', layer = 'counts') + NoLegend() + geom_hline(yintercept = 500) + geom_hline(yintercept = 400) + geom_hline(yintercept = 300) + geom_hline(yintercept = 200)
ggsave(filename = 'results/figures/qc_nFeature_violin.png', width = 12, height = 6, units = 'in')
```

We observe that samples behave similarly within a day, but have different distributions across days. We also note that many cells appear to have a low number of genes detected (< 500).

### UMI counts per cell

Now let's look at `nCount_RNA` plot, the total number of UMIs detected.

```{r, count_plot, fig.show = 'hold'}
# Review count violin plots   ------------------------------------------
VlnPlot(geo_so, features = 'nCount_RNA', assay = 'RNA', layer = 'counts') + NoLegend()
ggsave(filename = 'results/figures/qc_nCount_violin.png', width = 12, height = 6, units = 'in')
```

We observe a similar within-and-across day behavior among the samples. Overall it appears that the Day 0 samples have fewer UMIs detected per cell than Day 7 and Day 21. Also of note is the outlier cell in HO.Day0.replicate1, with around 100K UMIs detected. This cell might be a doublet.

### Percent mitochondrial reads

Finally, lets plot the percent mitochondrial reads, `percent.mt`.

```{r, mito_plot, fig.show = 'hold'}
# Review mitochondrial violin plots   ----------------------------------
VlnPlot(geo_so, features = 'percent.mt', assay = 'RNA', layer = 'counts') + NoLegend() + geom_hline(yintercept = 25) + geom_hline(yintercept = 20) + geom_hline(yintercept = 15) + geom_hline(yintercept = 10)
ggsave(filename = 'results/figures/qc_mito_violin.png', width = 12, height = 6, units = 'in')
```

We observe the majority of cells seem to have < 25% mitochondrial reads, but there are many cells with > 25% that we may want to remove.

Generally, many tutorials use a cutoff of 5 - 10% mitochondrial reads. However, for some experiments, high mitochondrial reads would be expected (such as in cases where the condition/treatment or genotype increases cell death). In this case, a relaxed threshold would help preserve biologically relevant cells. In our case, the cells were collected as part of an injury model, perhaps justifying a more lenient percent mitochondrial read filter.

# Filtering approaches

## Fixed thresholds

Fixed thresholds for any combination of `nFeature_RNA`, `nCount_RNA`, and `percent.mt` can be selected to remove low quality cells. In general, selecting very stringent thresholds for each of the metrics may lead to removing informative cells. As per the HBC training materials ([link](https://hbctraining.github.io/scRNA-seq_online/lessons/04_SC_quality_control.html)), we recommend setting individual thresholds to err on the permissive side.

[Sorkin et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7002453/) chose these thresholds:

> We filtered out cells with less than 500 genes per cell and with more than 25% mitochondrial read content.

In other words, cells with > 500 `nFeature_RNA` and < 25% `percent.mt` were retained by the authors. We could preview what the resulting cell counts would be with these thresholds:

```{r, preview_postfilter_cell_counts}
# Consider excluding suspect cells -------------------------------------
subset(geo_so, subset = nFeature_RNA > 500 & percent.mt < 25)@meta.data %>% 
    count(orig.ident, name = 'postfilter_cells')
```

## Adaptive thresholds

For this workshop we will use fixed thresholds, but another option is to remove low-quality cells adaptively. This approach assumes that most of the cells are of acceptable quality. For more information see the relevant section in Bioconductor's online book: Orchestrating Single-Cell Analysis ([link](https://bioconductor.org/books/3.12/OSCA/quality-control.html#quality-control-outlier)).

# Removing low-quality cells

We will deviate from the publication slightly and retain the cells with `nFeature_RNA > 300` and `percent.mt < 15`. Noting that the `nFeature_RNA` plot had a gap around 300, and that we want to be more lenient than the usual 5-10% cutoff for mitochondrial reads.

```{r, filter_seurat}
# Filter to exclude suspect cells and assign to new Seurat object ------
geo_so = subset(geo_so, subset = nFeature_RNA > 300 & percent.mt < 15)
geo_so
```

And let's record the number of cells remaining after filtering:

```{r, postfilter_cell_counts}
# Examine remaining cell counts ----------------------------------------
cell_counts_post_tbl = geo_so@meta.data %>% count(orig.ident, name = 'postfilter_cells')
cell_counts_post_tbl
```

Let's combine the pre and post tables:

```{r, cell_counts}
# Show cell counts before and after filtering --------------------------
cell_counts_tbl = cell_counts_pre_tbl %>% left_join(cell_counts_post_tbl, by = 'orig.ident')
cell_counts_tbl
```

We can now easily see how many cells we started with, and how many we retained for downstream analysis. Let's write this table to a file.

```{r, write_cell_counts}
write_csv(cell_counts_tbl, file = 'results/tables/cell_filtering_counts.csv')
```

<!--What thresholds would we start with based on the plots alone? How does that compare to the thresholds reported in paper?-->

<!--Other “advanced” methods: out of scope for this workshop but there are packages  specifically developed to detect "doublets", e.g. droplets that contained more than one cell, such as DoubleFinder-->

<!--Additional aside - For single-nuclei experiments removing background/ambient RNA with CellBlender OR DecontX is an important additional step since nuclei are both sticky and porous-->

# Revising thresholds

It is important to recognize that thresholds for retaining cells are not set in stone. It may happen that the downstream analysis suggests that thresholds should be more or less stringent. In which case, you may experiment with the thresholds and observe the downstream effects.

# Save our progress

Let's save this filtered form of our Seurat object. It will also include our changes to the `meta.data`:

```{r, save_rds_hidden, echo = FALSE}
if(!file.exists('results/rdata/geo_so_filtered.rds')) {
  saveRDS(geo_so, file = 'results/rdata/geo_so_filtered.rds')
}
```

```{r, save_rds, eval = FALSE}
# Save the current Seurat object ---------------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_filtered.rds')
```

# Summary

In this section we:

- Discussed the big three quality metrics: `nFeature`s, `nCount`s, and `percent.mt`.
- Visualized these metrics across cells / samples to help identify low-quality cells.
- Filtered low-quality cells using fixed thresholds.

Next steps: Normalization

----

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the [Creative Commons Attribution license (CC BY 4.0)](http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

<br/>
<br/>

-------

| [Previous lesson](00B-CellRangerInAction.html) | [Top of this lesson](#top) | [Next lesson](03-Normalization.html) |
| :--- | :----: | ---: |
