Workflow Overview


wayfinder

Introduction

To interpret our initial clustering results for all cells in our experiment, we’ll need to characterize the gene expression driving separation between the clusters.

Expression profiles (A) for for each cell cluster are evaluated against all other clusters (B) to identify marker genes, i.e. genes useful for segregating the cells into clusters (C).


In this section, we will demonstrate: - How to generate “analytical” markers based on gene expression in each cluster - How to visualize the expression of genes of interest to aid in cluster identification and labeling

This process can be a highly variable, from seeing well-characterized marker genes as top markers to needing to dig through the literature and plot the expression for multiple sets of genes of interest. Like the previous sections (and like other areas of research), when this process is more involved or iterative, often only the final results are reported.

Objectives

  • Determine gene markers for each of the clusters using FindAllMarkers()
  • Visualize expression across clusters for genes of interest using DotPlot()

Cluster markers and characterization

After generating clusters, we need to perform differential expression analysis to identify the genes that distinguish those clusters (source). This should allow us to get visibility on some key questions for our clusters, as highlighted by the HBC materials, namely:

  • Are there are biologically meaningful gene expression differences between the clusters?
  • Does the gene expression of the generated clusters correspond to expected cell-types or sub-types?
  • Are there clusters with similar expression that should be combined and/or clusters that might need to be sub-clustered into smaller populations?

Many of the differential expression (DE) tools designed for bulk RNA-seq samples have been benchmarked for performance on scRNA-seq in Soneson and Robinson (2018) and there are also dedicated DE tools for scRNA-seq, like MAST, that use models that account for the expected sparse structure of scRNA-seq data. However in our and others’ experience, the default Wilcoxon test is often sufficient for simple pairwise DE comparisons, while edgeR (Robinson, McCarthy, and Smyth 2010) is recommended by the Ouyang Lab for more complex comparison designs, such as those that include covariates.

Image: Comparison of the performance of DE methods applied to scRNA-seq datasets. Methods are ranked by their average performance across all the listed criteria. Image taken from Soneson and Robinson (2018).
Image: Comparison of the performance of DE methods applied to scRNA-seq datasets. Methods are ranked by their average performance across all the listed criteria. Image taken from Soneson and Robinson (2018).
Additional considerations for differential expression The Ouyang Lab has a section of their tutorial that discusses the methods available for differential expression including some highlighted in the figure below, as well as a more extensive section on threshold considerations, while the HBC section on marker genes identification highlights the different types of marker identification options available via Seurat.


Marker identification

First, we’ll change identities of the cells to “integrated.sct.rpca.clusters” explicitly with SetIdent(). The “seurat_clusters” column is the default column for cell identities and changes each time a new clustering is performed. Then, we’ll ensure that the correct resolution is selected from our Seurat object and then we’ll use the PrepSCTFindMarkers() function in preparation for DE comaprisons to “reverse the individual SCT regression model using minimum of median UMI as the sequencing depth covariate” according to the Seurat documentation. Remember that we’ve performed integration and clustering to assign the cells to clusters regardless of their experimental condition but that we now want to ensure that the data is normalized but not with the SCTransformation scaling needed for the previous steps.

Then we’ll run theFindAllMarkers() function to generate comparisons between each cluster and all other cells, regardless of the experimental group. Note - the statistical test to perform can be specified in FindAllMarkers(), but the default is a Wilcoxon test.

##### Day 3 - Marker identification and visualization 

# Find empirical markers  -------------------------------------------------
# Prep for cluster comparisons
geo_so = SetIdent(geo_so, value = 'integrated.sct.rpca.clusters')
geo_so = PrepSCTFindMarkers(geo_so)

# Run comparisons for each cluster to generate markers
geo_markers = FindAllMarkers(geo_so, only.pos = TRUE, min.pct = 0.05)

# Take a look at the first few rows of the result
head(geo_markers)
        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster    gene
Col12a1     0   4.223843 0.915 0.132         0       0 Col12a1
Thbs2       0   3.851482 0.896 0.151         0       0   Thbs2
Lox         0   3.026665 0.904 0.163         0       0     Lox
Tnc         0   3.936619 0.823 0.098         0       0     Tnc
Col11a1     0   4.996025 0.753 0.062         0       0 Col11a1
Aebp1       0   2.294299 0.972 0.295         0       0   Aebp1
# Write out full cluster marker results to file
write_csv(geo_markers, file = 'results/tables/marker_genes_0.4res.csv')
Image: Schematic after SetIdent().
Image: Schematic after SetIdent().
Seurat v5 improvements For marker generation, Seurat v5 uses the presto package to reduce the time required to run DE comparisons, particularly for large datasets. For users who are not using presto, Seurat recommends increasing the min.pct and logfc.threshold parameterscto increase the speed of DE testing (source).


Note that over-interpretation of these results should be avoided, since each cell is used as a replicate in these comparisons which can lead to inflated (e.g. very low) p-values, the top markers are more likely to be trustworthy (source).

Therefore, it’s useful to filter the results to highlight the top positive markers (since a positive fold-change would mean that gene is more highly expressed in the cluster compared to all other cells), before looking at our results

# Identify marker genes for each cluster ----------------------------------
# Create table of top 5 markers per cluster (using default ranking)
top_5 = geo_markers %>% filter(p_val_adj < 0.01) %>% group_by(cluster) %>% slice_head(n = 5)

# Look at results
head(top_5, n = 10)
# A tibble: 10 × 7
# Groups:   cluster [2]
   p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
   <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
 1     0       4.22 0.915 0.132         0 0       Col12a1 
 2     0       3.85 0.896 0.151         0 0       Thbs2   
 3     0       3.03 0.904 0.163         0 0       Lox     
 4     0       3.94 0.823 0.098         0 0       Tnc     
 5     0       5.00 0.753 0.062         0 0       Col11a1 
 6     0       4.28 0.883 0.108         0 1       Clec3b  
 7     0       2.83 0.953 0.29          0 1       Serping1
 8     0       3.38 0.903 0.242         0 1       Igfbp6  
 9     0       4.07 0.725 0.07          0 1       Tnxb    
10     0       2.91 0.764 0.147         0 1       Rarres2 
# Optional - Create table of top 5 markers per cluster (ranked by logFC)
top_5_by_log2FC = geo_markers %>% group_by(cluster) %>% arrange(p_val_adj, desc(avg_log2FC)) %>% slice_head(n = 5)

# Look at results after ranking
head(top_5_by_log2FC, n = 10) # notice difference in pct.1 column between tables
# A tibble: 10 × 7
# Groups:   cluster [2]
   p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene 
   <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>
 1     0       6.32 0.127 0.004         0 0       Cilp2
 2     0       5.84 0.418 0.013         0 0       Scx  
 3     0       5.68 0.133 0.005         0 0       Kera 
 4     0       5.66 0.083 0.003         0 0       Ucma 
 5     0       5.66 0.08  0.003         0 0       Matn3
 6     0       5.22 0.347 0.018         0 1       Myoc 
 7     0       5.18 0.153 0.006         0 1       Inmt 
 8     0       5.15 0.213 0.008         0 1       C7   
 9     0       4.97 0.096 0.006         0 1       Dact2
10     0       4.86 0.162 0.006         0 1       Lvrn 

We expect to see several columns:

  • gene: gene symbol
  • p_val: p-value not adjusted for multiple test correction
  • avg_logFC: average log fold change. Positive values indicate that the gene is more highly expressed in the cluster.
  • pct.1: percentage of cells where the gene is detected in the cluster
  • pct.2: percentage of cells where the gene is detected on average across all other clusters
  • p_val_adj: adjusted p-value based on bonferroni correction using all genes in the dataset, used to determine significance
  • cluster: cluster represented by pct.1 and for which the statistics in the row are reported

When looking at the output, it is important to prioritize marker genes with both larger fold-change differences and larger difference between pct.1 and pct.2, particularly if pct.1 is high (e.g. if 80% of cells in the cluster evaluated express the gene that more reliable than if only 20% of cells express that gene) (source).

Marker visualization

Now that we have generated a set of marker genes for our clusters, it is useful to visualize the expression of those markers to aid in evaluating them. While the expression of individual genes per cell can be overlaid on our UMAPs (as with the FeaturePlot() function), it’s often more useful to visualize the expression of multiple genes simultaneously. While there are multiple methods supported by Seurat for visualizing marker gene expression, a heatmap or a related plot called a dotplot are commonly used.

We’ll use the DotPlot() function with the SCT values to visualize the top 5 marker genes per cluster:

# Visualize top marker genes as dot plot ----------------------------------
top_5_sct_dot_plot = DotPlot(geo_so, features = unique(top_5$gene)) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = 'Top 5 Cluster Genes by FDR and avg_log2FC') + coord_flip()
top_5_sct_dot_plot

ggsave(filename = 'results/figures/markers_top_5_sct_dot_plot.png', plot = top_5_sct_dot_plot, width = 8, height = 18, units = 'in') 

In the dotplot we can see that the color indicates the expression of the gene while the size of the dot indicates the proportion of cells expressing that gene in each cluster (source).

Using raw RNA values in Dotplots

In addition to plotting the SCT values, the raw or normalized RNA values can be plotted as well:

# Add RNA values to dot plot  ---------------------------------------------
top_5_rna_dot_plot = DotPlot(geo_so, features = unique(top_5_by_log2FC$gene), assay = 'RNA') + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = 'Top 5 Cluster Genes by FDR and avg_log2FC') + coord_flip()

ggsave(filename = 'results/figures/markers_top_5_rna_dot_plot.png', plot = top_5_rna_dot_plot, width = 8, height = 18, units = 'in') 



Before proceeding with cluster annotations, we’ll also check the percentage of mitochondrial genes to determine if there are any clusters (or sub populations) that might correspond to interesting cell death patterns (or might indicate further filtering is needed):

# Check mitochondrial gene expression -------------------------------------
percent_mito_plot = FeaturePlot(geo_so, features='percent.mt')
percent_mito_plot

# save to file
ggsave(filename = 'results/figures/percent_umap_mito_plot.png', plot = percent_mito_plot, width = 6, height = 6, units = 'in')

We see that a higher % seem to be somewhat concentrated in a few places, but if cell death might be of interest for the research question, we’d want to consider investigating this pattern further by splitting up the plots by day and/or waiting until after running initial differential expression analysis to determine if these cell populations are interesting biology or not.

Save our progress

Finally, we’ll create an output file for our updated Seurat object and for the cluster marker results:

# Save Seurat object and gene marker data ---------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated_with_markers.rds')
saveRDS(geo_markers, file = 'results/rdata/geo_markers.rds')

Summary

Now that we have characterized the expression of both analytical marker genes and literature / knowledge-based marker genes, we may have a better sense of what cell-types or subtypes our clusters might correspond to.

However, marker genes alone might not be sufficient to determine cell-type or sub-type labels for our clusters so we will discuss other more automated approaches to complement these results.

Next steps: Cell type prediction tools


These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.




Previous lesson Top of this lesson Next lesson
---
title: "Marker identification and visualization"
author: "UM Bioinformatics Core"
date: "`r Sys.Date()`"
output:
        html_document:
            includes:
                in_header: header.html
            theme: paper
            toc: true
            toc_depth: 4
            toc_float: true
            number_sections: false
            fig_caption: true
            markdown: GFM
            code_download: true
---

<style type="text/css">
body, td {
   font-size: 18px;
}
code.r{
  font-size: 12px;
}
pre {
  font-size: 12px
}

table.fig, th.fig, td.fig {
  border: 1px solid black;
  border-collapse: collapse;
  padding: 15px;
}
</style>

```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(lang = c("r", "markdown", "bash"), position = c("top", "right"))
```

```{r, include = FALSE}
source("../bin/chunk-options.R")
knitr_fig_path("06-MarkerVisualization/06-")
```

# Workflow Overview {.unlisted .unnumbered}

<br/>
<img src="images/wayfinder/wayfinder.png" alt="wayfinder" style="height: 400px;"/>
<br/>
<br/>

# Introduction

To interpret our initial clustering results for all cells in our experiment, we'll need to characterize the gene expression driving separation between the clusters. 

<table class='fig'>
<tr class='fig'><td class='fig'>![](images/graphical_abstracts/graphical_abstract_marker_id.png)</td></tr>
<tr class='fig'><td class='fig'>Expression profiles (A) for for each cell cluster are evaluated against all other clusters (B) to identify marker genes, i.e. genes useful for segregating the cells into clusters (C). 
</td></tr>
</table>
<br/>

In this section, we will demonstrate:
- How to generate "analytical" markers based on gene expression in each cluster 
- How to visualize the expression of genes of interest to aid in cluster identification and labeling

This process can be a highly variable, from seeing well-characterized marker genes as top markers to needing to dig through the literature and plot the expression for multiple sets of genes of interest. Like the previous sections (and like other areas of research), when this process is more involved or iterative, often only the final results are reported.

## Objectives
<!--Add specific goals for section-->
- Determine gene markers for each of the clusters using `FindAllMarkers()`
- Visualize expression across clusters for genes of interest using `DotPlot()`

---

```{r, read_rds_hidden, echo = FALSE, warning = FALSE, message = FALSE}
if(!exists('geo_so')) {
  library(Seurat)
  library(BPCells)
  library(tidyverse)

  options(future.globals.maxSize = 1e9)

  geo_so = readRDS('results/rdata/geo_so_sct_clustered.rds')
}
```

# Cluster markers and characterization

After generating clusters, we need to perform differential expression analysis to identify the genes that distinguish those clusters ([source](https://ouyanglab.com/singlecell/clust.html#identifying-marker-genes)).  This should allow us to get visibility on some key questions for our clusters, as [highlighted by the HBC materials](https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html), namely:

- Are there are biologically meaningful gene expression differences between the clusters? 
- Does the gene expression of the generated clusters correspond to expected cell-types or sub-types? 
- Are there clusters with similar expression that should be combined and/or clusters that might need to be sub-clustered into smaller populations?

Many of the differential expression (DE) tools designed for bulk RNA-seq samples have been benchmarked for performance on scRNA-seq in [Soneson and Robinson (2018)](https://pubmed.ncbi.nlm.nih.gov/29481549/) and there are also dedicated DE tools for scRNA-seq, like [MAST](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0844-5), that use models that account for the expected sparse structure of scRNA-seq data. However in our and others' experience, the default Wilcoxon test is often sufficient for simple pairwise DE comparisons, while edgeR ([Robinson, McCarthy, and Smyth 2010](https://pubmed.ncbi.nlm.nih.gov/19910308/)) is recommended by the [Ouyang Lab]([Ouyang](https://ouyanglab.com/singlecell/clust.html#identifying-marker-genes)) for more complex comparison designs, such as those that include covariates.

<!-- Additional context from [Ouyang](https://ouyanglab.com/singlecell/clust.html#identifying-marker-genes) regarding tool options and [HBC materials](https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html) regarding FindMarkers vs FindConservedMarkers, etc. -->

![Image: Comparison of the performance of DE methods applied to scRNA-seq datasets. Methods are ranked by their average performance across all the listed criteria. Image taken from Soneson and Robinson (2018).](./images/curriculum/06-MarkerVisualization/Ouyang_clust-deCompare.png)

<details>
    <summary>*Additional considerations for differential expression*</summary>
    The Ouyang Lab has a [section of their tutorial](https://ouyanglab.com/singlecell/clust.html#sec:diffexpr) that discusses the methods available for differential expression including some highlighted in the figure below, as well as a more extensive section on [threshold considerations](https://ouyanglab.com/singlecell/clust.html#sec:diffexpr), while the [HBC section on marker genes identification](https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html) highlights the different types of marker identification options available via Seurat.
</details>
<br>

## Marker identification

First, we'll change identities of the cells to "integrated.sct.rpca.clusters" explicitly with `SetIdent()`. The "seurat_clusters" column is the default column for cell identities and changes each time a new clustering is performed. Then, we'll ensure that the correct resolution is selected from our Seurat object and then we'll use the `PrepSCTFindMarkers()` [function](https://satijalab.org/seurat/reference/prepsctfindmarkers) in preparation for DE comaprisons to "reverse the individual SCT regression model using minimum of median UMI as the sequencing depth covariate" according to the [Seurat documentation](https://satijalab.org/seurat/reference/prepsctfindmarkers). Remember that we've performed integration and clustering to assign the cells to clusters regardless of their experimental condition but that we now want to ensure that the data is normalized but not with the SCTransformation scaling needed for the previous steps.

<!-- if we want to use the normalized RNA data (not the integrated data) `DefaultAssay(geo_so) <- "RNA"`, although need to check if that still makes sense for Seurat v5 --> 

Then we'll run the`FindAllMarkers()` [function](https://satijalab.org/seurat/reference/findallmarkers) to generate comparisons between each cluster and all other cells, regardless of the experimental group. Note - the statistical test to perform can be specified in `FindAllMarkers()`, but the default is a Wilcoxon test. 

```{r, find_all_markers, cache = TRUE, cache.lazy = FALSE, warning = FALSE, message = FALSE}
##### Day 3 - Marker identification and visualization 

# Find empirical markers  -------------------------------------------------
# Prep for cluster comparisons
geo_so = SetIdent(geo_so, value = 'integrated.sct.rpca.clusters')
geo_so = PrepSCTFindMarkers(geo_so)

# Run comparisons for each cluster to generate markers
geo_markers = FindAllMarkers(geo_so, only.pos = TRUE, min.pct = 0.05)

# Take a look at the first few rows of the result
head(geo_markers)

# Write out full cluster marker results to file
write_csv(geo_markers, file = 'results/tables/marker_genes_0.4res.csv')
```


![Image: Schematic after SetIdent().](images/seurat_schematic/Slide11.png)

<!--- Consider how to represent the result of PrepSCTFindMarkers() in the schematic. It's a little too subtle... -->

<!--- Consider adding figure showing example of what's being compared (circle cluster 1 and then circle all other cells for example) -->

<details>
    <summary>*Seurat v5 improvements*</summary>
    For marker generation, Seurat v5 uses the [presto package](https://www.biorxiv.org/content/10.1101/653253v1) to reduce the time required to run DE comparisons, particularly for large datasets. For users who are not using presto, Seurat recommends increasing the min.pct and logfc.threshold parameterscto increase the speed of DE testing ([source](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#finding-differentially-expressed-features-cluster-biomarkers)).
</details>
<br>

Note that over-interpretation of these results should be avoided, since each cell is used as a replicate in these comparisons which can lead to inflated (e.g. very low) p-values, the top markers are more likely to be trustworthy ([source](https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html)). 

Therefore, it's useful to filter the results to highlight the top positive markers (since a positive fold-change would mean that gene is more highly expressed in the cluster compared to all other cells), before looking at our results

```{r, top_5_cluster_markers}
# Identify marker genes for each cluster ----------------------------------
# Create table of top 5 markers per cluster (using default ranking)
top_5 = geo_markers %>% filter(p_val_adj < 0.01) %>% group_by(cluster) %>% slice_head(n = 5)

# Look at results
head(top_5, n = 10)

# Optional - Create table of top 5 markers per cluster (ranked by logFC)
top_5_by_log2FC = geo_markers %>% group_by(cluster) %>% arrange(p_val_adj, desc(avg_log2FC)) %>% slice_head(n = 5)

# Look at results after ranking
head(top_5_by_log2FC, n = 10) # notice difference in pct.1 column between tables
```

We expect to see several columns:

- `gene`: gene symbol
- `p_val`: p-value not adjusted for multiple test correction
- `avg_logFC`: average log fold change. Positive values indicate that the gene is more highly expressed in the cluster.
- `pct.1`: percentage of cells where the gene is detected in the cluster 
- `pct.2`: percentage of cells where the gene is detected on average across all other clusters 
- `p_val_adj`: adjusted p-value based on bonferroni correction using all genes in the dataset, used to determine significance
- `cluster`: cluster represented by `pct.1` and for which the statistics in the row are reported

When looking at the output, it is important to prioritize marker genes with both larger fold-change differences and larger difference between pct.1 and pct.2, particularly if pct.1 is high (e.g. if 80% of cells in the cluster evaluated express the gene that more reliable than if only 20% of cells express that gene) ([source](https://hbctraining.github.io/scRNA-seq_online/lessons/09_merged_SC_marker_identification.html)).

## Marker visualization

Now that we have generated a set of marker genes for our clusters, it is useful to visualize the expression of those markers to aid in evaluating them. While the expression of individual genes per cell can be overlaid on our UMAPs (as with the `FeaturePlot()` [function](https://satijalab.org/seurat/reference/featureplot)), it's often more useful to visualize the expression of multiple genes simultaneously. While there are [multiple methods supported by Seurat](https://satijalab.org/seurat/articles/visualization_vignette) for  visualizing marker gene expression, a heatmap or a related plot called a dotplot are commonly used. 

We'll use the `DotPlot()` [function](https://satijalab.org/seurat/reference/dotplot) with the SCT values to visualize the top 5 marker genes per cluster:

```{r, top_5_dot_plot_sct, fig.height = 18, fig.width = 8}
# Visualize top marker genes as dot plot ----------------------------------
top_5_sct_dot_plot = DotPlot(geo_so, features = unique(top_5$gene)) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = 'Top 5 Cluster Genes by FDR and avg_log2FC') + coord_flip()
top_5_sct_dot_plot

ggsave(filename = 'results/figures/markers_top_5_sct_dot_plot.png', plot = top_5_sct_dot_plot, width = 8, height = 18, units = 'in') 
```

In the dotplot we can see that the color indicates the expression of the gene while the size of the dot indicates the proportion of cells expressing that gene in each cluster ([source](https://ouyanglab.com/singlecell/clust.html#visualising-marker-genes)). 

<details>
    <summary>*Using raw RNA values in Dotplots*</summary>
    In addition to plotting the SCT values, the raw or normalized RNA values can be plotted as well:

```{r, top_5_dot_plot_rna, eval=FALSE}
# Add RNA values to dot plot  ---------------------------------------------
top_5_rna_dot_plot = DotPlot(geo_so, features = unique(top_5_by_log2FC$gene), assay = 'RNA') + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    labs(title = 'Top 5 Cluster Genes by FDR and avg_log2FC') + coord_flip()

ggsave(filename = 'results/figures/markers_top_5_rna_dot_plot.png', plot = top_5_rna_dot_plot, width = 8, height = 18, units = 'in') 
```

</details>
<br>
<br>

Before proceeding with cluster annotations, we'll also check the percentage of mitochondrial genes to determine if there are any clusters (or sub populations) that might correspond to interesting cell death patterns (or might indicate further filtering is needed):
```{r, echo=TRUE, eval= TRUE}
# Check mitochondrial gene expression -------------------------------------
percent_mito_plot = FeaturePlot(geo_so, features='percent.mt')
percent_mito_plot

# save to file
ggsave(filename = 'results/figures/percent_umap_mito_plot.png', plot = percent_mito_plot, width = 6, height = 6, units = 'in')
```

We see that a higher % seem to be somewhat concentrated in a few places, but if cell death might be of interest for the research question, we'd want to consider investigating this pattern further by splitting up the plots by day and/or waiting until after running initial differential expression analysis to determine if these cell populations are interesting biology or not.



# Save our progress

Finally, we'll create an output file for our updated Seurat object and for the cluster marker results:

```{r, save_rds_hidden, echo = FALSE}
if(!file.exists('results/rdata/geo_so_sct_integrated_with_markers.rds')) {
  saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated_with_markers.rds')
}

if(!file.exists('results/rdata/geo_markers.rds')) {
  saveRDS(geo_markers, file = 'results/rdata/geo_markers.rds')
}
```

```{r, save_rds, eval=FALSE}
# Save Seurat object and gene marker data ---------------------------------
saveRDS(geo_so, file = 'results/rdata/geo_so_sct_integrated_with_markers.rds')
saveRDS(geo_markers, file = 'results/rdata/geo_markers.rds')
```

# Summary

Now that we have characterized the expression of both analytical marker genes and literature / knowledge-based marker genes, we may have a better sense of what cell-types or subtypes our clusters might correspond to.   

However, marker genes alone might not be sufficient to determine cell-type or sub-type labels for our clusters so we will discuss other more automated approaches to complement these results.

Next steps: Cell type prediction tools

----

These materials have been adapted and extended from materials listed above. These are open access materials distributed under the terms of the [Creative Commons Attribution license (CC BY 4.0)](http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

<br/>
<br/>
<hr/>
| [Previous lesson](05-ProjectionAndClustering.html) | [Top of this lesson](#top) | [Next lesson](07-CellTypeAnnos.html) |
| :--- | :----: | ---: |
