Workflow Overview




Workshop exercises

Build on the content and analysis steps covered in the workshop sessions by working through these independent exercises. Note - if you work on the exercises make sure to restart R session to clear out environment before closing the window (like we have at the end of each session) to avoid lags when logging in the next day


Reminder: RStudio, Seurat, and Memory Management

  • R and RStudio are designed to grab computer memory (RAM) when necessary and release that memory when it’s no longer needed. R does this with a fancy computer science technique called garbage collection.

  • The R garbage collector is very good for small objects and simple analyses, but complex single-cell analysis can overwhelm it. So don’t rely on R/RStudio’s built in garbage collection and session management for single-cell analysis. Instead, do the following:


  1. Save intermediate versions of the Seurat object as you go along using saveRDS(). This gives you “savepoints” so if you need to backtrack, you don’t have to go back to the beginning.
  2. Save ggplots into image files as you go along. Periodically clear ggplot objects from the R environment.
  3. At the end of each day, wherever you are in the analysis, explicitly save the Seurat object and “power down” your session.
  4. The next day, you can load the Seruat object using readRDS().

This simple pattern will make your code more reproducible and your RStudio session more reliable.


Day 1

You don’t need to have a fresh environment for these exercises. All variable names will be different from the lessons. Load the libraries (it’s okay if they’re already loaded) and the unfiltered Seurat object with the following code:

# =========================================================================
# Independent Exercise - Day 1 Startup
# =========================================================================

# After restarting our R session, load the required libraries & input data
library(Seurat)
library(BPCells)
library(tidyverse)

setwd('~/ISC_R')

# Use provided copy of integrated data
exso = readRDS('./inputs/prepared_data/rdata/geo_so_unfiltered.rds')

# Add percent.mt column to meta.data
exso$percent.mt = PercentageFeatureSet(exso, pattern = '^mt-')

## NOTE - BEFORE STOPPING WORK ON THE EXERCISES REMEMBER TO POWER DOWN AND RESTART R SESSION !!!!

Let’s practice filtering cells based on different criteria than we’re using in the lessons.

Day 1 Exercise 1

Subset exso so that all cells have nFeature_RNA greater than or equal to 1000. How many cells remain?

Hint 1: For a reminder of how to subset a Seurat object, see the “Removing low-quality cells” section of the Secondary QC and Filtering module. Hint 2: For a reminder, of how to count the number cells per sample, see the “Cell counts” section of the Secondary QC and Filtering module.

Day 1 Exercise 2

Subset exso so that all cells have nCount_RNA greater than or equal to 5000. How many cells remain?

Day 1 Exercise 3

Subset exso so that all cells have percent.mt less than 10%. How many cells remain?

Day 1 Exercise 4

Subset exso so that each of the previous three conditions are satisfied. How many cells remain?

Example Solutions - Day 1
# =========================================================================
# Independent Exercise - Day 1 Startup
# =========================================================================

# After restarting our R session, load the required libraries & input data
library(Seurat)
library(BPCells)
library(tidyverse)

setwd('~/ISC_R')

# Load the unfiltered version and give it a new variable name
exso = readRDS('./inputs/prepared_data/rdata/geo_so_unfiltered.rds')

# Add percent.mt column to meta.data
exso$percent.mt = PercentageFeatureSet(exso, pattern = '^mt-')

## NOTE - BEFORE STOPPING WORK ON THE EXERCISES REMEMBER TO POWER DOWN AND RESTART R SESSION !!!!

# Exercise 1
exso = subset(exso, nFeature_RNA >= 1000)

ex1 = exso@meta.data %>% count(orig.ident, name = 'postfilter_cells')
ex1

# Exercise 2
exso = subset(exso, nCount_RNA >= 1000)

ex2 = exso@meta.data %>% count(orig.ident, name = 'postfilter_cells')
ex2

# Exercise 3
exso = subset(exso, percent.mt < 10)

ex3 = exso@meta.data %>% count(orig.ident, name = 'postfilter_cells')
ex3

# Exercise 4
exso = subset(exso, nFeature_RNA >= 1000 & nCount_RNA >= 1000 & percent.mt < 10)

ex4 = exso@meta.data %>% count(orig.ident, name = 'postfilter_cells')
ex4



Day 2

Since we are working with larger object sizes, it’s best to start with a fresh session. All variable names will be different from the lessons. Load the libraries (it’s okay if they’re already loaded) and the integrated Seurat object with the following code:

# =========================================================================
# Independent Exercise - Day 2 Startup
# =========================================================================

# After restarting our R session, load the required libraries & input data
library(Seurat)
library(BPCells)
library(tidyverse)

setwd('~/ISC_R')

# Use provided copy of integrated data
exso2 = readRDS('inputs/prepared_data/rdata/geo_so_sct_integrated.rds')
exso2 # check that object loaded

## NOTE - BEFORE STOPPING WORK ON THE EXERCISES REMEMBER TO POWER DOWN AND RESTART R SESSION !!!!


Day 2 Exercise 1 - Clustering with reduced number of PCs

Test how the clustering results would change if you used fewer or more PCs when clustering these data.

# -------------------------------------------------------------------------
# Testing fewer PCs for clustering 

# look at elbow plot to check PCs and consider alternatives to the number selected for main materials
ElbowPlot(exso2, ndims = 50, reduction = 'unintegrated.sct.pca')

# select alternative value to try (choose a number <10 or >10 PCs)
pcs = # Your value here #

# generate nearest neighbor (graph), using selected number of PCs
exso2 = FindNeighbors(exso2, dims = 1:pcs, reduction = 'integrated.sct.rpca')


After generating clustering, set resolution parameter - remember this only impacts the how the boundaries of the neighbors are drawn, not the underlying NN graph/structure.

# -------------------------------------------------------------------------
# Testing resolution options to see impact

# start with one resolution
res = # Your value here #

# generate clusters, using `pcs` and `res` to make a custom cluster name that will be added to the metadata
exso2 = FindClusters(exso2, resolution = res, 
                     cluster.name = paste0('int.sct.rpca.clusters', res))

# look at meta.data to see cluster labels
head(exso2@meta.data)

# run UMAP reduction to prepare for plotting
exso2 = RunUMAP(exso2, dims = 1:pcs, 
                  reduction = 'integrated.sct.rpca', 
                  reduction.name = paste0('umap.integrated.sct.rpca_alt', res))

# check object to see if named reduction was added
exso2


Challenge 1: How could we generate clusters for multiple alternative resolutions (e.g. 0.4, 0.8, & 1.0)? Hint - for loops can be useful for iteration in many programming languages, including R.


Day 2 Exercise 2 - Plotting alternative clustering results

After generating our clustering results it’s time to visualize them - create a UMAP plot, ensuring that you generate a UMAP reduction for the appropriate reduction

Hint 1 - Remember the previous code chunk used paste0('umap.integrated.sct.rpca_alt', res)) for the cluster name parameter.

Hint 2 - For a reminder of how to generate a UMAP plot, look at the Visualizing and evaluating clustering from the Clustering and Projection module.


Challenge 2: Create UMAP plots for the alternative resolutions (e.g. 0.4, 0.8, & 1.0) generated in the previous challenge.


Check-in Questions - UMAP for alternative PCs

How does the UMAP plot look when fewer or more PCs are used? Do you notice any relationship between the PC parameter and resolution parameter?


Clean up session

Before closing out the window, make sure to clear environment and restart R session to manage the memory usage

## ----------------------------------------------------------
## Clean up session, including any plot objects
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot)))));
gc()

## Save copy of Seurat object in current state to file
saveRDS(exso2, file = paste0('results/rdata/geo_so_sct_integrated_exercise.rds'))
rm(exso2)
gc()



Example Solutions - Day 2
# =========================================================================
# Independent Exercise - Day 2 Startup
# =========================================================================

# After restarting our R session, load the required libraries & input data
library(Seurat)
library(BPCells)
library(tidyverse)

setwd('~/ISC_R')

# Use provided copy of integrated data
exso2 = readRDS('inputs/prepared_data/rdata/geo_so_sct_integrated.rds')
exso2 # check that object loaded


### Day 2 Exercise 1 - Clustering with reduced number of PCs
# -------------------------------------------------------------------------
# Testing fewer PCs for clustering 

# look at elbow plot to check PCs and consider alternatives to the number selected for main materials
ElbowPlot(exso2, ndims = 50, reduction = 'unintegrated.sct.pca')

# select alternative value to try (can choose a number <10 or >10 PCs)
pcs = 6

# generate nearest neighbor (graph), using selected number of PCs
exso2 = FindNeighbors(exso2, dims = 1:pcs, reduction = 'integrated.sct.rpca')

# -------------------------------------------------------------------------
# start with one resolution
res = 0.2

# generate clusters, using `pcs` and `res` to make a custom cluster name that will be added to the metadata
exso2 = FindClusters(exso2, resolution = res, 
                     cluster.name = paste0('int.sct.rpca.clusters', res))

# look at meta.data to see cluster labels
head(exso2@meta.data)

# run UMAP reduction to prepare for plotting
exso2 = RunUMAP(exso2, dims = 1:pcs, 
                reduction = 'integrated.sct.rpca', 
                reduction.name = paste0('umap.integrated.sct.rpca_alt', res))

# check object to see if named reduction was added
exso2


## Challenge 1 - solution option
# -------------------------------------------------------------------------
# use for loop to generate clustering with alternative resolutions
for(i in c(0.4, 0.8, 1.0)){
  exso2 = FindClusters(exso2, resolution = i, 
                       cluster.name = paste0('int.sct.rpca.clusters', i))
  
  # look at meta.data to see cluster labels
  head(exso2@meta.data)
  
  # run UMAP reduction to prepare for plotting
  exso2 = RunUMAP(exso2, dims = 1:pcs, 
                  reduction = 'integrated.sct.rpca', 
                  reduction.name = paste0('umap.integrated.sct.rpca_alt', i))
  
  # check object to see if multiple cluster resolutions are added
  head(exso2@meta.data)
}



### Day 2 Exercise 2 - Plotting alternative clustering results
# -------------------------------------------------------------------------
# plot clustering results
post_integration_umap_clusters_testing = 
  DimPlot(exso2, group.by = paste0('int.sct.rpca.clusters', res), label = TRUE, 
          reduction = paste0('umap.integrated.sct.rpca_alt', res)) + NoLegend()
post_integration_umap_clusters_testing # look at plot

# output to file, including the number of PCs and resolution used to generate results
ggsave(filename = paste0('results/figures/umap_int_sct_clusters_exercise_', 
                         pcs,'PC.',res,'res','.png'),
       plot = post_integration_umap_clusters_testing, 
       width = 8, height = 6, units = 'in')


## Challenge 2 - solution option
# -------------------------------------------------------------------------
# use for loop to visualize clustering across tested resolutions
post_integration_umap_plots <- c()
for(i in c(0.4, 0.8, 1.0)){
  res_type = paste0("res_", i)
  post_integration_umap_plots[[res_type]] = 
    DimPlot(exso2, group.by = paste0('int.sct.rpca.clusters', i), label = TRUE, 
            reduction = paste0('umap.integrated.sct.rpca_alt', i)) + NoLegend()
}

# look at plots for each resolution stored in list
post_integration_umap_plots
# remove plot list to clean up session 
rm(post_integration_umap_plots)


## ----------------------------------------------------------
## Clean up session, including any plot objects
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot)))));
gc()

## Save copy of Seurat object in current state to file
saveRDS(exso2, file = paste0('results/rdata/geo_so_sct_clustered_exercise.rds'))
rm(exso2)
gc()



Day 3

Since we are working with larger object sizes, it’s best to start with a fresh session. All variable names will be different from the lessons. Load the libraries (it’s okay if they’re already loaded) and a Seurat object with alternative clustering using the following code:

# =========================================================================
# Independent Exercise - Day 3 Startup
# =========================================================================

# After restarting our R session, load the required libraries
library(Seurat)
library(BPCells)
library(tidyverse)
library(scCATCH)

# Load in seurat object with alternative clustering results from yesterday's exercises
exso3 = readRDS('results/rdata/geo_so_sct_integrated_exercise.rds')
exso3 # check that object loaded

## NOTE - BEFORE STOPPING WORK ON THE EXERCISES REMEMBER TO POWER DOWN AND RESTART R SESSION !!!!


Before testing what marker genes and cell-type predictions are generated for an alternative clustering, start by ensuring the Seurat object is set up and has an expected identity set.

## ----------------------------------------------------------
# Check what identities are set
Idents(exso3) %>% head()

pcs = ## use same values as yesterday
res = ## use same values as yesterday

## Set identities to clustering for selected resolution
exso3 = SetIdent(exso3, value = paste0('int.sct.rpca.clusters', res))
Idents(exso3) %>% head()

Day 3 Exercise 1 - examine marker genes for alternative clusters

Using the approach from the marker identification module, run PrepSCTFindMarkers and generate markers for the alternative clustering results. Bonus to generate a table of the top 5 markers and outputing that to file.

Check-in Question - clustering impact on marker genes After you generate markers for the “fewer” pcs option, how do the results differ from the markers found for pcs=10 in the workshop session? What do you think would happen if you tried this with the “more” results?


Day 3 Exercise 2 - Generate predictions for alternative clustering

Using scCATCH like we did for our annotation process, generate predictions for the alternative clustering results.


Day 3 Exercise 3 - Add predictions to Seurat object

Skipping the refinement step, add the cell-type predictions to the Seurat object using the numeric labels as a key, similarly to the our annotation step.


Day 3 Exercise 4 - Plot UMAP with new cluster labels

Using a similar approach used to visualize the annotations in the workshop, generate a UMAP plot

Hint - use the same reduction name as yesterday’s exercise and make sure the group_by parameter matches the column name added to the metadata that stores the predicted cell-types.


Check-in Question

How did the number of pcs and/or resolution change the predictions? Do you think the predictions correspond better or worse to the cluster structure we see in the UMAP?


Clean up session

Before closing out the window, make sure to clear environment and restart R session to manage the memory usage

## ----------------------------------------------------------
## Clean up session 
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot))))); 
rm(catch_celltypes, catch_markers, geo2_catch, geo2_markers, new_metadata, top_5); 
gc()

## (Optional) - Save copy of exso3
saveRDS(exso3, file = paste0('results/rdata/geo_so_sct_integrated_with_markers_exercise.rds'))

## BEFORE CLOSING WINDOW - POWER DOWN AND RESTART R SESSION




Example Solutions - Day 3
# =========================================================================
# Independent Exercise - Day 3 Startup
# =========================================================================

# After restarting our R session, load the required libraries
library(Seurat)
library(BPCells)
library(tidyverse)
library(scCATCH)

# Load in seurat object with alternative clustering results from yesterday's exercises
exso3 = readRDS('results/rdata/geo_so_sct_integrated_exercise.rds')
exso3 # check that object loaded

## NOTE - BEFORE STOPPING WORK ON THE EXERCISES REMEMBER TO POWER DOWN AND RESTART R SESSION !!!!


### Day 3 Exercise 1 - examine marker genes and cell type predictions for alternative clustering
## ----------------------------------------------------------
# Check what identities are set
Idents(exso3) %>% head()

## use same values as previous exercise
pcs = 6
res = 0.2 

## Set identities to clustering for selected resolution
exso3 = SetIdent(exso3, value = paste0('int.sct.rpca.clusters', res))
Idents(exso3) %>% head()


## ----------------------------------------------------------
## Generate cluster markers to see how that changes with new parameters 
exso3 = PrepSCTFindMarkers(exso3, assay = "SCT") # NOTE - this step will take some time to run
exso3_markers = FindAllMarkers(exso3, only.pos = TRUE)
head(exso3)

# Create table of top 5 markers per cluster (using default ranking)
top_5 = exso3_markers %>% filter(p_val_adj < 0.01) %>% group_by(cluster) %>% slice_head(n = 5)
head(top_5, n = 10) # look at results
write_csv(top_5, file = paste0('results/tables/top5_marker_genes_exercise.csv'))


#### Day 3 Exercise 2 - Generate predictions for alternative clustering 
## ----------------------------------------------------
## Next - run scCATCH predictions for alternative clustering results
exso3_catch = createscCATCH(data = exso3@assays$SCT@counts, cluster = as.character(Idents(exso3)))
catch_markers = exso3_markers %>% rename('logfc' = 'avg_log2FC')
exso3_catch@markergene = exso3_markers
exso3_catch@marker = cellmatch[cellmatch$species == 'Mouse' & cellmatch$tissue %in% c('Blood', 'Peripheral Blood', 'Muscle', 'Skeletal muscle', 'Epidermis', 'Skin'), ]
exso3_catch = findcelltype(exso3_catch)

# Check predictions
exso3_catch@celltype %>% select(cluster, cell_type, celltype_score)


#### Day 3 Exercise 3 - Add predictions to Seurat object
## ------------------------------------------------------
## Use predictions to label clusters and UMAP plot
catch_celltypes = exso3_catch@celltype %>% select(cluster, cell_type)
colnames(catch_celltypes)[2] = paste0('cell_type.',pcs,'PC.',res,'res')
new_metadata = exso3@meta.data %>% 
  left_join(catch_celltypes, 
            by = c('seurat_clusters' = 'cluster')) # using `seurat_clusters`, which will store the most recently generated cluster labels for each cell
rownames(new_metadata) = rownames(exso3@meta.data) #  We are implicitly relying on the same row order!
exso3@meta.data = new_metadata # Replace the meta.data
head(exso3@meta.data)


#### Day 3 Exercise 4 - Plot UMAP with new cluster labels
## ------------------------------------------------------
catch_umap_plot = DimPlot(exso3, group.by = paste0('cell_type.',pcs,'PC.',res,'res'), 
                          label = TRUE, reduction = paste0('umap.integrated.sct.rpca_alt', res))
catch_umap_plot


# Save the plot to file
# output to file, including the number of PCs and resolution used to generate results
ggsave(filename = paste0('results/figures/umap_int_catch-labeled_', 
                         pcs,'PC.',res,'res','.png'),
       plot = catch_umap_plot, 
       width = 8, height = 6, units = 'in')


## ----------------------------------------------------------
## Clean up session 
rm(list=names(which(unlist(eapply(.GlobalEnv, is.ggplot))))); 
rm(catch_celltypes, catch_markers, exso3_catch, exso3_markers, new_metadata, top_5); 
gc()

## (Optional) - Save copy of exso3
saveRDS(exso3, file = paste0('results/rdata/geo_so_sct_integrated_with_markers_exercise.rds'))
rm(exso3)
gc()

## BEFORE PROCEEDING TO THE NEXT SECTION or closing window - POWER DOWN AND RESTART R SESSION