2021-04-26-umich-rnaseqDemystified

More QC - Cutadapt and MultiQC

In this module we will learn:

Differential Expression Workflow

As a reminder, our overall differential expression workflow is shown below. In this lesson, we will go over the bold part of the workflow.

Step Task
1 Experimental Design
2 Biological Samples / Library Preparation
3 Sequence Reads
4 Assess Quality of Reads
5 Splice-aware Mapping to Genome
6 Count Reads Associated with Genes
:--: ----
7 Organize project files locally
8 Initialize DESeq2 and fit DESeq2 model
9 Assess expression variance within treatment groups
10 Specify pairwise comparisons and test for differential expression
11 Generate summary figures for comparisons
12 Annotate differential expression result tables

Cutadapt

Cutadapt is a very widely used read trimming and fastq processing software, cited several thousands of times. It's written in python, and is user-friendly and reasonably fast.

It is used for removing adapter sequences, primers, and poly-A tails, for trimming based on quality thresholds, for filtering reads based on characteristics, etc.

It can operate on both FASTA and FASTQ file formats, and it supports compressed or raw inputs and outputs.

Notably, cutadapt's error-tolerant adapter trimming likely contributed greatly to its early popularity. We will use it to trim the adapters from our reads. As usual, we'll view the help page to get a sense for how to structure our command.





Cutadapt Exercise:

  1. View the help page of the cutadapt tool
  2. Construct a cutadapt command to trim the adapters from paired-end reads
  3. View the output of cutadapt, and verify that it's correct
  4. Construct commands to trim the reads for all of our samples
Click here for solution - cutadapt exercise
  1. Log back in to aws instance with ssh <username>@50.17.210.255

  2. View cutadapt help page

     cutadapt --help | less
     # Will need to type `q` to exit from `less`
    
  3. Trim the adapters from sample_01 with cutadapt

     # Need to create directory for trimmed outputs
     mkdir ~/analysis/trimmed
    
     cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/sample_01.trimmed.fastq.gz -p ~/analysis/trimmed/sample_01_R2.trimmed.fastq.gz ~/data/reads/sample_01_R1.fastq.gz ~/data/reads/sample_01_R2.fastq.gz
    
  4. View cutadapt output

     ls -l ~/analysis/trimmed
    
  5. Construct commands to trim the reads for all of our samples

Note: We're re-using the same command. We can update $SAMPLE, then press 'up' to re-run cutadapt command with newly defined variable.

    SAMPLE=sample_02
    cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz

    SAMPLE=sample_03
    cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz

    SAMPLE=sample_04
    cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz




Re-running FastQC

Now that we've run cutadapt and trimmed the adapters from our reads, we will quickly re-run FastQC on these trimmed read FASTQs. This will confirm that we've successfully trimmed the adapters, and we'll see that our FASTQ files are ready for sequencing.

Re-running FastQC Exercise:

  1. Create directory for new fastqc results
  2. Construct and execute FastQC command to evaluate trimmed read FASTQ files
  3. View the output (filenames)
Click here for solution - re-running FastQC exercise
  1. Create directory for new fastqc results

     mkdir ~/analysis/fastqc_trimmed
    
  2. FastQC command to evaluate trimmed FASTQ files

     fastqc -o ~/analysis/fastqc_trimmed ~/analysis/trimmed/*.fastq.gz
    




MultiQC

FastQC is an excellent tool, but it can be tedious to look at the report for each sample separately, while keeping track of what trends emerge. It would be much easier to look at all the FastQC reports compiled into a single report. MultiQC is a tool that does exactly this.

MultiQC is designed to interpret and aggregate reports from various tools and output a single report as an HTML document.





MultiQC Exercise:

  1. View the multiQC help page
  2. Construct a MultiQC command to aggregate our QC results into a single report
  3. Transfer the MultiQC report to personal computer using scp
  4. View the MultiQC report
MultiQC solution
  1. View MultiQC help page

     multiqc --help | less
     # Will need to type `q` to exit from `less`
    
  2. MultiQC command to process our trimmed read results

     multiqc --outdir ~/analysis/multiqc ~/analysis/fastqc_trimmed/
    
  3. Log out of aws instance and use scp to transfer MultiQC report to local computer

     exit # log out from remote
    
     # Now on local
     scp <username>@50.17.210.255:~/analysis/multiqc/multiqc_report.html ~/workshop_rsd/
    
  4. View MultiQC report Use GUI file manager to find your ~/workshop_rsd folder Double-click multiqc_report.html (open it with an internet browser)

Opening the HTML report, we see it is organized by the same modules and each plot has all samples for which FastQC was run. We can see the report confirms that the adapters have been trimmed from our sequence.


These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.