In this module we will learn:
As a reminder, our overall differential expression workflow is shown below. In this lesson, we will go over the bold part of the workflow.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
:--: | ---- |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
Cutadapt is a very widely used read trimming and fastq processing software, cited several thousands of times. It's written in python, and is user-friendly and reasonably fast.
It is used for removing adapter sequences, primers, and poly-A tails, for trimming based on quality thresholds, for filtering reads based on characteristics, etc.
It can operate on both FASTA and FASTQ file formats, and it supports compressed or raw inputs and outputs.
Notably, cutadapt's error-tolerant adapter trimming likely contributed greatly to its early popularity. We will use it to trim the adapters from our reads. As usual, we'll view the help page to get a sense for how to structure our command.
Cutadapt Exercise:
Log back in to aws instance with ssh <username>@50.17.210.255
View cutadapt help page
cutadapt --help | less
# Will need to type `q` to exit from `less`
Trim the adapters from sample_01 with cutadapt
# Need to create directory for trimmed outputs
mkdir ~/analysis/trimmed
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/sample_01.trimmed.fastq.gz -p ~/analysis/trimmed/sample_01_R2.trimmed.fastq.gz ~/data/reads/sample_01_R1.fastq.gz ~/data/reads/sample_01_R2.fastq.gz
View cutadapt output
ls -l ~/analysis/trimmed
Construct commands to trim the reads for all of our samples
Note: We're re-using the same command. We can update $SAMPLE
, then press 'up' to re-run cutadapt command with newly defined variable.
SAMPLE=sample_02
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
SAMPLE=sample_03
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
SAMPLE=sample_04
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz -p ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/reads/${SAMPLE}_R1.fastq.gz ~/data/reads/${SAMPLE}_R2.fastq.gz
Now that we've run cutadapt and trimmed the adapters from our reads, we will quickly re-run FastQC on these trimmed read FASTQs. This will confirm that we've successfully trimmed the adapters, and we'll see that our FASTQ files are ready for sequencing.
Re-running FastQC Exercise:
Create directory for new fastqc results
mkdir ~/analysis/fastqc_trimmed
FastQC command to evaluate trimmed FASTQ files
fastqc -o ~/analysis/fastqc_trimmed ~/analysis/trimmed/*.fastq.gz
FastQC is an excellent tool, but it can be tedious to look at the report for each sample separately, while keeping track of what trends emerge. It would be much easier to look at all the FastQC reports compiled into a single report. MultiQC is a tool that does exactly this.
MultiQC is designed to interpret and aggregate reports from various tools and output a single report as an HTML document.
MultiQC Exercise:
View MultiQC help page
multiqc --help | less
# Will need to type `q` to exit from `less`
MultiQC command to process our trimmed read results
multiqc --outdir ~/analysis/multiqc ~/analysis/fastqc_trimmed/
Log out of aws instance and use scp to transfer MultiQC report to local computer
exit # log out from remote
# Now on local
scp <username>@50.17.210.255:~/analysis/multiqc/multiqc_report.html ~/workshop_rsd/
View MultiQC report Use GUI file manager to find your ~/workshop_rsd folder Double-click multiqc_report.html (open it with an internet browser)
Opening the HTML report, we see it is organized by the same modules and each plot has all samples for which FastQC was run. We can see the report confirms that the adapters have been trimmed from our sequence.
These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.