In this Quality Control (QC) module, we will learn:
An overview of the steps of entire differential expression experiment and analysis workflow is shown below. In this lesson, we will go over the bold part of the workflow.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
:--: | ---- |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
Analysis of RNA-seq data requires the use of many types of data. The three most common that we will encounter are:
We will introduce each data type and their formats in more detail as they enter the workflow, but for now some general notes about the formats we will encounter:
Raw sequenced reads are stored in FASTQ format. This is a plain text, human readable file format that stores information about each read. For a single read there are four lines:
@A00437:266:HVFMYDSXX:1:1101:1181:1000 1:N:0:AGCCTATC+TGCGTAAC
TTTATTGTTGATGGTTATTTTTTGTTTATGGTTATTTTGTGTTTATGGTTATTTTTTGTTTATGGTTATTTTTTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Line | Description |
---|---|
1 | Begins with @ and has information about the read. From AGC will contain flowcell and barcode information. |
2 | The DNA sequence. |
3 | Always begins with + and may contain the same information as line 1. |
4 | A string of characters representing the quality scores. Always the same number of characters as line 2. |
The fourth line encodes the quality of each nucleotide in the read. The most common encoding is referred to as PHRED-33, and the following maps characters to qualities:
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | |
Quality score: 0........10........20........30........40
The above example is a rather boring read with all bases having F
, meaning a score of 37, which is good. Each quality score represents the probability that the corresponding nucleotide call is incorrect. The quality score is logarithmic and is calculated as:
Q = -10 x log10(P), where P is the probability that the base call is erroneous
Another way to interpret the scores is in terms of the following table:
PHRED Quality Score | Probability of incorrect base call | Base call accuracy |
---|---|---|
10 | 1 in 10 | 90% |
20 | 1 in 100 | 99% |
30 | 1 in 1000 | 99.9% |
40 | 1 in 10000 | 99.99% |
It would clearly be painstaking to go through all the base quality scores of all the millions of reads obtained from a sequencing run, so we typically don't evaluate the FASTQ files by sight. Instead, we use tools that take in FASTQ files and summarize the base quality scores (and other characteristics) to determine the overall quality of the reads from the sequencer.
The first step in many high-throughput sequencing analyses is to examine the quality of the raw reads with FastQC. It analyzes various aspects of each FASTQ file and provides a quick impression of the overall quality metrics which are useful to determine before proceeding with downstream analysis. The following modules are present in an easy to read HTML report.
Let's try running FastQC on our FASTQ input files.
We can see some HTML reports, but there isn't a way to open them on the remote machine in the command-line interface (CLI). This is a good opportunity to learn how to pull files from a remote server onto our local computer that we're using.
Ensure you're on the remote aws instance
View FastQC help file
fastqc --help | less
# Will need to type `q` to exit from `less`
Prepare to run FastQC
# Make the output directory before running
mkdir ~/analysis/fastqc
fastqc -o ~/analysis/fastqc ~/data/reads/*.fastq.gz
Watch FastqQC output during processing
View output of FastQC
ls -l analysis/fastqc/
Transferring Files Exercise:
Ensure you're on your local computer
# If currently on remote:
exit
View scp manual page
man scp
# Will need to type `q` to exit the manual
Use scp to transfer a FastQC report to local computer
scp <username>@50.17.210.255:~/analysis/fastqc/sample_01_R1_fastqc.html ~/workshop_rsd
# you will need to enter your password to transfer the file
Ensure file is on local computer
ls ~/workshop_rsd/
View the FastQC report Use GUI file manager to find your ~/workshop_rsd folder Double-click sample_01_R1_fastqc.html (open it with an internet browser)
FastQC reports are organized around the the different modules characterizing different aspects of the reads. The FastQC help page has detailed descriptions for each module.
The report summary is a PASS, WARN, FAIL overview of the modules. It's important to not take them too seriously in all cases as there are a number of assumptions that go into the thresholds that may not apply in your particular experiment.
An important note: For computational reasons, FastQC only evaluates the first 100,000 reads in a FASTQ file. The assumption is that the first 100,000 reads sufficiently represent the full library.
A per base boxplot of the quality scores with helpful coloring for good, fair, and poor qualities. If a sample has quality scores that degrade at the 3' end of a read (as in the second example below), the sample doesn't necessarily have to be dropped. Some aligners, e.g. STAR, will automatically trim poor quality read ends.
An example of a good quality run.
An example of a poor quality run.
High sequence duplication, indicated by a skew to the right of the blue curve in the plots below, can be caused by low complexity libraries, too many cycles of PCR amplification, or too little starting material.
On the other hand, if the library was prepared with an enrichment step (e.g. ChIP-seq), then sequence duplication might not be unexpected.
An example of a sample with low sequence duplication levels.
An example of a sample with high sequence duplication levels.
Overrepresented sequences are those making up more than 0.1% of the total (of the 100,000 reads sampled). Finding an overrepresented sequence is an indication of something biologically meaningful (e.g. an enrichment step in the library prep process), or it is an issue stemming from the library preparation process (e.g. adapters, a failure to fully remove polyA tails, or contamination).
We see that there is a high number of reads with polyA tails. This could be a result of an inefficient removal of polyA tails.
Adapter sequence typically is not present in standard RNA-seq libraries. However, if the fragment sizes resulting from library prep are smaller than the read length on the sequencer, then it is possible to get "read-through" of the adapter.
This can be dealt with by adding a step to the workflow that trims the reads of adapter content with a tool such as cutadapt. We will see this in a later module.
Good QC on raw reads means that RNA quality, library preparation, and sequencing seem fine to proceed with the next steps of analysis. It does not mean that you will find differentially expressed genes downstream.
The source of bad QC measures are not always easy to determine, and doesn't necessarily mean the data cannot be used downstream. However, bad QC measures may indicate that hazards lie ahead, and that certain steps upstream of sequencing (sample handling, RNA extraction, library preparation) might need to be altered.
These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.