In this module, we will learn:
We'll discuss the alignment and gene quantification steps which gather the necessary data that we will need prior to testing for differential expression, the topic of Day 2.
Step | Task |
---|---|
1 | Experimental Design |
2 | Biological Samples / Library Preparation |
3 | Sequence Reads |
4 | Assess Quality of Raw Reads |
5 | Splice-aware Mapping to Genome |
6 | Count Reads Associated with Genes |
:--: | ---- |
7 | Organize project files locally |
8 | Initialize DESeq2 and fit DESeq2 model |
9 | Assess expression variance within treatment groups |
10 | Specify pairwise comparisons and test for differential expression |
11 | Generate summary figures for comparisons |
12 | Annotate differential expression result tables |
The FASTQ files of raw sequenced reads are untethered from any notion of where they came from in the genome, and which transcribed genes the sequence belongs to. The alignment and gene quantification steps fill in that gap and allow us to proceed with the question we are really interested in: Which genes are differentially expressed between groups of samples?
We will use RSEM (paper and GitHub) combined with STAR Aligner (paper and GitHub) to accomplish the task of read mapping and gene quantifcation simultaneously.
The "Spliced Transcripts Alignment to a Reference" (STAR) Aligner is aware of splice-sites of transcripts and is able to align reads that span them. The figure below illustrates the difference between splice-unaware aligners (e.g. Bowtie2) and splice-aware aligners (e.g. STAR).
Some benefits of splice-aware aligners include:
We should note that the default parameters for STAR are optimized for mammalian genomes.
RSEM (RNA-seq by Expectation Maximization) determines gene and isoform abundance using an expectation maximization (EM) algorithm to determine the probability that any particular read originated from a particular transcript. From there, gene-level quantification is reported by effectively collapsing the isoform quantifications over all isoforms belonging to the gene.
The primary issue that RSEM attempts to solve is that reads can align to multiple isoforms (when, for example, they share an exon), and that creates ambiguity in deciding which isoform a read gets assigned to for quantification. See the image below for an illustration of this problem.
RSEM can be run with just two commands: the first rsem-prepare-reference
(manual) builds an index for STAR and RSEM to use, and the second rsem-calculate-expression
(manual) does the alignment and gene quantification.
rsem-prepare-reference
A reference index is essentially a lookup table that speeds up the finding of sequence matches for alignment. In the case of a splice-aware aligner, the reference index is also aware of the various splice junctions at locations in the gene model, and a subset of reads will map across these. This allows us to infer isoform usage later on.
RSEM Prepare Reference Exercise:
The result of rsem-prepare-reference
is a folder containing files for RSEM and STAR to be able to look up genomic location and gene model information as quickly and efficiently as possible.
Ensure you're on the remote aws instance
View rsem-prepare-reference help page
rsem-prepare-reference --help
# Will need to type `q` to exit from help page
Command to index our example genome
rsem-prepare-reference --gtf ~/data/refs/GRCh38.98.chr22reduced.gtf --star --num-threads 1 ~/data/refs/GRCh38.98.chr22reduced.fa ~/data/refs/GRCh38.98.chr22reduced
Examine the output of rsem-prepare-reference
ls -l ~/data/refs
rsem-calculate-expression
After preparing the reference index, we can do alignment and quantification with the rsem-calculate-expression
(manual) command. For our inputs, we will be using our FASTQ reads and the path to the reference index we just created.
RSEM Calculate Expression Exercise:
View rsem-calculate-expression
help page
Command to align and quantify sample_01
# Need to create directory for rsem + STAR outputs
mkdir ~/analysis/rsem_star
SAMPLE=sample_01
rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file --star-output-genome-bam --keep-intermediate-files --keep-intermediate-files --paired-end ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/refs/GRCh38.98.chr22reduced ~/analysis/rsem_star/${SAMPLE}
Iterate through our samples to align them all
Note: We're re-using the same command. We can update $SAMPLE
, then press 'up' to re-run cutadapt command with newly defined variable.
SAMPLE=sample_02
rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file --star-output-genome-bam --keep-intermediate-files --paired-end ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/refs/GRCh38.98.chr22reduced ~/analysis/rsem_star/${SAMPLE}
SAMPLE=sample_03
rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file --star-output-genome-bam --keep-intermediate-files --paired-end ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/refs/GRCh38.98.chr22reduced ~/analysis/rsem_star/${SAMPLE}
SAMPLE=sample_04
rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file --star-output-genome-bam --keep-intermediate-files --paired-end ~/analysis/trimmed/${SAMPLE}_R1.trimmed.fastq.gz ~/analysis/trimmed/${SAMPLE}_R2.trimmed.fastq.gz ~/data/refs/GRCh38.98.chr22reduced ~/analysis/rsem_star/${SAMPLE}
RSEM+STAR, after completing above, outputs the following files for our sample:
File | Description |
---|---|
sample_N.genome.bam |
The alignments in genomic coordinates. Used for visualization in a genome browser such as IGV. |
sample_N.transcript.bam |
The alignments in transcriptomic coordinates. Not used for this workshop. |
sample_N.genes.results |
Gene-level results to be used in downstream DE analysis. |
sample_N.isoforms.results |
Isoform-level results. Not used for this workshop. |
The two results we will use most often from RSEM+STAR are the gene-level quantifications (sample_N.genes.results
) and the alignments in genome-coordinates (sample_N.genome.bam
). Each sample for which we run RSEM+STAR will have these output files named after the sample.
The sample_N.genome.bam
alignments file is a special, compressed, version of a SAM file (sequence alignment/map). In order to view it, we have to use a special program called samtools
.
If we were too peek inside of sample_N.genome.bam
, we would see:
$ samtools view ~/workshop_data/rsem_star/sample_N.genome.bam | head -2
NB551521:212:H5L73AFX2:1:11101:16446:1034 0 2 10022660 255 148M * 0 0 GANAGACAGATATCCTACAAAACACAGAAAGACTAATAAACTCTTATGTTGACTATGAAAGCTGTAAGAAACTTCCAGAAGAAATATTGAAAATGTAGAATAACTGAAGTGTGCTGTGTGTCCATAGCTGTTCTGCTGAGGAAACATT AA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE<EEEEAEAEEAEA<A<AAAAEEEEEA NH:i:1 HI:i:1 AS:i:145 NM:i:1 MD:Z:2A145
NB551521:212:H5L73AFX2:1:11101:16366:1035 0 X 48488697 255 146M * 0 0 TANGTACGCACACAAATTGATCCATACCTTTACTTCCTTTTTTTCCAGCTACTGAATAAGGGGACCTTTCTATTCCTTTGTGTCTCACCATTTTATTGTCTTTCAGAATCTTCACCTGGTCCATTCATTCCTCTACCCTCTCCTGT AA#EEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEAEEEEEEEEEEEEEEEEEAEEEEEEEAAEEEE/<E<AA<<<AAAA<AEE NH:i:1 HI:i:1 AS:i:143 NM:i:1 MD:Z:2G143
The SAM format gives information about where each read maps to in the genome (one read per line), and has information about that mapping.
If we were to look at the top 3 lines of sample_N.genes.results
, we see it is a plain-text file separated by tabs where each row is a gene, and the columns are described the first row.
$ head -3 ~/workshop_data/rsem_star/sample_N.genes.results
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
ENSMUSG00000000001 ENSMUST00000000001 3262.00 3116.28 601.00 45.50 36.70
ENSMUSG00000000003 ENSMUST00000000003,ENSMUST00000114041 799.50 653.78 0.00 0.00 0.00
Column | Description |
---|---|
gene_id | The ID from the gene model GTF. |
transcript_id(s) | The transcript IDs corresponding to the gene in the gene model GTF. |
length | The weighted average of its transcripts' lengths. |
effective_length | The weighted average, over its transcripts, of the mean number of positions from which a fragment may start within the sequence of transcript. |
expected_count | The sum, over all transcripts, of the estimated counts from the EM algorithm. |
TPM | Transcript per million, a relative measure of transcript abundance where the sum of all TPMs is 1 million. |
FPKM | Fragments per kilobase of transcript per million mapped reads. |
The genes.results
files for each sample can be directly imported into DESeq2 using the tximport
R Bioconductor package.
Alternatively, we can combine these results into a count matrix. The count matrix can be very useful, since it contains summary-level data in a widely supported format, this makes it great for sharing and as input into different analyses.
These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.