Cutadapt
Cutadapt is a very widely used read trimming and fastq processing software, cited several thousands of times. It’s written in python, and is user-friendly and reasonably fast.
It is used for removing adapter sequences, primers, and poly-A tails, for trimming based on quality thresholds, for filtering reads based on characteristics, etc.
It can operate on both FASTA and FASTQ file formats, and it supports compressed or raw inputs and outputs.
Notably, cutadapt’s error-tolerant adapter trimming likely contributed greatly to its early popularity. We will use it to trim the adapters from our reads. Similar to earlier, we’ll discuss the details of cutadapt’s functionality and input/output files, before proceeding to an exercise where we can try running the software ourselves.
Cutadapt details
Cutadapt’s input and output files are simple to understand given its stated purpose. Both input and output are fastq files - the input being the fastq files that need processing, and output being the fastq files after they’ve been processed. Depending on the parameters, chosen outputs often have shorter read lengths due to trimming processes and fewer total reads due to filtering.
# Given the paired-end input files:
reads/sample_01_R1.fastq.gz
reads/sample_01_R2.fastq.gz
# Suitable output filename/paths:
out_trimmed/sample_01_R1.trimmed.fastq.gz
out_trimmed/sample_01_R2.trimmed.fastq.gz
As mentioned above, cutadapt has many capabilities. Depending on the parameters given, we can invoke different functionalities. Given our results from the previous QC module, we know that we need to trim adapters from the reads in our fastq files.
Cutadapt Exercise
- Create a directory for our trimmed reads
- View the help page of the cutadapt tool
- Construct a cutadapt command to trim the adapters from paired-end reads
- View the output of cutadapt, and verify that it’s correct
# Create a directory for the trimmed reads
mkdir out_trimmed
# View the help page of Cutadapt
cutadapt --help
# Construct a cutadapt command to trim adapters from paired-end reads
cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o out_trimmed/sample_01_R1.trimmed.fastq.gz -p out_trimmed/sample_01_R2.trimmed.fastq.gz ../data/reads/sample_01_R1.fastq.gz ../data/reads/sample_01_R2.fastq.gz
# View the output of cutadapt, (verify presence of output files and peek into the files)
At this point, we’ve run cutadapt on one of our samples. We could construct a series of similar commands by altering the sample names. However, there’s an easier way. For this, we’ll use a bash variable.
Bash Variable Exercise
- Use a bash variable to echo “Hello, World!”
- Use a bash variable to echo “Hello, !”
noun=World
echo "Hello, $noun!"
noun=Travis
echo "Hello, $noun!"
Now that we’ve learned the basics of bash variables and of running Cutadapt, let’s try an exercise
Cutadapt All Samples Exercise (Breakout)
- View the help page, and construct a cutadapt command with a bash variable in it
- Use variable reassignment along with our command to trim all samples
Running cutadapt on all samples using a bash variable
Here is an example of using a bash variable to run cutadapt on all of our samples.
Now that we’ve run cutadapt and trimmed the adapters from our reads, we will quickly re-run FastQC on these trimmed read FASTQs. This will confirm that we’ve successfully trimmed the adapters, and we’ll see that our FASTQ files are ready for sequencing. Since we’ve discussed the FastQC input/output and functionality in the previous module, we’ll go next to an exercise re-running FastQC on the trimmed read data
Re-running FastQC Exercise:
- Construct and execute FastQC command to evaluate trimmed read FASTQ files
- View the output (filenames)
# We'll have to create an output directory first
mkdir out_fastqc_trimmed
# Construct the fastqc command
fastqc -o out_fastqc_trimmed out_trimmed/*.fastq.gz
# Execute the command
# Then verify that the output files are present
ls -l out_fastqc_trimmed
Optional exercise - Transfer a FastQC report to personal computer
Make sure you’re running scp on your local computer, requesting a file from the remote computer we were just using.
scp command format, with the address for AWS remote
# Usage: scp [source] [destination]
scp <username>@bfx-workshop01.med.umich.edu:~/analysis/out_fastqc_trimmed/sample_01_R1.trimmed_fastqc.html ~/rsd-workshop/
Opening the HTML report, we see it is organized by the same modules and each plot has all samples for which FastQC was run. We can see the report confirms that the adapters have been trimmed from our sequence.
These materials have been adapted and extended from materials created by the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
LS0tCnRpdGxlOiAiRGF5IDEgLSBNb2R1bGUgMDI6IE1vcmUgUUMiCmF1dGhvcjogIlVNIEJpb2luZm9ybWF0aWNzIENvcmUiCm91dHB1dDoKICAgICAgICBodG1sX2RvY3VtZW50OgogICAgICAgICAgICBpbmNsdWRlczoKICAgICAgICAgICAgICAgIGluX2hlYWRlcjogaGVhZGVyLmh0bWwKICAgICAgICAgICAgdGhlbWU6IHBhcGVyCiAgICAgICAgICAgIHRvYzogdHJ1ZQogICAgICAgICAgICB0b2NfZGVwdGg6IDQKICAgICAgICAgICAgdG9jX2Zsb2F0OiB0cnVlCiAgICAgICAgICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQogICAgICAgICAgICBmaWdfY2FwdGlvbjogdHJ1ZQogICAgICAgICAgICBtYXJrZG93bjogR0ZNCiAgICAgICAgICAgIGNvZGVfZG93bmxvYWQ6IHRydWUKLS0tCjxzdHlsZSB0eXBlPSJ0ZXh0L2NzcyI+CmJvZHl7IC8qIE5vcm1hbCAgKi8KICAgICAgZm9udC1zaXplOiAxNHB0OwogIH0KcHJlIHsKICBmb250LXNpemU6IDEycHQKfQo8L3N0eWxlPgoKIyBNb3JlIFFDIC0gQ3V0YWRhcHQKCkluIHRoaXMgbW9kdWxlIHdlIHdpbGwgbGVhcm46CgoqIGFib3V0IHRoZSBjdXRhZGFwdCBzb2Z0d2FyZSBhbmQgaXRzIHVzZXMKKiBob3cgdG8gdXNlIHRoZSBjdXRhZGFwdCB0b29sIGZvciB0cmltbWluZyBhZGFwdGVycwoqIGhvdyB0byB0cmltIGFsbCBvZiBvdXIgc2FtcGxlcyBpbiBhIGZvci1sb29wCgojIERpZmZlcmVudGlhbCBFeHByZXNzaW9uIFdvcmtmbG93CgpBcyBhIHJlbWluZGVyLCBvdXIgb3ZlcmFsbCBkaWZmZXJlbnRpYWwgZXhwcmVzc2lvbiB3b3JrZmxvdyBpcyBzaG93biBiZWxvdy4gSW4gdGhpcyBsZXNzb24sIHdlIHdpbGwgZ28gb3ZlciB0aGUgaGlnaGxpZ2hlZCBwb3J0aW9uIG9mIHRoZSB3b3JrZmxvdy4KCiFbXShpbWFnZXMvd2F5ZmluZGVyL3dheWZpbmRlci0wMS5wbmcpCjxicj4KPGJyPgo8YnI+Cjxicj4KCiMgQ3V0YWRhcHQKCltDdXRhZGFwdF0oaHR0cHM6Ly9jdXRhZGFwdC5yZWFkdGhlZG9jcy5pby9lbi9zdGFibGUvKSBpcyBhIHZlcnkgd2lkZWx5IHVzZWQgcmVhZCB0cmltbWluZyBhbmQgZmFzdHEgcHJvY2Vzc2luZyBzb2Z0d2FyZSwgY2l0ZWQgc2V2ZXJhbCB0aG91c2FuZHMgb2YgdGltZXMuIEl0J3Mgd3JpdHRlbiBpbiBweXRob24sIGFuZCBpcyB1c2VyLWZyaWVuZGx5IGFuZCByZWFzb25hYmx5IGZhc3QuCgpJdCBpcyB1c2VkIGZvciByZW1vdmluZyBhZGFwdGVyIHNlcXVlbmNlcywgcHJpbWVycywgYW5kIHBvbHktQSB0YWlscywgZm9yIHRyaW1taW5nIGJhc2VkIG9uIHF1YWxpdHkgdGhyZXNob2xkcywgZm9yIGZpbHRlcmluZyByZWFkcyBiYXNlZCBvbiBjaGFyYWN0ZXJpc3RpY3MsIGV0Yy4KCkl0IGNhbiBvcGVyYXRlIG9uIGJvdGggRkFTVEEgYW5kIEZBU1RRIGZpbGUgZm9ybWF0cywgYW5kIGl0IHN1cHBvcnRzIGNvbXByZXNzZWQgb3IgcmF3IGlucHV0cyBhbmQgb3V0cHV0cy4KCk5vdGFibHksIGN1dGFkYXB0J3MgZXJyb3ItdG9sZXJhbnQgYWRhcHRlciB0cmltbWluZyBsaWtlbHkgY29udHJpYnV0ZWQgZ3JlYXRseSB0byBpdHMgZWFybHkgcG9wdWxhcml0eS4gV2Ugd2lsbCB1c2UgaXQgdG8gdHJpbSB0aGUgYWRhcHRlcnMgZnJvbSBvdXIgcmVhZHMuIFNpbWlsYXIgdG8gZWFybGllciwgd2UnbGwgZGlzY3VzcyB0aGUgZGV0YWlscyBvZiBjdXRhZGFwdCdzIGZ1bmN0aW9uYWxpdHkgYW5kIGlucHV0L291dHB1dCBmaWxlcywgYmVmb3JlIHByb2NlZWRpbmcgdG8gYW4gZXhlcmNpc2Ugd2hlcmUgd2UgY2FuIHRyeSBydW5uaW5nIHRoZSBzb2Z0d2FyZSBvdXJzZWx2ZXMuCgoKIyMgQ3V0YWRhcHQgZGV0YWlscwoKQ3V0YWRhcHQncyBpbnB1dCBhbmQgb3V0cHV0IGZpbGVzIGFyZSBzaW1wbGUgdG8gdW5kZXJzdGFuZCBnaXZlbiBpdHMgc3RhdGVkIHB1cnBvc2UuIEJvdGggaW5wdXQgYW5kIG91dHB1dCBhcmUgZmFzdHEgZmlsZXMgLSB0aGUgaW5wdXQgYmVpbmcgdGhlIGZhc3RxIGZpbGVzIHRoYXQgbmVlZCBwcm9jZXNzaW5nLCBhbmQgb3V0cHV0IGJlaW5nIHRoZSBmYXN0cSBmaWxlcyBhZnRlciB0aGV5J3ZlIGJlZW4gcHJvY2Vzc2VkLiBEZXBlbmRpbmcgb24gdGhlIHBhcmFtZXRlcnMsIGNob3NlbiBvdXRwdXRzIG9mdGVuIGhhdmUgc2hvcnRlciByZWFkIGxlbmd0aHMgZHVlIHRvIHRyaW1taW5nIHByb2Nlc3NlcyBhbmQgZmV3ZXIgdG90YWwgcmVhZHMgZHVlIHRvIGZpbHRlcmluZy4KCgogICAgIyBHaXZlbiB0aGUgcGFpcmVkLWVuZCBpbnB1dCBmaWxlczoKICAgIHJlYWRzL3NhbXBsZV8wMV9SMS5mYXN0cS5negogICAgcmVhZHMvc2FtcGxlXzAxX1IyLmZhc3RxLmd6CiAgICAjIFN1aXRhYmxlIG91dHB1dCBmaWxlbmFtZS9wYXRoczoKICAgIG91dF90cmltbWVkL3NhbXBsZV8wMV9SMS50cmltbWVkLmZhc3RxLmd6CiAgICBvdXRfdHJpbW1lZC9zYW1wbGVfMDFfUjIudHJpbW1lZC5mYXN0cS5negoKCkFzIG1lbnRpb25lZCBhYm92ZSwgY3V0YWRhcHQgaGFzIG1hbnkgY2FwYWJpbGl0aWVzLiBEZXBlbmRpbmcgb24gdGhlIHBhcmFtZXRlcnMgZ2l2ZW4sIHdlIGNhbiBpbnZva2UgZGlmZmVyZW50IGZ1bmN0aW9uYWxpdGllcy4gR2l2ZW4gb3VyIHJlc3VsdHMgZnJvbSB0aGUgcHJldmlvdXMgUUMgbW9kdWxlLCB3ZSBrbm93IHRoYXQgd2UgbmVlZCB0byB0cmltIGFkYXB0ZXJzIGZyb20gdGhlIHJlYWRzIGluIG91ciBmYXN0cSBmaWxlcy4KCgojIyBDdXRhZGFwdCBFeGVyY2lzZQoKMS4gQ3JlYXRlIGEgZGlyZWN0b3J5IGZvciBvdXIgdHJpbW1lZCByZWFkcwoyLiBWaWV3IHRoZSBoZWxwIHBhZ2Ugb2YgdGhlIGN1dGFkYXB0IHRvb2wKMy4gQ29uc3RydWN0IGEgY3V0YWRhcHQgY29tbWFuZCB0byB0cmltIHRoZSBhZGFwdGVycyBmcm9tIHBhaXJlZC1lbmQgcmVhZHMKNC4gVmlldyB0aGUgb3V0cHV0IG9mIGN1dGFkYXB0LCBhbmQgdmVyaWZ5IHRoYXQgaXQncyBjb3JyZWN0CgpgYGAKIyBDcmVhdGUgYSBkaXJlY3RvcnkgZm9yIHRoZSB0cmltbWVkIHJlYWRzCm1rZGlyIG91dF90cmltbWVkCiMgVmlldyB0aGUgaGVscCBwYWdlIG9mIEN1dGFkYXB0CmN1dGFkYXB0IC0taGVscAoKIyBDb25zdHJ1Y3QgYSBjdXRhZGFwdCBjb21tYW5kIHRvIHRyaW0gYWRhcHRlcnMgZnJvbSBwYWlyZWQtZW5kIHJlYWRzCmN1dGFkYXB0IC1hIEFHQVRDR0dBQUdBRyAtQSBBR0FUQ0dHQUFHQUcgLW8gb3V0X3RyaW1tZWQvc2FtcGxlXzAxX1IxLnRyaW1tZWQuZmFzdHEuZ3ogLXAgb3V0X3RyaW1tZWQvc2FtcGxlXzAxX1IyLnRyaW1tZWQuZmFzdHEuZ3ogLi4vZGF0YS9yZWFkcy9zYW1wbGVfMDFfUjEuZmFzdHEuZ3ogLi4vZGF0YS9yZWFkcy9zYW1wbGVfMDFfUjIuZmFzdHEuZ3oKIyBWaWV3IHRoZSBvdXRwdXQgb2YgY3V0YWRhcHQsICh2ZXJpZnkgcHJlc2VuY2Ugb2Ygb3V0cHV0IGZpbGVzIGFuZCBwZWVrIGludG8gdGhlIGZpbGVzKQpgYGAKCkF0IHRoaXMgcG9pbnQsIHdlJ3ZlIHJ1biBjdXRhZGFwdCBvbiBvbmUgb2Ygb3VyIHNhbXBsZXMuIFdlIGNvdWxkIGNvbnN0cnVjdCBhIHNlcmllcyBvZiBzaW1pbGFyIGNvbW1hbmRzIGJ5IGFsdGVyaW5nIHRoZSBzYW1wbGUgbmFtZXMuIEhvd2V2ZXIsIHRoZXJlJ3MgYW4gZWFzaWVyIHdheS4gRm9yIHRoaXMsIHdlJ2xsIHVzZSBhIGJhc2ggdmFyaWFibGUuCgojIyBCYXNoIFZhcmlhYmxlIEV4ZXJjaXNlCgoxLiBVc2UgYSBiYXNoIHZhcmlhYmxlIHRvIGVjaG8gIkhlbGxvLCBXb3JsZCEiCjIuIFVzZSBhIGJhc2ggdmFyaWFibGUgdG8gZWNobyAiSGVsbG8sIDxZb3VyIE5hbWU+ISIKCmBgYApub3VuPVdvcmxkCmVjaG8gIkhlbGxvLCAkbm91biEiCm5vdW49VHJhdmlzCmVjaG8gIkhlbGxvLCAkbm91biEiCmBgYAoKTm93IHRoYXQgd2UndmUgbGVhcm5lZCB0aGUgYmFzaWNzIG9mIGJhc2ggdmFyaWFibGVzIGFuZCBvZiBydW5uaW5nIEN1dGFkYXB0LCBsZXQncyB0cnkgYW4gZXhlcmNpc2UKCiMjIEN1dGFkYXB0IEFsbCBTYW1wbGVzIEV4ZXJjaXNlIChCcmVha291dCkKCjEuIFZpZXcgdGhlIGhlbHAgcGFnZSwgYW5kIGNvbnN0cnVjdCBhIGN1dGFkYXB0IGNvbW1hbmQgd2l0aCBhIGJhc2ggdmFyaWFibGUgaW4gaXQKMi4gVXNlIHZhcmlhYmxlIHJlYXNzaWdubWVudCBhbG9uZyB3aXRoIG91ciBjb21tYW5kIHRvIHRyaW0gYWxsIHNhbXBsZXMKCjxkZXRhaWxzPgo8c3VtbWFyeT5SdW5uaW5nIGN1dGFkYXB0IG9uIGFsbCBzYW1wbGVzIHVzaW5nIGEgYmFzaCB2YXJpYWJsZTwvc3VtbWFyeT4KCltIZXJlXShodHRwczovL2dpc3QuZ2l0aHViLmNvbS90d3NhYXJpL2FhYTQzYWUzYWQ0NWFkNGNiMmYyOGYyMjY4ZTcxMTQ4KSBpcyBhbiBleGFtcGxlIG9mIHVzaW5nIGEgYmFzaCB2YXJpYWJsZSB0byBydW4gY3V0YWRhcHQgb24gYWxsIG9mIG91ciBzYW1wbGVzLgoKPC9kZXRhaWxzPgoKCk5vdyB0aGF0IHdlJ3ZlIHJ1biBjdXRhZGFwdCBhbmQgdHJpbW1lZCB0aGUgYWRhcHRlcnMgZnJvbSBvdXIgcmVhZHMsIHdlIHdpbGwgcXVpY2tseSByZS1ydW4gRmFzdFFDIG9uIHRoZXNlIHRyaW1tZWQgcmVhZCBGQVNUUXMuIFRoaXMgd2lsbCBjb25maXJtIHRoYXQgd2UndmUgc3VjY2Vzc2Z1bGx5IHRyaW1tZWQgdGhlIGFkYXB0ZXJzLCBhbmQgd2UnbGwgc2VlIHRoYXQgb3VyIEZBU1RRIGZpbGVzIGFyZSByZWFkeSBmb3Igc2VxdWVuY2luZy4gU2luY2Ugd2UndmUgZGlzY3Vzc2VkIHRoZSBGYXN0UUMgaW5wdXQvb3V0cHV0IGFuZCBmdW5jdGlvbmFsaXR5IGluIHRoZSBwcmV2aW91cyBtb2R1bGUsIHdlJ2xsIGdvIG5leHQgdG8gYW4gZXhlcmNpc2UgcmUtcnVubmluZyBGYXN0UUMgb24gdGhlIHRyaW1tZWQgcmVhZCBkYXRhCgojIyBSZS1ydW5uaW5nIEZhc3RRQyBFeGVyY2lzZToKCjEuIENvbnN0cnVjdCBhbmQgZXhlY3V0ZSBGYXN0UUMgY29tbWFuZCB0byBldmFsdWF0ZSB0cmltbWVkIHJlYWQgRkFTVFEgZmlsZXMKMi4gVmlldyB0aGUgb3V0cHV0IChmaWxlbmFtZXMpCgpgYGAKIyBXZSdsbCBoYXZlIHRvIGNyZWF0ZSBhbiBvdXRwdXQgZGlyZWN0b3J5IGZpcnN0Cm1rZGlyIG91dF9mYXN0cWNfdHJpbW1lZAojIENvbnN0cnVjdCB0aGUgZmFzdHFjIGNvbW1hbmQKZmFzdHFjIC1vIG91dF9mYXN0cWNfdHJpbW1lZCBvdXRfdHJpbW1lZC8qLmZhc3RxLmd6CiMgRXhlY3V0ZSB0aGUgY29tbWFuZAojIFRoZW4gdmVyaWZ5IHRoYXQgdGhlIG91dHB1dCBmaWxlcyBhcmUgcHJlc2VudApscyAtbCBvdXRfZmFzdHFjX3RyaW1tZWQKYGBgCgo8ZGV0YWlscz4KPHN1bW1hcnk+T3B0aW9uYWwgZXhlcmNpc2UgLSBUcmFuc2ZlciBhIEZhc3RRQyByZXBvcnQgdG8gcGVyc29uYWwgY29tcHV0ZXI8L3N1bW1hcnk+CgpNYWtlIHN1cmUgeW91J3JlIHJ1bm5pbmcgc2NwIG9uIHlvdXIgKipsb2NhbCoqIGNvbXB1dGVyLCByZXF1ZXN0aW5nIGEgZmlsZSBmcm9tIHRoZSAqKnJlbW90ZSoqIGNvbXB1dGVyIHdlIHdlcmUganVzdCB1c2luZy4KCnNjcCBjb21tYW5kIGZvcm1hdCwgd2l0aCB0aGUgYWRkcmVzcyBmb3IgQVdTIHJlbW90ZQoKYGBgCiMgVXNhZ2U6IHNjcCBbc291cmNlXSBbZGVzdGluYXRpb25dCnNjcCA8dXNlcm5hbWU+QGJmeC13b3Jrc2hvcDAxLm1lZC51bWljaC5lZHU6fi9hbmFseXNpcy9vdXRfZmFzdHFjX3RyaW1tZWQvc2FtcGxlXzAxX1IxLnRyaW1tZWRfZmFzdHFjLmh0bWwgfi9yc2Qtd29ya3Nob3AvCmBgYAoKPC9kZXRhaWxzPgoKT3BlbmluZyB0aGUgSFRNTCByZXBvcnQsIHdlIHNlZSBpdCBpcyBvcmdhbml6ZWQgYnkgdGhlIHNhbWUgbW9kdWxlcyBhbmQgZWFjaCBwbG90IGhhcyBhbGwgc2FtcGxlcyBmb3Igd2hpY2ggRmFzdFFDIHdhcyBydW4uIFdlIGNhbiBzZWUgdGhlIHJlcG9ydCBjb25maXJtcyB0aGF0IHRoZSBhZGFwdGVycyBoYXZlIGJlZW4gdHJpbW1lZCBmcm9tIG91ciBzZXF1ZW5jZS4KCi0tLQoKVGhlc2UgbWF0ZXJpYWxzIGhhdmUgYmVlbiBhZGFwdGVkIGFuZCBleHRlbmRlZCBmcm9tIG1hdGVyaWFscyBjcmVhdGVkIGJ5IHRoZSBbSGFydmFyZCBDaGFuIEJpb2luZm9ybWF0aWNzIENvcmUgKEhCQyldKGh0dHA6Ly9iaW9pbmZvcm1hdGljcy5zcGguaGFydmFyZC5lZHUvKS4gVGhlc2UgYXJlIG9wZW4gYWNjZXNzIG1hdGVyaWFscyBkaXN0cmlidXRlZCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIFtDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uIGxpY2Vuc2UgKENDIEJZIDQuMCldKGh0dHA6Ly9jcmVhdGl2ZWNvbW1vbnMub3JnL2xpY2Vuc2VzL2J5LzQuMC8pLCB3aGljaCBwZXJtaXRzIHVucmVzdHJpY3RlZCB1c2UsIGRpc3RyaWJ1dGlvbiwgYW5kIHJlcHJvZHVjdGlvbiBpbiBhbnkgbWVkaXVtLCBwcm92aWRlZCB0aGUgb3JpZ2luYWwgYXV0aG9yIGFuZCBzb3VyY2UgYXJlIGNyZWRpdGVkLgo=