Cutadapt
Cutadapt is
a very widely used read trimming and fastq processing software, cited
several thousands of times. It’s written in python, and is user-friendly
and reasonably fast.
It is used for removing adapter sequences, primers, and poly-A tails,
for trimming based on quality thresholds, for filtering reads based on
characteristics, etc.
It can operate on both FASTA and FASTQ file formats, and it supports
compressed or raw inputs and outputs.
Notably, cutadapt’s error-tolerant adapter trimming likely
contributed greatly to its early popularity. We will use it to quality
trim our reads. Similar to earlier, we’ll discuss the details of
cutadapt’s functionality and input/output files, before proceeding to an
exercise where we can try running the software ourselves.
Cutadapt details
Cutadapt’s input and output files are simple to understand given its
stated purpose. Both input and output are fastq files - the input being
the fastq files that need processing, and output being the fastq files
after they’ve been processed. Depending on the parameters, chosen
outputs often have shorter read lengths due to trimming processes and
fewer total reads due to filtering.
# Given the single-end input file:
reads/sample_A_R1.fastq.gz
# Suitable output filename/paths:
out_trimmed/sample_A_R1.trimmed.fastq.gz
As mentioned above, cutadapt has many capabilities. Depending on the
parameters given, we can invoke different functionalities. Given our
results from the previous QC module, we know that we need to quality
trim the reads in our fastq files.
Cutadapt Exercise
- Create a directory for our trimmed reads
- View the help page of the cutadapt tool
- Construct a cutadapt command to trim the reads in
sample_A_R1.fastq.gz
- View the output of cutadapt, and verify that it’s correct
# Create a directory for the trimmed reads
mkdir out_trimmed
# View the help page of Cutadapt
cutadapt --help
# Construct a cutadapt command to trim our reads
cutadapt -q 30 -m 20 -o out_trimmed/sample_A_R1.trimmed.fastq.gz ../reads/sample_A_R1.fastq.gz
# View the output of cutadapt, (verify presence of output files and peek into the files)
At this point, we’ve run cutadapt on one of our samples. We could
construct a series of similar commands by altering the sample names.
However, there’s an easier way. For this, we’ll use a bash variable.
Cutadapt All Samples Exercise
Before starting our cutadapt exercise, we should make sure that we
are on the same page. Follow the link below:
Link to Cutadapt
exercise
Now that we’ve run cutadapt and trimmed our reads, we will quickly
re-run FastQC on these trimmed read FASTQs. This will confirm that we’ve
successfully trimmed the low quality sequence, and we’ll see that our
FASTQ files are ready for sequencing. Since we’ve discussed the FastQC
input/output and functionality in the previous module, we’ll go next to
an exercise re-running FastQC on the trimmed read data
Re-running FastQC Exercise:
- Construct and execute FastQC command to evaluate trimmed read FASTQ
files
- View the output (filenames)
# We'll have to create an output directory first
mkdir out_fastqc_trimmed
# Construct the fastqc command
fastqc -o out_fastqc_trimmed out_trimmed/*.fastq.gz
# Execute the command
# Then verify that the output files are present
ls -l out_fastqc_trimmed
Opening the HTML report, we see it is organized by the same modules
and each plot has all samples for which FastQC was run. We can see the
report confirms that the low quality bases have been trimmed from our
sequence.
These materials have been adapted and extended from materials created
by the Harvard Chan
Bioinformatics Core (HBC). These are open access materials
distributed under the terms of the Creative Commons
Attribution license (CC BY 4.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original
author and source are credited.
LS0tCnRpdGxlOiAiVHJpbW1pbmciCmF1dGhvcjogIlVNIEJpb2luZm9ybWF0aWNzIENvcmUiCm91dHB1dDoKICAgICAgICBodG1sX2RvY3VtZW50OgogICAgICAgICAgICBpbmNsdWRlczoKICAgICAgICAgICAgICAgIGluX2hlYWRlcjogaGVhZGVyLmh0bWwKICAgICAgICAgICAgdGhlbWU6IHBhcGVyCiAgICAgICAgICAgIHRvYzogdHJ1ZQogICAgICAgICAgICB0b2NfZGVwdGg6IDQKICAgICAgICAgICAgdG9jX2Zsb2F0OiB0cnVlCiAgICAgICAgICAgIG51bWJlcl9zZWN0aW9uczogZmFsc2UKICAgICAgICAgICAgZmlnX2NhcHRpb246IHRydWUKICAgICAgICAgICAgbWFya2Rvd246IEdGTQogICAgICAgICAgICBjb2RlX2Rvd25sb2FkOiB0cnVlCi0tLQo8c3R5bGUgdHlwZT0idGV4dC9jc3MiPgpib2R5eyAvKiBOb3JtYWwgICovCiAgICAgIGZvbnQtc2l6ZTogMTRwdDsKICB9CnByZSB7CiAgZm9udC1zaXplOiAxMnB0Cn0KPC9zdHlsZT4KCkluIHRoaXMgbW9kdWxlIHdlIHdpbGwgbGVhcm46CgoqIGFib3V0IHRoZSBjdXRhZGFwdCBzb2Z0d2FyZSBhbmQgaXRzIHVzZXMKKiBob3cgdG8gdXNlIHRoZSBjdXRhZGFwdCB0b29sIGZvciBxdWFsaXR5IHRyaW1taW5nCiogaG93IHRvIHRyaW0gYWxsIG9mIG91ciBzYW1wbGVzIHVzaW5nIGEgYmFzaCB2YXJpYWJsZQoqIGhvdyB0byBmdXJ0aGVyIHVzZSBmYXN0cWMgdG8gZXZhbHVhdGUgdHJpbW1pbmcgc3VjY2VzcwoKIyBEaWZmZXJlbnRpYWwgRXhwcmVzc2lvbiBXb3JrZmxvdwoKQXMgYSByZW1pbmRlciwgb3VyIG92ZXJhbGwgZGlmZmVyZW50aWFsIGV4cHJlc3Npb24gd29ya2Zsb3cgaXMgc2hvd24gYmVsb3cuIEluIHRoaXMgbGVzc29uLCB3ZSB3aWxsIGdvIG92ZXIgdGhlIGhpZ2hsaWdoZWQgcG9ydGlvbiBvZiB0aGUgd29ya2Zsb3cuCgohW10oaW1hZ2VzL3dheWZpbmRlci93YXlmaW5kZXItVHJpbW1pbmcucG5nKQo8YnI+Cjxicj4KPGJyPgo8YnI+CgojIEN1dGFkYXB0CgpbQ3V0YWRhcHRdKGh0dHBzOi8vY3V0YWRhcHQucmVhZHRoZWRvY3MuaW8vZW4vc3RhYmxlLykgaXMgYSB2ZXJ5IHdpZGVseSB1c2VkIHJlYWQgdHJpbW1pbmcgYW5kIGZhc3RxIHByb2Nlc3Npbmcgc29mdHdhcmUsIGNpdGVkIHNldmVyYWwgdGhvdXNhbmRzIG9mIHRpbWVzLiBJdCdzIHdyaXR0ZW4gaW4gcHl0aG9uLCBhbmQgaXMgdXNlci1mcmllbmRseSBhbmQgcmVhc29uYWJseSBmYXN0LgoKSXQgaXMgdXNlZCBmb3IgcmVtb3ZpbmcgYWRhcHRlciBzZXF1ZW5jZXMsIHByaW1lcnMsIGFuZCBwb2x5LUEgdGFpbHMsIGZvciB0cmltbWluZyBiYXNlZCBvbiBxdWFsaXR5IHRocmVzaG9sZHMsIGZvciBmaWx0ZXJpbmcgcmVhZHMgYmFzZWQgb24gY2hhcmFjdGVyaXN0aWNzLCBldGMuCgpJdCBjYW4gb3BlcmF0ZSBvbiBib3RoIEZBU1RBIGFuZCBGQVNUUSBmaWxlIGZvcm1hdHMsIGFuZCBpdCBzdXBwb3J0cyBjb21wcmVzc2VkIG9yIHJhdyBpbnB1dHMgYW5kIG91dHB1dHMuCgpOb3RhYmx5LCBjdXRhZGFwdCdzIGVycm9yLXRvbGVyYW50IGFkYXB0ZXIgdHJpbW1pbmcgbGlrZWx5IGNvbnRyaWJ1dGVkIGdyZWF0bHkgdG8gaXRzIGVhcmx5IHBvcHVsYXJpdHkuIFdlIHdpbGwgdXNlIGl0IHRvIHF1YWxpdHkgdHJpbSBvdXIgcmVhZHMuIFNpbWlsYXIgdG8gZWFybGllciwgd2UnbGwgZGlzY3VzcyB0aGUgZGV0YWlscyBvZiBjdXRhZGFwdCdzIGZ1bmN0aW9uYWxpdHkgYW5kIGlucHV0L291dHB1dCBmaWxlcywgYmVmb3JlIHByb2NlZWRpbmcgdG8gYW4gZXhlcmNpc2Ugd2hlcmUgd2UgY2FuIHRyeSBydW5uaW5nIHRoZSBzb2Z0d2FyZSBvdXJzZWx2ZXMuCgoKIyMgQ3V0YWRhcHQgZGV0YWlscwoKQ3V0YWRhcHQncyBpbnB1dCBhbmQgb3V0cHV0IGZpbGVzIGFyZSBzaW1wbGUgdG8gdW5kZXJzdGFuZCBnaXZlbiBpdHMgc3RhdGVkIHB1cnBvc2UuIEJvdGggaW5wdXQgYW5kIG91dHB1dCBhcmUgZmFzdHEgZmlsZXMgLSB0aGUgaW5wdXQgYmVpbmcgdGhlIGZhc3RxIGZpbGVzIHRoYXQgbmVlZCBwcm9jZXNzaW5nLCBhbmQgb3V0cHV0IGJlaW5nIHRoZSBmYXN0cSBmaWxlcyBhZnRlciB0aGV5J3ZlIGJlZW4gcHJvY2Vzc2VkLiBEZXBlbmRpbmcgb24gdGhlIHBhcmFtZXRlcnMsIGNob3NlbiBvdXRwdXRzIG9mdGVuIGhhdmUgc2hvcnRlciByZWFkIGxlbmd0aHMgZHVlIHRvIHRyaW1taW5nIHByb2Nlc3NlcyBhbmQgZmV3ZXIgdG90YWwgcmVhZHMgZHVlIHRvIGZpbHRlcmluZy4KCgogICAgIyBHaXZlbiB0aGUgc2luZ2xlLWVuZCBpbnB1dCBmaWxlOgogICAgcmVhZHMvc2FtcGxlX0FfUjEuZmFzdHEuZ3oKICAgICMgU3VpdGFibGUgb3V0cHV0IGZpbGVuYW1lL3BhdGhzOgogICAgb3V0X3RyaW1tZWQvc2FtcGxlX0FfUjEudHJpbW1lZC5mYXN0cS5negoKCkFzIG1lbnRpb25lZCBhYm92ZSwgY3V0YWRhcHQgaGFzIG1hbnkgY2FwYWJpbGl0aWVzLiBEZXBlbmRpbmcgb24gdGhlIHBhcmFtZXRlcnMgZ2l2ZW4sIHdlIGNhbiBpbnZva2UgZGlmZmVyZW50IGZ1bmN0aW9uYWxpdGllcy4gR2l2ZW4gb3VyIHJlc3VsdHMgZnJvbSB0aGUgcHJldmlvdXMgUUMgbW9kdWxlLCB3ZSBrbm93IHRoYXQgd2UgbmVlZCB0byBxdWFsaXR5IHRyaW0gdGhlIHJlYWRzIGluIG91ciBmYXN0cSBmaWxlcy4KCgojIyBDdXRhZGFwdCBFeGVyY2lzZQoKMS4gQ3JlYXRlIGEgZGlyZWN0b3J5IGZvciBvdXIgdHJpbW1lZCByZWFkcwoyLiBWaWV3IHRoZSBoZWxwIHBhZ2Ugb2YgdGhlIGN1dGFkYXB0IHRvb2wKMy4gQ29uc3RydWN0IGEgY3V0YWRhcHQgY29tbWFuZCB0byB0cmltIHRoZSByZWFkcyBpbiBgc2FtcGxlX0FfUjEuZmFzdHEuZ3pgCjQuIFZpZXcgdGhlIG91dHB1dCBvZiBjdXRhZGFwdCwgYW5kIHZlcmlmeSB0aGF0IGl0J3MgY29ycmVjdAoKYGBgCiMgQ3JlYXRlIGEgZGlyZWN0b3J5IGZvciB0aGUgdHJpbW1lZCByZWFkcwpta2RpciBvdXRfdHJpbW1lZAojIFZpZXcgdGhlIGhlbHAgcGFnZSBvZiBDdXRhZGFwdApjdXRhZGFwdCAtLWhlbHAKCiMgQ29uc3RydWN0IGEgY3V0YWRhcHQgY29tbWFuZCB0byB0cmltIG91ciByZWFkcwpjdXRhZGFwdCAtcSAzMCAtbSAyMCAtbyBvdXRfdHJpbW1lZC9zYW1wbGVfQV9SMS50cmltbWVkLmZhc3RxLmd6IC4uL3JlYWRzL3NhbXBsZV9BX1IxLmZhc3RxLmd6CiMgVmlldyB0aGUgb3V0cHV0IG9mIGN1dGFkYXB0LCAodmVyaWZ5IHByZXNlbmNlIG9mIG91dHB1dCBmaWxlcyBhbmQgcGVlayBpbnRvIHRoZSBmaWxlcykKYGBgCgpBdCB0aGlzIHBvaW50LCB3ZSd2ZSBydW4gY3V0YWRhcHQgb24gb25lIG9mIG91ciBzYW1wbGVzLiBXZSBjb3VsZCBjb25zdHJ1Y3QgYSBzZXJpZXMgb2Ygc2ltaWxhciBjb21tYW5kcyBieSBhbHRlcmluZyB0aGUgc2FtcGxlIG5hbWVzLiBIb3dldmVyLCB0aGVyZSdzIGFuIGVhc2llciB3YXkuIEZvciB0aGlzLCB3ZSdsbCB1c2UgYSBiYXNoIHZhcmlhYmxlLgoKPGJyPgo8YnI+CgojIyBDdXRhZGFwdCBBbGwgU2FtcGxlcyBFeGVyY2lzZQoKQmVmb3JlIHN0YXJ0aW5nIG91ciBjdXRhZGFwdCBleGVyY2lzZSwgd2Ugc2hvdWxkIG1ha2Ugc3VyZSB0aGF0IHdlIGFyZSBvbiB0aGUgc2FtZSBwYWdlLiBGb2xsb3cgdGhlIGxpbmsgYmVsb3c6CgpbTGluayB0byBDdXRhZGFwdCBleGVyY2lzZV0oTW9kdWxlMDJiX2JyZWFrb3V0MDFfZXguaHRtbCkKCjxicj4KPGJyPgoKTm93IHRoYXQgd2UndmUgcnVuIGN1dGFkYXB0IGFuZCB0cmltbWVkIG91ciByZWFkcywgd2Ugd2lsbCBxdWlja2x5IHJlLXJ1biBGYXN0UUMgb24gdGhlc2UgdHJpbW1lZCByZWFkIEZBU1RRcy4gVGhpcyB3aWxsIGNvbmZpcm0gdGhhdCB3ZSd2ZSBzdWNjZXNzZnVsbHkgdHJpbW1lZCB0aGUgbG93IHF1YWxpdHkgc2VxdWVuY2UsIGFuZCB3ZSdsbCBzZWUgdGhhdCBvdXIgRkFTVFEgZmlsZXMgYXJlIHJlYWR5IGZvciBzZXF1ZW5jaW5nLiBTaW5jZSB3ZSd2ZSBkaXNjdXNzZWQgdGhlIEZhc3RRQyBpbnB1dC9vdXRwdXQgYW5kIGZ1bmN0aW9uYWxpdHkgaW4gdGhlIHByZXZpb3VzIG1vZHVsZSwgd2UnbGwgZ28gbmV4dCB0byBhbiBleGVyY2lzZSByZS1ydW5uaW5nIEZhc3RRQyBvbiB0aGUgdHJpbW1lZCByZWFkIGRhdGEKCiMjIFJlLXJ1bm5pbmcgRmFzdFFDIEV4ZXJjaXNlOgoKMS4gQ29uc3RydWN0IGFuZCBleGVjdXRlIEZhc3RRQyBjb21tYW5kIHRvIGV2YWx1YXRlIHRyaW1tZWQgcmVhZCBGQVNUUSBmaWxlcwoyLiBWaWV3IHRoZSBvdXRwdXQgKGZpbGVuYW1lcykKCmBgYAojIFdlJ2xsIGhhdmUgdG8gY3JlYXRlIGFuIG91dHB1dCBkaXJlY3RvcnkgZmlyc3QKbWtkaXIgb3V0X2Zhc3RxY190cmltbWVkCiMgQ29uc3RydWN0IHRoZSBmYXN0cWMgY29tbWFuZApmYXN0cWMgLW8gb3V0X2Zhc3RxY190cmltbWVkIG91dF90cmltbWVkLyouZmFzdHEuZ3oKIyBFeGVjdXRlIHRoZSBjb21tYW5kCiMgVGhlbiB2ZXJpZnkgdGhhdCB0aGUgb3V0cHV0IGZpbGVzIGFyZSBwcmVzZW50CmxzIC1sIG91dF9mYXN0cWNfdHJpbW1lZApgYGAKCjxicj4KCk9wZW5pbmcgdGhlIEhUTUwgcmVwb3J0LCB3ZSBzZWUgaXQgaXMgb3JnYW5pemVkIGJ5IHRoZSBzYW1lIG1vZHVsZXMgYW5kIGVhY2ggcGxvdCBoYXMgYWxsIHNhbXBsZXMgZm9yIHdoaWNoIEZhc3RRQyB3YXMgcnVuLiBXZSBjYW4gc2VlIHRoZSByZXBvcnQgY29uZmlybXMgdGhhdCB0aGUgbG93IHF1YWxpdHkgYmFzZXMgaGF2ZSBiZWVuIHRyaW1tZWQgZnJvbSBvdXIgc2VxdWVuY2UuCgo8YnI+CgotLS0KClRoZXNlIG1hdGVyaWFscyBoYXZlIGJlZW4gYWRhcHRlZCBhbmQgZXh0ZW5kZWQgZnJvbSBtYXRlcmlhbHMgY3JlYXRlZCBieSB0aGUgW0hhcnZhcmQgQ2hhbiBCaW9pbmZvcm1hdGljcyBDb3JlIChIQkMpXShodHRwOi8vYmlvaW5mb3JtYXRpY3Muc3BoLmhhcnZhcmQuZWR1LykuIFRoZXNlIGFyZSBvcGVuIGFjY2VzcyBtYXRlcmlhbHMgZGlzdHJpYnV0ZWQgdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBbQ3JlYXRpdmUgQ29tbW9ucyBBdHRyaWJ1dGlvbiBsaWNlbnNlIChDQyBCWSA0LjApXShodHRwOi8vY3JlYXRpdmVjb21tb25zLm9yZy9saWNlbnNlcy9ieS80LjAvKSwgd2hpY2ggcGVybWl0cyB1bnJlc3RyaWN0ZWQgdXNlLCBkaXN0cmlidXRpb24sIGFuZCByZXByb2R1Y3Rpb24gaW4gYW55IG1lZGl1bSwgcHJvdmlkZWQgdGhlIG9yaWdpbmFsIGF1dGhvciBhbmQgc291cmNlIGFyZSBjcmVkaXRlZC4KCjxici8+Cjxici8+Cjxoci8+CnwgW1ByZXZpb3VzIGxlc3Nvbl0oTW9kdWxlMDJhX1NlcXVlbmNlX1FDLmh0bWwpIHwgW1RvcCBvZiB0aGlzIGxlc3Nvbl0oI3RvcCkgfCBbTmV4dCBsZXNzb25dKE1vZHVsZTAzYV9SZWZlcmVuY2VfR2Vub21lcy5odG1sKSB8CnwgOi0tLSB8IDotLS0tOiB8IC0tLTogfAo=