Aligning All Samples Exercise


15 Minutes


We just learned about how to use RSEM & STAR, but now we need to align all of the rest of our samples to the reference genome. In this breakout exercise, we’ll build upon some concepts we’ve learned previously.


Instructions:


  • Work independently in the main room, posting any questions that arise to slack.
  • Recommendations for writing your own code:
    • Read function documentation
    • Test out ideas - it’s okay to make mistakes and generate errors
    • Use a search engine to look up errors or recommended solutions using keywords
  • We’ll review possible solutions after time is up as a group.


  • Review what we’ve learned about running RSEM + STAR, to determine an appropriate command for aligning our samples.
  • Using what we’ve learned previously, create a script using this command to quickly and easily align the rest of our samples.
  • Run the script, view the output, and verify that we have the files we need.


Solution - Aligning All Samples Exercise


Based on our earlier breakout exercise, using a for-loop with our bash variable would look something like this:

for SAMPLE in sample_B sample_C sample_D sample_E sample_F
do
    rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file \
    --star-output-genome-bam --keep-intermediate-files \
    out_trimmed/${SAMPLE}_R1.trimmed.fastq.gz \
    ../refs/GRCm38.102.chr19reduced \
    out_rsem/${SAMPLE}
done

Place the appropriate code into a file using the nano editor to create the script, then execute the script.

# Use the nano editor to create a script
nano aligning_B-F.sh # Insert commands into editor, then close the file
# Run the script
bash aligning_B-F.sh

Optional: Add execute permissions to the script before executing.

If going this route, you can call the script directly, without calling bash.

Note that since the script is in the current directory, you’ll have to provide that additional contextual information when calling it (e.g. ./ to represent the current directory).

# Add execute permissions
chmod +x aligning_B-F.sh
# Run the script
./aligning_B-F.sh


Helper Hints: When using a for-loop approach, it can be helpful to slowly build up to the end result, sometimes using a “dry-run” command as a test case, to get learners to be more cognizant of what their code will do.

  • Echoing filenames is an easy place to start.
  • Iterating over a single sample might also be helpful when testing.

Example echoing filenames:

for SAMPLE in sample_B sample_C sample_D sample_E sample_F
do
    echo "in_file: out_trimmed/${SAMPLE}_R1.trimmed.fastq.gz"
    echo "out_prefix: out_rsem/${SAMPLE}"
done

Example iterating over a single sample (sample_A, which we’ve already aligned prior to the breakout exercise)

for SAMPLE in sample_A
do
    rsem-calculate-expression --star --num-threads 1 --star-gzipped-read-file \
    --star-output-genome-bam --keep-intermediate-files \
    out_trimmed/${SAMPLE}_R1.trimmed.fastq.gz \
    ../refs/GRCm38.102.chr19reduced \
    out_rsem/${SAMPLE}
done


LS0tCnRpdGxlOiAiRXhlcmNpc2UgMDIgU29sdXRpb24iCmF1dGhvcjogIlVNIEJpb2luZm9ybWF0aWNzIENvcmUiCm91dHB1dDoKICAgICAgICBodG1sX2RvY3VtZW50OgogICAgICAgICAgICBpbmNsdWRlczoKICAgICAgICAgICAgICAgIGluX2hlYWRlcjogaGVhZGVyLmh0bWwKICAgICAgICAgICAgdGhlbWU6IHBhcGVyCiAgICAgICAgICAgIGZpZ19jYXB0aW9uOiB0cnVlCiAgICAgICAgICAgIG1hcmtkb3duOiBHRk0KICAgICAgICAgICAgY29kZV9kb3dubG9hZDogdHJ1ZQotLS0KPHN0eWxlIHR5cGU9InRleHQvY3NzIj4KYm9keXsgLyogTm9ybWFsICAqLwogICAgICBmb250LXNpemU6IDE0cHQ7CiAgfQpwcmUgewogIGZvbnQtc2l6ZTogMTJwdAp9Cjwvc3R5bGU+Cgo8YnI+CgojIyBBbGlnbmluZyBBbGwgU2FtcGxlcyBFeGVyY2lzZQoKPGJyPgoKKioxNSBNaW51dGVzKioKCjxicj4KCldlIGp1c3QgbGVhcm5lZCBhYm91dCBob3cgdG8gdXNlIFJTRU0gJiBTVEFSLCBidXQgbm93IHdlIG5lZWQgdG8gYWxpZ24gYWxsIG9mIHRoZSByZXN0IG9mIG91ciBzYW1wbGVzIHRvIHRoZSByZWZlcmVuY2UgZ2Vub21lLiBJbiB0aGlzIGJyZWFrb3V0IGV4ZXJjaXNlLCB3ZSdsbCBidWlsZCB1cG9uIHNvbWUgY29uY2VwdHMgd2UndmUgbGVhcm5lZCBwcmV2aW91c2x5LgoKPGJyPgoKIyMjIEluc3RydWN0aW9uczoKCjxicj4KCi0gV29yayBpbmRlcGVuZGVudGx5IGluIHRoZSBtYWluIHJvb20sIHBvc3RpbmcgYW55IHF1ZXN0aW9ucyB0aGF0IGFyaXNlIHRvIHNsYWNrLgotIFJlY29tbWVuZGF0aW9ucyBmb3Igd3JpdGluZyB5b3VyIG93biBjb2RlOgogIC0gUmVhZCBmdW5jdGlvbiBkb2N1bWVudGF0aW9uCiAgLSBUZXN0IG91dCBpZGVhcyAtIGl0J3Mgb2theSB0byBtYWtlIG1pc3Rha2VzIGFuZCBnZW5lcmF0ZSBlcnJvcnMKICAtIFVzZSBhIHNlYXJjaCBlbmdpbmUgdG8gbG9vayB1cCBlcnJvcnMgb3IgcmVjb21tZW5kZWQgc29sdXRpb25zIHVzaW5nIGtleXdvcmRzCi0gV2UnbGwgcmV2aWV3IHBvc3NpYmxlIHNvbHV0aW9ucyBhZnRlciB0aW1lIGlzIHVwIGFzIGEgZ3JvdXAuCgo8YnI+CgotIFJldmlldyB3aGF0IHdlJ3ZlIGxlYXJuZWQgYWJvdXQgcnVubmluZyBSU0VNICsgU1RBUiwgdG8gZGV0ZXJtaW5lIGFuIGFwcHJvcHJpYXRlIGNvbW1hbmQgZm9yIGFsaWduaW5nIG91ciBzYW1wbGVzLgotIFVzaW5nIHdoYXQgd2UndmUgbGVhcm5lZCBwcmV2aW91c2x5LCBjcmVhdGUgYSBzY3JpcHQgdXNpbmcgdGhpcyBjb21tYW5kIHRvIHF1aWNrbHkgYW5kIGVhc2lseSBhbGlnbiB0aGUgcmVzdCBvZiBvdXIgc2FtcGxlcy4KLSBSdW4gdGhlIHNjcmlwdCwgdmlldyB0aGUgb3V0cHV0LCBhbmQgdmVyaWZ5IHRoYXQgd2UgaGF2ZSB0aGUgZmlsZXMgd2UgbmVlZC4KCjxicj4KCiMjIyBTb2x1dGlvbiAtIEFsaWduaW5nIEFsbCBTYW1wbGVzIEV4ZXJjaXNlCgo8YnI+CgpCYXNlZCBvbiBvdXIgZWFybGllciBicmVha291dCBleGVyY2lzZSwgdXNpbmcgYSBmb3ItbG9vcCB3aXRoIG91ciBiYXNoIHZhcmlhYmxlIHdvdWxkIGxvb2sgc29tZXRoaW5nIGxpa2UgdGhpczoKCiAgICBmb3IgU0FNUExFIGluIHNhbXBsZV9CIHNhbXBsZV9DIHNhbXBsZV9EIHNhbXBsZV9FIHNhbXBsZV9GCiAgICBkbwogICAgICAgIHJzZW0tY2FsY3VsYXRlLWV4cHJlc3Npb24gLS1zdGFyIC0tbnVtLXRocmVhZHMgMSAtLXN0YXItZ3ppcHBlZC1yZWFkLWZpbGUgXAogICAgICAgIC0tc3Rhci1vdXRwdXQtZ2Vub21lLWJhbSAtLWtlZXAtaW50ZXJtZWRpYXRlLWZpbGVzIFwKICAgICAgICBvdXRfdHJpbW1lZC8ke1NBTVBMRX1fUjEudHJpbW1lZC5mYXN0cS5neiBcCiAgICAgICAgLi4vcmVmcy9HUkNtMzguMTAyLmNocjE5cmVkdWNlZCBcCiAgICAgICAgb3V0X3JzZW0vJHtTQU1QTEV9CiAgICBkb25lCgpQbGFjZSB0aGUgYXBwcm9wcmlhdGUgY29kZSBpbnRvIGEgZmlsZSB1c2luZyB0aGUgYG5hbm9gIGVkaXRvciB0byBjcmVhdGUgdGhlIHNjcmlwdCwgdGhlbiBleGVjdXRlIHRoZSBzY3JpcHQuCgogICAgIyBVc2UgdGhlIG5hbm8gZWRpdG9yIHRvIGNyZWF0ZSBhIHNjcmlwdAogICAgbmFubyBhbGlnbmluZ19CLUYuc2ggIyBJbnNlcnQgY29tbWFuZHMgaW50byBlZGl0b3IsIHRoZW4gY2xvc2UgdGhlIGZpbGUKICAgICMgUnVuIHRoZSBzY3JpcHQKICAgIGJhc2ggYWxpZ25pbmdfQi1GLnNoCgpPcHRpb25hbDogQWRkIGV4ZWN1dGUgcGVybWlzc2lvbnMgdG8gdGhlIHNjcmlwdCBiZWZvcmUgZXhlY3V0aW5nLgoKSWYgZ29pbmcgdGhpcyByb3V0ZSwgeW91IGNhbiBjYWxsIHRoZSBzY3JpcHQgZGlyZWN0bHksIHdpdGhvdXQgY2FsbGluZyBiYXNoLgoKTm90ZSB0aGF0IHNpbmNlIHRoZSBzY3JpcHQgaXMgaW4gdGhlIGN1cnJlbnQgZGlyZWN0b3J5LCB5b3UnbGwgaGF2ZSB0byBwcm92aWRlIHRoYXQgYWRkaXRpb25hbCBjb250ZXh0dWFsIGluZm9ybWF0aW9uIHdoZW4gY2FsbGluZyBpdCAoZS5nLiBgLi9gIHRvIHJlcHJlc2VudCB0aGUgY3VycmVudCBkaXJlY3RvcnkpLgoKICAgICMgQWRkIGV4ZWN1dGUgcGVybWlzc2lvbnMKICAgIGNobW9kICt4IGFsaWduaW5nX0ItRi5zaAogICAgIyBSdW4gdGhlIHNjcmlwdAogICAgLi9hbGlnbmluZ19CLUYuc2gKCjxicj4KCj4gSGVscGVyIEhpbnRzOiBXaGVuIHVzaW5nIGEgZm9yLWxvb3AgYXBwcm9hY2gsIGl0IGNhbiBiZSBoZWxwZnVsIHRvIHNsb3dseSBidWlsZCB1cCB0byB0aGUgZW5kIHJlc3VsdCwgc29tZXRpbWVzIHVzaW5nIGEgImRyeS1ydW4iIGNvbW1hbmQgYXMgYSB0ZXN0IGNhc2UsIHRvIGdldCBsZWFybmVycyB0byBiZSBtb3JlIGNvZ25pemFudCBvZiB3aGF0IHRoZWlyIGNvZGUgd2lsbCBkby4KPgo+ICAgIC0gRWNob2luZyBmaWxlbmFtZXMgaXMgYW4gZWFzeSBwbGFjZSB0byBzdGFydC4KPiAgICAtIEl0ZXJhdGluZyBvdmVyIGEgc2luZ2xlIHNhbXBsZSBtaWdodCBhbHNvIGJlIGhlbHBmdWwgd2hlbiB0ZXN0aW5nLgoKPiBFeGFtcGxlIGVjaG9pbmcgZmlsZW5hbWVzOgoKICAgIGZvciBTQU1QTEUgaW4gc2FtcGxlX0Igc2FtcGxlX0Mgc2FtcGxlX0Qgc2FtcGxlX0Ugc2FtcGxlX0YKICAgIGRvCiAgICAgICAgZWNobyAiaW5fZmlsZTogb3V0X3RyaW1tZWQvJHtTQU1QTEV9X1IxLnRyaW1tZWQuZmFzdHEuZ3oiCiAgICAgICAgZWNobyAib3V0X3ByZWZpeDogb3V0X3JzZW0vJHtTQU1QTEV9IgogICAgZG9uZQoKPiBFeGFtcGxlIGl0ZXJhdGluZyBvdmVyIGEgc2luZ2xlIHNhbXBsZSAoc2FtcGxlX0EsIHdoaWNoIHdlJ3ZlIGFscmVhZHkgYWxpZ25lZCBwcmlvciB0byB0aGUgYnJlYWtvdXQgZXhlcmNpc2UpCgogICAgZm9yIFNBTVBMRSBpbiBzYW1wbGVfQQogICAgZG8KICAgICAgICByc2VtLWNhbGN1bGF0ZS1leHByZXNzaW9uIC0tc3RhciAtLW51bS10aHJlYWRzIDEgLS1zdGFyLWd6aXBwZWQtcmVhZC1maWxlIFwKICAgICAgICAtLXN0YXItb3V0cHV0LWdlbm9tZS1iYW0gLS1rZWVwLWludGVybWVkaWF0ZS1maWxlcyBcCiAgICAgICAgb3V0X3RyaW1tZWQvJHtTQU1QTEV9X1IxLnRyaW1tZWQuZmFzdHEuZ3ogXAogICAgICAgIC4uL3JlZnMvR1JDbTM4LjEwMi5jaHIxOXJlZHVjZWQgXAogICAgICAgIG91dF9yc2VtLyR7U0FNUExFfQogICAgZG9uZQoKPGJyPgo=