Bulk Cut and Run

Cut and Run workflow

Methods

QC

We use FastQC (FastQC) (v0.11.8) to assess the overall quality of each sequenced sample.

We use trimmomatic (Bolger et al., 2014) (v0.39) with the following paramters: ILLUMINACLIP:{adapter_path}:2:15:4:4:true LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:25, followed by kseq (Zhu et al., 2019) to ensure read lengths of at least 50bp.

We align trimmed reads to the genome with Bowtie2 (Langmead and Salzberg, 2012) (v2.3.4.1) using default parameters with the exception of the following flags: --dovetail

Duplicate reads may be are marked with Picard (Picard) (v2.20.2).

Alignments are filtered with samtools (Li et al., 2009) (v1.2) using the flags: -F4 -F8 -f3 -q10 -F1024.

Alignments completely overlapping blacklisted regions (ENCODE Blacklist Regions) are removed with bedtools (Quinlan and Hall, 2010) (v2.28.0).

Following the Cut and Run Tools pipeline (Zhu et al., 2019), reads are filtered for fragment lengths less than 120 bp.

Sample-wise peaks are called with macs2 (Zhang et al., 2008) (v2.1.2) with flags: -B --SPMR --format BAM --gsize {macs_genome}.

Optionally, consensus peaks for replicates are determined by mutual intersection of sample-wise peaks with bedops –intersect. Consensus peaks are then annotated with annotatr (Cavalcante and Sartor, 2017).

Optionally, motifs are determined with the MEME suite (Machanick and Bailey, 2011) (v5.1.1) followed by footprinting with CENTIPEDE (Pique-Regi et al., 2011) (v1.2) as in the Cut and Run Tools pipeline (Zhu et al., 2019).

Peaks over all samples are merged with bedops (Neph et al., 2012) (v2.4.36) for the purpose of principal component analysis and unsupervised clustering to assess the similarity of samples. Read cross-correlations within samples are plotted using phantompeakqualtools (Landt et al., 2012). MultiQC (Ewels et al., 2016) (v1.7) generates a report combining FastQC, trimming, alignment, and duplicate calling over all the samples. The software DeepTools is used to calculate coverage and IP efficiency plots (Ramírez et al., 2016) (v3.3.0).

Differential Testing

We use the edgeR R Bioconductor package (McCarthy et al., 2012) to identify regions of differential binding. For each sample, the number of reads in the merged peaks is counted for each sample, and a library size normalization factor is determined. With no replicates, we manually tune the BCV (biological coefficient of variance) parameter. We then fit each model using the glmFit function, and test each contrast with a likelihood ratio test. With replicates, the common, trended, and tagwise negative binomial dispersions are calculated. We then fit each model using the glmQLFit function, and test each contrast with an empirical Bayes quasi-likelihood F-test. The differential peaks are then annotated to genic and CpG island annotations using the annotatr R Bioconductor package (Cavalcante and Sartor, 2017).

References

Bolger,A.M. et al. (2014) Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30, 2114–2120.

Cavalcante,R.G. and Sartor,M.A. (2017) Annotatr: Genomic regions in context. Bioinformatics, 33, 2381–2383.

ENCODE Blacklist Regions https://sites.google.com/site/anshulkundaje/projects/blacklists.

Ewels,P. et al. (2016) MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32, 3047–3048.

FastQC https://github.com/s-andrews/FastQC.

Landt,S.G. et al. (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research, 22, 1813–1831.

Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–359.

Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079.

Machanick,P. and Bailey,T.L. (2011) MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics, 27, 1696–1697.

McCarthy,D.J. et al. (2012) Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40, 4288–4297.

Neph,S. et al. (2012) BEDOPS: High-performance genomic feature operations. Bioinformatics, 28, 1919–1920.

Picard https://github.com/broadinstitute/picard.

Pique-Regi,R. et al. (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Research, 21, 447–455.

Quinlan,A.R. and Hall,I.M. (2010) BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842.

Ramírez,F. et al. (2016) deepTools2: A next generation web server for deep-sequencing data analysis. Nucleic Acids Research, 44, W160–W165.

Zhang,Y. et al. (2008) Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 9, R137.

Zhu,Q. et al. (2019) CUT&RUNTools: A flexible pipeline for CUT&RUN processing and footprint analysis. Genome Biology, 20, 192.

Bulk Cut and Run

UM Bioinformatics Core

2023-08-29

Methods

QC

Differential Testing

References