Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling
CUT&RUN is an efficient epigenome profiling method that identifies sites of DNA binding protein enrichment genome-wide with high signal to noise and low sequencing requirements. Currently, the analysis of CUT&RUN data is complicated by its exceptionally low background, which renders programs designed for analysis of ChIP-seq data vulnerable to oversensitivity in identifying sites of protein binding.
Here we introduce Sparse Enrichment Analysis for CUT&RUN (SEACR), an analysis strategy that uses the global distribution of background signal to calibrate a simple threshold for peak calling. SEACR discriminates between true and false-positive peaks with near-perfect specificity from “gold standard” CUT&RUN datasets and efficiently identifies enriched regions for several different protein targets. We also introduce a web server (http://seacr.fredhutch.org) for plug-and-play analysis with SEACR that facilitates maximum accessibility across users of all skill levels.
SEACR is a highly selective peak caller that definitively validates the accuracy of CUT&RUN for datasets with known true negatives. Its ease of use and performance in comparison with existing peak calling strategies make it an ideal choice for analyzing CUT&RUN data.
KeywordsCUT&RUN Epigenome profiling Peak calling
Eukaryotic DNA is wrapped in millions of nucleosomes that restrict access of thousands of DNA binding proteins, including transcription factors (TFs) that bind to enhancers and promoters to activate or repress gene expression and often specify cell fate . Determining where chromatin and DNA binding proteins localize in the genome is crucial for elucidating fundamental principles of genome regulation. Efforts to map DNA binding proteins to their targets in the genome originated with chromatin immunoprecipitation (ChIP), in which proteins are physically crosslinked to their targets, immunoprecipitated with protein-specific antibodies, and their crosslinks reversed for downstream analysis of associated DNA . In recent years, such techniques have been adapted for high-throughput sequencing readouts [3, 4, 5], which has enabled genome-wide identification of thousands of binding sites for hundreds of proteins . This proliferation of data is accompanied by a need for fast and accurate analysis tools to process them.
Standard ChIP-seq data analysis often involves “peak calling” algorithms, which identify genomic regions at which target ChIP-seq signal is enriched in comparison with background noise from a control DNA input or non-targeting antibody experiment . Such algorithms frequently employ Poisson or negative binomial models of global and local read distributions to derive statistical measures of signal enrichment over background, and extensive efforts have been dedicated to evaluating the merits of such design choices . Importantly, since ChIP-seq experiments are typically sequenced deeply and thus feature high background, most peak calling algorithms designed for the analysis of ChIP-seq data use models that are optimized primarily for high recall to distinguish signal from noise .
In contrast with ChIP-seq, CUT&RUN is an in situ epigenome profiling technique that uses an antibody-targeted micrococcal nuclease (MNase) fusion protein to selectively digest and liberate DNA fragments at sites of protein binding while leaving the remainder of the genome behind; thus, it features exceedingly low background in comparison with ChIP-seq [10, 11]. The low read depths and background levels of CUT&RUN data render standard peak callers vulnerable to reduced precision (i.e., avoidance of false positives) due to the sparseness of the background, resulting in any spurious background read being called as a peak. Thus, rather than requiring highly sensitive methods to distinguish signal from background noise, peak calling from CUT&RUN data requires high specificity for true positive peaks.
Here we introduce Sparse Enrichment Analysis for CUT&RUN (SEACR), a peak caller designed for the processing of paired-end CUT&RUN data. SEACR is model free and empirically data driven and therefore does not require arbitrary selection of parameters from a statistical model. Moreover, we show that SEACR retains superior selectivity versus common ChIP-seq peak callers from CUT&RUN data, while retaining competitive performance across a range of experimental configurations. Finally, we have made SEACR available to the community through a simple web interface. We conclude that SEACR is fast, accurate, scalable and simple to use for the analysis of CUT&RUN data.
Peak calling based on fragment block aggregation
We evaluated the effectiveness of using signal blocks as a metric for discrimination between potential true- and false-positive peaks in comparison with common strategies employed by ChIP-seq peak callers. For CUT&RUN datasets profiling K562 cells for H3K4me2, H3K4me3, H3K27me3, and CTCF at several different read subsampling levels, we used SEACR’s block aggregation utility to sort all signal blocks in the genome by total signal, and in parallel called peaks using MACS2  or HOMER  with maximally relaxed peak cutoffs as detailed in the "Methods" section. We then used comparisons with validated ChIP-seq peak calls from ENCODE to plot precision–recall curves for each subsampled dataset for each of the three cutoff strategies (total signal in signal block for SEACR, -log10(FDR) for MACS2, and “Peak Score” for HOMER) and calculated areas under the curve (AUPR). SEACR AUPR was competitive with MACS2 and HOMER for all datasets interrogated and outperformed both of them across all read subsampling levels for CTCF (Fig. 1b, Additional file 1: Fig. S1A–C). Notably, though running MACS2 without a local lambda parameter to imitate the non-local peak identification of SEACR improved performance of MACS2 peak calling for H3K27me3 data, it made a negligible difference in performance for H3K4me2, H3K4me3 and CTCF. These results confirm that total signal in signal blocks is a valid metric for discriminating enriched regions from CUT&RUN data.
SEACR thresholding validates “gold standard” CUT&RUN data
To test how well SEACR can avoid false-positive peaks in a CUT&RUN dataset in which such distinctions are known, we used SEACR, MACS2, and HOMER to call peaks from CUT&RUN data for two transcription factors (TFs), Sox2 and FoxA2 in human embryonic stem cells (hESCs) and definitive endoderm (DE) cells. Sox2 expression is restricted to hESCs, and FoxA2 expression is restricted to DE cells . All three methods called a comparable number of peaks for Sox2 in hESCs or FoxA2 in DE cells. In contrast, SEACR called only 1–2 peaks for each factor when it is not expressed (Fig. 2a, “stringent”). HOMER and MACS2 called up to ~ 900 spurious peaks in these datasets using default peak calling thresholds for each program; these trends held when analyzing total bases covered by peaks or percentage of reads in peaks (Fig. 2—Additional file 2: Fig. S2A, B, “stringent”). Notably, this high selectivity of SEACR was maintained when running in “relaxed” mode (Fig. 2b, Additional file 2: Fig. S2A, B, “relaxed”). These results indicate that SEACR outperforms popular ChIP-seq peak callers in avoiding false-positive peak calls. Indeed, the combination of CUT&RUN data with SEACR peak calling results in nearly complete exclusion of false positives, which validates the trustworthiness of the CUT&RUN method.
SEACR default thresholds are robust over a wide range of read depths
To analyze the general optimality of combined recall and precision for each peak caller, we calculated the F1 score for each peak caller at each read subsampling level, such that larger F1 scores corresponded with higher performance in a combination of the two metrics. SEACR relaxed mode exhibited superior performance at all subsampling levels above ~ 7.5 million reads (Fig. 3b, blue curve). To account for the fact that peak callers such as MACS2 have parameters that can be optimized to adjust the desired precision–recall balance, we selected a stringent set of peaks from the MACS2 peak calls that meet a −log10(FDR) threshold of greater than 10, and recalculated F1 scores in comparison with SEACR. Although the more stringent MACS2 peak calls had improved performance above 10 million fragments, performance suffered at fragment subsampling levels below 10 million reads, rendering SEACR superior at those levels (Fig. 3b, magenta curve). Therefore, SEACR thresholds remain competitive with widely used ChIP-seq peak callers across multiple parameter selection strategies, even in the absence of arbitrary user input for the purposes of optimization. Although our conclusions are based on the presumption that high-scoring ENCODE peaks are true positives, the fact that they were originally called using MACS2 leads us to expect that the superior performance of SEACR on CUT&RUN data will generalize to any set of true positives. Thus, SEACR is an accurate peak caller for CUT&RUN data across a range of read depths and maintains a high percentage of true positive peak calls at low read depth.
SEACR retains broad domain structures
SEACR exhibits competitive run time and read–write memory allocation
A web server for rapid desktop analysis using SEACR
We have introduced a novel peak calling strategy that takes advantage of the precise position and fragment spanning information that is present in CUT&RUN data. Popular peak calling programs were designed around ChIP-seq data, where fragment spans are lacking owing to the widespread use of sonication and single-end sequencing. In contrast, our SEACR algorithm finds peaks in CUT&RUN data with a better precision–recall trade-off than the most popular ChIP-inspired peak callers. The near absence of false positives called by SEACR for Sox2 and FoxA2 transcription factors in cells that do not express them confirms the very high accuracy of CUT&RUN, in contrast to ChIP-seq, where reports of “Phantom Peaks” and other issues undermine confidence in peak calls [12, 17, 18]. SEACR is also likely to be useful for CUT&Tag  and other epigenomic datasets that capture fragment position and length information with high signal to noise. We expect that as the value of precise fragment information becomes better appreciated, for example in inferring chromatin dynamics , our block aggregation strategy will become increasingly powerful. The fast run times and favorable read memory allocation requirements have made it possible for us to offer SEACR as a public web server, which enables researchers to conveniently analyze their own data without requiring computational skills or the availability of institutional resources.
SEACR is a highly specific peak caller for CUT&RUN data that outperforms common ChIP-seq peak calling algorithms in avoiding known false positives from a gold standard dataset, while exhibiting competitive or superior performance in calling peaks from diverse CUT&RUN datasets.
CUT&RUN was performed as previously described . Briefly, cells were washed with Wash Buffer (20 mM HEPES pH 7.5, 150 mM NaCl, 0.5 mM spermidine and one Roche Complete protein inhibitor tablet per 50 mL), bound to Concanavalin A-coated magnetic beads and incubated with primary antibody diluted in wash buffer containing 0.05% digitonin (Dig Wash) overnight at 4 °C. Cells were then washed and incubated with protein A-MNase (pA-MN) for 1 h at 4 °C. Slurry was washed again and placed on an ice-cold block and incubated with Dig Wash containing 2 mM CaCl2 to activate pA-MN digestion. After digestion for 30 min, one volume of 2× stop buffer (340 mM NaCl, 20 mM EDTA, 4 mM EGTA, 0.05% Digitonin, 0.05 mg/mL glycogen, 5 µg/mL RNase A, 2 pg/mL heterologous spike-in DNA) was added to stop the reaction, and fragments were released by 30-min incubation at 37 °C. Samples were centrifuged 5 min at 16,000×g, and supernatant was recovered and DNA extracted via phenol–chloroform extraction and ethanol precipitation. Resulting DNA was used as input for library preparation as previously described . Antibodies used for CUT&RUN in this study were as follows: rabbit anti-Sox2 (Abcam ab92494); rabbit anti-FoxA2 (Millipore 07-633); Guinea-Pig anti-rabbit IgG (antibodies online ABIN101961); rabbit anti-H3K4me2 (Millipore 07-030); rabbit anti-H3K4me3 (Active Motif 39159); rabbit anti-H3K27me3 (Cell Signaling Technologies CST9733); and rabbit anti-CTCF (Millipore 07-729).
SEACR design and methodology
SEACR was designed to call enriched regions from sparse CUT&RUN data, in which background is dominated by “zeros” (i.e., regions with no read coverage). SEACR takes as input the following five fields: (1) target data bedgraph file in UCSC bedgraph format (https://genome.ucsc.edu/goldenpath/help/bedgraph.html) that omits regions containing 0 signal; (2) control (IgG) data bedgraph file; (3) “norm” denotes normalization of control to target data, “non” skips this behavior; (4) “relaxed” uses a total signal threshold between the knee and peak of the total signal curve and corresponds to the “relaxed” mode described in the text, whereas “stringent” uses the peak of the curve and corresponds to “stringent” mode; (5) prefix for output file.
Peak calling and precision–recall analyses
MACS2 peaks were called using macs2 callpeak -f BEDPE --keep dup all, with treatment and control files. For H3K27me3, the --broad flag was added. For local lambda-inactivated peak calling, --llocal 0 was added. HOMER peaks were called by generating tag directories for target and control datasets, then using findPeaks, -style factor for TFs, H3K4me2 or H3K4me3, and -style histone for H3K27me3.
For area under the precision–recall curve analysis (AUPR), we compared CUT&RUN peak calls from SEACR, MACS2, and HOMER to stringent sets of MACS2-called ChIP-seq peaks generated by the ENCODE consortium. The ENCODE accession numbers and thresholding used for each ENCODE peak file are as follows: H3K4me2: ENCFF099LMD, −log10(FDR) > 10; H3K4me3: ENCFF258PHY, −log10(FDR) > 10; CTCF: ENCFF002DDJ, no extra thresholding; H3K27me3: ENCFF126QYP, −log10(FDR) > 10. We called all peaks using loose stringency parameters in order to generate full peak files with nearly 100% recall that could be subset by ranking metrics (total signal in signal block for SEACR, −log10(FDR) for MACS2, and peak score for HOMER) in order to derive full precision–recall curves. For SEACR, we artificially set total signal threshold to ½ the threshold corresponding to the knee of the curve described in the “relaxed” mode. For MACS2, we added the flag –p 0.05. For HOMER, we added the flags –F 1.0 –P 0.25 –L 1.0 –LP 0.25 –fdr 0.25. We then used custom bash and R scripts to calculate AUPR values for each dataset. Briefly, to generate values for precision calculations, we used bedtools intersect –u  with the CUT&RUN peak set as the –a file and the ChIP-seq reference peak set as the –b file, and for each CUT&RUN peak reported its ranking metric and whether it overlapped a ChIP-seq peak. To calculate recall, we used bedtools intersect –u with the ChIP-seq reference peak set as the –a file and the CUT&RUN peak set as the –b file, and for each ChIP-seq peak reported the lowest ranking metric of any CUT&RUN peak that overlapped it (if any). We then calculated the percentage of CUT&RUN peaks that were overlapped by a ChIP-seq peak (precision) and the percentage of ChIP-seq peaks that were overlapped by a CUT&RUN peak (recall) for every value of the ranking metric that was recorded in our analysis, plotted the precision and recall values on a curve, and calculated the area under the curve. All subsampling was performed in 10 replicates, and error bars throughout the figures represent the standard deviation of 10 trials.
We thank Christine Codomo for help preparing sequencing libraries and Jorja Henikoff for help with data processing. We thank all members of the Henikoff laboratory for valuable discussions and Kami Ahmad, Brian Freie, and Bob Eisenman for comments on the manuscript.
MPM devised and wrote algorithms, performed analyses, and wrote the manuscript. DT built the web server. SH provided the funding and critical oversight of analyses and edited the manuscript. All authors read and approved the final manuscript.
This work was supported by the Howard Hughes Medical Institute, a grant from the National Institutes of Health (4DN TCPA A093) and the Chan-Zuckerberg Initiative. Funding was provided by National Human Genome Research Institute (Grant No. 1R01HG010492).
Ethics approval and consent to participate
Consent for publication
All authors consent to publication.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.