CUT&RUNTools: a flexible pipeline for CUT&RUN processing and footprint analysis
We introduce CUT&RUNTools as a flexible, general pipeline for facilitating the identification of chromatin-associated protein binding and genomic footprinting analysis from antibody-targeted CUT&RUN primary cleavage data. CUT&RUNTools extracts endonuclease cut site information from sequences of short-read fragments and produces single-locus binding estimates, aggregate motif footprints, and informative visualizations to support the high-resolution mapping capability of CUT&RUN. CUT&RUNTools is available at https://bitbucket.org/qzhudfci/cutruntools/.
CUT&RUNTools takes paired-end sequencing read FASTQ files as the input and performs a set of analytical steps: trimming of adapter sequences, alignment to the reference genome, peak calling, estimation of cut matrix at single-nucleotide resolution, de novo motif searching, motif footprinting analysis, direct binding site identification, and data visualization (Fig. 1b). The outputs of the pipeline (Fig. 1c) are (1) an aggregate footprint capturing the characteristics of chromatin-associated protein binding (Fig. 1c, (i)), (2) binding log-odds values for individual motif sites informative for direct binding site identification (Fig. 1c, (ii)), and (3) visualization of a cut frequency profile at nucleotide resolution (Fig. 1c, (iii)).
Specifically, CUT&RUNTools performs sequence alignment with special attention to short-read trimming and read alignment (Fig. 1b, step 1) (the “Methods” section). Due to the predominance of short fragments (25–50 bp) generated by CUT&RUN, the typical settings in the read trimming and sequence alignment does not perform well. We introduce a two-step read trimming process to improve the quality. First, the sequencing data are processed with Trimmomatic , a commonly used template-based trimmer. Next, a second trimming step was included to remove any remaining adapter overhang sequences not removed due to fragment read-through. CUT&RUNTools further adjusts the default alignment settings by turning on dovetail alignment , designed to accept alignments for paired-end reads when there is a large degree of overlap between two mates of each pair. Together, this improved trimming and alignment procedure increased the alignment percentage from 68 to 98% compared to a setting with no trimming and alignment adjustments (Additional file 1: Table S1). With the reads aligned, CUT&RUNTools employs MACS  to perform peak calling based on the coverage profile, followed by de novo motif searching within the peak regions with MEME suite  (Fig. 1b, step 2).
Cut matrix estimation
An important element of CUT&RUN analysis is the estimation of cut sites, which enables higher-resolution mapping of binding locations than peak calling. The cut sites derive from the two ends of individual DNA fragments generated upon cutting of chromatin by the pA-MN fusion recruited to the antibody binding sites. The regions of lower cut frequency tend to be protected due to chromatin-associated protein binding, whereas flanking regions without binding display higher cut frequencies (Additional file 2: Figure S1a). CUT&RUNTools accurately tabulates the frequency with which cleavage is observed at each base pair (the “Methods” section).
Using the cut matrix, footprinting analysis [13, 14] is then applied to identify high-resolution occupancy of sequence-specific binding factors such as TFs. To detect footprints from CUT&RUN data, CUT&RUNTools first generates an aggregated cut frequency profile based on all ± 100-bp regions extending from each peak-embedded motif site. Then, CUT&RUNTools estimates a probabilistic bimodal clustering model derived from the CENTIPEDE package  and assigns a binding probability score, expressed as log-odds, to each motif occurrence based on the model. This log-odds score quantifies the similarity between the cuts at each motif occurrence and the aggregate footprint pattern. By ranking the sites by the log-odds score, CUT&RUNTools generates a rank-ordered list of likely direct binding sites. Of note, this approach is only applicable to factors with distinct sequence specificity.
Application to GATA1 CUT&RUN dataset
Importantly, de novo analysis also identified an extended motif for co-binding of GATA1 and TAL1 . GATA1 forms a multiprotein complex with TAL1 along with LMO2 and Ldb1 [18, 19]. The GATA1-TAL1 complex recognizes HGATAA and a half E-box (TAL1) separated by a gap of ten nucleotides . Despite the length of this motif, CUT&RUNTools displays a strong footprint for the extended motif. The high value of FSS indicates that this is a primary motif, as expected from the GATA1-TAL1 complex binding model (Additional file 2: Figure S5). The motif footprinting result is consistent between de novo and known GATA1-TAL1 motifs (Additional file 2: Figure S5). Therefore, in cases where the recognition sequence of TF is not known in advance, de novo analysis in combination with genomic footprinting should be helpful in establishing the primary motif and searching for novel co-factors.
To validate these predictions, we applied CUT&RUN to profile TAL1 and KLF1 (Additional file 2: Figure S6). Of the 19,871 predicted GATA1-TAL1 co-binding sites from GATA1 CUT&RUN, 12,841 (64.6%, Jaccard index = 0.51, P < 10−5, bootstrap test) are validated by the TAL1 CUT&RUN experiment. In the case of KLF1, 10,733 of 17,826 (60.2%, Jaccard index = 0.26, P < 10−5, bootstrap test) predicted GATA1-KLF1 co-binding sites are confirmed by KLF1 CUT&RUN analysis. These results suggest that CUT&RUNTools is useful for uncovering combinatorial regulatory modules.
Applicability in additional CUT&RUN datasets
Comparison with existing software packages
Several tools are available for estimating cut matrices from ATAC-seq and DNase-seq data [22, 23]. However, the direct application of such tools to analyze CUT&RUN data often leads to incorrect estimates due to the differences in the experimental protocol (Additional file 2: Figure S7, S8). One reason is that the two ends of each mate of paired reads both do not indicate the ends of a fragment (Additional file 2: Figure S9), making the accounting of cut positions challenging. Another important difference is that Tn5 transposase in ATAC-seq leaves a 4-bp overhang in sequenced fragments , whereas pA-MN enzyme in CUT&RUN cleaves surrounding the location of binding sites with no overhang. Specific adjustments are thus required and have been made in the enumeration of the cut matrix to take into account this feature of CUT&RUN (the “Methods” section). Recognizing these differences, we provide an option to tune the cut site offset to make CUT&RUNTools applicable to both CUT&RUN and ATAC-seq footprinting analyses (Additional file 2: Figure S10) and in doing so allow flexibility of experiment type.
Quality control measures such as alignment rate and fragment duplication rate may be used to evaluate CUT&RUN, but we note that due to the differences in the mappability and sequence composition of the antibody-bound DNA, some factors inherently have a low complexity of the binding regions and an increased fragment duplication rate. Users, therefore, need to make judgment calls, for example, whether or not to remove duplicates, on a factor-dependent basis. We also advise an interpretation of the data that is aided by motif and replicate analysis.
CUT&RUN experiments were carried out following the nuclei isolation version of the protocol as described [5, 7]. Nuclei from 2 × 106 cells were isolated with NE buffer that consisted of 20 mM HEPES-KOH pH 7.9, 10 mM KCl, 0.5 mM spermidine, 0.1% Triton X-100, 20% glycerol, and 1× protease inhibitor cocktails. The nuclei were captured with BioMagPlus Concanavalin A and incubated with 2 μg primary antibody (α-GATA1, ab11852, Abcam) in 200 μL wash buffer (20 mM HEPES-NaOH pH 7.5, 150 mM NaCl, 0.5 mM spermidine, 0.1% BSA, and 1× protease inhibitor cocktails) for 2 h. Then, unbound antibody was washed away with 400 μL wash buffer twice. Then pA-MN was added at 1:1000 ratio to 200 μL wash buffer and incubated for 1 h. The nuclei were washed again and resuspended in 150 μL wash buffer. CaCl2 was next added at a final concentration of 2 mM to activate the enzyme. The reaction was carried out at 0 °C and stopped by 150 μL of 2X STOP buffer (200 mM NaCl, 20 mM EDTA, 50 μg/mL RNase A, and 40 μg/mL glycogen). Protein-DNA complex was released by centrifugation and digested by proteinase K at 50 °C overnight, followed by DNA precipitation by ethanol. The pellet was washed with 70% ethanol and dissolved in 25 μL 0.1× TE (1 mM Tris-HCl pH 8.0, 0.1 mM EDTA). Antibody used for TAL1 and KLF1 CUT&RUN were ab155195 (Abcam) and HPA051850 (Sigma), respectively.
CUT&RUN library preparation and sequencing
The NEBNext Ultra II DNA Library Prep Kit was used with modifications described previously  which aims to preserve short DNA fragments (30–80 bp). Briefly, 6 ng of CUT&RUN DNA were treated with endprep module at 20 °C for 30 min and 50 °C for 1 h to reduce the melting of short DNA. Ligation was performed by adding 5 pmol of NEB adapter and ligation mix and incubated at 20 °C for 15 min. To clean up the reaction, add 1.75× volume of Agencourt AMPure XP beads (Beckman Coulter) to capture short ligation products. PCR amplification was performed for 12 cycles. The resulting libraries were purified with 1.2× volume of AMPure beads then analyzed and quantified by Qubit and Tapestation. The detailed step-by-step protocol can be found at protocol.io ( https://doi.org/10.17504/protocols.io.wvgfe3w). Libraries with different indexes were pooled, and Illumina paired-end sequencing was performed using Nextseq 500 platform with NextSeq 500/550 High Output Kit v2 (75 cycles) (2 × 42 bp, 6-bp index).
Broadly, CUT&RUNTools consists of trimming, alignment, peak calling, motif finding, cut matrix generation, and motif footprinting steps. The pipeline incorporates specific changes to some of the steps to accommodate the short-read and short fragment characteristics of CUT&RUN. Its cut matrix generation ensures an accurate accounting of cut positions for footprint analyses. These steps are described below.
Raw read trimming and alignment
An initial trimming was first performed with Trimmomatic , with settings optimized to detect adapter contamination in short-read sequences. Trimmomatic is a template-based trimmer. However, reads containing 6 bp, or less, of adapters are not trimmed. Therefore, a separate tool Kseq was developed to trim up to 6-bp adapters from the 3′ end of each read that was not effectively processed by Trimmomatic. Note that this trimming does not affect the cut site calculation, which counts only the 5′ end of sequences. After trimming, a minimum read length of 25 bp was imposed, as reads smaller than this were hard to align accurately.
Dovetail alignment policy. Bowtie2  aligns each mate of a pair separately and then discards any pairs that have been aligned inconsistently. Dovetail refers to the situation when mates extend past each other. In the default setting, these alignments are discarded. Dovetail is unusual but encountered in CUT&RUN experiments. The --dove-tail setting  was enabled to flag this situation as normal or “concordant” instead of elimination of such reads.
Peak calling and motif finding
After alignment, fragments were divided into ≤ 120-bp and > 120-bp fractions. For the rest of the analyses, we used the ≤ 120-bp fraction which is likely to contain TF binding sites . Then, MACS2 was applied with the default narrowPeak setting . Afterward, sequences within 100 bp from the summit of each peak were obtained, and any sequences containing a substantial amount of repeats (as reported by RepeatMask) were removed. These remaining sequences were next used to perform de novo motif searching using MEME . The top 20 motifs were saved for subsequent analyses. FIMO (part of MEME suite ) was applied to enumerate all motif sites in the peak regions.
Like other techniques, some fraction of sequenced read pairs appears as duplicates (i.e., with identical start and end positions between duplicates). However, it is argued that nuclease cleavage of chromatin by its stereotypical nature is influenced by conformation of chromatin and/or nuclease bias , and shorter DNA fragments also increased the likelihood of identical reads that originated from different cells . Thus, removing duplicates from CUT&RUN experiments should be dealt with caution if the library complexity is not too low (due to extremely low input and/or high PCR cycle numbers). Thus, the default action in CUT&RUNTools is to retain duplicate reads, and users can choose to remove duplicates at their own discretion. We recommend users to be aware of the low complexity of libraries with high duplication rates, as these may indicate a poor quality preparation. Users may repeat peak calling analysis on both duplicate and duplicate-removed instances. By comparing the peak number, motif enrichment, enrichment of expected motifs, and other quality metrics, users may decide whether it makes sense to use the duplicate version for subsequent analysis.
Cut matrix generation
For any motif of interest, its corresponding cut matrix was generated as follows. The rows of the cut matrix are the motif sites. The columns are the individual nucleotides in the − 100-bp motif and + 100-bp regions. Cut matrix requires all motif sites to be in a consistent orientation. That is, if the motif occurrence is located on the minus strand in the reference genome, all the cut frequencies in that motif site are flipped, so that − 100-bp position from the old profile becomes the + 100-bp position in the new profile. By convention, a value at ith nucleotide means the cut is situated just before ith nucleotide. The cut matrix tabulates the frequency of fragments ending in each nucleotide.
To compute strand-specific cut matrix, the ends of DNA fragments that overlap with the motif were assigned to forward and reverse strand cut matrices as follows. For each fragment, define R1 and R2 as two mates. The ends of the fragment are the start of R1 (s1) and the end of R2 (e2). If a given motif occurrence appears on the positive strand of the reference genome, then s1 belongs to the “forward” strand cut and e2 belongs to the “reverse” strand cut. Otherwise, if the motif occurrence is on the negative strand, then s1 belongs to the “reverse” strand cut and e2 belongs to the “forward” strand cut. Likewise, tabulation was repeated for all paired reads and for all motif occurrences, each time separately for each strand.
Motif footprinting analysis
A motif footprint is a plot that shows the enzyme cleavages around the motif region, presumably due to the protection of TF-bound DNA. It is typically characterized by a low-cut frequency (or low posterior probability of cut) in the motif core and a high-cut frequency in the motif flanking regions. Prior to footprint analysis, blacklisted regions were excluded from the peak list. Any chromosome M peaks were also excluded. Next, CENTIPEDE  was applied to fit a probabilistic bimodal clustering model on the strand-specific cut matrix data which has aligned and centered all motif-containing regions. CENTIPEDE was run with default settings and specifying the length of the motif.
Footprint symmetry analysis for identification of primary and secondary motifs
CUT&RUNTools has built in a feature to determine whether a motif footprint is primary or secondary, based on a “footprint symmetry score” (FSS) defined as follows. The footprint profile is first divided in the middle into two halves, and to capture shape information, each half is fitted by an exponential decay curve (of the form Aleft exp (Bleft × x) and Aright exp (Bright) × x, respectively) (Fig. 3). The parameter Bleft (and Bright) reflects the ascent rate for the left arm (and the “descent rate” for the right arm). The goodness of fit is quantified using the R2 statistic, represented by R2left and R2right. The FSS score is defined as Bleft × R2left + − 1 × Bright × R2right. Intuitively, the FSS score measures the rate of increase of cut probabilities in the footprint plot, as the position approaches the motif. This rate should match the respective rate of decrease of cut probabilities as the position is further away from center. A FSS score of > 0.3 and a small difference between Bleft and − 1 × Bright indicate symmetry of motif footprint. Such a motif is designated primary.
Determining direct binding sites
CUT&RUNTools was implemented using Python, R, and BASH scripts. Visualizations of motif footprints were implemented using matplotlib library in Python. Visualization of single-locus cut profile was implemented using the Gviz R package . Integration of next-generation sequencing tools was achieved using Python and BASH scripts. Configuration of pipeline, including inputs/outputs and prerequisite paths, is specified by a JSON-formatted file. CUT&RUNTools works under the SLURM  job submission environment. A usage manual is provided online at the repository link: https://bitbucket.org/qzhudfci/cutruntools.
Comparison with existing tools
There are two currently available tools for enumerating cut matrices from enzyme cleavage data. One is Atactk, designed for ATAC-seq data, and the other is CENTIPEDE.tutorial, targeted towards DNase-seq. These tools were each applied to CUT&RUN data for the purpose of showing the advantage of CUT&RUNTools. Make-cut-matrix tool from the Atactk package  v0.1.5 was downloaded from https://github.com/ParkerLab/atactk, and the CENTIPEDE.tutorial package v1.0 was downloaded from https://github.com/slowkow/CENTIPEDE.tutorial. Make-cut-matrix was run with default settings on GATA1 CUT&RUN data, using HGATAA as the motif. The centipede_data() function of CENTIPEDE.tutorial package was used to generate cut matrix with default parameters. To evaluate the quality of the cut matrix generated by these tools, CENTIPEDE motif footprinting was performed on the generated cut matrices, and the quality of the motif footprint plot was inspected for differences. Two loci were selected to more specifically compare the cut frequency profile estimated by these tools and CUT&RUNTools and illustrate their differences.
To make sure that the cut matrix is accurately estimated for CUT&RUN data, CUT&RUNTools adapts the following changes starting with the make-cut-matrix implementation. Adjustments are written in the form of a “patch,” which is available in the pipeline. First, the default setting of 4-bp cut site offset was removed as it was usually required for ATAC-seq data (due to Tn5 transposase imposing a 4-bp overhang on the sequences ). CUT&RUN cuts approximately at the TF binding site, so no cut site offset is required (offset = 0). Second, the position of the reverse strand cut site is noted to be shifted by 1 bp even after setting cut site offset to be 0 (Additional file 2: Figure S11a). This shift has been a remnant feature of ATAC-seq where forward strand has a cut offset of 4 bp while the reverse strand has a cut offset of 5 bp. So, an adjustment of the cut position has been further made to correct this behavior (Additional file 2: Figure S11b). With both of these changes adapted, the cut matrix was independently verified with the fragment end positions produced by bamtobed tool from BEDTools  to ensure its accuracy.
Quality control metrics
CUT&RUNTools reports a number of metrics to evaluate the quality of a CUT&RUN dataset, including fragment size distribution, adapter content percentage, library size, read duplication rate, alignment percentage, number of peaks, and enrichment of expected motif. The fragment size is measured by the start and end positions of a pair of reads in paired-end sequencing. Since the experimental protocol enriches short fragments, it is a routine to ensure that the fragment size is within the expected range (e.g., ≤ 120 bp). The quality of sequence reads is evaluated by the adapter content percentage, which is the percentage of reads retained after the read trimming step. For a good-quality dataset, the number of reads removed by trimming should be less than 10–15%, mostly corresponding to short fragments. A substantially higher number may indicate technical problems such as self-ligation. The library size, which is defined the number of reads in the sample library, should be at minimum 10 million and ideally at least ~ 15–20 million. The read duplication rate is defined as the fraction of paired reads that have identical starts for the first mate and ends for the second mate. A good-quality data should typically have a low read duplication rate (10–15%), although the rate may be higher for factors with an affinity for low-complexity regions. The alignment percentage is computed as the percentage of reads that can be mapped concordantly to the reference genome. For a good dataset, the alignment percentage should be high (e.g., > 90%). CUT&RUNTools detects peaks by applying MACS2  after filtering out a number of uninteresting regions (including RepeatMasked regions, chromosome M, and any blacklisted regions). In case there is prior knowledge regarding the expected number of peaks, this may also serve as a guide to evaluate the quality of the data. For transcription factors with known sequence specificity, the enrichment of the expected motif should be high at the detected peaks. As there is no single score that captures the overall quality, the users are encouraged to make their own judgment call by considering the collective information.
Installation and usage
Installation instructions are provided at https://bitbucket.org/qzhudfci/cutruntools/src/default/. To use the pipeline, users first create a new job which entails modifying the provided JSON configuration file with information about the sample fastq file path, output path, SLURM resource requirements, and various settings. Then, execute ./create_scripts.py config.json to create a working directory and a set of tailored SLURM submission scripts. Finally, to start the analysis for a sample of interest, users simply execute ./integrated.all.steps.sh GATA1_R1_001.fastq.gz. This script will perform the entire analysis pipeline via a 1-command interface. Options are also available for running the steps of the pipeline individually (see the manual on the website for details).
Public dataset analysis
In the GATA1 study, GATA1, TAL1, KLF1, and NFE2 ChIP-seq experiments were downloaded from GEO. In the MAX and MYC example, public CUT&RUN samples were downloaded from GEO and compared against ChIP-seq experiments from the ENCODE consortium, see the “Availability of data and materials” section for accession IDs. ChIP-seq raw reads were trimmed, aligned, and subjected to peak calling following standard MACS2 narrow peak settings (-q 0.01 -B –SPMR) [9, 10, 11]. CUT&RUN datasets were processed using CUT&RUNTools using the default trimming and alignment settings. For MYC and MAX CUT&RUN, fragments of all sizes were kept so as to capture both free DNA and nucleosomal DNA binding. For TAL1 and KLF1 CUT&RUN, fragments of sizes ≤ 120 bp were selected for downstream analysis. Then, fragments in BAM files were subject to peak calling with MACS2 (default narrow peak settings). To compare with ChIP-seq, we subset the ChIP-seq experiment to most significant peaks to the extent that the resulting peak number is similar to the total peak number in the corresponding CUT&RUN experiment. Where the peak coverage is higher in CUT&RUN than ChIP-seq, subsetting was done instead in CUT&RUN. Then, we performed motif scanning using FIMO  to locate peaks containing enriched motif for the factor. The motif instances within the peaks were next overlapped between CUT&RUN and ChIP-seq, and a Venn diagram was drawn . The significance of the overlap was computed using the Jaccard R package using the bootstrap method with bootstrap iterations set to 100,000. Motif scanning and footprinting analyses used the following reference motifs from JASPAR database : MA0140.2 (GATA1.TAL1), MA0493.1 (KLF1), MA0841.1 (NFE2), MA0035.2 (GATA1), MA0058.3 (MAX), and MA0147.3 (MYC). FIMO motif scanning P value of 0.0005 was used for all motifs, except MA0035.2 that used P = 0.001 due to the motif’s short length.
In summary, CUT&RUNTools provides a means of directly detecting TF binding through assessment of the protection of TF-bound DNA from enzyme cleavages and should enable biologists to realize advantages provided by CUT&RUN. Thus, CUT&RUNTools represents a valuable enabling tool for genomic biologists to analyze and interpret CUT&RUN data and extend insights into the regulatory mechanisms.
We thank Peter Skene and Steven Henikoff for the advice on CUT&RUN protocols, Birgit Knoechel of Dana-Farber Molecular Biology Core Facility for the DNA sequencing, Harvard Medical School Research Computing for providing the computing resource for sequencing data analysis, and the members of Stuart Orkin lab meeting, Daniel Bauer, and Alan Cantor for the useful feedback.
The review history is available as Additional file 3.
QZ, NL, SHO, and G-CY conceived the project and wrote the paper. QZ implemented the CUT&RUNTools. NL performed the CUT&RUN experiments. All authors read and approved the final manuscript.
This work was supported by the Howard Hughes Medical Institute (HHMI to SHO); National Heart, Lung, and Blood Institute (NHLBI) (R01 HL119099 to G-CY; R01 HL032259 to SHO); and National Human Genome Research Institute (NHGRI) (HG009663 to G-CY).
Ethics approval and consent to participate
The authors declare that they have no competing interests.
- 5.Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife. 2016;6:1–35.Google Scholar
- 10.Ben L, Steven S. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2013;9:357–9.Google Scholar
- 11.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome bio. 2015:1–9.Google Scholar
- 17.Hasegawa A, Shimizu R. GATA1 activity governed by configurations of cis-acting elements. Front Oncol. 2017;6:1-7.Google Scholar
- 25.Yoo AB, Jette MA, Grondona M. SLURM: simple linux utility for resource management; 2003. p. 44–60.Google Scholar
- 27.Fu Y, Wu PH, Beane T, Zamore PD, Weng Z. Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics. 2018;19:1-14.Google Scholar
- 30.Chen H, Boutros PC. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics. 2011;12:1-7.Google Scholar
- 31.Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2017; Available from: http://academic.oup.com/nar/article/doi/10.1093/nar/gkx1126/4621338.
- 33.Zhu Q, Liu N, Yuan G, Orkin S. CUT&RUNTools: a flexible pipeline for CUT&RUN processing and footprint analysis. Raw sequencing reads. Gene Expression Omnibus. 2019; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136251. Accessed 24 Aug 2019.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.