Pinpointing transcription factor binding sites from ChIP-seq data with SeqSite
- 4.3k Downloads
Chromatin immunoprecipitation combined with the next-generation DNA sequencing technologies (ChIP-seq) becomes a key approach for detecting genome-wide sets of genomic sites bound by proteins, such as transcription factors (TFs). Several methods and open-source tools have been developed to analyze ChIP-seq data. However, most of them are designed for detecting TF binding regions instead of accurately locating transcription factor binding sites (TFBSs). It is still challenging to pinpoint TFBSs directly from ChIP-seq data, especially in regions with closely spaced binding events.
With the aim to pinpoint TFBSs at a high resolution, we propose a novel method named SeqSite, implementing a two-step strategy: detecting tag-enriched regions first and pinpointing binding sites in the detected regions. The second step is done by modeling the tag density profile, locating TFBSs on each strand with a least-squares model fitting strategy, and merging the detections from the two strands. Experiments on simulation data show that SeqSite can locate most of the binding sites more than 40-bp from each other. Applications on three human TF ChIP-seq datasets demonstrate the advantage of SeqSite for its higher resolution in pinpointing binding sites compared with existing methods.
We have developed a computational tool named SeqSite, which can pinpoint both closely spaced and isolated binding sites, and consequently improves the resolution of TFBS detection from ChIP-seq data.
KeywordsBinding Region Canonical Motif Fragment Size Estimation Transcription Factor Binding Region Effective Genome Size
Exploring protein-DNA binding events in a genome-wide manner is a key step in studying transcription regulation. Chromatin immunoprecipitation (ChIP)  followed by hybridization to DNA tiling arrays (ChIP-chip) [2, 3, 4] or by next-generation high-throughput sequencing (ChIP-seq) [5, 6, 7, 8, 9] are major techniques for experimentally profiling the binding events. Due to many advantages of next-generation sequencing [10, 11], ChIP-seq measures immunoprecipitated DNA fragments at a higher signal-to-noise ratio than ChIP-chip, and provides the potential to detect protein-DNA binding locations at a higher resolution . ChIP-seq is being widely used on large mammalian genomes to comprehensively map in vivo transcription factor (TF)-binding events [6, 12] and histone marks [5, 13].
In ChIP-seq, DNA molecules are isolated after protein-DNA cross-linking, and randomly sheared into fragments by sonication, followed by immunoprecipitation. After cross-link reversal, immunoprecipitated (IPed) DNA fragments are sequenced with next-generation sequencing machines, such as the Illumina Genome Analyzer. Details of the experimental procedures can be found in recent literature such as [14, 15]. Supplementary Figure S1 in Additional File 1 shows a typical workflow of ChIP-seq experiment. The experiment produces large amounts of short reads measured from 5’end of each IPed double-strand DNA fragment. Typical read lengths are 25-36 nucleotides (nt) as in profiling of transcription factors binding events on the human genome by the ENCODE project [16, 17]. The sequenced reads are then mapped to the reference genome. Unmapped reads are excluded in following analyses, and those reads that can be mapped to multiple locations are also filtered out in most applications. The remaining uniquely mapped reads (called tags) assigned with genomic positions and orientations that provide the information of the starting sites and directions of the IPed DNA fragments.
Transcription factors recognize and bind to specific DNA sequences (named motifs, typically about 5~20 base-pair (bp) long) [18, 19]. The binding locations are called transcription factor binding sites (TFBSs). A major purpose of ChIP-seq experiments with transcription factors is to accurately identify TFBSs on the genome. Several computational methods and tools have been developed to detect TFBSs from ChIP-seq data. The most straight-forward method is to count tags along the genome and detect TFBSs or binding regions that cover them with highly accumulated reads (e.g., ). However, as ChIP-ed DNA fragments are of ~150-300-bp long but the sequencing reads are much shorter, the location detected in this way will be very imprecise. A better strategy is to extend each tag to a certain length (usually the estimated fragment length) along its strand orientation to generate “pseudo fragments” (composing the “extended set” or “XSET” ). Such pseudo fragments are then profiled along the genomic coordinates, and TFBSs can be detected from profile peaks. Methods taking this type of strategy include PeakSeq , FindPeaks , F-Seq , GLITR  and HPeak . The extension step works similar to a low-pass filtering in signal processing, which therefore results in a low resolution in detection of peaks. More sophistical methods were developed to improve the resolution of TFBS detection without extending tags . For example, CisGenome , QuEST , MACS  and USeq  take similar strategies to shift both forward and reverse tag profiles toward the center to form an integrated tag profile for peak calling; SISSRs  calculates read counts of the forward and reverse strands as positive and negative signals, respectively, and takes the transition point as the candidate site; spp  takes into account the symmetry of the two-stand tag profiles and detects paired peaks. Pepke et al  gives a comprehensive review on these methods and discussed the need of peak deconvolution when there may be multiple binding sites in one peak region. Laajala et al.  and Wilbanks & Facciotti  conducted series of comparative experiments with most of the open-source methods. They observed no substantial difference between the compared methods in the sensitivity and specificity of binding region detection, but noticeable divergence was observed in their spatial resolutions for precisely locating binding sites.
In many cases, TFs can bind to a genomic region at closely spaced adjacent sites. Pinpointing TFBSs from ChIP-seq data is still quite challenging in such cases. For example, taking the case of two adjacent binding sites as an example, many existing methods tend to identify a single site around the two binding sites’ center, or report one region covering both sites , leaving binding site detection for a secondary step of sequence motif analysis. Due to the possible indirect binding of some TFs and the influence of other factors such as the chromatin structure, there are cases that TFBS sequence motifs cannot be found in the detected region and also cases when sequence motifs do not imply real binding events. There is a need to develop methods that can locate multiple adjacent TFBSs from ChIP-seq data at a higher resolution . In this paper, we developed a method called SeqSite to pinpoint TFBSs for both isolated and close adjacent binding events from ChIP-seq data. The method includes a second step for locating binding sites by modeling the tag density profile after binding regions are detected in the first step. A series of simulation experiments were conducted to study the property and performance of the method. The method SeqSite was applied on three published datasets of transcription factors GABP , STAT1  and NRSF . Experiments show that it performs better than or similarly with existing methods on the accuracy of binding site detection, but SeqSite can locate binding sites at a higher resolution and can pinpoint adjacent binding sites even if they are as close as 40-bp from each other. Recent publications reported three new methods, CSDeconv , PICS  and GPS , for similar questions. We compared them and some other classical tools with SeqSite in our experiments and discussed the characteristics and advantages of our method. The software tool SeqSite implemented in C/C++ code is freely available at SeqSite website .
Characteristics of ChIP-seq data for multiple adjacent TFBSs
We further demonstrated the ubiquitous phenomenon of closely spaced binding sites by investigating ChIP-seq datasets of transcription factors GABP, STAT1 and NRSF (see Methods). We used a tag-clustering method to get pairs of neighboring tag clusters on the specific strands, and defined tag density as a measure of the signal intensity to rank all tag clusters (see Methods for details). We focused on genomic regions ranked high to ensure that they are truly binding regions. We defined three variables to characterize the overlapping situation of tag stacks on the two strands. As shown in Figure 1C, variable d1 measures the genomic distance between the density centers of two tag stacks on opposite strands. Variable d2 measures the lag after the center on the forward strand (the distance from the center to the ending of the tag stack along the sequencing direction) and d3 measures the lag on the reverse strand. We estimated the boundary of a tag stack by comparing with the background signal (see Methods for details). Figure 1D shows average d1, d2 and d3 with standard deviations in the top 5% densest tag clusters in each dataset, and Supplementary Figure S2 in Additional File 1 provides the results based on other parameter settings. We can see that the averages of d2 and d3 in all three datasets are larger than the averages of d1, which indicates that many TF-binding regions must contain multiple adjacent binding sites.
The SeqSite method
We developed a method called SeqSite for detecting binding regions and pinpointing closely spaced adjacent binding sites in detected binding regions. It first aggregates tags to form tag clusters on each strand. Then the Poisson models with dynamic parameters are adopted to assess the statistical significance of tag enrichment comparing to the control data or local background. SeqSite filters out tag clusters that are not significant according to a given false discovery rate. Tag clusters with fewer than 10 tags or shorter than 100-nt are also excluded, for those tag cluster regions are probably artifacts  and too noisy for the secondary binding site detection. The remaining tag clusters are reported as TF binding regions with refined boundaries. This is the first step of the method.
Supplementary Figure S3 in Additional File 1 gives the diagram of the whole system. We implemented SeqSite in C/C++ code with efficient algorithm design. The software tool SeqSite can be downloaded for free academic use at SeqSite website .
Binding site detection on simulated data
Summary of the SeqSite results on the three ChIP-seq datasets at FDR 10%.
# BRs with multi-BS (%)
We first studied the effect of sequencing depth on the resolution of detecting the binding site locations. We generated 10 to 200 tags for each binding region on one strand, which covers most (82.8%) of the situations in the real data (Supplementary Figure S4 in Additional File 1 shows the distribution on the real data). Figure 3B shows the distribution of the detected binding sites at different sequencing depths for a single binding site. We can see that the detected binding sites center at the true binding site location. Increasing the sequencing depth can improve the resolution but SeqSite can work reasonably well even when the coverage is low. When there are more than 70 tags in the binding region on one strand, most (>97%) of detected binding sites are within 5 bp from the true binding site. When the coverage drops to only 10 tags on each strand, the variance in the estimated site locations becomes larger, but still in 78% of the experiments, the distance between the detected position and the true binding site is less than 10 bp. For cases when there are two adjacent binding sites apart by 60 bp as the example shown in Figure 3C, we observed that the SeqSite successfully distinguish two binding sites except for the lowest sequencing depth, although the locations tend to be closer to each other than the true sites. The effect of sequencing depth is minor unless it is too low. Except for the cases with only 10 tags in the region, more than 70% of the detected sites are within 20 bp from the true binding sites.
We then fixed the number of tags in each binding region as 100 tags on each strand and experimented on different distances between two adjacent binding sites, from 10 bp to 100 bp. As shown in Figure 3D, the proposed method can successfully identify the two sites when their distance is as close as 40 bp. The further apart the two sites are, the more accurate the detections are. When the distance is 40-60 bp, 75.7% of the detected sites are within 20 bp around the true sites on average; when the distance increases to more than 60 bp, 77.1% of the sites are within 10 bp on average. When distance is less than 20 bp, the method tends to report one binding sites at approximately the center of the two true binding sites. Also, considering the length of binding motifs, it is less likely that the same TF can bind to two motifs too close to each other. Therefore, we set the SeqSite software to combine two identified sites into one if they are closer than 20 bp from each other (see Methods).
Application to the GABP, STAT1 and NRSF ChIP-seq datasets
As the true answers of accurate binding sites are not available on real data, we adopted the strategy to use the occurrence of known TF-associated motifs as an indicator of true detection and use the distance between the detected binding sites and motif centers as an indicator for the detection resolution. GABP, STAT1 and NRSF are all well-studied transcription factors. Previous studies on these ChIP-seq data have shown that many detected binding regions are associated known canonical motifs of these factors [6, 20, 21, 28, 31]. We extracted the position weight matrixes or PWMs of these binding motifs from the TRANSFAC database  and studied their occurrence in the detected binding regions. The sequence LOGOs of the TF-associated motifs are shown in Supplementary Figure S5 in Additional File 1. GABP is associated with a 12-bp canonical motif . STAT1 is associated with three motifs under the data-generation condition: the GAS (gamma-activated sequence) motif and the ISRE motifs (ISRE-3 of 15 bp and ISRE-2 of 14 bp) [20, 31]. We took these motifs together to scan for their occurrence in the STAT1 binding regions. NRSF is associated with a 21-bp canonical motif, and its left and right half-site motifs can also be enriched in detected binding regions [6, 28, 31]. Therefore, we scanned for the occurrence of both the canonical motif and half-site motifs in the detected NRSF binding regions. In the scanning, we used MAST  to identify the sequence matches with P-value < 0.001.
Two examples of the detected binding regions are given in Figure 4D,E to demonstrate the performance of SeqSite. SeqSite detected two adjacent binding sites at each promoter region of genes ITGB2 (also known as CD18) and TOMM22. We can also find two close occurrences of the corresponding sequence motif in each region. It has been reported that the two close GABP binding sites at gene ITGB2 are both essential for transcriptional activation [39, 40, 41]. More examples of the three datasets can be found in Supplementary Figures S6, S7 and S8 in Additional File 1.
Comparison with other methods on binding site detection accuracy
We compared the performance of SeqSite on binding region/site detection with other existing methods. Three classic methods QuEST , MACS , SISSRs  and two recent methods GPS  and PICS  were compared. We chose FDR at 10% for MACS and SISSRs, Q-value cutoff at 0.01 for GPS, and used the default parameters for other programs (see Methods for details), and applied them to the three ChIP-seq datasets for detecting genome-wide sets of TF binding regions and/or TFBSs. Supplementary Table S1 in Additional File 1 summarizes the results.
We first compared SeqSite with other tools on the accuracy of detected binding sites. Since no true answers are available on the real data, we adopted the motif occurrence as an indicator. Although it must under-estimate positive detections, this evaluation is fair for any particular method. Besides, the detected binding sites and sequence motifs may not necessarily overlap with each other so we extend binding sites to standardized binding regions or SBRs for comparison. We did not directly use the binding regions provided by some of the tools compared, because the sizes of those regions were not comparable (Supplementary Figure S9 in Additional File 1). A SBR is a 200-bp genomic region centered at the detected binding site (SeqSite, SISSRs, GPS and PICS) or the binding region summit (QuEST and MACS, as did in literature ). We define the positive detection rate or PDR as the fraction of SBRs with corresponding motif occurrence.
Comparison on binding site detection resolution
The major merit of the proposed SeqSite method is that it can detect binding sites at a high resolution, and can locate closely spaced binding sites. We demonstrated this merit by comparing SeqSite with other tools on detection resolution.
On the NRSF data, however, we found that SeqSite’s performance is behind that of MACS, QuEST and SISSRs, although the investigated fractions given by SeqSite (all above 0.51) are higher than those on STAT1. To understand why this happens, we looked into some examples in more details. We found that there are cases where SeqSite gives reasonable multiple binding site detections but doesn’t gain on the resolution measured in this way. For example, in Supplementary Figure S8D in Additional File 1, the long overlap between tag clusters on both strands indicates there must be two adjacent binding sites, which is also cross-validated by the occurrence of the canonical sequence motif. However, because the two motif centers are too close to each other and the canonical motif is relatively long, the two binding sites identified by SeqSite do not result in shorter distances to each motif center, compared with the distance from the single site in the middle to the nearest motif center. We noticed that the recent methods PICS and GPS also perform less well on NRSF data. This reflects that there must be different patterns in NRSF binding that need further investigations comparing to the other TFs.
The interaction of transcription factors and other proteins with genomic DNA is a major knot in the complex system of molecular regulation in cells. Profiling DNA binding sites of these factors at a high resolution is an important step in understanding their functions and regulation networks. It has been reported that some transcription factors have multiple binding sites on DNA at close locations [39, 40, 41, 44]. We developed the method SeqSite that can detect such clustered multiple binding sites at a high resolution from ChIP-seq data. Experiments on the GABP, STAT1 and NRSF ChIP-seq data show that a large portion of TF binding events are associated with multiple adjacent binding sites. In recent years, understanding on the function of transcription factors has been widened. They not only play the role of turning on or off their target genes, but also regulate the gene expression in a quantitative manner. Pinpointing the details of their binding on DNA is crucial for better understanding of the mechanism.
Many existing work depends on a secondary step of sequence motif analysis for locating binding sites in detected binding regions. However, due to the complexity of the protein-DNA interaction mechanism such as the involvement of co-factors, sequence motifs are not sufficient for locating binding sites precisely [47, 48, 49]. It is also widely observed that many genomic regions with sequence motifs are not bind or not always bind by the corresponding proteins [48, 50]. Chromatin-IP followed by next-generation sequencing provides protein-DNA binding tags at a single-base resolution. By modeling the ChIP-seq signal, methods like SeqSite provide a direct way in pinpointing protein binding sites accurately. The detected high-resolution binding sites can also help analyze sequence features of binding sites where no known motif can be found, and therefore identify possible co-factors.
The current version of SeqSite needs to estimate the fragment size in modeling tag profiles from a binding site. This estimation is assumed to be the same for all binding sites in the same experiment. However, there is a randomness in the size of ChIP-ed DNA fragments from different genomic regions with respect to nucleosome position . This partially explains cases where SeqSite may lose resolution and points to a possible direction for further improving the method. Integrating ChIP-seq data with other kinds of data such as nucleosome positioning data would be a promising approach to improve the resolution of binding site pinpointing and to better understand the data. Another possible improvement is utilizing multi-hit reads by some probabilistic methods such as Gibbs sampling , or integrating multi-hit reads in the model-fitting procedure in SeqSite.
The three TF ChIP-seq datasets used in this study are: human transcription activator GABP (growth-associated binding protein) in Jurkat T lymphoblast cell , transcription activator STAT1 (signal transducers and activator of transcription) in interferon γ-stimulated (IFN-γ) HeLa S3 cell , and transcription repressor NRSF (neuron-restrictive silencer factor) in Jurkat T lymphoblast cell with a polyclonal antibody . We downloaded the GABP and NRSF datasets from , and the STAT1 data from . The three datasets were all sequenced by the Illumina Genome Analyzer, using single-end sequencing with read length of 25~27 nt. All the datasets provide mapped results of short reads against the human genome (hg18 or NCBI build 36), and total numbers of mapped read-tags are 7,862,231, 26,731,492 and 8,813,398 for GABP, STAT1 and NRSF, respectively. The corresponding control data for GABP and NRSF are of 17,404,922 tags, and the control data for STAT1 contain 23,435,631 tags. Unmapped reads and non-uniquely mapped reads are excluded in this study.
Detecting tag-enriched regions by tag clustering
Known from the ChIP-seq protocol, tags will be enriched around binding sites of the studied protein. Usually, in large mammalian genomes such as the human genome, there are lots of desert regions with no or few tags, so efficient ChIP-seq analysis algorithms first filter out those regions, only leaving tag-enriched regions for further analysis.
Instead of scanning the genome with a sliding window for tag-enriched region, SeqSite clusters nearby tags within d-bp (d = 30 by default) to form tag clusters, and the corresponding regions are regarded as tag-enriched regions. This avoids the extra step for merging neighboring enriched regions with a fixed sliding window-size. To remove experimental noise and/or artifacts  and to avoid possible false positives in binding site detecting, SeqSite filters clusters with fewer than 10 tags or of less than 100-nt length. We noticed on real data that there are some binding regions with jagged tag profiles, which will poison binding site detection, so SeqSite filters those short tag clusters by setting the cluster length threshold (100-nt).
Hypothesis testing for tag enrichment
where λcontrol is estimated in the control data from the same region where the tags are enriched in the ChIP-seq data, λcontrol_1k is estimated from the 1-kb window centered at its midpoint, and Rcontrol2chip is the normalization ratio for control against the ChIP-seq sample. As λlocal also reflects the regional noise level, only the bases with tag density profile above the noise level are regarded as binding region. Thus, we can define detected binding regions’ boundary accordingly.
To calculate λBG , we use the efficient genome size as 80% of the whole human genome when the short read length is ~30 nt, as did in SISSRs . In SeqSite, we set 2.4 G-bp (0.8× 3 G-bp) as the default effective genome size for the human genome. But as the read length changes, the effective genome size should be changed correspondingly .
Correction for multiple testing
where Count is the total number of tag clusters tested, and Rank is each p-value’s rank from the smallest to the largest. Tag-enriched regions are then selected by a Q-value threshold (the given FDR) instead of a P-value threshold.
The density of tag-enriched regions
For any tag-enriched region, we define its density as the total tag count divided by the length of this region. We sort all tag-enriched regions by density from high to low. A proportion (say 5%) of top-ranking regions are used for estimating the DNA fragment size, which is a parameter of our tag profile model, and also used for investigating the distances in Figure 1. For example, in a mammalian genome, there are about 10,000 binding regions for a TF. By default, we choose the top 5% (~500 regions) with the strongest tag signals to achieve a reliable and robust estimation. Other choices on the parameter of top-percentage were also examined (for fragment size estimation, see d1 in Figure 1D and Supplementary Figure S2 in Additional File 1).
DNA fragment size estimation
The estimation of average DNA fragment size is a key step in modeling tag profiles. This is done within the densest tag-enriched regions in the data. For each selected tag-enriched region, the distance between the density centers of the forward tag cluster and the reverse tag cluster is calculated (d1 in Figure 1C). The average distance is taken as the estimate of average DNA fragment size. If the DNA fragment length mean is known from the experiment protocol, one can set this parameter with this value instead of estimating. But it is recommended to estimate this parameter from data .
Modeling the tag profile
Following previous work on ChIP-chip studies [57, 58], we assume that the ChIP-ed DNA fragment size (length) l in an experiment follows a gamma distribution p(l) = Gamma(l|α,β). Based on our observation on the real data, we set the shape parameter α = 10. In binding site detection (see below), the largest maximum R2 value in least-squares fitting is achieved when α = 10, comparing with other α values (1, 2, 5, 20, 50, and 100) on both the GABP and NRSF ChIP-seq datasets. The other parameter can be written as αβ, which is the mean of the gamma distribution, and can be estimated as the average DNA fragment size L from data.
Detecting binding sites by least-squares fitting
Given the modeled tag density profile from a binding site, we adopt a least-squares fitting technique to identify TF binding sites. In this case, the least-squares fitting strategy measures the difference between the theoretic tag profile and the observed signal. A sliding window is adopted to scan each tag-enriched region, and sites supporting the least difference are reported as binding sites.
In the least-squares fitting, the independent variables are x i = ftag (i – 1), i = 1,2,…,W, where W is the width of the sliding window we will use to scan for binding sites. We set W as less than the average fragment length L, which makes the d’s in ftag (d) approximation better satisfying the restriction that d is smaller than L. This means that we use only a part of the tag density profile when looking for binding sites.
Meanwhile, we prepare the dependent variable as follows. Given a tag-enriched region, we first count the number of tags starting at each position n denoted as t n (n = 1,2,…,C; C is the size of the tag-enriched region). We then smooth the counts to reduce possible noises introduced in the procedures of fragmentation by sonication and DNA sequencing, and get the smoothed tag density Open image in new window , where b is the size of the smoothing window. We set b = 20 in our experiments. A proper b will reduce noises but not remove adjacent binding information. We use the sliding window with 1-nt step to scan the smoothed tag density s n ’s along the tag-enriched region. In j-th (j = 1,2,…,C-W+1) sliding window, we take the tag density s j + i –1 as dependent variables y ij (i = 1,2,…,W).
The slope β j can be used to represent the relative binding affinity. We move the sliding window along the tag-enriched region, and get a goodness-of-fit curve as shown in the example of Figure 2B. A local maximum on this curve indicates a location where the tag density signal best fits the derived model, and is therefore detected as a possible binding site. If there are more than one binding sites in a tag-enriched region, multiple peaks would be observed. We report multiple binding sites from multiple peaks if they are at least 20 bp away from each other. A binding affinity threshold is set according to the signal level of each tag-enriched region to allow only the “strong peaks” detected. In our experiments, we only extract those peaks with regression slope β j no smaller than half of the maximum slope obtained in each tag-enriched region.
The above procedure is carried out on each DNA strand separately. We take a further step to combine binding sites detected from both strands. Binding sites detected from different strands are reserved unless two sites are too close (less than 20 bp from each other). When two binding sites are close to each other, their tag signal may overlap and shows only one peak on a single strand. But as the signal directions on the two strands are opposite, each strand will detect each of the two sites, and the combination will recover both binding sites. However, considering possible noises and the widths of binding sites, if the locations of the two detected sites from the two strands are within 20 bp from each other, we merge them as one binding sites at the middle. This combination can also rescue some binding sites supported only by one-strand signal because of the exclusion of multiply mapped reads due to mappability.
We used the derived theoretical tag profile to generate a series of simulated data. For single binding sites, given the position of a binding site (denoted as pbs ) and w tags in a tag-enriched region for each strand, we randomly placed w tags each at a position p i with the probability ftag (pbs – p i ) on each strand. For multiple adjacent binding sites, we generated the tags for each single binding site, and accumulated the tag counts at all positions from all binding sites.
We generated simulation data of different situations: (1) Single binding site, with tag counts varying from 10 to 200 at step-size 10; (2) Two adjacent binding sites of distance 60 bp, with tag counts varying from 10 to 200 at step-size 10; (3) Two adjacent binding sites with tag count 100, and with the distance between the two sides varying from 10 to 100 bp at step-size 10 bp. All datasets were generated in 100 duplications.
Existing software tools compared
QuEST  combines forward tags and reverse tags based on a kernel-density-estimation approach and a “peak-shift” strategy, and then identifies significant peaks and provides peak summits an estimate for TFBSs. MACS  models the peak shift size to merge forward and reverse tags for peak detection. The most updated version of MACS can identify sub-peaks and reports the summits of sub-peaks as TFBSs. SISSRs  tries to detect the switch points of the forward and the reverse tag profiles and takes the detected points as binding locations. To locate binding sites, PICS  adopts a t-distribution to model the single-event tag profile , while GPS  empirically models the tag distribution and refines it during iterations. Another recent method CSDeconv  uses a blind deconvolution strategy to locate TFBSs. Because it cannot be applied to large mammalian genomes , it isn’t included in the comparison with SeqSite.
We downloaded QuEST-2.4, MACS-1.4.0beta, SISSRs-1.4 and GPS-0.10.1 from their respective websites, PICS release version 1.4.0 from Bioconductor. We chose the FDR for MACS and SISSRs at 10%, Q-value cutoff for GPS at 0.01. We applied them on the three ChIP-seq datasets without changing the default setting of other parameters.
We developed a software tool called SeqSite of the method presented. It is written in C/C++ and can run on all major computer platforms with Windows, Unix/Linux. The software tool is available at SeqSite website  for free academic use.
We thank Dr. Xiaowo Wang for his helpful discussion. This work is partially supported by the National Basic Research Program of China (2012CB316504) and NSFC grants (Nos. 30625012, 60721003 and 60928007).
This article has been published as part of BMC Systems Biology Volume 5 Supplement 2, 2011: 22nd International Conference on Genome Informatics: Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/5?issue=S2.
- 16.Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.CrossRefPubMedGoogle Scholar
- 20.Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651-657. 10.1038/nmeth1068.CrossRefPubMedGoogle Scholar
- 38.SeqSite website. [http://bioinfo.au.tsinghua.edu.cn/seqsite]
- 44.Bruce AW, Donaldson IJ, Wood IC, Yerbury SA, Sadowski MI, Chapman M, Gottgens B, Buckley NJ: Genome-wide analysis of repressor element 1 silencing transcription factor/neuron-restrictive silencing factor (REST/NRSF) target genes. Proc Natl Acad Sci U S A. 2004, 101: 10458-10463. 10.1073/pnas.0401827101.PubMedCentralCrossRefPubMedGoogle Scholar
- 53.QuEST website. [http://mendel.stanford.edu/sidowlab/downloads/quest]
- 54.PeakSeq website. [http://archive.gersteinlab.org/proj/PeakSeq]
- 56.Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995, 57: 289-300.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.