iSeg: an efficient algorithm for segmentation of genomic and epigenomic data
Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems.
We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences.
We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.
(micro)array-based comparative genomic hybridization
Balanced binary tree
Dynamic Bayesian Network
False discovery rate
Fragment per million
Family-wise error rate
Model-based Analysis for ChIP-Seq
Next generation sequencing
The Cancer Genome Atlas
High throughput genomic assays, such as microarrays and next-generation sequencing, are powerful tools for studying genetic and epigenetic functional elements at a genome scale . A large number of approaches have been developed to exploit these technologies to identify and characterize the distribution of genomic and epigenomic features, such as nucleosome occupancy, chromatin accessibility, histone modifications, transcription-factor binding, replication timing, and DNA copy-number variations (CNVs). These approaches are often applied to multiple samples to identify differences in such features among different biological contexts. When detecting changes for such features, one needs to consider a very large number of segments that may undergo changes, and robustly calculating statistics for all possible segments is usually not feasible. As a result, heuristic algorithms are often needed to find the optimal solution for the objective function adopted by an approach. This problem is often called segmentation problem in the field of genomics, and change-point problem in other scientific disciplines.
Solving the segmentation problem typically involves dividing a sequence of measurements along the genome such that adjacent segments are different for a predefined criterion. For example, if segments without changes have a mean value of zero, then the goal could be to identify those segments of the genome whose means are significantly above or below zero. A large number of such methods have been developed for different types of genomic and epigenomic data [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Many methods are designed for specific data types or structures, but it is challenging to find versatile programs for data with different properties and different underlying statistical assumptions. The previous methods fall into several categories including change-point detection [2, 3, 9, 10, 12, 14, 18, 19, 20, 21, 22, 23, 24, 25], Hidden Markov models [5, 15, 26, 27, 28], Dynamic Bayesian Network (DBN) models [29, 30], signal smoothing [31, 32, 33, 34], and variational models [35, 36]. For review and comprehensive comparison, please refer to [16, 37, 38, 39, 40].
Many currently available segmentation tools have poor performance, run slowly on large datasets, or are not straightforward to use. To address these challenges, we developed a general method for segmentation of sequence-indexed genome-wide data. Our method, called iSeg, is based on a simple formulation of the optimization problem . Assuming the significance (i.e. p-value) of segments can be computed based on certain parametric or non-parametric models, iSeg identifies the most significant segments, those with smallest p-values. Once the segment with the smallest p-value is found, it will be removed from the dataset and segment with the second smallest p-value will be searched in the remaining of the data. The procedure repeats until no segments whose significance levels pass a predefined threshold. This simple objective function is intuitive from a biological perspective since the most statistically significant segments often biologically significant.
iSeg has several noteworthy features. First, the simple formulation allows it to serve as a general framework to be combined with different assumptions of underlying probability distributions of the data, such as Gaussian, Poisson, negative Binomial, or non-parametric models. As long as p-values can be calculated for the segments, the corresponding statistical model can be incorporated into the framework. Second, iSeg is a general segmentation method, able to deal with both positive and negative signals. Many of the existing methods, cannot deal with negative values in the data, because they are designed specifically for certain data types, such as genome-wide read densities, assuming data values with only zero or non-negative values. Negative values in genomic datasets can occur when analyzing data pair relationships, such as difference values or log2 ratios commonly used with fold-change analysis. Currently, most methods segment profiles separately and compare the resulting segmentations. The drawback of such treatment is that peaks with different starting and ending positions from different segmentations cannot be conveniently compared, and peaks with different magnitudes may not always be distinguished. Taking differences from two profiles to generate a single profile overcomes these drawbacks. Third, to deal with cases where segments are statistically significant, but the biological significance may be weak, we apply biological significance threshold to allow practitioners the flexibility to incorporate their domain knowledge when calling “significant” segments. Fourth, iSeg is implemented in C++ with careful design of data structures to minimize computational time to accommodate, for example, multiple or very long whole genome profiles. Fifth, iSeg is relatively easy to use with few parameters to be tuned by the users.
Here we describe the method in detail, followed by performance analysis using multiple data types including simulated, benchmark, and our own. The data types include DNA copy number variations (CNVs) for microarray-based comparative genomic hybridization (aCGH) data, copy number variations from next generation sequencing (NGS) data, and nucleosome occupancy data from NGS. From these tests, we found that iSeg performs at least comparably with popular, contemporary methods.
We adopted a formulation of the segmentation problem from previous methods . The goal was to find segments with statistical significance higher than a predefined level measured by p-values under a certain probability distribution assumption. The priority was given to segments with higher significance, meaning that the segment with highest significance was identified first, followed by the one with the second highest significance, and so on. We implemented the method using Gaussian-based tests (i.e. t-test and z-test), as used it in other existing methods [3, 7, 9]. Our method achieved satisfactory performance for both microarray and next generation sequencing data without modifying the hypothesis test. A more common formulation of the change-point problems is given in .
To illustrate, Figure 2a shows segments generated from Normal distributions with non-zero means where the rest of the data is generated from a standard Normal distribution. There are two computational challenges associated with the approach we are taking that also manifest in many previous methods. First, the number of segments that are examined is very large. Second, the overlaps among significant segments need to be detected so that the significance of the overlapping segments can be adjusted accordingly. To deal with the first challenge, we applied dynamic programming combined with exponentially increased segment scales to speed up the scanning of a large sequence of data points. To deal with the second challenge, we designed an algorithm coupling two balanced binary trees to quickly detect overlaps and update the list of the most significant segments. Segment refinement and merging allow iSeg to detect segments of arbitrary length.
Computing p-values using dynamic programming
iSeg scans a large number of segments starting with a minimum window length, W min , and up to a maximum window length, W max . The minimum and maximum window lengths have default values of 1 and 300, respectively. This window length increases by a fixed multiplicative factor, called power factor (ρ), with every iteration. For example, the shortest window length is W min , and the next shortest window length would be ρW min . The default value for ρ is 1.1. When scanning with a particular window length, W, we use overlapping windows with a space of W/5. When ‘W’ is not a multiple of 5, numerical rounding (ceiling) is applied. The aforementioned parameters can be changed by a user. We found the default parameters work robustly for all the datasets we have worked with. The algorithm computes p-values for candidate segments and detects a set of non-overlapping segments most significant among all possible segments.
MAD = median(| x i − median(x)| ).
Detecting overlapping segments and updating significant segments using coupled balanced binary trees
When the p-values of all the segments are computed, we rank the segments by their p-values from the smallest to the largest. All the segments with p-values smaller than a threshold value, p s , are kept in a balanced binary tree (BBT1). The default value of p s is set as 0.001. Assuming a significance level (α) of 0.1, 100 simultaneous tests will maintain a family-wise error rate (FWER) bounded by 0.001 with Bonferroni and Sidak corrections. Thus, the cut-off is an acceptable upper bound for multiple testing. It can be changed by a user if necessary. The procedure for overlapping segment detection is described below as a pseudo-code. The set BBT1 stores all significant segments passing the initial significance level cutoff (default value 0.001). The second balanced binary tree (BBT2) stores the boundaries for significant segments. After the procedure, SS contains all the detected significant segments. The selection of segments using balanced binary tree makes sure that segments with small p-values will be kept, while those overlapping ones with bigger p-values will be removed.
Refinement of significant segments
The significant segments are refined further by expansion and shrinkage. Without loss of generality, in the procedure (see SegmentExpansion text box) we describe expansion on left side of a segment only. Expansion on the right side and shrinkage are done similarly. When performing said expansion and shrinkage, a condition to check for overlapping segments is applied so the algorithm results in only disjoint segments.
Merging adjacent significant segments
When all the significant non-overlapping segments are detected and refined in the previous steps, iSeg performs a final merging step to merge adjacent segments (no other significant segments in between). The procedure is straightforward. We check each pair of adjacent segments. If the merged segment, whose range is defined by the left boundary of the first segment and the right boundary of the second segments, has a p-value smaller than those of individual segments, then we merge the two segments. The new segment will then be tested for merging with its adjacent segments iteratively. The procedure continues until no segments can be merged. With refinement and merging, iSeg can detect segments of arbitrary length— long and short. We added an option to merge only segments whose distances are no more than certain threshold, where distances are measured by the difference of the ending position of the first segment and the starting position of the second segment.
In iSeg, p-values for potentially significant segments are calculated. Using a common p-value cutoff, for example 0.05, to determine significant segments can suffer from a large number of false positives due to multiple comparisons. To cope with the multiple comparisons issue, which can be very serious when the sequence of measurements is long, we use a false discovery rate (FDR) control. Specifically, we employ the Benjamini-Hochberg (B-H) procedure  to obtain a cutoff value for a predefined false discovery rate (α), which has a default value of 0.01, and can also be set by a user. Other types of cutoff values can be used to select significant segments, such as a fixed number of most significant segments.
Often in practice, biologists prefer to call signals above a certain threshold. For example, in gene expression analysis, a minimum of two-fold change may be applied to call differentially expressed genes. Here we add a parameter, bc, which can be tuned by a user to allow more flexible and accurate calling of significant segments. The default output gives four bc cutoffs: 1.0, 1.5, 2.0 and 3.0. Biological cutoff value 1.0 means that the height of a segment has to be greater than 1.0*standard deviation of the data for it to be called as significant, regardless of the length of the segment. The biological cutoff parameter allows users to select significant segments, whichever are more likely to be biological significant based on their knowledge of the problem they are studying.
Processing of the raw NGS data from maize
Raw fastq files were clipped of 3′ illumina adapters with cutadapt 1.9.1. Reads were, aligned to B73 AGPv3  with bowtie2 v2.2.8 , alignments with a quality < 20 were removed, and fragment intervals were generated with bedtools (v2.25) bamtobed . Fragments were optionally subset based on their size. Read counts in 20-bp nonoverlapping windows across the entire genome were calculated with bedtools genomecov and normalized for sequencing depth (to fragments-per-million, FPM). Difference profiles were calculated by subtracting heavy FPM from light FPM. Quantile normalization was performed using the average score distributions within a given combination of digestion level (or difference) and fragment size class.
We compared our method with several previous methods for which we were able to obtain executable programs: HMMSeg , CGHSeg , DNAcopy [9, 24], fastseg , cghFLasso , BioHMM-snapCGH , mBPCR , SICER , PePr  and MACS . Among them, CGHSeg, DNAcopy, BioHMM-snapCGH, mBPCR and cghFLasso are specifically designed for DNA copy number variation data; MACS, SICER and PePr are designed for ChIP-seq data; and HMMSeg is a general method for segmentation of genomic data. Each method has some parameters that can be tuned by a user to achieve better performance. In our comparative study, we carefully selected parameters on the basis of the recommendations provided by the authors of the methods. For each method including iSeg, a single set of parameters is used for all data sets except where specified. Post-processing is required by some of the methods to identify significant segments.
The methods CGHSeg, DNAcopy, and fastseg depend on random seeds given by a user (or at run-time automatically), and the F1-score at different runs are very similar but not the same. These methods were run using three different random seeds. The averages of the F1-score were used to measure their performance.
Performance on simulated data
The simulated profiles were generated under varying noise conditions, with signal to noise ratios (SNR) of 0.5, 1.0 and 2.0, which correspond to poor, realistic and best case scenarios, respectively. Ten different profiles of length 5000 were simulated.
Performance on experimental data
DNA copy-number variation (CNV) data
To assess the performance of iSeg on experimental data, we use three different datasets. They were the Coriell dataset  with 11 profiles, the BACarray dataset  with three profiles, and the dataset from The Cancer Genome Atlas (TCGA) with two profiles. The 11 profiles in Coriell datasets correspond to 11 cell lines: GM03563, GM05296, GM01750, GM03134, GM13330, GM01535, GM07081, GM13031, GM01524, S0034 and S1514. We constructed “gold standard” annotations using a consensus approach. We first ran all the methods using several different parameter settings for each method. The resulting segments from all the different parameter settings of all the methods were combined to give an initial set of potential segments. The test statistics and p-values are then calculated for all the segments using the same probability distribution assumption described in Method. Benjamini-Hochberg procedure was then used to correct for multiple comparison. The 0.05 adjusted p-value cutoff was then used to select the set of segments as the gold standard. The annotations derived using the consensus approach are provided as Additional file 1.
Comparison of computational times (in seconds) on simulated data and Coriell data. These are total times required to process 10 simulated and 11 Coriell profiles
Simulation (SNR ≃ 1.0, n = 5000)
Simulation (SNR ≃ 1.0, n = 100 K)
Differential nuclease sensitivity profiling (DNS-seq) data
We then tested our method on next generation sequencing data, for which discrete probability distribution models have been used in most of the previous methods. The dataset profiles were genome-wide reads from light or heavy digests (with zero or positive values) or difference profiles (light minus heavy, with positive or negative values) . The difference plots are also referred to as sensitivity or differential nuclease sensitivity (DNS) profiles .
Segmentation of single nuclease sensitivity profiles
Segmentation of difference profiles with both positive and negative values
In this study, we designed an efficient method, iSeg, for the segmentation of large-scale genomic and epigenomic profiles. When compared with existing methods using both simulated and experimental data, iSeg showed comparable or improved accuracy and speed. iSeg performed equally well when tested on very long profiles, making it suitable for deployment in real-time, including online webservers able to handle large-scale genomic datasets.
In this study, we have assumed that the data follow a Gaussian (normal) distribution. The algorithm is not limited, however, to this distribution assumption. Other hypothesis tests, such as Poisson, negative binomial, and non-parametric tests, can be used to compute p-values for the segments. Data generated by next-generation sequencing (NGS) are often assumed to follow either a Poisson or a negative binomial distribution [53, 54]. However, a recent study by us (to be published) showed that normality-based tests such as t-tests or Welch’s t-test can perform equally well for NGS data after certain normalization, indicating Normal approximation of NGS data is likely adequate when analyzing NGS data, at least for some applications. For segmentation problems, when the segments tend to have relatively large sizes, the averages of signals within segments can be well approximated by normal distributions. This likely explains why iSeg performed well on NGS data. However, p-values of the segments can be computed based on either Poisson or negative binomial distributions, which can be directly used in the rest of the segmentation procedures. Such flexibility and general applicability make iSeg especially useful for segmenting a variety of genomic and epigenomic data.
The use of dynamic programming and a power factor makes iSeg computationally more efficient when analyzing very large data sets, especially with sparse signals as is typical for many types of NGS data sets. The refinement step identifies the exact boundaries of segments found by the scanning step. Merging allows iSeg to detect segments of any length. Together, these steps make iSeg an accurate and efficient method for segmentation of sequential data.
The statistic used in our method is very similar to that described for optimal sparse segment identification . However in that study, the segments are identified using an exhaustive approach, which will not be efficient for segmenting very large profiles. To speed up computation, the optimal sparse segment method  employs the assumption that the segments have relatively short length, which is not true for some datasets. In contrast, the algorithm designed in this study allowed us to detect segments of any length with demonstrably greater efficiency.
The gold standard approach is desirable but the derivation of a true gold standard is complex and possibly subjective. A gold standard generated using a consensus approach does not guarantee that the true optimal segments will be identified. In addition, the F1-scores may favor iSeg more as the test statistic used to generate the gold standard is not employed by the other methods. However, the statistic we used is based on model assumptions used by many existing methods. Visual inspection of segmentation results clearly shows how iSeg performs well in direct comparisons with other popular, contemporary methods. We expect that future research will benefit from the application of iSeg to compare multiple profiles simultaneously.
We have designed the method to make it flexible and versatile. This resulted in a number of parameters that users can tune. However, the default values work well for all the simulated and experimental datasets. In practice, to obtain satisfactory results, users are not expected to modify any parameters. The speed of iSeg would allow us and fellow researchers to implement it as an online tool to deliver segmentation results in real-time.
In this work, we developed a new method, iSeg, for the segmentation of large-scale genomic and epigenomic profiles. The performance of iSeg was demonstrated by comparing with existing segmentation methods using both simulated and real datasets. The computation framework utilized in iSeg can readily accommodate different assumptions on the underlying probability distribution of the data. The computational algorithms designed in iSeg make it able to search for, grow, and refine segments comprehensively for very long profiles without compromising too much of the computational speed. iSeg also provides a biological cutoff parameter, allowing researchers to select segments which are more likely to be biological significant by setting a proper biological cutoff value based on the problems at hand and/or their prior knowledge. We believe that iSeg will be a very useful tool for segmentation problems in biology.
We would like to thank the researchers who have generated the data used in this study.
This work was supported by a grant from NSF (Plant Genome Research Program, IOS Award 1444532). JZ was also supported by the National Institute of General Medical Sciences of the National Institute of Health under award number R01GM126558.
JZ designed the core iSeg algorithm. JZ, HWB, DLV, and JHD were responsible for the study design and result interpretation. SBG, YL, and PYL performed data analyses. SBG wrote the first draft of the manuscript. All authors revised the manuscript and approved the final version.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Baldi P, Brunak S. Bioinformatics: the machine learning approach. Cambridge: MIT press; 2001.Google Scholar
- 24.Sen A, Srivastava MS. On tests for detecting change in mean. Ann Stat. 1975;3(1):98–108.Google Scholar
- 25.Yao Q. Tests for change-points with epidemic alternatives. Biometrika. 1993:179–91.Google Scholar
- 26.Jaschek R, Tanay A. Spatial Clustering of Multivariate Genomic and Epigenomic Information. In: Batzoglou S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science, vol 5541. Berlin, Heidelberg: Springer; 2009.Google Scholar
- 30.Hoffman MM, et al. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2012:gks1284.Google Scholar
- 41.Brodsky E, Darkhovsky BS. Nonparametric methods in change point problems. Vol. 243: Springer Science & Business Media; 2013. https://play.google.com/store/books/details?id=GLvwCAAAQBAJ.
- 43.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995:289–300.Google Scholar
- 47.Xu S, et al. Spatial clustering for identification of ChIP-enriched regions (SICER) to map regions of histone methylation patterns in embryonic stem cells, Stem cell transcriptional networks: methods and protocols; 2014. p. 97–111.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.