Unraveling overlapping deletions by agglomerative clustering
 2.3k Downloads
 3 Citations
Abstract
Background
Structural variations in human genomes, such as deletions, play an important role in cancer development. NextGeneration Sequencing technologies have been central in providing ways to detect such variations. Methods like pairedend mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions.
Result
We present a first approach specifically designed to cluster shortread pairedend data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account.
Conclusion
We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a sideeffect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.
Keywords
Similarity Score Maximal Clique Agglomerative Cluster Donor Genome Deletion SizeIntroduction
It is well known that mutations in the human genome are associated to diseases such as cancer. Besides small scale mutations like single nucleotide variants, larger events such as deletions, insertions, inversions, or interchromosomal rearrangements can have a crucial impact on the initiation and development of cancer. The detection and analysis of these structural variations play an important role in understanding the underlying mechanisms of cancer, its diagnosis and treatment [1, 2, 3, 4].
Current sequencing technologies allow to obtain high data volumes at low cost. It has now become affordable to sequence several samples of the same patient, enabling comparative analyses of, e.g., tumor cells versus healthy blood cells, or samples taken before versus after treatment, to distinguish tumor from patient specific variations, or to observe structural variations over time [5, 6].
In the analysis of such complex data, it is important to consider heterogeneity of various kinds [7]. Besides the differences between several tissues or time points, in cancer one always has to face heterozygosity (mutations only affecting one allele), loss of heterozygosity (mutation inactivating the second allele), aneploidy (different copy numbers for some chromosomes), copy number alterations (different copy numbers for parts of chromosomes), differentiation of tumor cell lines developing different variations, etc. An additional challenge is that a cancer sample is most likely a mixed sample, i.e., although taken from tumor tissue, it usually contains also normal cells [2, 8, 9].
For the detection of single nucleotide variants (SNVs), there exist several approaches, some of which address the above issues. For instance, SomaticSniper [10], JointSNVMix [11] and MutationSeq [12] call somatic SNVs from pairs of normal and tumor samples, where the first two methods follow a Bayesian approach to distinguish tumor from patient specific SNVs, and the latter builds on clustering by support vector machines. Strelka [13] explicitly models mixtures of tumor and normal cells and can also call small indels. Also, several tools exist to accurately detect SNVs in pooled data [14, 15, 16, 17], even mutations of low abundance. Apart from analyzing single SNVs, also haplotype inference and assembly has been addressed [18, 19, 20]. For the analysis of gene expression data, also Bayesian approaches have been proposed, even considering subtypes of cancer [21].
In contrast to the analysis of SNVs, for the detection of somatic deletions and other larger structural variations, one usually has to process the different samples separately and to compare the results afterwards, e.g., subtract deletions found in the healthy sample from those found in the tumor sample. Or one can pool the data and afterwards only select those calls solely based on tumor data [22]. Only recently, joint analysis of several data sets have been proposed [5, 6].
As shown in a preliminary study [5] to detect deletions by a combined analysis of samples from tumor and healthy tissue, there were regions in the tumor genome for which existing tools predicted more deletions than there could actually be on a diploid genome. When instead two diploid sets of chromosomes were assumed, i.e., the tumor sample is actually a mixture of cancerous and healthy cells, almost all data could be explained consistently, by explicitly modeling heterozygous deletions on different alleles to be overlapping. These observations have been made particularly in regions where deletions were found in the healthy cells and additional deletions have been predicted in the tumor sample  thus in regions where it is especially difficult to distinguish cancer from patient specific mutations. However, the scope of the presented method has been to show that a consistent scenario of overlapping deletions can be found. It greedily constructs a "possible" solution, without raising the claim of reporting a "reasonable" result. Furthermore, the model is restricted to the very specific case of analyzing a mixture of cancer and normal cells, including some additional technical, combinatorial assumptions.
In this paper, we present a method to detect deletions that is particularly designed to handle overlapping deletions. For the sake of flexibility, no assumption on the composition of the data is made  whether it is just from one sample or pooled data, it contains different cell lines, aneploid cells, etc. Being aware of such heterozygosity and specially designed for such data, we refrain from predicting a deletion being "heterozygous or homozygous", or "tumor or patient specific". Instead, besides a tabular listing of the results, a rich visual output for each set of overlapping deletions is provided, allowing an easy inspection of the results.
The method takes as input paired ends that have been mapped to a reference genome and collects those mappings likely originating from the same deletion in clusters. Agglomerative clustering is utilized to cluster mappings by similarity. The similarity score is based on a threedimensional representation of mappings and deletions similar to the twodimensional representation introduced by Dew et al. [23] and used for structural variation detection by Sindi et al. [22] in the tool GASV. With this representation, we obtain a score that captures both the putative location and size of the deletion.
We applied our method to a data set from several samples taken from an acute lymphoblastic leukemia patient. Besides examples for predicted overlapping deletions, we find that single, putatively erroneous mappings not assigned to another cluster can nicely be identified as outliers. Since overlapping deletions are rare events and their verification is difficult, we performed an evaluation on simulated data showing good accuracy.
After providing the necessary background in the following section, we introduce our approach in Section "Method". In Section "Results and discussion", we present a simulationbased evaluation and results on real data, before we conclude our study.
Background
We will now give a brief overview of existing approaches to identify structural variations and then introduce the technique of pairedend mapping, which our method is based on.
Structural variation detection
Besides different experimental techniques, there are many computational approaches for structural variation detection [24]. A straight forward idea to detect mutations would be to fully assemble the genome under consideration, the socalled donor genome, and to align it to a reference sequence. To save the time and cost intensive assembly and finishing steps which would be necessary to determine the full genome sequence, one usually follows other approaches. Instead of performing a full assembly, one can restrict the process to only reads from regions suspect to harbor a variation, as for instance done in [25]. The tool fermi [26] allows both full assembly and a prefiltering for reads unique to one sample. Other recent methods simultaneously assemble several genomes into a single graph data structure allowing for variation calling [27, 28].
Another, more common approach is to omit any assembly of the donor genome and instead utilize the reads directly to detect differences by mapping them to the finished reference sequence. Basically, there are three classes of methods to identify structural variations from those mappings. (See [24] or [29] for reviews.) (1) Significant fluctuations of the coverage of the reference by mappings can indicate copy number changes. If part of a chromosome is lost or duplicated, the coverage drops or increases, respectively. (2) If a read has not been mapped completely, but in parts, this can indicate different types of mutations. E.g., if one half of the split read is mapped with some space to the remaining part, the segment inbetween might be absent in the donor genome. If parts are mapped to different chromosomes, this can indicate interchromosomal rearrangements. (3) If the donor genome has been sequenced using a certain technology, pairs of reads (paired ends) are obtained. The orientation and the distance of the reads within a pair are known in the donor, and when the corresponding mappings on the reference do not agree with this pattern, structural variations can be called similar to the split read approach. This pairedend mapping technique, introduced by Korbel et al. [30], will be explained in detail in the following section.
Since all techniques have their advantages and disadvantages, in general, a combination of different techniques is advisable. Different tools can be applied separately and the results are combined in a postprocessing step. Also tools exist which combine techniques already during the detection, e.g. inGAPsv [31], CNVer [32], GASVPro [33], SVseq2 [34], or the method by Nord et al. [35]. The method presented here is based on pairedend mapping. It is particularly designed for regions harboring several overlapping deletions. Due to the complexity of such regions, a partial assembly or coverage analysis is hardly possible. However, integrating a split read approach is planned for future work.
Deletion detection by pairedend mapping
When genomic material from a sample is sequenced by shortread sequencing, many overlapping DNAfragments are produced and a certain number of bases are read from both ends of each fragment, resulting in a pair of reads, socalled paired ends. The reading direction and the length of the reads are known. Further, since the fragment size is fixed  in practice distributed around the desired length  their approximate distance is known as well.
The paired ends from a newly sequenced donor genome can be mapped to a reference genome which is already assembled to a complete DNAsequence. In a region where the two genomes do not differ, the mapped reads have the original direction and their distance coincides with the fragment length. Such a mapping is called concordant. If however a mapping is discordant, i.e., either the orientation is inconsistent or the distance differs significantly from the expected fragment size, this indicates a structural variation in the donor w.r.t. the reference.
In this paper, we only focus on deletions, i.e., a segment present in the reference is not present in  we say deleted from  the donor genome. Hence, paired ends spanning the deletion breakpoint in the donor genome will be mapped to the reference with a distance increased by the size of the deletion but with proper orientation. We call such a mapping stretched.
Method
In a region that possibly harbors several deletions, we obtain paired ends from the different alleles which are then all mapped to the same reference sequence. Depending on how similar the deletions are in terms of location and size, it is difficult to separate the mappings and to recover the different deletions. Our goal is to partition the set of mappings into clusters of similar mappings probably belonging to the same deletion. To this end, we utilize the technique of agglomerative clustering, which in general works as follows. At the beginning, each object (in our case stretched mapping) is a singleton cluster, and a similarity score is computed for all pairs of clusters. Then, iteratively, a pair of clusters of maximum similarity is merged, the two original clusters are replaced by this new cluster, and the similarity between the new cluster and all others is recomputed. These merging steps are repeated until either only one cluster is left, or the maximum similarity is below a certain threshold.
In the following, we will first explain how we model deletion clusters, and then introduce our similarity score, which is crucial for accurate clustering. Finally, we give some details about our implementation.
Deletion clusters
For our similarity score, we build on this twodimensional representation of deletion coordinates, as it nicely incorporates both the location and the size of a deletion. Again, we use the x and yaxis for the deletion start and end coordinates, respectively. But to model the deletion size, instead of the sharply bounded interval, we use a continuous distribution. As mentioned before, assuming that the fragment length follows a normal distribution with mean μ and standard deviation σ, the expected deletion size implied by a mapping with distance d is normally distributed around d  μ with the same standard deviation σ. In the geometric interpretation, for each possible start coordinate for a deletion on the xaxis, this distribution describes the expected end coordinate on the y axis. Since the deletion has to be located inbetween the mappings of the paired ends, these, together with the main diagonal, define a triangle of "allowed" deletion coordinates, which is covered in the third dimension by the normal distribution as shown in Figure 3(b). In this diagram and all following, the probability will be represented by shaded gray toning.
With our choice of l and r as the maximum and minimum, respectively, we follow an "average link" strategy as it corresponds to taking the intersection of the triangles. This is the most intuitive way since the new cluster should represent the "area" of deletion coordinates to those which are compatible to both of the original clusters. The equation for the joint standard deviation is borrowed from populationbased statistics for aggregating nonoverlapping subpopulations. If the two clusters agree in the location of the deletion, i.e., μ_{1}  μ_{2} = 0, the resulting ${\sigma}_{1\cup 2}^{2}$ is the mean of the two standard deviations. The more the two clusters disagree, the broader becomes the distribution. For reasons we will explain later, this way of joining the deviations turned out to be better suitable for our similarity score and the resulting performance in clustering than using the actual joint variance.
Similarity score
This score has the following properties.

The score is zero if and only if the triangles do not overlap, which means that there is no location for a deletion compatible with both clusters.

The score is one if and only if the two clusters are equal, because, only if all parameters of the two clusters are equal, the intersection volume equals the maximum.

It is more sensitive to differences in the deletion size than in the location. The reason for this is that a shift parallel to the ridge of the volume does not affect the intersection volume to such an extent as a shift perpendicular to it. This is a particularly desired behavior since we expect a deletion breakpoint being covered by several mappings. In some mappings the breakpoint lies more to the left and in others more to the right, which  even if all mappings are correct  corresponds to staggered triangles. Such triangles, even if staggered, would have to be clustered and should thus not be punished too hard by a low score. In contrast, a difference in the deletion size indicates a disagreement of the mappings.
We now come back to the issue how to combine the standard deviations of several mappings. As already mentioned, we did not choose the more intuitive combination by computing the standard deviation of the means, i.e. ${\sigma}_{1\cup 2}^{2}:={\sigma}^{2}/{n}_{1\cup 2}$, where σ is the standard deviation of the fragment length. This results in narrower distributions for clusters containing more mappings. On the one hand, this would make sense since the prediction is more accurate, the more mappings support it. On the other hand, besides computational problems due to score values close to zero, this also yields an artifact: Clusters containing only a few mappings have a broader volume and are thus more likely to be clustered than larger clusters with a narrow volume. The order in which clusters are aggregated would be dominated by their cardinality, rather than by their similarity. To avoid this, we chose our definition of ${\sigma}_{1\cup 2}^{2}$ to be independent of the cardinality of the clusters.
It remains to describe the stopping criterion for the clustering process. Since the normal distribution is never exactly zero, any two clusters that overlap in their base (triangle) have a nonempty intersection volume, no matter how weak the overlap is. We thus set a minimum threshold S_{ min } for the similarity score. If no pair of clusters has similarity larger than S_{ min }, the clustering process is stopped. To avoid the introduction of a parameter which is arbitrarily fixed or has to be given by the user, we examine the data in a preprocessing step to determine a threshold that separates Significant similarity values from noise. For each mapping, we determine the smallest nonzero similarity score to any other (overlapping) mapping. Setting S_{ min } to the median of these minima showed robust accurate behavior in practice, also approved in our experiment as explained at the end of "Simulation of overlapping deletions".
Implementation
In general, the run time complexity of agglomerative clustering depends on the similarity score. Using priority queues to store the scores, O(N^{2} log N) pairwise score computations have to be performed to cluster N objects. Our score can be computed as follows. In each triangle, the height of the volume is constant on a line parallel to the main diagonal and thus parallel to the hypotenuse of the triangle. This allows us to compute the volume by traversing a triangle (or the intersection of two triangles, which is again a triangle) starting from its hypotenuse towards the upper left corner, and summing up the product of the length of the line and the height of the volume given by the normal distribution (or the minimum of two, respectively). Since this takes O (r  l) time and the score computation consists of a constant number of volume computations, the overall run time complexity of the agglomerative clustering is in O (L N^{2} log N), where L is the maximum length of a mapping.
In practice, the run time is dominated by reading the input and thus approximately linear in the number of mappings.
We implemented the method in JAVA. The input consists of one or several BAM files and a simple tabular separated file listing the mean segment length and standard deviation for each file. Additionally, a color can be specified for each BAM file, which is used to visualize the mappings in the graphical output. The mappings are read from the BAM files using SAMtools [39] and applying several filters. Only mappings of high quality (quality value at least 20), without gaps, uncapped and without cooptimal mapping locations are used. Further, the mappings can be filtered by their length. Here, a minimum length of mean plus three times standard deviation showed good performance in practice.
Before the actual clustering, we partition the mappings into socalled regions, maximal subsets such that no mapping from one set overlaps with any mapping in another subset. These regions are then clustered independently.
The tool, including the source code and example data, is available from the Bielefeld University Bioinformatics Server: http://bibiserv.cebitec.unibielefeld.de/agglodel/
Results and discussion
Before we present results on real data from three samples of a cancer patient, we first investigate the accuracy of the new clustering method in distinguishing overlapping from single deletions in simulated data sets.
Simulation of overlapping deletions
Since, to the best of our knowledge, no other method exists that aims at identifying overlapping deletions, and because such instances are difficult to detect or verify in the wet lab, there is no gold standard available we could use for an evaluation. Instead, we created a simulated data set which is based on previously detected single deletions.
We used our agglomerative clustering approach with the filtering of mappings as explained in "Method", and also ran two other tools on the same set of filtered mappings. First of all, we included GASV [22] (release 2.01) into our evaluation, as our similarity score is based on an extension of the geometric interpretation used there. We set the minimum cluster cardinality to 2 and used both, the option to output all maximal sets of overlapping trapezoids (see "Method") which we will refer to as GASV_max, and the default to output merged clusters, referred to as GASV. Although a followup version GASVPro [33] has been published recently, the available software does not yet include all necessary preprocessing tools to process BAM files. Secondly, we also ran CLEVER [37] as one of the most recent and accurate variation detection tools. This method computes a probability for mappings arising from the same deletion, and builds a graph with the mappings as vertices and edges for pairs with Significantly high probability. Clusters are determined as maximal cliques in this graph, which in general allows overlapping deletions in the output. Clusters are determined as maximal cliques in this graph.
On average, CLEVER detected 92% of all deletions and made only about 1% wrong predictions. Even though maximal cliques output by CLEVER can be overlapping and the output can thus contain overlapping deletion predictions, this has to be understood as a technical consequence of the clustering procedure as CLEVER does not aim at predicting overlapping clusters. Nevertheless, we tried to distinguish between nonoverlapping and overlapping deletions by simply considering nonoverlapping cliques and pairs of overlapping cliques. The true and false positive rates were below 0.3 for all scenarios (not shown in the figure).
In all settings, agglomerative clustering with the proposed similarity score proves to be accurate. GASV performs well, where using the option to output all clusters before merging turned out to be slightly more effective in this setting.
We also investigated, when two overlapping deletions have been called correctly, how accurate the mappings originating from the two different alleles have been assigned to the two clusters. Ideally, one clusters should only contain mappings from the first deletion and the other cluster only those from the second deletion. On those pairs, correctly identified by all three methods, GASV clusters contained 0.041% misassigned mappings, GASV_max 0.019%, and our clusters contained 0.006%. In total, i.e., including the more difficult cases GASV did not detect as pairs at all, we observed 1.3% misassignments in our clusters and 4.0% in the GASV_max clusters (averaged over all four simulation settings).
Real data
The Department of Paediatric Oncology, Haematology and Immunology at the Düsseldorf University Hospital, Germany, provided sequencing data of an acute lymphoblastic leukemia patient. Three samples (before treatment, after treatment, and relapse) have been sequenced on an Illumina HiSeq 2000 with read length 51, segment length around 300, and sequence coverage of 6×, 8× and 8×, respectively. At the Institute of Medical Informatics at University of Münster, Germany, the reads have been mapped to hg19 with BWA [42] (version 0.5.9, at most 3 mismatches and standard parameters otherwise) and Picard [43] has been used to remove duplicates. Figure 1 exemplarily shows the fragment length distribution for chromosome 21 from one of the data sets estimated from the mapping distances. It approves that the assumption of a normal distribution is approximately met.
Number of overlapping deletions.
Size (# deletions)  0  1  2  3  4  5  6  7  8  9  ≥ 10  Total 

# regions  631,769  97,869  1,768  68  25  8  2  2  3  4  4  731,522 
# components    101,615  104  9  4  3  1  2  0  1  0  101,739 
Many highly overlapping deletions were found in telomeric or centromeric regions, where mappings are not very reliable. Figure 4 exemplarily shows a region where we found overlapping clusters. The shown region is also an example for a general observation we made: Agglomerative clustering with our similarity score nicely identifies outliers among the discordant mappings, as those remain as singleton clusters. In our approach, such putatively erroneous mappings do not distort the main deletion prediction, whereas, for instance in GASV, a single mapping can drastically affect a cluster if the corresponding trapezoid overlaps it, or can result in many maximal sets of overlapping trapezoids.
An experimental validation of this and further findings is currently being performed. Here, we wanted to give an overview of the general performance of the presented method on a real world instance. In cooperation with the Düsseldorf University Hospital, further analyses of this and other data sets are planned.
Conclusions
It is well known that structural variations in the human genome play an important role in the development of diseases, especially cancer. In particular, the accurate detection of deletions still remains a challenging task. A preliminary study has motivated that detection methods should be capable of handling overlapping deletions to be able to draw a clearer picture of variations in heterogeneous samples, e.g., to distinguish cancer from patient specific mutations.
In the present study, we have demonstrated that agglomerative clustering is suitable for this task. We have introduced a similarity score that is based on a geometric and probabilistic interpretation of paired ends which have been mapped to a reference sequence. Taking into account both the location and the length of a deletion, this scoring allows to effectively cluster mappings into possibly overlapping clusters. The method has successfully been applied on real data, and has proven to be accurate according to a simulationbased evaluation.
Here, we have investigated the performance of an intuitive clustering approach. On the one hand, the simplicity of agglomerative clustering offered us insight into the clustering process and understanding of its behavior. On the other hand, computationally more sophisticated clustering strategies could perform even better, if they are computationally not too expensive for such large data sets.
We believe that there is still more potential to be explored in the detection of structural variants, especially deletions. Besides the combination with other approaches, such as split read or coverage analysis, the increasing availability of several samples from one patient offers a dimension that has to be investigated.
Note added in proof
Another clustering based method to call deletions (and insertions) from pairedend mapping data from heterogeneous samples has been published very recently. The tool SVM^{2} by Chiara et al. [44] utilizes a support vector machine incorporating, similar to our tool, mapping location and mapping distance, but also coverage information. Including SVM^{2} into our simulationbased evaluation could not be finished during revision of the present manuscript.
Declarations
The publication costs for this article were funded by the Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University Library.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
Notes
Acknowledgements
The author wishes to thank Tobias Marschall for many discussions and valuable support in the evaluation; Peter Husemann, Linda Sundermann, Cedric Chauve and Jens Stoye for fruitful discussions; the Department of Paediatric Oncology, Haematology and Immunology at the Düsseldorf University Hospital for providing the data; Christoph Bartenhagen for performing the preprocessing of the data; and the reviewers for their helpful comments.
Supplementary material
References
 1.Mardis E: Cancer genomics identifies determinants of tumor biology. Genome Biol. 2010, 11 (5): 21110.1186/gb2010115211.PubMedCentralCrossRefPubMedGoogle Scholar
 2.Robinson K: Application of secondgeneration sequencing to cancer genomics. Brief Bioinform. 2010, 11 (5): 524534. 10.1093/bib/bbq013.CrossRefGoogle Scholar
 3.Pleasance E, Cheetham R, Stephens P, McBride D, Humphray S, Greenman C, Varela I, Lin M, Ordóñez G, Bignell G et al: A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010, 463 (7278): 191196. 10.1038/nature08658.PubMedCentralCrossRefPubMedGoogle Scholar
 4.Dalgliesh G, Furge K, Greenman C, Chen L, Bignell G, Butler A, Davies H, Edkins S, Hardy C, Latimer C et al: Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes. Nature. 2010, 463 (7279): 360363. 10.1038/nature08672.PubMedCentralCrossRefPubMedGoogle Scholar
 5.Wittler R, Chauve C: Consistencybased detection of potential tumorspecific deletions in matched normal/tumor genomes. BMC Bioinformatics. 2011, 12 (Suppl 9): S2110.1186/1471210512S9S21. [Proc. of RECOMBCG 2011]PubMedCentralCrossRefPubMedGoogle Scholar
 6.Hormozdiari F, Hajirasouliha I, McPherson A, Eichler E, Sahinalp S: Simultaneous structural variation discovery among multiple pairedend sequenced genomes. Genome Res. 2011, 21 (12): 22032212. 10.1101/gr.120501.111.PubMedCentralCrossRefPubMedGoogle Scholar
 7.Shibata D: Heterogeneity and Tumor History. Science. 2012, 336 (6079): 304305. 10.1126/science.1222361.CrossRefPubMedGoogle Scholar
 8.Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through secondgeneration sequencing. Nat Rev Genet. 2010, 11 (10): 685696. 10.1038/nrg2841.CrossRefPubMedGoogle Scholar
 9.Koboldt D, Ding L, Mardis E, Wilson R: Challenges of sequencing human genomes. Brief Bioinform. 2010, 11 (5): 484498. 10.1093/bib/bbq016.PubMedCentralCrossRefPubMedGoogle Scholar
 10.Larson D, Harris C, Chen K, Koboldt D, Abbott T, Dooling D, Ley T, Mardis E, Wilson R, Ding L: SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 2012, 28 (3): 311317. 10.1093/bioinformatics/btr665.PubMedCentralCrossRefPubMedGoogle Scholar
 11.Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A, Hirst M, Turashvili G, Oloumi A et al: JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired nextgeneration sequencing data. Bioinformatics. 2012, 28 (7): 907913. 10.1093/bioinformatics/bts053.PubMedCentralCrossRefPubMedGoogle Scholar
 12.Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, Hirst M, Marra M, Condon A et al: Featurebased classifiers for somatic mutation detection in tumournormal paired sequencing data. Bioinformatics. 2012, 28 (2): 167175. 10.1093/bioinformatics/btr629.PubMedCentralCrossRefPubMedGoogle Scholar
 13.Saunders C, Wong W, Swamy S, Becq J, Murray L, Cheetham R: Strelka: accurate somatic smallvariant calling from sequenced tumornormal sample pairs. Bioinformatics. 2012, 28 (14): 18111817. 10.1093/bioinformatics/bts271.CrossRefPubMedGoogle Scholar
 14.Druley T, Vallania F, Wegner D, Varley K, Knowles O, Bonds J, Robison S, Doniger S, Hamvas A, Cole F et al: Quantification of rare allelic variants from pooled genomic DNA. Nat Methods. 2009, 6 (4): 263265. 10.1038/nmeth.1307.PubMedCentralCrossRefPubMedGoogle Scholar
 15.Bansal V: A statistical method for the detection of variants from nextgeneration resequencing of DNA pools. Bioinformatics. 2010, 26 (12): i318i324. 10.1093/bioinformatics/btq214.PubMedCentralCrossRefPubMedGoogle Scholar
 16.Zhou B: An Empirical Bayes Mixture Model for SNP Detection in Pooled Sequencing Data. Bioinformatics. 2012, 28: 25692575. 10.1093/bioinformatics/bts501.CrossRefPubMedGoogle Scholar
 17.Wilm A, Aw P, Bertrand D, Yeo G, Ong S, Wong C, Khor C, Petric R, Hibberd M, Nagarajan N: LoFreq: a sequencequality aware, ultrasensitive variant caller for uncovering cellpopulation heterogeneity from highthroughput sequencing datasets. Nucleic Acids Res. 2012, 40: 1118911201. 10.1093/nar/gks918.PubMedCentralCrossRefPubMedGoogle Scholar
 18.Ramadhani S, Mousavi S, Talebi M: An improved heuristic for haplotype inference. Gene. 2012, 507: 177182. 10.1016/j.gene.2012.06.032.CrossRefPubMedGoogle Scholar
 19.Lim H, Jeong I, Kang S: Individual haplotype assembly of Apis mellifera (honeybee) by using a practical branch and bound algorithm. J Asia Pac Entomol. 2012, 15: 375381. 10.1016/j.aspen.2012.05.012.CrossRefGoogle Scholar
 20.Aguiar D, Istrail S: HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data. J Comput Biol. 2012, 19 (6): 577590. 10.1089/cmb.2012.0084.PubMedCentralCrossRefPubMedGoogle Scholar
 21.Hong Z, Lian H: BOPA: A Bayesian hierarchical model for outlier expression detection. Comput Stat Data Anal. 2012, 56: 41464156. 10.1016/j.csda.2012.05.003.CrossRefGoogle Scholar
 22.Sindi S, Helman E, Bashir A, Raphael B: A geometric approach for classification and comparison of structural variants. Bioinformatics. 2009, 25 (12): i222i230. 10.1093/bioinformatics/btp208. [Proc. of RECOMB 2009]PubMedCentralCrossRefPubMedGoogle Scholar
 23.Dew I, Walenz B, Sutton G: A tool for analyzing mate pairs in assemblies (TAMPA). J Comput Biol. 2005, 12 (5): 497513. 10.1089/cmb.2005.12.497.CrossRefPubMedGoogle Scholar
 24.Alkan C, Coel B, Eichler E: Genome structural variation discovery and genotyping. Nat Rev Genet. 2011, 12: 363376. 10.1038/nrg2958.PubMedCentralCrossRefPubMedGoogle Scholar
 25.Ye K, Schulz M, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from pairedend short reads. Bioinformatics. 2009, 25 (21): 28652871. 10.1093/bioinformatics/btp394.PubMedCentralCrossRefPubMedGoogle Scholar
 26.Li H: Exploring singlesample SNP and INDEL calling with wholegenome de novo assembly. Bioinformatics. 2012, 28 (14): 18381844. 10.1093/bioinformatics/bts280.PubMedCentralCrossRefPubMedGoogle Scholar
 27.Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G: De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012, 44: 226232. 10.1038/ng.1028.PubMedCentralCrossRefPubMedGoogle Scholar
 28.Sirén J, Välimäki N, Mäkinen V: Indexing finite language representation of population genotypes. Algorithms in Bioinformatics. 2011, 6833: 270281. 10.1007/9783642230387_23.CrossRefGoogle Scholar
 29.Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with nextgeneration sequencing. Nat Methods. 2009, 6: S13S20. 10.1038/nmeth.1374.CrossRefPubMedGoogle Scholar
 30.Korbel J, Urban A, Affourtit J, Godwin B, Grubert F, Simons J, Kim P, Palejev D, Carriero N, Du L et al: Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007, 318 (5849): 420426. 10.1126/science.1149504.PubMedCentralCrossRefPubMedGoogle Scholar
 31.Qi J, Zhao F: inGAPsv: a novel scheme to identify and visualize structural variation from paired end mapping data. Nucleic Acids Res. 2011, 39 (suppl 2): W567W575.PubMedCentralCrossRefPubMedGoogle Scholar
 32.Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M: Detecting copy number variation with mated short reads. Genome Res. 2010, 20 (11): 16131622. 10.1101/gr.106344.110.PubMedCentralCrossRefPubMedGoogle Scholar
 33.Sindi S, Onal S, Peng L, Wu H, Raphael B: An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol. 2012, 13 (3): R2210.1186/gb2012133r22.PubMedCentralCrossRefPubMedGoogle Scholar
 34.Zhang J, Wang J, Wu Y: An improved approach for accurate and efficient calling of structural variations with lowcoverage sequence data. BMC Bioinformatics. 2012, 13 (Suppl 6): S6Google Scholar
 35.Nord A, Lee M, King M, Walsh T: Accurate and exact CNV identification from targeted highthroughput sequence data. BMC Genomics. 2011, 12: 18410.1186/1471216412184.PubMedCentralCrossRefPubMedGoogle Scholar
 36.Lee S, Hormozdiari F, Alkan C, Brudno M: MoDIL: Detecting small indels from cloneend sequencing with mixtures of distributions. Nat Methods. 2009, 6 (7): 473474. 10.1038/nmeth.f.256.CrossRefPubMedGoogle Scholar
 37.Marschall T, Costa I, Canzar S, Bauer M, Klau G, Schliep A, Schönhuth A: CLEVER: CliqueEnumerating Variant Finder. Bioinformatics. 2012Google Scholar
 38.Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler E, Sahinalp S: Nextgeneration VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010, 26 (12): i350i357. 10.1093/bioinformatics/btq216.PubMedCentralCrossRefPubMedGoogle Scholar
 39.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R et al: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 20782079. 10.1093/bioinformatics/btp352.PubMedCentralCrossRefPubMedGoogle Scholar
 40.Zhang J, Feuk L, Duggan G, Khaja R, Scherer S: Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res. 2006, 115 (34): 205214. 10.1159/000095916.CrossRefPubMedGoogle Scholar
 41.John SJ: SimSeq. accessed 24April2011, [https://github.com/jstjohn/SimSeq]
 42.Li H, Durbin R: Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009, 25 (14): 17541760. 10.1093/bioinformatics/btp324.PubMedCentralCrossRefPubMedGoogle Scholar
 43.Wysoker A, Tibbetts K, Fennell T: Picard. accessed 24April2011, [http://picard.sourceforge.net/]
 44.Chiara M, Pesole G, Horner D: SVM^{2}: an improved pairedendbased tool for the detection of small genomic structural variations using highthroughput singlegenome resequencing data. Nucleic Acids Res. 2012, 40: e14510.1093/nar/gks606.PubMedCentralCrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.