Abstract
We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in approximately 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tishkoff, S.A., Kidd, K.K.: Implications of biogeography of human populations for ‘race’ and medicine. Nature Genetics 36, S21–S27 (2004)
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Hsi-Yang, F.M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 21(5), 734–740 (2011)
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research 40(22), e171 (2012)
Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)
DePristo, M.A.: et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43(5), 491–498 (2011)
Zaharia, M., Bolosky, W., Curtis, K., Fox, A., Patterson, P., Shenker, S., Stoica, I., Karp, R., Sittler, T.: Faster and more accurate sequence alignment with SNAP (2011), http://arxiv.org/abs/1111.5572
Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41(1), e27 (2013)
Shen, J.J., Zhang, N.R.: Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. The Annals of Applied Statistics 40(6), 476–496 (2012)
Shen, Y., Gu, Y., Pe’er, I.: A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data. BMC Bioinformatics 12(suppl. 6), S4 (2011)
Wang, K., et al.: PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17(11), 1665–1674 (2007)
Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)
Evans, S.N., Hower, V., Pachter, L.: Coverage statistics for sequence census methods. BMC Bioinformatics 11, 430 (2010)
Hower, V., Starfield, R., Roberts, A., Pachter, L.: Quantifying uniformity of mapped reads. Bioinformatics 28(20), 2680–2682 (2012)
Medvedev, P., Stanciu, M., Brudno, M.: Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6, S13–S20 (2009)
Sherry, S.T., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29(1), 308–311 (2001)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Jackson, B., et al.: An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12, 105–108 (2005)
Killick, R., Fearnhead, P., Eckley, I.A.: Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107(500), 1590–1598 (2012)
Talwalkar, A., et al.: SMaSH: A benchmarking toolkit for variant calling (2013), http://arxiv.org/abs/1310.8420
Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biology 5(10), e254 (2007)
Illumina Corporation. Platinum genomes project (2013), http://www.platinumgenomes.org
Zhao, Z., Fu, Y., Hewett-Emmett, D., Boerwinkle, E.: Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene 312, 207–213 (2003)
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONEÂ 7(1), e30377 (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Bloniarz, A. et al. (2014). Changepoint Analysis for Efficient Variant Calling. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-05269-4_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05268-7
Online ISBN: 978-3-319-05269-4
eBook Packages: Computer ScienceComputer Science (R0)