Changepoint Analysis for Efficient Variant Calling

Bloniarz, Adam; Talwalkar, Ameet; Terhorst, Jonathan; Jordan, Michael I.; Patterson, David; Yu, Bin; Song, Yun S.

doi:10.1007/978-3-319-05269-4_3

Adam Bloniarz²⁰,
Ameet Talwalkar²¹,
Jonathan Terhorst²⁰,
Michael I. Jordan^20,21,
David Patterson²¹,
Bin Yu²⁰ &
…
Yun S. Song^20,21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8394))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

3015 Accesses

Abstract

We present CAGe, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection, CAGe is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods. CAGe runs on a deeply sequenced human whole genome sample in approximately 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tishkoff, S.A., Kidd, K.K.: Implications of biogeography of human populations for ‘race’ and medicine. Nature Genetics 36, S21–S27 (2004)
Google Scholar
Cox, A.J., Bauer, M.J., Jakobi, T., Rosone, G.: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 28(11), 1415–1419 (2012)
Article Google Scholar
Hsi-Yang, F.M., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 21(5), 734–740 (2011)
Article Google Scholar
Jones, D.C., Ruzzo, W.L., Peng, X., Katze, M.G.: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research 40(22), e171 (2012)
Google Scholar
Li, H., et al.: The sequence alignment/map format and samtools. Bioinformatics 25(16), 2078–2079 (2009)
Article Google Scholar
DePristo, M.A.: et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43(5), 491–498 (2011)
Article Google Scholar
Zaharia, M., Bolosky, W., Curtis, K., Fox, A., Patterson, P., Shenker, S., Stoica, I., Karp, R., Sittler, T.: Faster and more accurate sequence alignment with SNAP (2011), http://arxiv.org/abs/1111.5572
Popitsch, N., von Haeseler, A.: NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Research 41(1), e27 (2013)
Google Scholar
Shen, J.J., Zhang, N.R.: Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. The Annals of Applied Statistics 40(6), 476–496 (2012)
Article MathSciNet Google Scholar
Shen, Y., Gu, Y., Pe’er, I.: A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data. BMC Bioinformatics 12(suppl. 6), S4 (2011)
Google Scholar
Wang, K., et al.: PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17(11), 1665–1674 (2007)
Article Google Scholar
Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2(3), 231–239 (1988)
Article Google Scholar
Evans, S.N., Hower, V., Pachter, L.: Coverage statistics for sequence census methods. BMC Bioinformatics 11, 430 (2010)
Article Google Scholar
Hower, V., Starfield, R., Roberts, A., Pachter, L.: Quantifying uniformity of mapped reads. Bioinformatics 28(20), 2680–2682 (2012)
Article Google Scholar
Medvedev, P., Stanciu, M., Brudno, M.: Computational methods for discovering structural variation with next-generation sequencing. Nature Methods 6, S13–S20 (2009)
Google Scholar
Sherry, S.T., et al.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29(1), 308–311 (2001)
Article MathSciNet Google Scholar
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
Article Google Scholar
Jackson, B., et al.: An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12, 105–108 (2005)
Article Google Scholar
Killick, R., Fearnhead, P., Eckley, I.A.: Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107(500), 1590–1598 (2012)
Article MATH MathSciNet Google Scholar
Talwalkar, A., et al.: SMaSH: A benchmarking toolkit for variant calling (2013), http://arxiv.org/abs/1310.8420
Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Biology 5(10), e254 (2007)
Google Scholar
Illumina Corporation. Platinum genomes project (2013), http://www.platinumgenomes.org
Zhao, Z., Fu, Y., Hewett-Emmett, D., Boerwinkle, E.: Investigating single nucleotide polymorphism (SNP) density in the human genome and its implications for molecular evolution. Gene 312, 207–213 (2003)
Article Google Scholar
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of California, Berkeley, USA
Adam Bloniarz, Jonathan Terhorst, Michael I. Jordan, Bin Yu & Yun S. Song
Computer Science Division, University of California, Berkeley, USA
Ameet Talwalkar, Michael I. Jordan, David Patterson & Yun S. Song
Department of Integrative Biology, University of California, Berkeley, USA
Yun S. Song

Authors

Adam Bloniarz
View author publications
You can also search for this author in PubMed Google Scholar
Ameet Talwalkar
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Terhorst
View author publications
You can also search for this author in PubMed Google Scholar
Michael I. Jordan
View author publications
You can also search for this author in PubMed Google Scholar
David Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yun S. Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Tel Aviv University, 69978, Tel Aviv, Israel
Roded Sharan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bloniarz, A. et al. (2014). Changepoint Analysis for Efficient Variant Calling. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-05269-4_3
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05268-7
Online ISBN: 978-3-319-05269-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics