Abstract
We present a computational method for analyzing deep sequencing data obtained from a genetically diverse sample. The set of reads obtained from a deep sequencing experiment represents a statistical sample of the underlying population. We develop a generative probabilistic model for assigning observed reads to unobserved haplotypes in the presence of sequencing errors. This clustering problem is solved in a Bayesian fashion using the Dirichlet process mixture to define a prior distribution on the unknown number of haplotypes in the mixture. We devise a Gibbs sampler for sampling from the joint posterior distribution of haplotype sequences, assignment of reads to haplotypes, and error rate of the sequencing process to obtain estimates of the local haplotype structure of the population. The method is evaluated on simulated data and on experimental deep sequencing data obtained from HIV samples.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends Genet. 24(3), 133–141 (2008)
Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24(3), 142–149 (2008)
Chi, K.R.: The year of sequencing. Nat. Methods 5(1), 11–14 (2008)
Nowak, M.A., Anderson, R.M., McLean, A.R., Wolfs, T.F., Goudsmit, J., May, R.M.: Antigenic diversity thresholds and the development of AIDS. Science 254(5034), 963–969 (1991)
Walker, B.D., Burton, D.R.: Toward an AIDS vaccine. Science 320(5877), 760–764 (2008)
Perrin, L., Telenti, A.: HIV treatment failure: testing for HIV resistance in clinical practice. Science 280(5371), 1871–1873 (1998)
Hoffmann, C., Minkah, N., Leipzig, J., Wang, G., Arens, M.Q., Tebas, P., Bushman, F.D.: DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res. 35(13), e91 (2007)
Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M., Shafer, R.W.: Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res. 17(8), 1195–1201 (2007)
Wildenberg, A., Skiena, S., Sumazin, P.: Deconvolving sequence variation in mixed DNA populations. J. Comput. Biol. 10(3-4), 635–652 (2003)
Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R.W., Beerenwinkel, N.: Viral population estimation using pyrosequencing. PLoS Computational Biology 4(4), e1000074 (2008)
Westbrooks, K., Astrovskaya, I., Campo, D., Khudyakov, Y., Berman, P., Zelikovsky, A.: HCV quasispecies assembly using network flows. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 159–170. Springer, Heidelberg (2008)
Xing, E.P., Jordan, M.I., Sharan, R.: Bayesian haplotype inference via the Dirichlet process. J. Comput. Biol. 14(3), 267–284 (2007)
Saeed, F., Khokhar, A., Zagordi, O., Beerenwinkel, N.: Multiple sequence alignment system for pyrosequencing reads. In: Bioinformatics and Computational Biology (BICoB) conference 2009, LNCS (in press, 2009)
Neal, R.: Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265 (2000)
Schmid, R., Schuster, S., Steel, M., Huson, D.: Readsim- a simulator for sanger and 454 sequencing (unpublished) (2006)
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H., Field, D.: Metasim—a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)
Campbell, P.J., Pleasance, E.D., Stephens, P.J., Dicks, E., Rance, R., Goodhead, I., Follows, G.A., Green, A.R., Futreal, P.A., Stratton, M.R.: Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc. Natl. Acad. Sci. USA 105(35), 13081–13086 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zagordi, O., Geyrhofer, L., Roth, V., Beerenwinkel, N. (2009). Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-02008-7_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)