Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction

  • Osvaldo Zagordi
  • Lukas Geyrhofer
  • Volker Roth
  • Niko Beerenwinkel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


We present a computational method for analyzing deep sequencing data obtained from a genetically diverse sample. The set of reads obtained from a deep sequencing experiment represents a statistical sample of the underlying population. We develop a generative probabilistic model for assigning observed reads to unobserved haplotypes in the presence of sequencing errors. This clustering problem is solved in a Bayesian fashion using the Dirichlet process mixture to define a prior distribution on the unknown number of haplotypes in the mixture. We devise a Gibbs sampler for sampling from the joint posterior distribution of haplotype sequences, assignment of reads to haplotypes, and error rate of the sequencing process to obtain estimates of the local haplotype structure of the population. The method is evaluated on simulated data and on experimental deep sequencing data obtained from HIV samples.


Deep Sequencing Heterogeneous Sample Technical Noise Deep Sequencing Data Dirichlet Process Mixture 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends Genet. 24(3), 133–141 (2008)CrossRefPubMedGoogle Scholar
  2. 2.
    Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24(3), 142–149 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Chi, K.R.: The year of sequencing. Nat. Methods 5(1), 11–14 (2008)CrossRefPubMedGoogle Scholar
  4. 4.
    Nowak, M.A., Anderson, R.M., McLean, A.R., Wolfs, T.F., Goudsmit, J., May, R.M.: Antigenic diversity thresholds and the development of AIDS. Science 254(5034), 963–969 (1991)CrossRefPubMedGoogle Scholar
  5. 5.
    Walker, B.D., Burton, D.R.: Toward an AIDS vaccine. Science 320(5877), 760–764 (2008)CrossRefPubMedGoogle Scholar
  6. 6.
    Perrin, L., Telenti, A.: HIV treatment failure: testing for HIV resistance in clinical practice. Science 280(5371), 1871–1873 (1998)CrossRefPubMedGoogle Scholar
  7. 7.
    Hoffmann, C., Minkah, N., Leipzig, J., Wang, G., Arens, M.Q., Tebas, P., Bushman, F.D.: DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res. 35(13), e91 (2007)CrossRefGoogle Scholar
  8. 8.
    Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M., Shafer, R.W.: Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res. 17(8), 1195–1201 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Wildenberg, A., Skiena, S., Sumazin, P.: Deconvolving sequence variation in mixed DNA populations. J. Comput. Biol. 10(3-4), 635–652 (2003)CrossRefPubMedGoogle Scholar
  10. 10.
    Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R.W., Beerenwinkel, N.: Viral population estimation using pyrosequencing. PLoS Computational Biology 4(4), e1000074 (2008)CrossRefGoogle Scholar
  11. 11.
    Westbrooks, K., Astrovskaya, I., Campo, D., Khudyakov, Y., Berman, P., Zelikovsky, A.: HCV quasispecies assembly using network flows. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 159–170. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Xing, E.P., Jordan, M.I., Sharan, R.: Bayesian haplotype inference via the Dirichlet process. J. Comput. Biol. 14(3), 267–284 (2007)CrossRefPubMedGoogle Scholar
  13. 13.
    Saeed, F., Khokhar, A., Zagordi, O., Beerenwinkel, N.: Multiple sequence alignment system for pyrosequencing reads. In: Bioinformatics and Computational Biology (BICoB) conference 2009, LNCS (in press, 2009)Google Scholar
  14. 14.
    Neal, R.: Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265 (2000)Google Scholar
  15. 15.
    Schmid, R., Schuster, S., Steel, M., Huson, D.: Readsim- a simulator for sanger and 454 sequencing (unpublished) (2006)Google Scholar
  16. 16.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H., Field, D.: Metasim—a sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)CrossRefGoogle Scholar
  17. 17.
    Campbell, P.J., Pleasance, E.D., Stephens, P.J., Dicks, E., Rance, R., Goodhead, I., Follows, G.A., Green, A.R., Futreal, P.A., Stratton, M.R.: Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc. Natl. Acad. Sci. USA 105(35), 13081–13086 (2008)CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Osvaldo Zagordi
    • 1
  • Lukas Geyrhofer
    • 1
  • Volker Roth
    • 2
  • Niko Beerenwinkel
    • 1
  1. 1.Department of Biosystems Science and EngineeringETH ZurichBaselSwitzerland
  2. 2.Department of Computer ScienceUniversity of BaselSwitzerland

Personalised recommendations