Estimating Viral Haplotypes in a Population Using k-mer Counting

  • Raunaq Malhotra
  • Shruthi Prabhakara
  • Mary Poss
  • Raj Acharya
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7986)


Viral haplotype estimation in a population is an important problem in virology. Viruses undergo a high number of mutations and recombinations during replication for their survival in host cells and exist as a population of closely related genetic variants. Due to this, estimating the number of haplotypes and their relative frequencies in the population becomes a challenging task. The usage of a sequenced reference genome has its limitations due to the high mutational rates in viruses. We propose a method for estimating viral haplotypes based only on the counts of k-mers present in the viral population without using the reference genome. We compute k-mer pairs that are related to each other by one mutation, and compute a minimal set of viral haplotypes that explain the whole population based on these k-mer pairs. We compare our method to the software ShoRAH (which uses a reference genome) on simulated dataset and obtained comparable results, even without using a reference genome.


viral haplotype estimation structural variants detection k-mer counting variant detection greedy generating set algorithm 


  1. 1.
    Astrovskaya, I., Tork, B., Mangul, S., Westbrooks, K., Măndoiu, I., Balfe, P., Zelikovsky, A.: Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12(6) (2011)Google Scholar
  2. 2.
    Beerenwinkel, N., Gunthard, H.F., Roth, V., Metzner, K.J.: Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Frontiers in Microbiology 329(3) (2012)Google Scholar
  3. 3.
    Benjamini, Y., Speed, T.P.: Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Research 40(10), e72 (2012)CrossRefGoogle Scholar
  4. 4.
    Boerlijst, M.C., Bonhoeffer, S., Nowak, M.A.: Viral quasi-species and recombination. Proceedings of the Royal Society of London. Series B: Biological Sciences 263(1376), 1577–1584 (1996)CrossRefGoogle Scholar
  5. 5.
    Boeva, V., Zinovyev, A., Bleakley, K., Vert, J.-P., Janoueix-Lerosey, I., Delattre, O., Barillot, E.: Control-free calling of copy number alterations in deep-sequencing data using gc-content normalization. Bioinformatics 27(2), 268–269 (2011)CrossRefGoogle Scholar
  6. 6.
    Collins, M.J., Kempe, D., Saia, J., Young, M.: Nonnegative integral subset representations of integer sets. Inf. Process. Lett. 101, 129–133 (2007)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Eigen, M., McCaskill, J., Schuster, P.: The molecular quasi-species. Adv. Chem. Phys. 75, 149–263 (1989)CrossRefGoogle Scholar
  8. 8.
    Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R.W., Beerenwinkel, N.: Viral population estimation using pyrosequencing. PLoS Comput. Biol. 4(5), e1000074 (2008)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Hoffmann, C., Minkah, N., Leipzig, J., Wang, G., Arens, M.Q., Tebas, P., Bushman, F.D.: DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Research 35, 91 (2007)CrossRefGoogle Scholar
  10. 10.
    Jojic, V., Hertz, T., Jojic, N.: Population sequencing using short reads: HIV as a case study. In: Proc. Pac. Symp. Biocomput., pp. 114–125 (2008)Google Scholar
  11. 11.
    Macalalad, A.R., Zody, M.C., Charlebois, P., Lennon, N.J., Newman, R.M., Malboeuf, C.M., Ryan, E.M., Boutwell, C.L., Power, K.A., Brackney, D.E., Pesko, K.N., Levin, J.Z., Ebel, G.D., Allen, T.M., Birren, B.W., Henn, M.R.: Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput. Biol. 8(3), e1002417 (2012)CrossRefGoogle Scholar
  12. 12.
    Port, E., Sun, F., Martin, D., Waterman, M.S.: Genomic mapping by end characterized random clones: A mathematical analysis. Genomics 26, 84–100 (1995)CrossRefGoogle Scholar
  13. 13.
    Prabhakara, S., Malhotra, R., Poss, M., Acharya, R.: Mutant Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome. Journal of Computational Biology (in press)Google Scholar
  14. 14.
    Prosperi, M., Prosperi, L., Bruselles, A., Abbate, I., Rozera, G., Vincenti, D., Solmone, M., Capobianchi, M., Ulivi, G.: Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics 12, 5 (2011)CrossRefGoogle Scholar
  15. 15.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasim: A sequencing simulator for genomics and metagenomics. PLoS One 3, 3373 (2008)CrossRefGoogle Scholar
  16. 16.
    Westbrooks, K., Astrovskaya, I., Campo, D., Khudyakov, Y., Berman, P., Zelikovsky, A.: HCV quasispecies assembly using network flows. In: Măndoiu, I., Wang, S.-L., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 159–170. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Zagordi, O., Bhattacharya, A., Eriksson, N., Beerenwinkel, N.: ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics 12(1), 119 (2011)CrossRefGoogle Scholar
  18. 18.
    Zagordi, O., Geyrhofer, L., Roth, V., Beerenwinkel, N.: Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. Journal of Computational Biology 17(3), 417–428 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Raunaq Malhotra
    • 1
  • Shruthi Prabhakara
    • 1
  • Mary Poss
    • 2
  • Raj Acharya
    • 1
  1. 1.Department of Computer Science and EngineeringPennsylvania State UniversityUniversity ParkUSA
  2. 2.Department of Biology, Center for Infectious Disease DynamicsPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations