A Two-Way Bayesian Mixture Model for Clustering in Metagenomics

  • Shruthi Prabhakara
  • Raj Acharya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7036)


We present a new and efficient Bayesian mixture model based on Poisson and Multinomial distributions for clustering metagenomic reads by their species of origin. We use the relative abundance of different words along a genome to distinguish reads from different species. The distribution of word counts within a genome is accurately represented by a Poisson distribution. The Multinomial mixture model is derived as a standardized Poisson mixture model. The Bayesian network efficiently encodes the conditional dependencies between word counts in a DNA due to overlaps and hence is most consistent with the data. We present a two-way mixture model that captures the high dimensionality and sparsity associated with the data. Our method can cluster reads as short as 50 bps with accuracy over 80%. The Bayesian mixture models clearly outperform their Naive Bayes counterparts on datasets of varying abundances, divergences and read lengths. Our method attains comparable accuracy to that of state-of-art Scimm and converges at least 5 times faster than Scimm for all the cases tested. The reduced time taken, by our method, to obtain accurate results is highly significant and justifies the use of our proposed method to evaluate large metagenome datasets.


Clustering Mixture Modeling Metagenomics 


  1. 1.
    Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., Vergassola, M.: Codon Usage Domains over Bacterial Chromosomes. PLoS Comput. Biol. 2(4), e37+ (2006)Google Scholar
  2. 2.
    Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38(1), 771–791 (2004)CrossRefGoogle Scholar
  3. 3.
    Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods 6(9), 673–676 (2009)CrossRefGoogle Scholar
  4. 4.
    Campbell, A., Mrázek, J., Karlin, S.: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proceedings of the National Academy of Sciences of the United States of America 96(16), 9184–9189 (1999)CrossRefGoogle Scholar
  5. 5.
    Chatterji, S., Yamazaki, I., Bai, Z., Eisen, J.: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. ArXiv e-prints, 708 (August 2007)Google Scholar
  6. 6.
    Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1(2), e24 (2005)Google Scholar
  7. 7.
    Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16), i7–i13 (2008)Google Scholar
  8. 8.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38 (1977)zbMATHGoogle Scholar
  9. 9.
    Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1. Wiley (1968)Google Scholar
  10. 10.
    Heckerman, D.: A tutorial on learning with bayesian networks. Technical report, Learning in Graphical Models (1995)Google Scholar
  11. 11.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome research 17(3), 377–386 (2007)CrossRefGoogle Scholar
  12. 12.
    Kelley, D., Salzberg, S.: Clustering metagenomic sequences with interpolated markov models. BMC Bioinformatics 11(1), 544 (2010)CrossRefGoogle Scholar
  13. 13.
    Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10(1), 316+ (2009)Google Scholar
  14. 14.
    Li, J., Zha, H.: Two-way poisson mixture models for simultaneous document classification and word clustering. Comput. Stat. Data Anal. 50, 163–180 (2006)CrossRefzbMATHGoogle Scholar
  15. 15.
    McHardy, A.C.C., Martín, H.G.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4(1), 63–72 (2007)CrossRefGoogle Scholar
  16. 16.
    Rapp, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annual Review of Microbiology 57(1), 369–394 (2003)CrossRefGoogle Scholar
  17. 17.
    Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational Biology 7(1-2), 1–46 (2000)CrossRefGoogle Scholar
  18. 18.
    Robin, S., Rodolphe, F., Schbath, S.: DNA, Words and Models: Statistics of Exceptional Words. Cambridge University Press (2005)Google Scholar
  19. 19.
    Rosen, G., Garbarine, E., Caseiro, D., Polikar, R., Sokhansanj, B.: Metagenome fragment classification using n-mer frequency profilesGoogle Scholar
  20. 20.
    Shruthi Prabhakara, R.A.: A two-way multi-dimensional mixture model for clustering metagenomic sequences. In: ACM BCB (2011)Google Scholar
  21. 21.
    Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glöckner, F.O.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6(9), 938–947 (2004)CrossRefGoogle Scholar
  22. 22.
    Tibshirani, R., Walther, G.: Cluster Validation by Prediction Strength. Journal of Computational & Graphical Statistics 14(3), 511–528 (2005)CrossRefGoogle Scholar
  23. 23.
    Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978), 37–43 (2004)CrossRefGoogle Scholar
  24. 24.
    Willse, A., Tyler, B.: Poisson and multinomial mixture models for multivariate sims image segmentation. Analytical Chemistry 74(24), 6314–6322 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Shruthi Prabhakara
    • 1
  • Raj Acharya
    • 1
  1. 1.Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations