SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

  • Shruthi Prabhakara
  • Raj Acharya
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6282)


A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.


Taxonomic Rank Metagenomic Data Mode Cluster Metagenome Dataset Soft Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comp. Biol., 1–24 (2005)Google Scholar
  2. 2.
    Rappe, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annual Rev. Microbiol., 357–369 (2003)Google Scholar
  3. 3.
    Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Karlin, S., Ladunga, I., Blaisdell, B.E.: Heterogeneity of genomes: measures and values. Proc. Natl. Acad. Sci. USA 91, 12837–12841 (1994)CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glockner, F.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6, 938–947 (2004)CrossRefPubMedGoogle Scholar
  6. 6.
    Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16) (2008)Google Scholar
  7. 7.
    Folino, G., Gori, F., Jetten, M.S., Marchiori, E.: Clustering Metagenome Short Reads Using Weighted Proteins. In: EvoBIO ’09: Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (2009)Google Scholar
  8. 8.
    Asharaf, S., Narasimha Murty, M.: An adaptive rough fuzzy single pass algorithm for clustering large data sets. Pattern Recognition 36(12) (2003)Google Scholar
  9. 9.
    Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (2006)Google Scholar
  10. 10.
    McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4, 63–72 (2007)CrossRefPubMedGoogle Scholar
  11. 11.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods 1358 (2009)Google Scholar
  13. 13.
    Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., Glockner, F.O.: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Nasser, S., Breland, A., Harris, F.C., Nicolescu, M.: A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Annual Meeting of the North American Fuzzy Information Processing Society, 1–6 (2008)Google Scholar
  15. 15.
    Non-Redundant Proteome database,
  16. 16.
    Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–792 (2004)CrossRefPubMedGoogle Scholar
  17. 17.
    Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., Vergassola, M.: Codon Usage Domains over Bacterial Chromosomes. PLoS Computational Biology 2(4), e37 (2006)CrossRefGoogle Scholar
  18. 18.
    Chan, C., Hsu, A., Halgamuge, S., Tang, S.: Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 9, 215 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)CrossRefPubMedGoogle Scholar
  20. 20.
    Chatterji, S., Yamazaki, I., Bai, Z., Eisen, J.: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  21. 21.
    Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10, 316 (2009)CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Shruthi Prabhakara
    • 1
  • Raj Acharya
    • 1
  1. 1.Department of Computer Science and EngineeringPennsylvania State UniversityUniversity Park, State College

Personalised recommendations