Evidence-Based Clustering of Reads and Taxonomic Analysis of Metagenomic Data

  • Gianluigi Folino
  • Fabio Gori
  • Mike S. M. Jetten
  • Elena Marchiori
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. In this paper we focus on clustering methods and their application to taxonomic analysis of metagenomic data. Clustering analysis for metagenomics amounts to group similar partial sequences, such as raw sequence reads, into clusters in order to discover information about the internal structure of the considered dataset, or the relative abundance of protein families. Different methods for clustering analysis of metagenomic datasets have been proposed. Here we focus on evidence-based methods for clustering that employ knowledge extracted from proteins identified by a BLASTx search (proxygenes). We consider two clustering algorithms introduced in previous works and a new one. We discuss advantages and drawbacks of the algorithms, and use them to perform taxonomic analysis of metagenomic data. To this aim, three real-life benchmark datasets used in previous work on metagenomic data analysis are used. Comparison of the results indicates satisfactory coherence of the taxonomies output by the three algorithms, with respect to phylogenetic content at the class level and taxonomic distribution at phylum level. In general, the experimental comparative analysis substantiates the effectiveness of evidence-based clustering methods for taxonomic analysis of metagenomic data.


BLASTx Search Taxonomic Analysis Metagenomic Data Taxonomic Information Taxonomic Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Yooseph, S., et al.: The Sorcerer II global ocean sampling expedition: Expanding the universe of protein families. PLoS Biol. 5(3), e16 (2007)CrossRefGoogle Scholar
  2. 2.
    McHardy, A., Rigoutsos, I.: What’s in the mix: phylogenetic classification of metagenome sequence samples. Current Opinion in Microbiology 10, 499–503 (2007)CrossRefPubMedGoogle Scholar
  3. 3.
    Raes, J., Foerstner, K., Bork, P.: Get the most out of your metagenome: computational analysis of environmental sequence data. Current Opinion in Microbiology 10, 490–498 (2007)CrossRefPubMedGoogle Scholar
  4. 4.
    Li, W., Wooley, J., Godzik, A.: Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One 3(10) (2008)Google Scholar
  5. 5.
    Dalevi, D., Ivanova, N., Mavromatis, K., Hooper, S., Szeto, E., Hugenholtz, P., Kyrpides, N., Markowitz, V.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16) (2008)Google Scholar
  6. 6.
    Pop, M., Phillippy, A., Delcher, A., Salzberg, S.: Comparative genome assembly. Briefings in Bioinformatics 5(3), 237–248 (2004)CrossRefPubMedGoogle Scholar
  7. 7.
    Chan, C., Hsu, A., Tang, S., Halgamuge, S.: Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. Journal of Biomedicine and Biotechnology (2008)Google Scholar
  8. 8.
    Folino, G., Gori, F., Jetten, M.S.M., Marchiori, E.: Clustering metagenome short reads using weighted proteins. In: EvoBIO 2009. LNCS, vol. 5483, pp. 152–163. Springer, Heidelberg (2009)Google Scholar
  9. 9.
    Korf, I., Yandell, M., Bedell, J.: BLAST. O’Reilly & Associates, Inc., Sebastopol (2003)Google Scholar
  10. 10.
    Madden, T.: 16. In: The BLAST Sequence Analysis Tool, Bethesda, MD (2002)Google Scholar
  11. 11.
    Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990)CrossRefGoogle Scholar
  12. 12.
    Marchiori, E., Steenbeek, A.: An evolutionary algorithm for large scale set covering problems with application to airline crew scheduling. In: Oates, M.J., et al. (eds.) EvoIASP 2000, EvoWorkshops 2000, EvoFlight 2000, EvoSCONDI 2000, EvoSTIM 2000, EvoTEL 2000, and EvoROB/EvoRobot 2000. LNCS, vol. 1803, pp. 367–381. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  13. 13.
    Biddle, J.F., et al.: Metagenomic signatures of the Peru margin subseafloor biosphere show a genetically distinct environment. PNAS (105), 10583–10588 (2008)Google Scholar
  14. 14.
    Venter, J., et al.: Environmental genome shotgun sequencing of the sargasso sea. Science (304), 66–74 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Gianluigi Folino
    • 1
  • Fabio Gori
    • 2
  • Mike S. M. Jetten
    • 2
  • Elena Marchiori
    • 2
  1. 1.ICAR-CNRRendeItaly
  2. 2.Radboud UniversityNijmegenThe Netherlands

Personalised recommendations