SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

Prabhakara, Shruthi; Acharya, Raj

doi:10.1007/978-3-642-16001-1_10

Shruthi Prabhakara²¹ &
Raj Acharya²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6282))

Included in the following conference series:

IAPR International Conference on Pattern Recognition in Bioinformatics

1192 Accesses
1 Citations

Abstract

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.

Download to read the full chapter text

Chapter PDF

GeFaST: An improved method for OTU assignment by generalising Swarm’s fastidious clustering approach

Article Open access 12 September 2018

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

Article Open access 28 April 2020

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comp. Biol., 1–24 (2005)
Google Scholar
Rappe, M.S., Giovannoni, S.J.: The uncultured microbial majority. Annual Rev. Microbiol., 357–369 (2003)
Google Scholar
Pop, M., Salzberg, S.L.: Bioinformatics challenges of new sequencing technology. Trends Genet. 24, 142–149 (2008)
Article CAS PubMed PubMed Central Google Scholar
Karlin, S., Ladunga, I., Blaisdell, B.E.: Heterogeneity of genomes: measures and values. Proc. Natl. Acad. Sci. USA 91, 12837–12841 (1994)
Article CAS PubMed PubMed Central Google Scholar
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R., Glockner, F.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6, 938–947 (2004)
Article CAS PubMed Google Scholar
Dalevi, D., Ivanova, N.N., Mavromatis, K., Hooper, S.D., Szeto, E., Hugenholtz, P., Kyrpides, N.C., Markowitz, V.M.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16) (2008)
Google Scholar
Folino, G., Gori, F., Jetten, M.S., Marchiori, E.: Clustering Metagenome Short Reads Using Weighted Proteins. In: EvoBIO ’09: Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (2009)
Google Scholar
Asharaf, S., Narasimha Murty, M.: An adaptive rough fuzzy single pass algorithm for clustering large data sets. Pattern Recognition 36(12) (2003)
Google Scholar
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 (2006)
Google Scholar
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods 4, 63–72 (2007)
Article CAS PubMed Google Scholar
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007)
Article CAS PubMed PubMed Central Google Scholar
Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods 1358 (2009)
Google Scholar
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., Glockner, F.O.: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004)
Article PubMed PubMed Central Google Scholar
Nasser, S., Breland, A., Harris, F.C., Nicolescu, M.: A fuzzy classifier to taxonomically group DNA fragments within a metagenome. Annual Meeting of the North American Fuzzy Information Processing Society, 1–6 (2008)
Google Scholar
Non-Redundant Proteome database, ftp://ftp.ncbi.nlm.nih.gov/blast/db
Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–792 (2004)
Article CAS PubMed Google Scholar
Bailly-Bechet, M., Danchin, A., Iqbal, M., Marsili, M., Vergassola, M.: Codon Usage Domains over Bacterial Chromosomes. PLoS Computational Biology 2(4), e37 (2006)
Article Google Scholar
Chan, C., Hsu, A., Halgamuge, S., Tang, S.: Binning sequences using very sparse labels within a metagenome. BMC Bioinformatics 9, 215 (2008)
Article PubMed PubMed Central Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Article CAS PubMed Google Scholar
Chatterji, S., Yamazaki, I., Bai, Z., Eisen, J.: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 17–28. Springer, Heidelberg (2008)
Chapter Google Scholar
Kislyuk, A., Bhatnagar, S., Dushoff, J., Weitz, J.S.: Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10, 316 (2009)
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Pennsylvania State University, University Park, State College, PA, 16801
Shruthi Prabhakara & Raj Acharya

Authors

Shruthi Prabhakara
View author publications
You can also search for this author in PubMed Google Scholar
Raj Acharya
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Computing and Information Sciences, Radboud University Nijmegen, Heyendaalseweg 135, 6525AJ, Nijmegen, The Netherlands
Tjeerd M. H. Dijkstra , Elena Marchiori & Tom Heskes , &
Institute for Computing and Information Sciences, Turku Centre for Computer Science, Radboud University Nijmegen, Heyendaalseweg 135, 6525AJ, Nijmegen, The Netherlands
Evgeni Tsivtsivadze

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prabhakara, S., Acharya, R. (2010). SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds) Pattern Recognition in Bioinformatics. PRIB 2010. Lecture Notes in Computer Science(), vol 6282. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16001-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-16001-1_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16000-4
Online ISBN: 978-3-642-16001-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

Abstract

Chapter PDF

Similar content being viewed by others

GeFaST: An improved method for OTU assignment by generalising Swarm’s fastidious clustering approach

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

Abstract

Chapter PDF

Similar content being viewed by others

GeFaST: An improved method for OTU assignment by generalising Swarm’s fastidious clustering approach

Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation