Abstract
The advances in next-generation sequencing technologies allow researchers to sequence in parallel millions of microbial organisms directly from environmental samples. The result of this “shotgun” sequencing are many short DNA fragments of different organisms, which constitute the basis for the field of metagenomics. Although there are big databases with known microbial DNA that allow us classify some fragments, these databases only represent around 1% of all the species existing in the entire world. For this reason, it is important to use unsupervised methods to group the fragments with the same taxonomic levels. In this paper we focus on the binning step in metagenomics in an unsupervised way. We propose a consensus clustering method based on an iterative clustering process using different lengths of sequences in the databases and a mixture of distance as approach to finding the consensus clustering. The final performance clustering is evaluated according with the purity of clusters. The results achieved by the proposed method outperforms results obtained by simple methods and iterative methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Riesenfeld, C.S., Schloss, P.D., Handelsman, J.: Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet. 38, 525–552 (2004)
Oulas, A., et al.: Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies. In: Bioinform. Biol. Insights. pp. 75–88 (2015)
Council, N.R.: The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. The National Academies Press, Washington (2007)
Chan, C.-K., et al.: Binning sequences using very sparse labels within a metagenome. BMC Bioinf. 9(1), 215 (2008)
Camacho, C., et al.: BLAST + : architecture and applications. BMC Bioinf. 10(1), 421 (2009)
Huson, D.H., et al.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)
McHardy, A.C., et al.: Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4(1), 63–72 (2007)
Diaz, N.N., et al.: TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinf. 10, 56 (2009)
Rosen, G.L., Reichenberger, E., Rosenfeld, A.: NBC: The Naïve Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinf. 27(1), 127–129 (2010)
Mande, S.S., Mohammed, M.H., Ghosh, T.S.: Classification of metagenomic sequences: methods and challenges. Brief Bioinf. 13(6), 669–681 (2012)
Teeling, H., et al.: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinf. 5(1), 163 (2004)
Reddy, R.M., Mohammed, M.H., Mande, S.S.: MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 103(2–3), 161–168 (2014)
Abe, T., et al.: Informatics for unveiling hidden genome signatures. Genome Res. 13(4), 693–702 (2003)
Chan, C.K.K., et al.: Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. J. Biomed. Biotechnol. 2008 (2008)
Nasser, S., Breland, A., Harris Jr., F.C., Nicolescu, M.: University of Nevada Reno. A Fuzzy Classifier to Taxonomically Group DNA Fragments within a Metagenome (2016). http://www.cse.unr.edu/~monica/Research/Publications/nafips2008.pdf
Leung, H.C., et al.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27(11), 1489–1495 (2011)
Wang, Y., et al.: MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genom. 15(1), 1–9 (2014)
Siegel, K., et al.: Puzzlecluster: a novel unsupervised clustering algorithm for binning DNA fragments in metagenomics (2016)
Wu, Y.W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2011)
Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6(9), 673–676 (2009)
Li, W., et al.: Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinf. 13(6), 656–668 (2012)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Statistics, Vol. 1, pp. 281–297. University of California Press: Berkeley, California (1967)
Arthur, D., Vassilvitskii, S.: K-Means ++: The Advantages of Careful Seeding. In: 8th Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans (2007)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. In: Jim Gray, M.R. (ed). . Morgan Kaufmann, San Francisco, 525 (2005)
Bonet, I., Montoya, W., Mesa-Múnera, A., Alzate, J.F.: Iterative clustering method for metagenomic sequences. In: Prasath, R., O’Reilly, P., Kathirvalavakumar, T. (eds.) MIKE 2014. LNCS, vol. 8891, pp. 145–154. Springer, Cham (2014). doi:10.1007/978-3-319-13817-6_15
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Bonet, I., Escobar, A., Mesa-Múnera, A., Alzate, J.F. (2017). Consensus Clustering for Binning Metagenome Sequences. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds) Advances in Soft Computing. MICAI 2016. Lecture Notes in Computer Science(), vol 10062. Springer, Cham. https://doi.org/10.1007/978-3-319-62428-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-62428-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62427-3
Online ISBN: 978-3-319-62428-0
eBook Packages: Computer ScienceComputer Science (R0)