Clustering Algorithm Optimization Applied to Metagenomics Using Big Data

Vanegas, Julián; Bonet, Isis

doi:10.1007/978-3-030-02828-2_14

Julián Vanegas²⁰ &
Isis Bonet²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 884))

Included in the following conference series:

Conference on Information Technologies and Communication of Ecuador

535 Accesses
1 Citations

Abstract

In metagenomics, the amino acid sequences, due to the extraction process, are separated in DNA fragments of variable sizes. These fragments are used afterwards to determine which of the already recognized species are present in the samples and what portion of these amino acid sequences have not been previously categorized. Seeking for this method for identification to produce better results, clustering algorithms will be used as enablers in the identification process for the different species. These algorithms group amino acid sequences with a certain similarity rate, producing DNA fragments clusters, so these can be compared in group and be analyzed faster. One of the problems when analyzing metagenomic databases is that they are very large, which makes the algorithms have a high computational time. New technologies already provide platforms to develop and run algorithms achieving better temporal performance. Platforms like Apache Spark and TensorFlow were used with the objective of reducing the execution times, as they include native implementations of these clustering algorithms in their libraries. With these libraries as a base, an implementation of Iterative k-means was implemented and then used as a comparison point. In the results iterative k-means reduce the execution time with respect to the traditional implementation. The use of TensorFlow improved the execution times in general, with a more significative difference in the case of the Iterative k-means, with the disadvantage that it requires much more processing power.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Locey KJ, Lennon JT (2016) Scaling laws predict global microbial diversity. Natl Acad Sci
Google Scholar
Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLoS Comput Biol 6(2):e10006672010
Article Google Scholar
Thomas T, Gilbert J, Meyer F (2012) Metagenomics-a guide from sampling to data analysis. Microb Inform Exp
Google Scholar
Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev
Google Scholar
Kislyuk A, Bhatnagar S, Dushoff J, Weitz J (2009) Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform 10(1):316
Article Google Scholar
Camacho C et al (2009) BLAST + : architecture and applications. BMC Bioinform 10(1):421
Article Google Scholar
Rosen GL, Reichenberger E, Rosenfeld A (2010) NBC: The Naïve Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics
Google Scholar
Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW (2009) TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinf 10:56–56
Article Google Scholar
Brady A, Salzberg SL (2009) Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated markov models. Nat Methods 6(9):673–676
Article Google Scholar
Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinf 5(1):163
Article Google Scholar
Reddy RM, Mohammed MH, Mande SS (2014) MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 103(2–3):161–168
Article Google Scholar
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T (2003) Informatics for unveiling hidden genome signatures. Genome Res 13(4): 693–702
Article Google Scholar
Zouari H, Heutte L, Lecourtier Y (2005) Controlling the diversity in classifier ensembles through a measure of agreement (in English). Pattern Recognit 38(11):2195–2199
Article Google Scholar
Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polytech Hung 14(3)
Google Scholar
Woods K, Kegelmeyer WP, Bowyer K (1997) Combination of multiple classifiers using local accuracy estimates (in English). IEEE Trans Pattern Anal Mach Intell 19(4):405–410
Article Google Scholar
Leung HC et al (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio (in eng). Bioinformatics 27(11):1489–1495
Article Google Scholar
Wang Y, Leung H, Yiu S, Chin F (2014) MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning (in English). BMC Genomics 15(1), 1–9. Article no. S12
Article Google Scholar
Partalas I, Tsoumakas G, Katakis I, Vlahavas I (2006) Ensemble pruning using reinforcement learning. In: Advances in artificial intelligence, proceedings, Lecture Notes in Computer Science, vol 3955. Springer, Berlin, pp 301–310
Chapter Google Scholar
Nanni L, Lumini A (2006) FuzzyBagging: a novel ensemble of classifiers. Pattern Recognit 39(3):488–490
Article Google Scholar
MLlib Clustering (2018) In: Apache Spark Docs ed
Google Scholar
Module (2018) tf.contrib.factorization. In: Tensorflow Python API Docs ed
Google Scholar
Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polythecnica Hung 14(3)
Google Scholar
Bonet I, Montoya W, Mesa Múnera A, Alzate JF (2014) Iterative Clustering Method for Metagenomic Sequences
Google Scholar
Apache Software Foundation (2018) MLlib Clustering. https://spark.apache.org/docs/2.3.0/mllib-clustering.html
Google, Module: tf.contrib.factorization(2018). https://www.tensorflow.org/api_docs/python/tf/contrib/factorization

Download references

Author information

Authors and Affiliations

EIA University, km 2 + 200 Vía al Aeropuerto José María Córdova, Envigado, Antioquia, Colombia
Julián Vanegas & Isis Bonet

Authors

Julián Vanegas
View author publications
You can also search for this author in PubMed Google Scholar
Isis Bonet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isis Bonet .

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, Noord-Brabant, The Netherlands
Miguel Botto-Tobar
Facultad de Ingeniería, Universidad Nacional de Chimborazo, Riobamba, Ecuador
Lida Barba-Maggi
Department of Software Engineering, Blekinge Tekniska Högskola, Karlskrona, Blekinge Län, Sweden
Javier González-Huerta
Facultad de Ingeniería, Universidad Nacional de Chimborazo, Riobamba, Ecuador
Patricio Villacrés-Cevallos
Escuela Superior Politécnica de Chimborazo, Riobamba, Ecuador
Omar S. Gómez
Facultad de Ingeniería, Universidad Nacional de Chimborazo, Riobamba, Ecuador
María I. Uvidia-Fassler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vanegas, J., Bonet, I. (2019). Clustering Algorithm Optimization Applied to Metagenomics Using Big Data. In: Botto-Tobar, M., Barba-Maggi, L., González-Huerta, J., Villacrés-Cevallos, P., S. Gómez, O., Uvidia-Fassler, M. (eds) Information and Communication Technologies of Ecuador (TIC.EC). TICEC 2018. Advances in Intelligent Systems and Computing, vol 884. Springer, Cham. https://doi.org/10.1007/978-3-030-02828-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-02828-2_14
Published: 18 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02827-5
Online ISBN: 978-3-030-02828-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics