Skip to main content

Clustering Algorithm Optimization Applied to Metagenomics Using Big Data

  • Conference paper
  • First Online:
Information and Communication Technologies of Ecuador (TIC.EC) (TICEC 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 884))

Included in the following conference series:

Abstract

In metagenomics, the amino acid sequences, due to the extraction process, are separated in DNA fragments of variable sizes. These fragments are used afterwards to determine which of the already recognized species are present in the samples and what portion of these amino acid sequences have not been previously categorized. Seeking for this method for identification to produce better results, clustering algorithms will be used as enablers in the identification process for the different species. These algorithms group amino acid sequences with a certain similarity rate, producing DNA fragments clusters, so these can be compared in group and be analyzed faster. One of the problems when analyzing metagenomic databases is that they are very large, which makes the algorithms have a high computational time. New technologies already provide platforms to develop and run algorithms achieving better temporal performance. Platforms like Apache Spark and TensorFlow were used with the objective of reducing the execution times, as they include native implementations of these clustering algorithms in their libraries. With these libraries as a base, an implementation of Iterative k-means was implemented and then used as a comparison point. In the results iterative k-means reduce the execution time with respect to the traditional implementation. The use of TensorFlow improved the execution times in general, with a more significative difference in the case of the Iterative k-means, with the disadvantage that it requires much more processing power.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Locey KJ, Lennon JT (2016) Scaling laws predict global microbial diversity. Natl Acad Sci

    Google Scholar 

  2. Wooley JC, Godzik A, Friedberg I (2010) A Primer on Metagenomics. PLoS Comput Biol 6(2):e10006672010

    Article  Google Scholar 

  3. Thomas T, Gilbert J, Meyer F (2012) Metagenomics-a guide from sampling to data analysis. Microb Inform Exp

    Google Scholar 

  4. Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev

    Google Scholar 

  5. Kislyuk A, Bhatnagar S, Dushoff J, Weitz J (2009) Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform 10(1):316

    Article  Google Scholar 

  6. Camacho C et al (2009) BLAST + : architecture and applications. BMC Bioinform 10(1):421

    Article  Google Scholar 

  7. Rosen GL, Reichenberger E, Rosenfeld A (2010) NBC: The Naïve Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics

    Google Scholar 

  8. Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW (2009) TACOA–Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinf 10:56–56

    Article  Google Scholar 

  9. Brady A, Salzberg SL (2009) Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated markov models. Nat Methods 6(9):673–676

    Article  Google Scholar 

  10. Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F (2004) TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinf 5(1):163

    Article  Google Scholar 

  11. Reddy RM, Mohammed MH, Mande SS (2014) MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 103(2–3):161–168

    Article  Google Scholar 

  12. Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T (2003) Informatics for unveiling hidden genome signatures. Genome Res 13(4): 693–702

    Article  Google Scholar 

  13. Zouari H, Heutte L, Lecourtier Y (2005) Controlling the diversity in classifier ensembles through a measure of agreement (in English). Pattern Recognit 38(11):2195–2199

    Article  Google Scholar 

  14. Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polytech Hung 14(3)

    Google Scholar 

  15. Woods K, Kegelmeyer WP, Bowyer K (1997) Combination of multiple classifiers using local accuracy estimates (in English). IEEE Trans Pattern Anal Mach Intell 19(4):405–410

    Article  Google Scholar 

  16. Leung HC et al (2011) A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio (in eng). Bioinformatics 27(11):1489–1495

    Article  Google Scholar 

  17. Wang Y, Leung H, Yiu S, Chin F (2014) MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning (in English). BMC Genomics 15(1), 1–9. Article no. S12

    Article  Google Scholar 

  18. Partalas I, Tsoumakas G, Katakis I, Vlahavas I (2006) Ensemble pruning using reinforcement learning. In: Advances in artificial intelligence, proceedings, Lecture Notes in Computer Science, vol 3955. Springer, Berlin, pp 301–310

    Chapter  Google Scholar 

  19. Nanni L, Lumini A (2006) FuzzyBagging: a novel ensemble of classifiers. Pattern Recognit 39(3):488–490

    Article  Google Scholar 

  20. MLlib Clustering (2018) In: Apache Spark Docs ed

    Google Scholar 

  21. Module (2018) tf.contrib.factorization. In: Tensorflow Python API Docs ed

    Google Scholar 

  22. Bonet I, Escobar A, Mesa-Múnera A, Alzate JF (2017) Clustering of metagenomic data by combining different distance functions. Acta Polythecnica Hung 14(3)

    Google Scholar 

  23. Bonet I, Montoya W, Mesa Múnera A, Alzate JF (2014) Iterative Clustering Method for Metagenomic Sequences

    Google Scholar 

  24. Apache Software Foundation (2018) MLlib Clustering. https://spark.apache.org/docs/2.3.0/mllib-clustering.html

  25. Google, Module: tf.contrib.factorization(2018). https://www.tensorflow.org/api_docs/python/tf/contrib/factorization

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Isis Bonet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vanegas, J., Bonet, I. (2019). Clustering Algorithm Optimization Applied to Metagenomics Using Big Data. In: Botto-Tobar, M., Barba-Maggi, L., González-Huerta, J., Villacrés-Cevallos, P., S. Gómez, O., Uvidia-Fassler, M. (eds) Information and Communication Technologies of Ecuador (TIC.EC). TICEC 2018. Advances in Intelligent Systems and Computing, vol 884. Springer, Cham. https://doi.org/10.1007/978-3-030-02828-2_14

Download citation

Publish with us

Policies and ethics