On Clustering Validation in Metagenomics Sequence Binning

  • Paulo Oliveira
  • Kleber Padovani
  • Ronnie AlvesEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11347)


In clustering, one of the most challenging aspects is the validation, whose objective is to evaluate how good a clustering solution is. Sequence binning is a clustering task on metagenomic data analysis. The sequence clustering challenge is essentially putting together sequences belonging to the same genome. As a clustering problem it requires proper use of validation criteria of the discovered partitions. In sequence binning, the concepts of precision and recall, and F-measure index (external validation) are normally used as benchmark. However, on practice, information about the (sub) optimal number of cluster is unknown, so these metrics might be biased to an overestimated “ground truth”. In the case of sequence binning analysis, where the reference information about genomes is not available, how to evaluate the quality of bins resulting from a clustering solution? To answer this question we empirically study both quantitative (internal indexes) and qualitative aspects (biological soundness) while evaluating clustering solutions on the sequence binning problem. Our experimental study indicates that the number of clusters, estimated by binning algorithms, do not have as much impact on the quality of bins by means of biological soundness of the discovered clusters. The quality of the sub-optimal bins (greater than 90%) were identified in both rich and poor clustering partitions. Qualitative validation is essential for proper evaluation of a sequence binning solution, generating bins with sub-optimal quality. Internal indexes can only be used in compliance with qualitative ones as a trade-off between the number of partitions and biological soundness of its respective bins.


Validation Clustering Unsupervised Binning Metagenomics 


  1. 1.
    Mande, S.S.: Classification of metagenomic sequences: methods and challenges. Brief. Bioinform. 13, 669–681 (2012)CrossRefGoogle Scholar
  2. 2.
    Sedlar, K.: Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput. Struct. Biotechnol. J. 15, 48–55 (2017)CrossRefGoogle Scholar
  3. 3.
    Wang, Y., et al.: MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012)CrossRefGoogle Scholar
  4. 4.
    Vinh, L., et al.: A two-phase binning algorithm using \(l\)-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol. 10, 2 (2015). Scholar
  5. 5.
    Wang, Y., et al.: MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinform. 16, 36 (2015)CrossRefGoogle Scholar
  6. 6.
    Wu, Y., et al.: MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014). Scholar
  7. 7.
    Lin, H., Yu-Chieh, L.: Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6, 24175 (2016)CrossRefGoogle Scholar
  8. 8.
    Parks, D., et al.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)CrossRefGoogle Scholar
  9. 9.
    Simão, F., et al.: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 1367–4803 (2015)CrossRefGoogle Scholar
  10. 10.
    Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)zbMATHCrossRefGoogle Scholar
  11. 11.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)CrossRefGoogle Scholar
  12. 12.
    Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Li, W., et al.: Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13(6), 656–668 (2012)CrossRefGoogle Scholar
  14. 14.
    Kang, D., Froula, J., Egan, R., Wang, Z.: MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)CrossRefGoogle Scholar
  15. 15.
    Sieber, C., et al.: Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018)CrossRefGoogle Scholar
  16. 16.
    Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. Benelearn (2015)Google Scholar
  17. 17.
    Legány, C., Juhász, S., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence (2006)Google Scholar
  18. 18.
    Alves, R., Rodriguez-Baena, D.S., Aguilar-Ruiz, J.S.: Gene association analysis: a survey of frequent pattern mining from gene expression data. Brief. Bioinform. 11(2), 210–224 (2010)CrossRefGoogle Scholar
  19. 19.
    Mikheenko, A., Saveliev, V., Gurevich, A.: MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32(7), 1088–1090 (2016)CrossRefGoogle Scholar
  20. 20.
    Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)CrossRefGoogle Scholar
  21. 21.
    Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)CrossRefGoogle Scholar
  22. 22.
    Reyes, P., Villegas, C.: An empirical comparison of EM and K-means algorithms for binning metagenomics datasets. Ingeniare. Rev. Chil. Ing. 26, 20–27 (2018)CrossRefGoogle Scholar
  23. 23.
    Richter, D.C., et al.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 3, e3373 (2018)CrossRefGoogle Scholar
  24. 24.
    Alneberg, J., Bjarnason, B.S., De Bruijn, I., Schirmer, M., Quick, J., Ijaz, U.Z., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11(11), 1144–1146 (2014)CrossRefGoogle Scholar
  25. 25.
    Baridam, B.B., Ali, M.M.: An investigation of K-means clustering to high and multi-dimensional biological data. Kybernetes 42(4), 614–627 (2013)CrossRefGoogle Scholar
  26. 26.
    Li, D., et al.: MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016)CrossRefGoogle Scholar
  27. 27.
    Parks, D., et al.: Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017)CrossRefGoogle Scholar
  28. 28.
    Khan, A.R., et al.: A comprehensive study of de novo genome assemblers: current challenges and future prospective. Evol. Bioinform. Online 14 (2018)Google Scholar
  29. 29.
    Krakauer, D.C., Plotkin, J.B.: Redundancy, antiredundancy, and the robustness of genomes. Proc. Nat. Acad. Sci. U.S.A. 99(3), 1405–1409 (2002)CrossRefGoogle Scholar
  30. 30.
    Chen, H.W., et al.: Predicting genome-wide redundancy using machine learning. BMC Evol. Biol. 10, 1471–2148 (2010)Google Scholar
  31. 31.
    Klassen, J.L., Currie, C.R.: Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genom. 13, 14 (2012)CrossRefGoogle Scholar
  32. 32.
    Poptsova, M.S., et al.: Non-random DNA fragmentation in next-generation sequencing. Sci. Rep. 4, 4532 (2014)CrossRefGoogle Scholar
  33. 33.
    Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., Gurevich, A.: Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34(13), i142–i150 (2018)CrossRefGoogle Scholar
  34. 34.
    Sangwan, N., Xia, F., Gilbert, J.: Recovering complete and draft population genomes from metagenome datasets. Microbiome 04(1), 2049–2618 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Computer Science Graduate ProgramFederal University of ParáBelémBrazil
  2. 2.Instituto Tecnológico ValeBelémBrazil

Personalised recommendations