On Clustering Validation in Metagenomics Sequence Binning
- 37 Downloads
In clustering, one of the most challenging aspects is the validation, whose objective is to evaluate how good a clustering solution is. Sequence binning is a clustering task on metagenomic data analysis. The sequence clustering challenge is essentially putting together sequences belonging to the same genome. As a clustering problem it requires proper use of validation criteria of the discovered partitions. In sequence binning, the concepts of precision and recall, and F-measure index (external validation) are normally used as benchmark. However, on practice, information about the (sub) optimal number of cluster is unknown, so these metrics might be biased to an overestimated “ground truth”. In the case of sequence binning analysis, where the reference information about genomes is not available, how to evaluate the quality of bins resulting from a clustering solution? To answer this question we empirically study both quantitative (internal indexes) and qualitative aspects (biological soundness) while evaluating clustering solutions on the sequence binning problem. Our experimental study indicates that the number of clusters, estimated by binning algorithms, do not have as much impact on the quality of bins by means of biological soundness of the discovered clusters. The quality of the sub-optimal bins (greater than 90%) were identified in both rich and poor clustering partitions. Qualitative validation is essential for proper evaluation of a sequence binning solution, generating bins with sub-optimal quality. Internal indexes can only be used in compliance with qualitative ones as a trade-off between the number of partitions and biological soundness of its respective bins.
KeywordsValidation Clustering Unsupervised Binning Metagenomics
- 16.Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. Benelearn (2015)Google Scholar
- 17.Legány, C., Juhász, S., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence (2006)Google Scholar
- 28.Khan, A.R., et al.: A comprehensive study of de novo genome assemblers: current challenges and future prospective. Evol. Bioinform. Online 14 (2018)Google Scholar
- 30.Chen, H.W., et al.: Predicting genome-wide redundancy using machine learning. BMC Evol. Biol. 10, 1471–2148 (2010)Google Scholar