Clustering stability-based Evolutionary K-Means
Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means’ initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster validity index, and they are effective in clustering well-separated clusters. However, their performance of clustering noisy data is often disappointing. On the other hand, clustering stability-based approaches are more robust to noise; yet, they should start intelligently to find some challenging clusters. It is necessary to join EKM with clustering stability-based analysis. In this paper, we present a novel EKM algorithm that uses clustering stability to evaluate partitions. We firstly introduce two weighted aggregated consensus matrices, positive aggregated consensus matrix (PA) and negative aggregated consensus matrix (NA), to store clustering tendency for each pair of instances. Specifically, PA stores the tendency of sharing the same label and NA stores that of having different labels. Based upon the matrices, clusters and partitions can be evaluated from the view of clustering stability. Then, we propose a clustering stability-based EKM algorithm CSEKM that evolves partitions and the aggregated matrices simultaneously. To evaluate the algorithm’s performance, we compare it with an EKM algorithm, two consensus clustering algorithms, a clustering stability-based algorithm and a multi-index-based clustering approach. Experimental results on a series of artificial datasets, two simulated datasets and eight UCI datasets suggest CSEKM is more robust to noise.
KeywordsClustering K-Means algorithm Genetic algorithm Clustering stability Consensus clustering
This study was funded by National Nature Science Foundation of China (Grant No. 60805042), and Fujian Natural Science Foundation (Grant No. 2018J01794).
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
- Alves V, Campello RJGB, Hruschka ER (2006) Towards a fast evolutionary algorithm for clustering. In: Proceedings of IEEE congress on evolutionary computation (CEC 2006), pp 1776–1783Google Scholar
- Arthur D, Vassilvitskii (2007) S K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (SODA), pp 1027–1035Google Scholar
- Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
- Ben-David S, von Luxburg U, Páal D (2006) A sober look at clustering stability. In: Proceedings of the 19th annual conference on learning theory (COLT 2006), pp 5–19Google Scholar
- Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the first IEEE conference on evolutionary computation, pp 34–39Google Scholar
- Chen S, Chao Y, Wang H, Fu H (2006) A prototypes-embedded genetic K-Means algorithm. In: Proceedings of the 18th international conference on pattern recognition (ICPR), pp 724–727Google Scholar
- Chiu TY, Hsu TC, Wang JS (2010) AP-based consensus clustering for gene expression time series. In: Proceedings of the 20th international conference on pattern recognition (ICPR), pp 2512–2515Google Scholar
- Craenendonck TV, Blockeel H (2015) Using internal validity measures to compare clustering algorithms. ICML 2015 AutoML Workshop, https://lirias.kuleuven.be/bitstream/123456789/504712/1/automl_camera.pdf
- Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings on 10th IEEE international conference on data mining (ICDM 2010), pp 911–916Google Scholar
- Moller U (2009) Resampling methods for unsupervised learning from sample data. In: Mellouk A, Chebira A (eds) Machine learning. InTech, Cape Town, SA, pp 289–304 http://cdn.intechweb.org/pdfs/6069.pdf
- R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
- Rahman MA, Islam MZ, Bossomaier T, DenClust (2014) A density based seed selection approach for K-Means. In: Proceedings of 13th international conference on artificial intelligence and soft computing (ICSISC), Part II, Lecture notes in computer science, vol 8468, pp 784–795Google Scholar
- Vinh NX, Epps J (2009) A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: Proceedings of the 9th international conference on bioinformatics and bioengineering (BIBE), pp 84–91Google Scholar
- Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: Proceedings of the 26th annual international conference on machine learning (ICML 2009), pp 1073–1080Google Scholar