Advertisement

Soft Computing

, Volume 23, Issue 1, pp 305–321 | Cite as

Clustering stability-based Evolutionary K-Means

  • Zhenfeng He
  • Chunyan Yu
Methodologies and Application

Abstract

Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means’ initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster validity index, and they are effective in clustering well-separated clusters. However, their performance of clustering noisy data is often disappointing. On the other hand, clustering stability-based approaches are more robust to noise; yet, they should start intelligently to find some challenging clusters. It is necessary to join EKM with clustering stability-based analysis. In this paper, we present a novel EKM algorithm that uses clustering stability to evaluate partitions. We firstly introduce two weighted aggregated consensus matrices, positive aggregated consensus matrix (PA) and negative aggregated consensus matrix (NA), to store clustering tendency for each pair of instances. Specifically, PA stores the tendency of sharing the same label and NA stores that of having different labels. Based upon the matrices, clusters and partitions can be evaluated from the view of clustering stability. Then, we propose a clustering stability-based EKM algorithm CSEKM that evolves partitions and the aggregated matrices simultaneously. To evaluate the algorithm’s performance, we compare it with an EKM algorithm, two consensus clustering algorithms, a clustering stability-based algorithm and a multi-index-based clustering approach. Experimental results on a series of artificial datasets, two simulated datasets and eight UCI datasets suggest CSEKM is more robust to noise.

Keywords

Clustering K-Means algorithm Genetic algorithm Clustering stability Consensus clustering 

Notes

Acknowledgements

This study was funded by National Nature Science Foundation of China (Grant No. 60805042), and Fujian Natural Science Foundation (Grant No. 2018J01794).

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

References

  1. Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca RatonCrossRefzbMATHGoogle Scholar
  2. Alves V, Campello RJGB, Hruschka ER (2006) Towards a fast evolutionary algorithm for clustering. In: Proceedings of IEEE congress on evolutionary computation (CEC 2006), pp 1776–1783Google Scholar
  3. Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256CrossRefGoogle Scholar
  4. Arthur D, Vassilvitskii (2007) S K-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (SODA), pp 1027–1035Google Scholar
  5. Bache K, Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
  6. Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-Means algorithm for optimal clustering in \(R^N\). Inf Sci 146:221–237CrossRefzbMATHGoogle Scholar
  7. Ben-David S, von Luxburg U, Páal D (2006) A sober look at clustering stability. In: Proceedings of the 19th annual conference on learning theory (COLT 2006), pp 5–19Google Scholar
  8. Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the first IEEE conference on evolutionary computation, pp 34–39Google Scholar
  9. Brunsch T, Roglin H (2013) A bad instance for k-means++. Theoret Comput Sci 505:19–26MathSciNetCrossRefzbMATHGoogle Scholar
  10. Bubeck S, Meilă M, Luxburg U (2012) How the initialization affects the stability of the K-Means algorithm. ESAIM Prob Stat 16:436–452MathSciNetCrossRefzbMATHGoogle Scholar
  11. Cano JR, Cordon O, Herrera F, Sanchez F (2002) A greedy randomized adaptive search procedure applied to the clustering problem as an initialization process using K-Means as a local search procedure, J Intell Fuzzy Syst 12:235–242zbMATHGoogle Scholar
  12. Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36CrossRefGoogle Scholar
  13. Chen S, Chao Y, Wang H, Fu H (2006) A prototypes-embedded genetic K-Means algorithm. In: Proceedings of the 18th international conference on pattern recognition (ICPR), pp 724–727Google Scholar
  14. Chiu TY, Hsu TC, Wang JS (2010) AP-based consensus clustering for gene expression time series. In: Proceedings of the 20th international conference on pattern recognition (ICPR), pp 2512–2515Google Scholar
  15. Chiui TY, Hsu TC, Yen CC, Wang JS (2015) Interpolation based consensus clustering for gene expression time series. BMC Bioinform 16:117CrossRefGoogle Scholar
  16. Craenendonck TV, Blockeel H (2015) Using internal validity measures to compare clustering algorithms. ICML 2015 AutoML Workshop, https://lirias.kuleuven.be/bitstream/123456789/504712/1/automl_camera.pdf
  17. de Amorima RC (2015) Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf Sci 324:126–145MathSciNetCrossRefGoogle Scholar
  18. Erisoglu M, Calis N, Sakallioglu S (2011) A new algorithm for initial cluster centers in K-Means algorithm. Pattern Recogn Lett 32:1701–1705CrossRefGoogle Scholar
  19. Famili AF, Liu G, Liu Z (2004) Evaluation and optimization of clustering in gene expression data analysis. Bioinformatics 20(10):1535–1545CrossRefGoogle Scholar
  20. Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477MathSciNetCrossRefzbMATHGoogle Scholar
  21. Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112CrossRefGoogle Scholar
  22. Handl J, Knowles J (2007) An evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 11(1):56–76CrossRefGoogle Scholar
  23. He Z (2016) Evolutionary K-Means with pair-wise constraints. Soft Comput 20(1):287–301MathSciNetCrossRefGoogle Scholar
  24. Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271MathSciNetCrossRefzbMATHGoogle Scholar
  25. Hruschka ER, Campello RJGB, de Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176:1898–1927MathSciNetCrossRefGoogle Scholar
  26. Hruschka ER, Campello RJGB, Freitas AA, Carvalho ACPLF (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man Cybern Part C Appl Rev 39(2):133–155CrossRefGoogle Scholar
  27. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666CrossRefGoogle Scholar
  28. Krishna K, Murty MN (1999) Genetic K-Means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439CrossRefGoogle Scholar
  29. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings on 10th IEEE international conference on data mining (ICDM 2010), pp 911–916Google Scholar
  30. Moller U (2009) Resampling methods for unsupervised learning from sample data. In: Mellouk A, Chebira A (eds) Machine learning. InTech, Cape Town, SA, pp 289–304 http://cdn.intechweb.org/pdfs/6069.pdf
  31. Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91118CrossRefzbMATHGoogle Scholar
  32. Naldi MC, Campello RJGB, Hruschka ER, Carvalho ACPLF (2011) Efficiency issues of evolutionary K-Means. Appl Soft Comput 11:1938–1952CrossRefGoogle Scholar
  33. R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
  34. Rahman MA, Islam MZ, Bossomaier T, DenClust (2014) A density based seed selection approach for K-Means. In: Proceedings of 13th international conference on artificial intelligence and soft computing (ICSISC), Part II, Lecture notes in computer science, vol 8468, pp 784–795Google Scholar
  35. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefzbMATHGoogle Scholar
  36. Schmidt TSB, Matias Rodrigues JF, von Mering C (2015) Limits to robustness and reproducibility in the demarcation of operational taxonomic units. Environ Microbiol 17(5):1689–1706CrossRefGoogle Scholar
  37. Senbabaoglu Y, Michailidis G, Li JZ (2014) Critical limitations of consensus clustering in class discovery. Sci Rep 4:6207CrossRefGoogle Scholar
  38. Shamir O, Tishby N (2010) Stability and model selection in K-Means clustering. Mach Learn 80(2–3):213–243MathSciNetCrossRefGoogle Scholar
  39. Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min 3(4):243–256MathSciNetGoogle Scholar
  40. Vinh NX, Epps J (2009) A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In: Proceedings of the 9th international conference on bioinformatics and bioengineering (BIBE), pp 84–91Google Scholar
  41. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In: Proceedings of the 26th annual international conference on machine learning (ICML 2009), pp 1073–1080Google Scholar
  42. von Luxburg U (2009) Clustering stability: an overview. Found Trends Mach Learn 2(3):235–274CrossRefzbMATHGoogle Scholar
  43. Wang X, Qiu W, Zamar RH (2007) CLUES: a non-parametric clustering method based on local shrinking. Comput Stat Data Anal 52(1):286–298MathSciNetCrossRefzbMATHGoogle Scholar
  44. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRefGoogle Scholar
  45. Yu Z, Wong H, Wang H (2007) Graph based consensus clustering for class discovery from gene expression data. Bioinformatics 23(21):2888–2896CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Mathematics and Computer ScienceFuzhou UniversityFuzhouChina

Personalised recommendations