Estimating the Optimal Number of Clusters Via Internal Validity Index

Abstract

Estimating the optimal number of clusters (NC) is pivotal in cluster analysis. From the viewpoint of sample geometry, a novel internal clustering validity index, which is termed the between-within cluster (BWC) index, is designed in this paper. Moreover, a method is proposed to estimate the optimal NC. The BWC index improves the well-known Silhouette index. BWC validates the clustering results from a certain clustering algorithm (e.g., affinity propagation or hierarchical) and estimates the optimal NC for many kinds of data sets, including synthetic data sets, benchmark data sets, UCI data sets, gene expression data sets, and images. Theoretical analysis and experimental studies demonstrate the effectiveness and high efficiency of the new index and method.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. pp 281–297

  2. 2.

    Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data: an Introduction to Cluster Analysis. Wiley & Sons, Hoboken, NJ, USA, pp 40–41

    Google Scholar 

  3. 3.

    Bradley PS, Mangasarian OL, Street WN (1996) Clustering via concave minimization. In: Proceedings of the NIPS, Denver, CO, USA. pp 368–374

  4. 4.

    Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57

    MathSciNet  Article  Google Scholar 

  5. 5.

    Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York, pp 550–554

    Google Scholar 

  6. 6.

    Cattinelli I, Valentini G, Paulesu E, Borghese NA (2013) A novel approach to the problem of non-uniqueness of the solution in hierarchical clustering. IEEE Trans Neural Netw Learn Syst 24(7):1166–1173

    Article  Google Scholar 

  7. 7.

    Bhargavi MS, Gowda SD (2015) A novel validity index with dynamic cut-off for determining true clusters. Pattern Recognit 48(11):3673–3687

    Article  Google Scholar 

  8. 8.

    Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976

    MathSciNet  Article  Google Scholar 

  9. 9.

    Wu S, Chow TWS (2003) Self-Organizing-Map based clustering using a local clustering validity index. Neural Process Lett 17:253–271

    Article  Google Scholar 

  10. 10.

    Tasdemir K, Merényi E (2011) A validity index for prototype-based clustering of data sets with complex cluster structures. IEEE Trans Syst Man Cybern B Cybern 41(4):1039–1053

    Article  Google Scholar 

  11. 11.

    Lee JS, Olafsson S (2013) A meta-learning approach for determining the number of clusters with consideration of nearest neighbors. Inf Sci 232:208–224

    MathSciNet  Article  Google Scholar 

  12. 12.

    Liu Y, Li Z, Xiong H et al (2013) Understanding and enhancement of internal clustering validation measures. IEEE Trans Cybern 43(3):982–994

    Article  Google Scholar 

  13. 13.

    Bezdek JC, Moshtaghi M, Runkler T, Leckie C (2016) The generalized C index for internal fuzzy cluster validity. IEEE Trans Fuzzy Syst 24(6):1500–1512

    Article  Google Scholar 

  14. 14.

    Wu CH, Ouyang CS, Chen LW, Lu LW (2015) A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Trans Fuzzy Syst 23(3):701–718

    Article  Google Scholar 

  15. 15.

    Liang J, Zhao X, Li D et al (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recognit 45(6):2251–2265

    Article  Google Scholar 

  16. 16.

    Guo G, Chen L, Ye Y, Jiang Q (2017) Cluster validation method for determining the number of clusters in categorical sequences. IEEE Trans Neural Netw Learn Syst 28(12):2936–2948

    MathSciNet  Article  Google Scholar 

  17. 17.

    Yang X, Song Q, Cao A (2006) A new cluster validity for data clustering. Neural Process Lett 23:325–344

    Article  Google Scholar 

  18. 18.

    Xu R, Xu J, Wunsch DC II (2012) A comparison study of validity indices on swarm-intelligence-based clustering. IEEE Trans Syst Man Cybern B Cybern 42(4):1243–1256

    Article  Google Scholar 

  19. 19.

    Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65

    Article  Google Scholar 

  21. 21.

    Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Article  Google Scholar 

  22. 22.

    Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21

    Article  Google Scholar 

  23. 23.

    Hartigan JA (1985) Statistical theory in clustering. J Classif 2(1):63–76

    MathSciNet  Article  Google Scholar 

  24. 24.

    Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. University of Texas at Austin, Austin

    Google Scholar 

  25. 25.

    Wang KJ, Li J, Zhang JY, Guo LX (2008) Experimental comparison of clusters number estimation for cluster analysis. Comput Eng 34(9):198–202

    Google Scholar 

  26. 26.

    Kapp AV, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8(1):9–31

    Article  Google Scholar 

  27. 27.

    Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37:487–501

    Article  Google Scholar 

  28. 28.

    Arbelaitz O, Gurrutxaga I, Muguerza J et al (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256

    Article  Google Scholar 

  29. 29.

    Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the ICML. pp 233–240

  30. 30.

    Pal NR, Bezdek JC (1995) On cluster validity for the fuzzy c-means model. IEEE Trans Fuzzy Syst 3(3):370–379

    Article  Google Scholar 

  31. 31.

    Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern B Cybern 28(3):301–315

    Article  Google Scholar 

  32. 32.

    Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409

    Article  Google Scholar 

  33. 33.

    Shieh HL (2014) Robust validity index for a modified subtractive clustering algorithm. Appl Soft Comput 22:47–59

    Article  Google Scholar 

  34. 34.

    Wang KJ, Zhang JY, Li D, Zhang XN, Guo T (2007) Adaptive affinity propagation clustering. Acta Autom Sin 33(12):1242–1246

    MATH  Google Scholar 

  35. 35.

    Armstrong SA, Staunton JE, Silverman LB et al (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47

    Article  Google Scholar 

  36. 36.

    Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Proceedings of the 7th Pacific symposium on Biocomputing. pp 6–17

  37. 37.

    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  38. 38.

    García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  39. 39.

    Jiang Y, Deng Z, Chung FL et al (2017) Recognition of epileptic EEG signals using a novel multiview TSK fuzzy system. IEEE Trans Fuzzy Syst 25(1):3–20

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. This work was supported in part by the Fundamental Research Funds for the Central Universities under Grant JUSRP11235 and in part by the National Natural Science Foundation of China under Grant Nos. 61673193 and 61833007.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shibing Zhou.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, S., Liu, F. & Song, W. Estimating the Optimal Number of Clusters Via Internal Validity Index. Neural Process Lett (2021). https://doi.org/10.1007/s11063-021-10427-8

Download citation

Keywords

  • Clustering validity index
  • Number of clusters
  • Affinity propagation
  • Hierarchical clustering