Seed selection algorithm through K-means on optimal number of clusters

  • Kuntal ChowdhuryEmail author
  • Debasis Chaudhuri
  • Arup Kumar Pal
  • Ashok Samal


Clustering is one of the important unsupervised learning in data mining to group the similar features. The growing point of the cluster is known as a seed. To select the appropriate seed of a cluster is an important criterion of any seed based clustering technique. The performance of seed based algorithms are dependent on initial cluster center selection and the optimal number of clusters in an unknown data set. Cluster quality and an optimal number of clusters are the important issues in cluster analysis. In this paper, the proposed seed point selection algorithm has been applied to 3 band image data and 2D discrete data. This algorithm selects the seed point using the concept of maximization of the joint probability of pixel intensities with the distance restriction criteria. The optimal number of clusters has been decided on the basis of the combination of seven different cluster validity indices. We have also compared the results of our proposed seed selection algorithm on an optimal number of clusters using K-Means clustering with other classical seed selection algorithms applied through K-Means Clustering in terms of seed generation time (SGT), cluster building Time (CBT), segmentation entropy and the number of iterations (NOTKmeans). We have also made the analysis of CPU time and no. of iterations of our proposed seed selection method with other clustering algorithms.


Clustering Cluster building time Cluster validity indices Joint probability K-means Seed point Seed generation time Segmentation entropy 



  1. 1.
    Al Malki A, Rizk MM, El-Shorbagy M, Mousa A (2016) Hybrid genetic algorithm with k-means for clustering problems. Open J Optim 5(02):71CrossRefGoogle Scholar
  2. 2.
    Alswaitti M, Albughdadi M, Isa NAM (2018) Density-based particle swarm optimization algorithm for data clustering. Expert Syst Appl 91:170–186CrossRefGoogle Scholar
  3. 3.
    Arifin AZ, Asano A (2006) Image segmentation by histogram thresholding using hierarchical cluster analysis. Pattern Recogn Lett 27(13):1515–1521CrossRefGoogle Scholar
  4. 4.
    Astrahan M (1970) Speech analysis by clustering or the hyperphoneme method. Tech. rep., STANFORD UNIV CA DEPT OF COMPUTER SCIENCEGoogle Scholar
  5. 5.
    Bai L, Liang J, Dang C, Cao F (2012) A cluster centers initialization method for clustering categorical data. Expert Syst Appl 39(9):8022–8029CrossRefGoogle Scholar
  6. 6.
    Ball GH, Hall DJ (1965) Isodata a novel method of data analysis and pattern classification. Tech. rep., Stanford Research Institute, Menlo Park CAGoogle Scholar
  7. 7.
    Bandyopadhyay O, Chanda B, Bhattacharya BB (2016) Automatic segmentation of bones in x-ray images based on entropy measure. Int J Image Graph 16(1):1650,001MathSciNetCrossRefGoogle Scholar
  8. 8.
    Bezdek JC (1974) Numerical taxonomy with fuzzy sets. J Math Biol 1(1):57–71MathSciNetCrossRefGoogle Scholar
  9. 9.
    Bhattacharya A, De RK (2008) Divisive correlation clustering algorithm (dcca) for grouping of genes: detecting varying patterns in expression profiles. Bioinformatics 24(11):1359–1366CrossRefGoogle Scholar
  10. 10.
    Bhusare BB, Bansode S (2014) Centroids initialization for k-means clustering using improved pillar algorithm. Int J Adv Res Comput Eng Technol 3(4):1317–1322Google Scholar
  11. 11.
    Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27MathSciNetCrossRefGoogle Scholar
  12. 12.
    Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483MathSciNetCrossRefGoogle Scholar
  13. 13.
    Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40 (1):200–210CrossRefGoogle Scholar
  14. 14.
    Celeux G, Soromenho G (1996) An entropy criterion for assessing the number of clusters in a mixture model. J Classif 13(2):195–212MathSciNetCrossRefGoogle Scholar
  15. 15.
    Chaudhuri BB (1994) How to choose a representative subset from a set of data in multi-dimensional space. Pattern Recogn Lett 15(9):893–899CrossRefGoogle Scholar
  16. 16.
    Chaudhuri D (1994) Some studies on density estimation and data clustering techniques. PhD thesis, ISI, CalcuttaGoogle Scholar
  17. 17.
    Chaudhuri D, Chaudhuri BB (1997) A novel multiseed nonhierarchical data clustering technique. IEEE Trans Syst Man Cybern Part B (Cybernetics) 27(5):871–876CrossRefGoogle Scholar
  18. 18.
    Chaudhuri D, Murthy CA, Chaudhuri BB (1994) Finding a subset of representative points in a data set. IEEE Trans Syst Man Cybern 24(9):1416–1424CrossRefGoogle Scholar
  19. 19.
    Chen K, Liu L (2005) The “best k” for entropy-based categorical data clusteringGoogle Scholar
  20. 20.
    Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis And Machine Intelligence (2):224–227Google Scholar
  21. 21.
    Fahim A, Salem A, Torkey FA, Ramadan M (2006) An efficient enhanced k-means clustering algorithm. J Zheijang Univ Sci A 7(10):1626–1633CrossRefGoogle Scholar
  22. 22.
    Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769Google Scholar
  23. 23.
    Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306MathSciNetCrossRefGoogle Scholar
  24. 24.
    Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666CrossRefGoogle Scholar
  25. 25.
    Jain AK, Dubes RC (1988) Algorithms for clustering dataGoogle Scholar
  26. 26.
    Kalyani S, Swarup KS (2011) Particle swarm optimization based k-means clustering approach for security assessment in power systems. Expert systems with applications 38(9):10, 839–10, 846CrossRefGoogle Scholar
  27. 27.
    Kim DJ, Park YW, Park DJ (2001) A novel validity index for determination of the optimal number of clusters. IEICE Trans Inf Syst 84(2):281–285Google Scholar
  28. 28.
    Kim DW, Lee KH, Lee D (2004) On cluster validity index for estimation of the optimal number of fuzzy clusters. Pattern Recogn 37(10):2009–2025CrossRefGoogle Scholar
  29. 29.
    Kodabagi M, Hanji SS, Hanji SV (2014) Application of enhanced clustering technique using similarity measure for market segmentation. Computer Science & Information Technology : 15Google Scholar
  30. 30.
    Kumar Y, Sahoo G (2014) A new initialization method to originate initial cluster centers for k-means algorithm. Int J Adv Sci Technol 62:43–54CrossRefGoogle Scholar
  31. 31.
    Liu Z, Zheng Q, Xue L, Guan X (2012) A distributed energy-efficient clustering algorithm with improved coverage in wireless sensor networks. Futur Gener Comput Syst 28(5):780–790CrossRefGoogle Scholar
  32. 32.
    Lu JF, Tang J, Tang ZM, Yang JY (2008) Hierarchical initialization approach for k-means clustering. Pattern Recogn Lett 29(6):787–795CrossRefGoogle Scholar
  33. 33.
    MacQueen J, et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297Google Scholar
  34. 34.
    Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199CrossRefGoogle Scholar
  35. 35.
    Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the world congress on engineering, vol 1, pp 1–3Google Scholar
  36. 36.
    Nie L, Zhang L, Yan Y, Chang X, Liu M, Shaoling L (2017) Multiview physician-specific attributes fusion for health seeking. IEEE Trans Cybern 47(11):3680–3691CrossRefGoogle Scholar
  37. 37.
    Oyelade O, Oladipupo O, Obagbuwa I (2010) Application of k means clustering algorithm for prediction of students academic performance. arXiv:10022425
  38. 38.
    Pal SK, Pramanik P (1986) Fuzzy measures in determining seed points in clusteringGoogle Scholar
  39. 39.
    Pol DUR (2014) Enhancing k-means clustering algorithm and proposed parallel k-means clustering for large data sets. International Journal of Advanced Research in Computer Science and Software Engineering 4(5)Google Scholar
  40. 40.
    Purohit P, Joshi R (2013) An efficient approach towards k-means clustering algorithm. Int J Comput Sci Commun Netw 4(3):125–129Google Scholar
  41. 41.
    Reddy CK, Vinzamuri B (2013) A survey of partitional and hierarchical clustering algorithms. Data Clustering: Algorithms and Applications. 87Google Scholar
  42. 42.
    Reddy D, Jana PK, et al. (2012) Initialization for k-means clustering using voronoi diagram. Procedia Technology 4:395–400CrossRefGoogle Scholar
  43. 43.
    Sardar TH, Faizabadi AR, Ansari Z (2017) An evaluation of mapreduce framework in cluster analysis. In: 2017 international conference on intelligent computing, instrumentation and control technologies (ICICICT). IEEE, pp 110–114Google Scholar
  44. 44.
    Shafeeq A, Hareesha K (2012) Dynamic clustering of data with modified k-means algorithm. In: Proceedings of the 2012 conference on information and computer networks, pp 221–225Google Scholar
  45. 45.
    Singh D, Reddy CK (2015) A survey on platforms for big data analytics. J Big Data 2(1):8CrossRefGoogle Scholar
  46. 46.
    Tian J, Zhu L, Zhang S, Liu L (2005) Improvement and parallelism of k-means clustering algorithm. Tsinghua Sci Technol 10(3):277–281MathSciNetCrossRefGoogle Scholar
  47. 47.
    Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423MathSciNetCrossRefGoogle Scholar
  48. 48.
    Tou JT (1974) Pattern recognition principle. Appl Math Comput 7:75–109MathSciNetGoogle Scholar
  49. 49.
    Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recogn 47(7):2505–2516CrossRefGoogle Scholar
  50. 50.
    Villmann T, Albani C (2001) Clustering of categoric data in medicine—application of evolutionary algorithms. In: International conference on computational intelligence. Springer, pp 619–627Google Scholar
  51. 51.
    Wang Q, Megalooikonomou V (2005) A clustering algorithm for intrusion detection. In: Data mining, intrusion detection, information assurance, and data networks security 2005, international society for optics and photonics, vol 5812, pp 31–39Google Scholar
  52. 52.
    Wang X, Bai Y (2016) A modified minmax-means algorithm based on pso. Comput Intell Neurosci 2016Google Scholar
  53. 53.
    Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence (8):841–847Google Scholar
  54. 54.
    Xiuchang H, Su W (2014) An improved k-means clustering algorithm. J Net 9(1):161Google Scholar
  55. 55.
    Yedla M, Pathakota SR, Srinivasa T (2010) Enhancing k-means clustering algorithm with improved initial center. Int J Comput Sci Inf Technol 1(2):121–125Google Scholar
  56. 56.
    Zahra S, Ghazanfar MA, Khalid A, Azam MA, Naeem U, Prugel-Bennett A (2015) Novel centroid selection approaches for kmeans-clustering based recommender systems. Inform Sci 320:156–189MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Kuntal Chowdhury
    • 1
    Email author
  • Debasis Chaudhuri
    • 2
  • Arup Kumar Pal
    • 1
  • Ashok Samal
    • 3
  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology (Indian School of Mines)[IIT(ISM)]DhanbadIndia
  2. 2.DRDO Integration Centre Panagarh WestBengalWestBengalIndia
  3. 3.Department of Computer Science and EngineeringUniversity of NebraskaLinclonUK

Personalised recommendations