Skip to main content

An Efficient Approach for Selection of Initial Cluster Centroids for k-means

  • Conference paper
  • First Online:
Data Science and Analytics (REDSET 2019)

Abstract

Choice of initial centroids has a major impact on the performance and accuracy of k-means algorithm to group the data objects into various clusters. In basic k-means, pure arbitrary choice of initial centroids lead to construction of different clusters in every run and consequently affects the performance and accuracy of it. To date, several attempts have been made by the researchers to increase the performance and accuracy of it. However, scope of improvement still exists in this area. Therefore, a new approach to initialize centroids for k-means is proposed in this paper on the basis of the concept to choose the well separated data-objects as initial cluster centroids instead of pure arbitrary selection. As a consequence, it leads to higher probability of closeness of the chosen centroids to the final cluster centroids. The proposed algorithm is empirically assessed on 6 different well-known datasets. The results confirms that the proposed approach is considerably better than the pure arbitrary selection of centroids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arora, R.K., Gupta, M.K.: e-Governance using data warehousing and data mining. Int. J. Comput. Appl. 169(8), 28–31 (2017). https://doi.org/10.5120/ijca2017914785

    Article  Google Scholar 

  2. Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Elsevier (2012)

    Google Scholar 

  3. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    Google Scholar 

  4. Gupta, M.K., Chandra, P.: A comparative study of clustering algorithms. In: Proceedings of the 13th INDIACom-2019; IEEE Conference ID: 461816; 6th International Conference on Computing for Sustainable Global Development (2019)

    Google Scholar 

  5. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 60 (1999)

    Article  Google Scholar 

  6. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. American Statistical Association and the Society for Industrial and Applied Mathematics. SIAM (2007)

    Google Scholar 

  7. Gupta, M.K., Chandra, P.: P-k-means: k-means using partition based cluster initialization method. In: Proceedings of the International Conference on Advancements in Computing & Management (ICACM 2019), pp. 567–573. Elsevier SSRN (2019). https://doi.org/10.2139/ssrn.3462549

  8. Gupta, M.K., Chandra, P.: HYBCIM: hypercube based cluster initialization method for k-means. Int. J. Innov. Technol. Explor. Eng. 8(10), 3584–3587 (2019). https://doi.org/10.35940/ijitee.j9774.0881019

  9. Gupta, M.K., Chandra, P.: An empirical evaluation of K-means clustering algorithm using different distance/similarity metrics. In: ICETIT 2019. LNEE, vol. 605, pp. 884–892. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-30577-2_79

    Google Scholar 

  10. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: part iI. ACM SIGMOD Rec. 31(3) (2002). https://doi.org/10.1145/601858.601862

    Article  Google Scholar 

  11. Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external cluster validation indexes. Int. J. Comput. Commun. 5(1), 27–34 (2011)

    Google Scholar 

  12. Motwani, M., Arora, N., Gupta, A.: A study on initial centroids selection for partitional clustering algorithms. In: Hoda, M., Chauhan, N., Quadri, S., Srivastava, P. (eds.) Software Engineering. Advances in Intelligent Systems and Computing, vol. 731. Springer, Heidelberg (2019). https://doi.org/10.1007/978-981-10-8848-3_21

    Google Scholar 

  13. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010)

    Article  Google Scholar 

  14. Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data. Sci. (2015). https://doi.org/10.1007/s40745-015-0040-1

    Article  MathSciNet  Google Scholar 

  15. Forgy, E.: Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21(3), 768 (1965)

    Google Scholar 

  16. McQueen, J.B.: Some methods for classification and analysis of multi-variate observation. In: Symposium on Mathematical Statistics and Probability, University of California Press (1967)

    Google Scholar 

  17. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, Hoboken (1990)

    Google Scholar 

  18. Katsavounidis, I, Kuo, C., Zhang, Z.: A new initialization technique for generalized Lloyd iteration. IEEE 1(10), 144–146 (1994)

    Article  Google Scholar 

  19. Bradley, P.S., Fayyad, U.M.: Refining initial points for K-Means clustering. In: Proceedings of the 15th International Conference on Machine Learning, San Francisco, CA, pp. 91–99 (1998)

    Google Scholar 

  20. Pei, J., Fan, J., Xie, W.: A new initialization method of cluster centers. J. Electron. 16(4), 320–326 (1999). https://doi.org/10.1007/s11767-999-0033-3

    Article  Google Scholar 

  21. Khan, S.S., Ahmad, A.: Cluster centre initialization algorithm for K-means clustering. Pattern Recogn. Lett. 25(11), 1293–1302 (2004)

    Article  Google Scholar 

  22. Su, T., Dy, J.: A deterministic method for initializing K-means clustering. Tools with artificial intelligence. In: 16th IEEE International Conference, ICTAI 2004, pp. 784–786 (2004)

    Google Scholar 

  23. Hathaway, R.J., Bezdek, J.C., Huband, J.M.: Maximin initialization for cluster analysis. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225. Springer, Heidelberg (2006). https://doi.org/10.1007/11892755_2

    Google Scholar 

  24. Arai, K., Barakbah, A.R.: Hierarchical K-means: an algorithm for centroids initialization for K-means. Rep. Fac. Sci. Eng. Saga Univ. 36 (2007)

    Google Scholar 

  25. Arthur, D., Vassilvitskii, S.: k-means ++: The advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms (SODA 2007) Astor Crowne Plaza, New Orleans, Louisiana, pp. 1–11 (2007)

    Google Scholar 

  26. Wu, S., Jiang, Q., Huang, J.Z.: A new initialization method for clustering categorical data. In: Zhou, Z.H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS, vol. 4426, pp. 972–980. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71701-0_109

  27. Kang, P., Cho, S.: K-means clustering seeds initialization based on centrality, sparsity, and isotropy. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04394-9_14

    Chapter  Google Scholar 

  28. Maitra, R.: Initializing partition-optimization algorithms. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 144–157 (2009)

    Article  Google Scholar 

  29. Xu, J., Xu, B., Zhang, W.: Stable initialization scheme for K-means clustering. Wuhan Univ. J. Nat. Sci. 14(1), 24–28 (2009). https://doi.org/10.1007/s11859-009-0106-z

    Article  MathSciNet  Google Scholar 

  30. Dang, Y., Xuan, Z., Rong, L., Liu, M.: A novel initialization method for semi-supervised clustering. In: Bi, Y., Williams, M.A. (eds.) KSEM 2010. LNCS, vol. 6291, pp. 317–328. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15280-1_30

    Chapter  Google Scholar 

  31. Naldi, M.C., Campello, R.J.G.B., Hruschka, E.R., Carvalho, A.C.P.L.F.: Efficiency issues of evolutionary K-means. Appl. Soft Comput. 11, 1938–1952 (2011)

    Google Scholar 

  32. Reddy, D., Mishra, D., Jana, P.K.: MST-based cluster initialization for K-means. In: Meghanathan, N., Kaushik, B.K., Nagamalai, D. (eds.) CCSIT 2011. CCIS, vol. 131, pp. 329–338. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-17857-3_33

    Chapter  Google Scholar 

  33. Bai, L., Liang, J., Dang, C., Cao, F.: A cluster centers initialization method for clustering categorical data. Expert Syst. Appl. 39(9), 8022–8029 (2012). ISSN 0957-4174. https://doi.org/10.1016/j.eswa.2012.01.131

    Article  Google Scholar 

  34. Chen, G.H.: Cluster center initialization using hierarchical two-division of a data set along each dimension. In: Jin, D., Lin, S. (eds.) Advances in Computer Science and Information Engineering. Advances in Intelligent and Soft Computing, vol. 168, pp. 235–241. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30126-1_38

    Google Scholar 

  35. Aldahdooh, R.T., Ashour, W.: DIMK-means distance-based initialization methods for K-means clustering algorithms. Int. J. Intell. Syst. Appl. 2, 41–51 (2013)

    Article  Google Scholar 

  36. Goyal, M., Kumar, S.: Improving the initial centroids of K-means clustering algorithm to generalize its applicability. J. Inst. Eng. (India): Ser. B 95(4), 345–350 (2014). https://doi.org/10.1007/s40031-014-0106-z

    Article  Google Scholar 

  37. Duwairi, R., Abu-Rahmeh, M.: A novel approach for initializing the spherical K-means clustering algorithm. Simul. Model. Practice Theory 54, 49–63 (2015). ISSN 1569-190X, https://doi.org/10.1016/j.simpat.2015.03.007

    Article  Google Scholar 

  38. Poomagal, S., Saranya, P., Karthik, S.: A novel method for selecting initial centroids in K-means clustering algorithm. Int. J. Intell. Syst. Technol. Appl. 15(3) (2016). https://doi.org/10.1504/IJISTA.2016.078347

    Article  Google Scholar 

  39. Dhanabal, S., Chandramathi, S.: Enhancing clustering accuracy by finding initial centroid using k-minimum-average-maximum method. Int. J. Inf. Commun. Technol. 11(2) (2017). https://doi.org/10.1504/IJICT.2017.086252

    Article  Google Scholar 

  40. Golasowski, M., Martinovič, J., Slaninová, K.: Comparison of K-means clustering initialization approaches with brute-force initialization. In: Chaki, R., Saeed, K., Cortesi, A., Chaki, N. (eds.) Advanced Computing and Systems for Security. Advances in Intelligent Systems and Computing, vol. 567, pp. 103–114. Springer, Heidelberg (2017). https://doi.org/10.1007/978-981-10-3409-1_7

    Google Scholar 

  41. Kumar, K.M., Reddy, A.R.M.: An efficient k-means clustering filtering algorithm using density based initial cluster centers. Inf. Sci. 418–419, 286–301 (2017). ISSN 0020-0255, https://doi.org/10.1016/j.ins.2017.07.036

    Article  MathSciNet  Google Scholar 

  42. Ismkhan, H.: I-k-means −+: an iterative clustering algorithm based on an enhanced version of the K-means. Pattern Recogn. 79, 402–413 (2018). ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2018.02.015

    Article  Google Scholar 

  43. Nguyen, C.D., Duc, T., Duong, T.H.: K-means** – a fast and efficient K-means algorithms. Int. J. Intell. Inf. Database Syst. 11(1) (2018). https://doi.org/10.1504/ijiids.2018.091595

    Article  Google Scholar 

  44. Sandhya, N., Raja Sekar, M.: Analysis of variant approaches for initial centroid selection in K-means clustering algorithm. In: Satapathy, S., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics. Smart Innovation, Systems and Technologies, vol. 78, pp. 109–121. Springer, Heidelberg (2018). https://doi.org/10.1007/978-981-10-5547-8_11

    Google Scholar 

  45. Yu, S., Chu, S., Wang, C., Chan, Y., Chang, T.: Two improved K-means algorithms. Appl. Soft Comput. 68, 747–755 (2018). ISSN 1568-4946, https://doi.org/10.1016/j.asoc.2017.08.032

    Article  Google Scholar 

  46. Kurada, R.R., Kanadam, K.P.: A novel evolutionary automatic clustering technique by unifying initial seed selection algorithms into teaching–learning-based optimization. In: Soft Computing and Medical Bioinformatics. Springer Briefs in Applied Sciences and Technology. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-0059-2_1

    Google Scholar 

  47. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2/3), 107–145 (2001)

    Google Scholar 

  48. Theodoridis, S., Koutroubas, K.: Pattern Recognition, 2nd edn. Academic Press, Cambridge (2003)

    Chapter  Google Scholar 

  49. Gupta, M.K., Chandra, P.: MP-K-Means: modified partition based cluster initialization method for K-means algorithm. Int. J. Recent Technol. Eng. 8(4), 1140–1148 (2019). https://doi.org/10.35940/ijrte.D6837.118419

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manoj Kr. Gupta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, M.K., Chandra, P. (2020). An Efficient Approach for Selection of Initial Cluster Centroids for k-means. In: Batra, U., Roy, N., Panda, B. (eds) Data Science and Analytics. REDSET 2019. Communications in Computer and Information Science, vol 1229. Springer, Singapore. https://doi.org/10.1007/978-981-15-5827-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-5827-6_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-5826-9

  • Online ISBN: 978-981-15-5827-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics