Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time

Abstract

Clustering is an unsupervised data mining technique where exploration is done with little knowledge of data classes. Its aim is to recognize the hidden information from the data for effective decision-making. Though many clustering algorithms has already been implemented till date, still it is an active topic of research for data mining. Researcher’s attempts to explore, compare, evaluate, and improve the different clustering algorithms available, for specialized situation and context. The purpose of all these efforts are to refine and propose improved version of algorithm after statistical evaluation by different metrices. The present research is an attempt to analysis empirically, the partitioning based clustering algorithms and hierarchical based clustering algorithm; by conducting extensive experiments. Both algorithms effectiveness has been measured through external and internal validity indices and Pearson’s correlation distance function using anatomized experiments. The parameters of evaluation that have been taken into consideration; for Internal Indices: Silhouette Index, Davies-Bouldin Validity Index and Calinski-Harabasz index; for external indices: Jaccard index, Rand Index, Entropy and Normalized Mutual Information. The other parameters of evaluation are accuracy and time of execution. Based on the experiments it may be concluded that K-means algorithm produces more promising result than hierarchical algorithm except in accuracy.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. 1.

    Deborah LJ, Baskaran R, Kannan A (2010) A survey on internal validity measure for cluster validation. Int J Comput Sci Eng Surv. https://doi.org/10.5121/ijcses.2010.1207

    Article  Google Scholar 

  2. 2.

    Hassan SI (2017) Designing a flexible system for automatic detection of categorical student sentiment polarity using machine learning. Int J u- e- Serv Sci Technol 10(3):25–32

    Article  Google Scholar 

  3. 3.

    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):255–323

    Article  Google Scholar 

  4. 4.

    Kryszczuk K, Hurley P (2010) Estimation of the number of clusters using multiple clustering validity indices. IBM Zurich Research Laboratory, Switzerland

    Google Scholar 

  5. 5.

    Thakre YS, Bagal SB (2015) Performance evaluation of K-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11):12–16

    Google Scholar 

  6. 6.

    Tang DJC, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386

    Article  Google Scholar 

  7. 7.

    Chormunge S, Jena S (2015) Efficiency and effectiveness of clustering algorithms for high dimensional data. Int J Comput Appl 125(11):35–40

    Google Scholar 

  8. 8.

    Firdaus S, Uddin A (2015) A survey on clustering algorithms and complexity analysis. IJCSI Issues 12(2):62–85

    Google Scholar 

  9. 9.

    Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137

    Article  Google Scholar 

  10. 10.

    Riyaz R, Wani MA (2014) Review and comparative study of cluster validity techniques using k-means algorithm. Int J Adv Found Res Sci Eng 1(3):236–241

    Google Scholar 

  11. 11.

    Ansari Z, Babu AV, Azeem MF, Ahmed W (2011) Quantitative evaluation of performance and validity indices for clustering the web navigational session. WCSIT 1(5):217–226

    Google Scholar 

  12. 12.

    Chen G, Jaradat SA, Banerjee N, Tanaka T, Ko MSH (2002) Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data. Stat Sin 12:241–262

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '99). ACM, New York, NY, USA, pp 16–22. https://doi.org/10.1145/312129.312186

  14. 14.

    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining

  15. 15.

    Abbas OA (2008) Comparison between data clustering algorithm. IAJIT 5(3):320–325

    Google Scholar 

  16. 16.

    Bala R, Sikka S, Singh J (2014) A comparative analysis of clustering algorithms. Int J Comput Appl 100(15):35–39 (00975-8875)

    Google Scholar 

  17. 17.

    Kaur M, Kaur U (2013) Comparison between k-means and hierarchical algorithm using query redirection. Int J Adv Res Comput Sci Softw Eng 3(7):1454–1459

    Google Scholar 

  18. 18.

    Maulik U, Bandhyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  19. 19.

    Rajalakshmi K (2015) Comparative analysis of K-means algorithm in disease prediction. IJSETR 4(7):2697–2699

    Google Scholar 

  20. 20.

    Singh P, Surya A (2015) Performance analysis of clustering algorithms in data mining in weka. IJAET 7:1866 (ISSN: 22311963)

    Google Scholar 

  21. 21.

    Gunaskara RPTH, Wijegunasekara MC, Dias NGJ (2014) Comparison of major clustering algorithms using Weka tool. In: Internal Conference in advances in ICT for emerging regions

  22. 22.

    Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recogn 1997(30):847–857

    Article  Google Scholar 

  23. 23.

    Arbelaitz O, Gurrutxagan I, Muguerza J, Pe´rez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256

    Article  Google Scholar 

  24. 24.

    Fahad A, Alshatr N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithm for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  25. 25.

    Gupta GK (2006) Introduction to data mining with case studies. PHI Learning Pvt Ltd, Delhi

    Google Scholar 

  26. 26.

    Hassan SI (2016) Extracting the sentiment score of customer review from unstructured big data using map reduce algorithm. Int J Database Theory Appl 9(12):289–298. https://doi.org/10.14257/ijdta.2016.9.12.26(ISSN: 2005-4270)

    Article  Google Scholar 

  27. 27.

    Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River

    Google Scholar 

  28. 28.

    Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci. https://doi.org/10.1007/s40745-015-0040-1

    Article  Google Scholar 

  29. 29.

    Xu R (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  30. 30.

    Kapil S, Chawla M (2016) Performance evaluation of k-means clustering algorithm with various distance metrics. In: 1st IEEE International Conference on power electronics, intelligent control and energy systems (ICPEICES)

  31. 31.

    Halkidi et al (2001) J Intell Inf Syst 17(2/3):107–145

    Article  Google Scholar 

  32. 32.

    Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, Cambridge

    Google Scholar 

  33. 33.

    Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32:241–254

    Article  Google Scholar 

  34. 34.

    Liu Y, Li Z, Xong H, Gao X, Wu J, Wu S (2010) Understanding of internal custer validation measures. In: IEEE International conference on data mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, pp 911–916. https://doi.org/10.1109/ICDM.2010.35

  35. 35.

    Guyon I, von Luxburg U, Williamson RC (2009) Clustering: science or art? In: NIPS (ed) Workshop on clustering theory. Vancouver, Canada

    Google Scholar 

  36. 36.

    Rendon E, Abundez I, Arizmendi A, Quiroz EM (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34

    Google Scholar 

  37. 37.

    Halkid M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2/3):107–145

    Article  Google Scholar 

  38. 38.

    Everitt Brian (1980) Cluster analysis. Qual Quant 14(1):75–100

    Article  Google Scholar 

  39. 39.

    Dziopa T (2016) Clustering validity indices evaluation with regards to semantic homogeneity. In: Position papers of the federated conference on computer science and information systems, ACSIS, vol 9, pp 3–9. https://doi.org/10.15439/2016f371(ISSN 2300-5963)

  40. 40.

    Mary SAL, Sivagami AN, Rani MU (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012

    Google Scholar 

  41. 41.

    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  42. 42.

    Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(4):224–227

    Article  Google Scholar 

  43. 43.

    Jaccard P (1912) The distribution of flora in the alpine zone. New Phytol 11:37–50

    Article  Google Scholar 

  44. 44.

    Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  45. 45.

    Dias DB, Madeo RCB, Rocha T, Peres SM (2009) Hand movement recognition for brazilian sign language: a study using distance-based neural networks. In: International Joint conference on neural networks 2009, Atlanta, GA. Proceedings of 2009 International Joint Conference on Neural Networks. Eau Claire, WI, USA : Documation LLC, 2009. pp 697–704. https://doi.org/10.1109/ijcnn.2009.5178917

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Syed Imtiyaz Hassan.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hassan, S.I., Samad, A., Ahmad, O. et al. Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time. Int. j. inf. tecnol. 12, 1377–1384 (2020). https://doi.org/10.1007/s41870-019-00406-7

Download citation

Keywords

  • Data mining
  • Data science
  • Machine learning
  • Clustering algorithm
  • K-means
  • Hierarchical algorithm
  • Validation indices
  • Pearson’s correlation distance