Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time

  • Syed Imtiyaz HassanEmail author
  • Afreen Samad
  • Omair Ahmad
  • Afshar Alam
Original Research


Clustering is an unsupervised data mining technique where exploration is done with little knowledge of data classes. Its aim is to recognize the hidden information from the data for effective decision-making. Though many clustering algorithms has already been implemented till date, still it is an active topic of research for data mining. Researcher’s attempts to explore, compare, evaluate, and improve the different clustering algorithms available, for specialized situation and context. The purpose of all these efforts are to refine and propose improved version of algorithm after statistical evaluation by different metrices. The present research is an attempt to analysis empirically, the partitioning based clustering algorithms and hierarchical based clustering algorithm; by conducting extensive experiments. Both algorithms effectiveness has been measured through external and internal validity indices and Pearson’s correlation distance function using anatomized experiments. The parameters of evaluation that have been taken into consideration; for Internal Indices: Silhouette Index, Davies-Bouldin Validity Index and Calinski-Harabasz index; for external indices: Jaccard index, Rand Index, Entropy and Normalized Mutual Information. The other parameters of evaluation are accuracy and time of execution. Based on the experiments it may be concluded that K-means algorithm produces more promising result than hierarchical algorithm except in accuracy.


Data mining Data science Machine learning Clustering algorithm K-means Hierarchical algorithm Validation indices Pearson’s correlation distance 


  1. 1.
    Deborah LJ, Baskaran R, Kannan A (2010) A survey on internal validity measure for cluster validation. Int J Comput Sci Eng Surv. CrossRefGoogle Scholar
  2. 2.
    Hassan SI (2017) Designing a flexible system for automatic detection of categorical student sentiment polarity using machine learning. Int J u- e- Serv Sci Technol 10(3):25–32MathSciNetCrossRefGoogle Scholar
  3. 3.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):255–323CrossRefGoogle Scholar
  4. 4.
    Kryszczuk K, Hurley P (2010) Estimation of the number of clusters using multiple clustering validity indices. IBM Zurich Research Laboratory, SwitzerlandCrossRefGoogle Scholar
  5. 5.
    Thakre YS, Bagal SB (2015) Performance evaluation of K-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11):12–16Google Scholar
  6. 6.
    Tang DJC, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386CrossRefGoogle Scholar
  7. 7.
    Chormunge S, Jena S (2015) Efficiency and effectiveness of clustering algorithms for high dimensional data. Int J Comput Appl 125(11):35–40Google Scholar
  8. 8.
    Firdaus S, Uddin A (2015) A survey on clustering algorithms and complexity analysis. IJCSI Issues 12(2):62–85Google Scholar
  9. 9.
    Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12. CrossRefGoogle Scholar
  10. 10.
    Riyaz R, Wani MA (2014) Review and comparative study of cluster validity techniques using k-means algorithm. Int J Adv Found Res Sci Eng 1(3):236–241Google Scholar
  11. 11.
    Ansari Z, Babu AV, Azeem MF, Ahmed W (2011) Quantitative evaluation of performance and validity indices for clustering the web navigational session. WCSIT 1(5):217–226Google Scholar
  12. 12.
    Chen G, Jaradat SA, Banerjee N, Tanaka T, Ko MSH (2002) Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data. Stat Sin 12:241–262MathSciNetzbMATHGoogle Scholar
  13. 13.
    Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '99). ACM, New York, NY, USA, pp 16–22.
  14. 14.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text MiningGoogle Scholar
  15. 15.
    Abbas OA (2008) Comparison between data clustering algorithm. IAJIT 5(3):320–325Google Scholar
  16. 16.
    Bala R, Sikka S, Singh J (2014) A comparative analysis of clustering algorithms. Int J Comput Appl 100(15):35–39 (00975-8875) Google Scholar
  17. 17.
    Kaur M, Kaur U (2013) Comparison between k-means and hierarchical algorithm using query redirection. Int J Adv Res Comput Sci Softw Eng 3(7):1454–1459Google Scholar
  18. 18.
    Maulik U, Bandhyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRefGoogle Scholar
  19. 19.
    Rajalakshmi K (2015) Comparative analysis of K-means algorithm in disease prediction. IJSETR 4(7):2697–2699Google Scholar
  20. 20.
    Singh P, Surya A (2015) Performance analysis of clustering algorithms in data mining in weka. IJAET 7:1866 (ISSN: 22311963) Google Scholar
  21. 21.
    Gunaskara RPTH, Wijegunasekara MC, Dias NGJ (2014) Comparison of major clustering algorithms using Weka tool. In: Internal Conference in advances in ICT for emerging regionsGoogle Scholar
  22. 22.
    Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recogn 1997(30):847–857CrossRefGoogle Scholar
  23. 23.
    Arbelaitz O, Gurrutxagan I, Muguerza J, Pe´rez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256CrossRefGoogle Scholar
  24. 24.
    Fahad A, Alshatr N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithm for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. CrossRefGoogle Scholar
  25. 25.
    Gupta GK (2006) Introduction to data mining with case studies. PHI Learning Pvt Ltd, DelhiGoogle Scholar
  26. 26.
    Hassan SI (2016) Extracting the sentiment score of customer review from unstructured big data using map reduce algorithm. Int J Database Theory Appl 9(12):289–298. 2005-4270) CrossRefGoogle Scholar
  27. 27.
    Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverzbMATHGoogle Scholar
  28. 28.
    Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci. CrossRefGoogle Scholar
  29. 29.
    Xu R (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRefGoogle Scholar
  30. 30.
    Kapil S, Chawla M (2016) Performance evaluation of k-means clustering algorithm with various distance metrics. In: 1st IEEE International Conference on power electronics, intelligent control and energy systems (ICPEICES)Google Scholar
  31. 31.
    Halkidi et al (2001) J Intell Inf Syst 17(2/3):107–145CrossRefGoogle Scholar
  32. 32.
    Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, CambridgeGoogle Scholar
  33. 33.
    Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32:241–254CrossRefGoogle Scholar
  34. 34.
    Liu Y, Li Z, Xong H, Gao X, Wu J, Wu S (2010) Understanding of internal custer validation measures. In: IEEE International conference on data mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, pp 911–916.
  35. 35.
    Guyon I, von Luxburg U, Williamson RC (2009) Clustering: science or art? In: NIPS (ed) Workshop on clustering theory. Vancouver, CanadaGoogle Scholar
  36. 36.
    Rendon E, Abundez I, Arizmendi A, Quiroz EM (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34Google Scholar
  37. 37.
    Halkid M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2/3):107–145CrossRefGoogle Scholar
  38. 38.
    Everitt Brian (1980) Cluster analysis. Qual Quant 14(1):75–100CrossRefGoogle Scholar
  39. 39.
    Dziopa T (2016) Clustering validity indices evaluation with regards to semantic homogeneity. In: Position papers of the federated conference on computer science and information systems, ACSIS, vol 9, pp 3–9. 2300-5963)
  40. 40.
    Mary SAL, Sivagami AN, Rani MU (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012Google Scholar
  41. 41.
    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefGoogle Scholar
  42. 42.
    Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(4):224–227CrossRefGoogle Scholar
  43. 43.
    Jaccard P (1912) The distribution of flora in the alpine zone. New Phytol 11:37–50CrossRefGoogle Scholar
  44. 44.
    Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRefGoogle Scholar
  45. 45.
    Dias DB, Madeo RCB, Rocha T, Peres SM (2009) Hand movement recognition for brazilian sign language: a study using distance-based neural networks. In: International Joint conference on neural networks 2009, Atlanta, GA. Proceedings of 2009 International Joint Conference on Neural Networks. Eau Claire, WI, USA : Documation LLC, 2009. pp 697–704.

Copyright information

© Bharati Vidyapeeth's Institute of Computer Applications and Management 2019

Authors and Affiliations

  • Syed Imtiyaz Hassan
    • 1
    Email author
  • Afreen Samad
    • 1
  • Omair Ahmad
    • 1
  • Afshar Alam
    • 1
  1. 1.Department of Computer Science and Engineering, School of Engineering Sciences and TechnologyJamia Hamdard (Deemed to be University)New DelhiIndia

Personalised recommendations