Abstract
Clustering is an unsupervised data mining technique where exploration is done with little knowledge of data classes. Its aim is to recognize the hidden information from the data for effective decision-making. Though many clustering algorithms has already been implemented till date, still it is an active topic of research for data mining. Researcher’s attempts to explore, compare, evaluate, and improve the different clustering algorithms available, for specialized situation and context. The purpose of all these efforts are to refine and propose improved version of algorithm after statistical evaluation by different metrices. The present research is an attempt to analysis empirically, the partitioning based clustering algorithms and hierarchical based clustering algorithm; by conducting extensive experiments. Both algorithms effectiveness has been measured through external and internal validity indices and Pearson’s correlation distance function using anatomized experiments. The parameters of evaluation that have been taken into consideration; for Internal Indices: Silhouette Index, Davies-Bouldin Validity Index and Calinski-Harabasz index; for external indices: Jaccard index, Rand Index, Entropy and Normalized Mutual Information. The other parameters of evaluation are accuracy and time of execution. Based on the experiments it may be concluded that K-means algorithm produces more promising result than hierarchical algorithm except in accuracy.
Similar content being viewed by others
References
Deborah LJ, Baskaran R, Kannan A (2010) A survey on internal validity measure for cluster validation. Int J Comput Sci Eng Surv. https://doi.org/10.5121/ijcses.2010.1207
Hassan SI (2017) Designing a flexible system for automatic detection of categorical student sentiment polarity using machine learning. Int J u- e- Serv Sci Technol 10(3):25–32
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):255–323
Kryszczuk K, Hurley P (2010) Estimation of the number of clusters using multiple clustering validity indices. IBM Zurich Research Laboratory, Switzerland
Thakre YS, Bagal SB (2015) Performance evaluation of K-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11):12–16
Tang DJC, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Chormunge S, Jena S (2015) Efficiency and effectiveness of clustering algorithms for high dimensional data. Int J Comput Appl 125(11):35–40
Firdaus S, Uddin A (2015) A survey on clustering algorithms and complexity analysis. IJCSI Issues 12(2):62–85
Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137
Riyaz R, Wani MA (2014) Review and comparative study of cluster validity techniques using k-means algorithm. Int J Adv Found Res Sci Eng 1(3):236–241
Ansari Z, Babu AV, Azeem MF, Ahmed W (2011) Quantitative evaluation of performance and validity indices for clustering the web navigational session. WCSIT 1(5):217–226
Chen G, Jaradat SA, Banerjee N, Tanaka T, Ko MSH (2002) Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data. Stat Sin 12:241–262
Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '99). ACM, New York, NY, USA, pp 16–22. https://doi.org/10.1145/312129.312186
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining
Abbas OA (2008) Comparison between data clustering algorithm. IAJIT 5(3):320–325
Bala R, Sikka S, Singh J (2014) A comparative analysis of clustering algorithms. Int J Comput Appl 100(15):35–39 (00975-8875)
Kaur M, Kaur U (2013) Comparison between k-means and hierarchical algorithm using query redirection. Int J Adv Res Comput Sci Softw Eng 3(7):1454–1459
Maulik U, Bandhyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Rajalakshmi K (2015) Comparative analysis of K-means algorithm in disease prediction. IJSETR 4(7):2697–2699
Singh P, Surya A (2015) Performance analysis of clustering algorithms in data mining in weka. IJAET 7:1866 (ISSN: 22311963)
Gunaskara RPTH, Wijegunasekara MC, Dias NGJ (2014) Comparison of major clustering algorithms using Weka tool. In: Internal Conference in advances in ICT for emerging regions
Pal NR, Biswas J (1997) Cluster validation using graph theoretic concepts. Pattern Recogn 1997(30):847–857
Arbelaitz O, Gurrutxagan I, Muguerza J, Pe´rez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46:243–256
Fahad A, Alshatr N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithm for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Gupta GK (2006) Introduction to data mining with case studies. PHI Learning Pvt Ltd, Delhi
Hassan SI (2016) Extracting the sentiment score of customer review from unstructured big data using map reduce algorithm. Int J Database Theory Appl 9(12):289–298. https://doi.org/10.14257/ijdta.2016.9.12.26(ISSN: 2005-4270)
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci. https://doi.org/10.1007/s40745-015-0040-1
Xu R (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Kapil S, Chawla M (2016) Performance evaluation of k-means clustering algorithm with various distance metrics. In: 1st IEEE International Conference on power electronics, intelligent control and energy systems (ICPEICES)
Halkidi et al (2001) J Intell Inf Syst 17(2/3):107–145
Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, Cambridge
Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32:241–254
Liu Y, Li Z, Xong H, Gao X, Wu J, Wu S (2010) Understanding of internal custer validation measures. In: IEEE International conference on data mining (ICDM '10). IEEE Computer Society, Washington, DC, USA, pp 911–916. https://doi.org/10.1109/ICDM.2010.35
Guyon I, von Luxburg U, Williamson RC (2009) Clustering: science or art? In: NIPS (ed) Workshop on clustering theory. Vancouver, Canada
Rendon E, Abundez I, Arizmendi A, Quiroz EM (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34
Halkid M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2/3):107–145
Everitt Brian (1980) Cluster analysis. Qual Quant 14(1):75–100
Dziopa T (2016) Clustering validity indices evaluation with regards to semantic homogeneity. In: Position papers of the federated conference on computer science and information systems, ACSIS, vol 9, pp 3–9. https://doi.org/10.15439/2016f371(ISSN 2300-5963)
Mary SAL, Sivagami AN, Rani MU (2015) Cluster validity measures dynamic clustering algorithms. ARPN J Eng Appl Sci 10(9):4009–4012
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(4):224–227
Jaccard P (1912) The distribution of flora in the alpine zone. New Phytol 11:37–50
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Dias DB, Madeo RCB, Rocha T, Peres SM (2009) Hand movement recognition for brazilian sign language: a study using distance-based neural networks. In: International Joint conference on neural networks 2009, Atlanta, GA. Proceedings of 2009 International Joint Conference on Neural Networks. Eau Claire, WI, USA : Documation LLC, 2009. pp 697–704. https://doi.org/10.1109/ijcnn.2009.5178917
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hassan, S.I., Samad, A., Ahmad, O. et al. Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time. Int. j. inf. tecnol. 12, 1377–1384 (2020). https://doi.org/10.1007/s41870-019-00406-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-019-00406-7