Abstract
In several contexts and domains, hierarchical agglomerative clustering (HAC) offers best-quality results, but at the price of a high complexity which reduces the size of datasets which can be handled. In some contexts, in particular, computing distances between objects is the most expensive task. In this paper we propose a pruning heuristics aimed at improving performances in these cases, which is well integrated in all the phases of the HAC process and can be applied to two HAC variants: single-linkage and complete-linkage. After describing the method, we provide some theoretical evidence of its pruning power, followed by an empirical study of its effectiveness over different data domains, with a special focus on dimensionality issues.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Breunig, M.M., Kriegel, H.-P., Krüger, P., Sander, J.: Data bubbles: quality preserving performance boosting for hierarchical clustering. In: SIGMOD 2001: Proc. of the 2001 ACM SIGMOD Int’ Conf. on Management of data, pp. 79–90 (2001)
Eppstein, D.: Fast hiearchical clustering and other applications of dynamic closet pairs. In: SODA 1998: Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, pp. 619–628 (1998)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Krznaric, D., Levcopoulos, C.: The first subquadratic algorithm for complete linkage clustering. In: Staples, J., Katoh, N., Eades, P., Moffat, A. (eds.) ISAAC 1995. LNCS, vol. 1004, pp. 392–401. Springer, Heidelberg (1995)
Krznaric, D., Levcopoulos, C.: Optimal algorithms for complete linkage clustering in d dimensions. Theor. Comput. Sci. 286(1), 139–149 (2002)
Mettu, R.R., Plaxton, C.G.: Optimal time bounds for approximate clustering. Machine Learning 56(1–3), 35–60 (2004)
Nanni, M.: Clustering methods for spatio-temporal data. PhD thesis, Dipartimento di Informatica, Università di Pisa (2002)
Nanni, M.: Hierarchical clustering in presence of expensive metrics. Technical report, ISTI-CNR (2005), http://ercolino.isti.cnr.it/mirco/papers.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nanni, M. (2005). Speeding-Up Hierarchical Agglomerative Clustering in Presence of Expensive Metrics. In: Ho, T.B., Cheung, D., Liu, H. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2005. Lecture Notes in Computer Science(), vol 3518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11430919_45
Download citation
DOI: https://doi.org/10.1007/11430919_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26076-9
Online ISBN: 978-3-540-31935-1
eBook Packages: Computer ScienceComputer Science (R0)