Distance Based Fast Hierarchical Clustering Method for Large Datasets

Patra, Bidyut Kr.; Hubballi, Neminath; Biswas, Santosh; Nandi, Sukumar

doi:10.1007/978-3-642-13529-3_7

Bidyut Kr. Patra²⁴,
Neminath Hubballi²⁴,
Santosh Biswas²⁴ &
…
Sukumar Nandi²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6086))

Included in the following conference series:

International Conference on Rough Sets and Current Trends in Computing

1598 Accesses
13 Citations

Abstract

Average-link (AL) is a distance based hierarchical clustering method, which is not sensitive to the noisy patterns. However, like all hierarchical clustering methods AL also needs to scan the dataset many times. AL has time and space complexity of O(n ²), where n is the size of the dataset. These prohibit the use of AL for large datasets. In this paper, we have proposed a distance based hierarchical clustering method termed l-AL which speeds up the classical AL method in any metric (vector or non-vector) space. In this scheme, first leaders clustering method is applied to the dataset to derive a set of leaders and subsequently AL clustering is applied to the leaders. To speed-up the leaders clustering method, reduction in distance computations is also proposed in this paper. Experimental results confirm that the l-AL method is considerably faster than the classical AL method yet keeping clustering results at par with the classical AL method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Inc., New York (1975)
MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. In: Proceedings ACM SIGMOD, pp. 49–60 (1999)
Google Scholar
Sneath, A., Sokal, P.H.: Numerical Taxonomy. Freeman, London (1973)
MATH Google Scholar
King, B.: Step-Wise Clustering Procedures. Journal of the American Statistical Association 62(317), 86–101 (1967)
Article Google Scholar
Murtagh, F.: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1, 101–113 (1984)
MATH Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: Proceedings of the 1996 ACM SIGMOD, pp. 103–114 (1996)
Google Scholar
Dash, M., Liu, H., Scheuermann, P., Tan, K.L.: Fast hierarchical clustering and its validation. Data Knowl. Eng. 44(1), 109–138 (2003)
Article MATH Google Scholar
Nanni, M.: Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 378–387. Springer, Heidelberg (2005)
Google Scholar
Koga, H., Ishibashi, T., Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing. Knowledge and Information Systems 12(1), 25–53 (2007)
Article Google Scholar
Viswanath, P., Babu, V.: Rough-dbscan: a fast hybrid density based clustering method for latge data sets. Pattern Recognition Letters 30(16), 1477–1488 (2009)
Article Google Scholar
Patra, B.K., Nandi, S.: A Fast Single Link Clustering Method Based on Tolerance Rough Set Model. In: Sakai, H., et al. (eds.) RSFDGrC 2009. LNCS (LNAI), vol. 5908, pp. 414–422. Springer, Heidelberg (2009)
Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Computing 21, 1313–1325 (1995)
Article MATH MathSciNet Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML, pp. 147–153 (2003)
Google Scholar
Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of SIGMOD Conference, pp. 467–478 (2004)
Google Scholar
Rand, W.M.: Objective Criteria for Evaluation of Clustering Methods. Journal of American Statistical Association 66(336), 846–850 (1971)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati, Assam, 781039, India
Bidyut Kr. Patra, Neminath Hubballi, Santosh Biswas & Sukumar Nandi

Authors

Bidyut Kr. Patra
View author publications
You can also search for this author in PubMed Google Scholar
Neminath Hubballi
View author publications
You can also search for this author in PubMed Google Scholar
Santosh Biswas
View author publications
You can also search for this author in PubMed Google Scholar
Sukumar Nandi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Mathematics, Warsaw University, Banacha 2, 02-097, Warsaw, Poland
Marcin Szczuka
ICS, Warsaw University of Technology,,
Marzena Kryszkiewicz
Department of Applied Computer Science, University of Winnipeg, R3B 2E9, Winnipeg, Manitoba, Canada
Sheela Ramanna
Dept. of Computer Science, The University of Wales, Aberystwyth, UK
Richard Jensen
Harbin Institute of Technology, PO Box 458, 150006, Harbin, China
Qinghua Hu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patra, B.K., Hubballi, N., Biswas, S., Nandi, S. (2010). Distance Based Fast Hierarchical Clustering Method for Large Datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds) Rough Sets and Current Trends in Computing. RSCTC 2010. Lecture Notes in Computer Science(), vol 6086. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13529-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-13529-3_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13528-6
Online ISBN: 978-3-642-13529-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics