Efficient Parallel Hierarchical Clustering

Dash, Manoranjan; Petrutiu, Simona; Scheuermann, Peter

doi:10.1007/978-3-540-27866-5_47

Manoranjan Dash¹⁹,
Simona Petrutiu²⁰ &
Peter Scheuermann²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3149))

Included in the following conference series:

European Conference on Parallel Processing

915 Accesses
8 Citations

Abstract

Hierarchical agglomerative clustering (HAC) is a common clustering method that outputs a dendrogram showing all N levels of agglomerations where N is the number of objects in the data set. High time and memory complexities are some of the major bottlenecks in its application to real-world problems. In the literature parallel algorithms are proposed to overcome these limitations. But, as this paper shows, existing parallel HAC algorithms are inefficient due to ineffective partitioning of the data. We first show how HAC follows a rule where most agglomerations have very small dissimilarity and only a small portion towards the end have large dissimilarity. Partially overlapping partitioning (POP) exploits this principle and obtains efficient yet accurate HAC algorithms. The total number of dissimilarities is reduced by a factor close to the number of cells in the partition. We present pPOP, the parallel version of POP, that is implemented on a shared memory multiprocessor architecture. Extensive theoretical analysis and experimental results are presented and show that pPOP gives close to linear speedup and outperforms the existing parallel algorithms significantly both in CPU time and memory requirements.

Download to read the full chapter text

Chapter PDF

A Framework for Parallelizing Hierarchical Clustering Methods

Parallel SLINK for big data

Article 11 June 2019

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Article 25 November 2017

Keywords

References

Dhillon, I.S., Modha, D.M.: Large-scale parallel data mining. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Chapter Google Scholar
Nagesh, H., Goil, S., Choudhary, A.: PMAFIA: A scalable parallel subspace clustering algorithm for massive datasets. In: Proc. International Conference on Parallel Processing, pp. 21–24 (2000)
Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Computing 21, 1313–1325 (1995)
Article MATH MathSciNet Google Scholar
Dash, M., Liu, H., Scheuermann, P., Tan, K.L.: Fast hierarchical clustering and its validation. Data and Knowledge Engineering 44(1), 109–138 (2003)
Article MATH Google Scholar
Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 1088–1092 (1990)
Article Google Scholar
Li, X., Fang, Z.: Parallel clustering algorithms. Parallel Computing 11, 275–290 (1989)
Article MATH MathSciNet Google Scholar
Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. Journal of Parallel and Distributed Computing 60, 1137–1153 (2000)
Article MATH Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114 (1996)
Google Scholar
Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., Menon, R. (eds.): Parallel Programming in OpenMP. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore, 639798
Manoranjan Dash
Department of Electrical & Computer Engineering, Northwestern University, Evanston, IL, 60208, USA
Simona Petrutiu & Peter Scheuermann

Authors

Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Simona Petrutiu
View author publications
You can also search for this author in PubMed Google Scholar
Peter Scheuermann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

No Affiliations,
Marco Danelutto
Computer Science Department, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Marco Vanneschi
Information Science and Technologies Institute (ISTI) The Italian National Research Council (CNR), Area della Ricerca, Via Giuseppe Moruzzi, 1, I-56126, Pisa, Italy
Domenico Laforenza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, M., Petrutiu, S., Scheuermann, P. (2004). Efficient Parallel Hierarchical Clustering. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds) Euro-Par 2004 Parallel Processing. Euro-Par 2004. Lecture Notes in Computer Science, vol 3149. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27866-5_47

Download citation

DOI: https://doi.org/10.1007/978-3-540-27866-5_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22924-7
Online ISBN: 978-3-540-27866-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Efficient Parallel Hierarchical Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A Framework for Parallelizing Hierarchical Clustering Methods

Parallel SLINK for big data

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Efficient Parallel Hierarchical Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A Framework for Parallelizing Hierarchical Clustering Methods

Parallel SLINK for big data

Efficient data distribution and results merging for parallel data clustering in mapreduce environment

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation