Skip to main content

A Framework for Parallelizing Hierarchical Clustering Methods

  • Conference paper
  • First Online:
Book cover Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11906))

Abstract

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.

We introduce efficient distributed algorithms for bisecting k-means, k-median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This model is widely used to capture the class of algorithms that scale in frameworks such as Spark and MapReduce.

  2. 2.

    We can remove this assumption by adding a small perturbation to every point.

  3. 3.

    Note that the guarantees is on each single choice made by the algorithm but not on all the choices together.

  4. 4.

    In prior work, Yaroslavtsev and Vadapalli [36] give an algorithm for single-linkage clustering with constant-dimensional Euclidean input that fits within our framework.

  5. 5.

    Consider an example where the optimal 2-clustering separates only 1 point at a time.

  6. 6.

    By the generalized triangle inequality this is true for \(p=1,2\) and it is true for \(p=\infty \). So this is true for the cost of k-center, k-means and k-median.

  7. 7.

    It is possible to construct worst-cases instances where the minimum distance \(\delta \) can decrease between iterations of the while loop.

  8. 8.

    In order to guarantee this second invariant, our algorithm must be allowed to make merges at distance \(O(\log ^2(n) \delta )\).

References

  1. Andoni, A., Nikolov, A., Onak, K., Yaroslavtsev, G.: Parallel algorithms for geometric graph problems. In: Symposium on Theory of Computing (STOC) (2014)

    Google Scholar 

  2. Bader, D.A., Cong, G.: Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J. Parallel Distrib. Comput. 66(11), 1366–1378 (2006)

    Article  Google Scholar 

  3. Balcan, M., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general communication topologies. In: NIPS, pp. 1995–2003 (2013)

    Google Scholar 

  4. Bateni, M., Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Lattanzi, S., Mirrokni, V.: On distributed hierarchical clustering. In: NIPS 2017 (2017)

    Google Scholar 

  5. Bateni, M., Bhaskara, A., Lattanzi, S., Mirrokni, V.S.: Distributed balanced clustering via mapping coresets. In: NIPS 2014 (2014)

    Google Scholar 

  6. Charikar, M., Chatziafratis, V.: Approximate hierarchical clustering via sparsest cut and spreading metrics. In: SODA 2017 (2017)

    Google Scholar 

  7. Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. SICOMP 33(6), 1417–1440 (2004)

    Article  MathSciNet  Google Scholar 

  8. Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)

    Article  MathSciNet  Google Scholar 

  9. Dasgupta, S.: A cost function for similarity-based hierarchical clustering. In: STOC 2016, pp. 118–127 (2016)

    Google Scholar 

  10. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG 2004 (2004)

    Google Scholar 

  11. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)

    Article  Google Scholar 

  12. Ene, A., Im, S., Moseley, B.: Fast clustering using MapReduce. In: KDD, pp. 681–689 (2011)

    Google Scholar 

  13. Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 18(1), 54–64 (1969)

    MathSciNet  Google Scholar 

  14. Gower, J.C., Ross, G.J.S.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)

    Article  MathSciNet  Google Scholar 

  15. Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. SSS, pp. 485–585. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_14

  16. Heller, K.A., Ghahramani, Z.: Bayesian hierarchical clustering. In: ICML 2005, pp. 297–304 (2005)

    Google Scholar 

  17. Hochbaum, D.S., Shmoys, D.B.: A unified approach to approximation algorithms for bottleneck problems. J. ACM 33(3), 533–550 (1986)

    Article  MathSciNet  Google Scholar 

  18. Im, S., Moseley, B., Sun, X.: Efficient massively parallel methods for dynamic programming. In: STOC 2017 (2017)

    Google Scholar 

  19. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC 1998, pp. 604–613 (1998)

    Google Scholar 

  20. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  21. Jin, C., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N.: Incremental, distributed single-linkage hierarchical clustering algorithm using MapReduce. In: HPC 2015, pp. 83–92 (2015)

    Google Scholar 

  22. Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N.: A scalable hierarchical clustering algorithm using spark. In: BigDataService 2015, pp. 418–426 (2015)

    Google Scholar 

  23. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. Comput. Geom. 28(2–3), 89–112 (2004)

    Article  MathSciNet  Google Scholar 

  24. Karloff, H.J., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: SODA 2010, pp. 938–948 (2010)

    Google Scholar 

  25. Krishnamurthy, A., Balakrishnan, S., Xu, M., Singh, A.: Efficient active algorithms for hierarchical clustering. In: ICML 2012 (2012)

    Google Scholar 

  26. Lattanzi, S., Moseley, B., Suri, S., Vassilvitskii, S.: Filtering: a method for solving graph problems in MapReduce. In: SPAA 2011 (Co-located with FCRC 2011), pp. 85–94 (2011)

    Google Scholar 

  27. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml

  28. Mao, Q., Zheng, W., Wang, L., Cai, Y., Mai, V., Sun, Y.: Parallel hierarchical clustering in linearithmic time for large-scale sequence analysis. In: 2015 IEEE International Conference on Data Mining, pp. 310–319, November 2015

    Google Scholar 

  29. Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev.: Data Min. Discov. 2(1), 86–97 (2012)

    Google Scholar 

  30. Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C., Lin, X.: Scalable big graph processing in MapReduce. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD 2014, pp. 827–838 (2014)

    Google Scholar 

  31. Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. 16(6), 497–502 (2005)

    Article  MathSciNet  Google Scholar 

  32. Roy, A., Pokutta, S.: Hierarchical clustering via spreading metrics. In: NIPS 2016, pp. 2316–2324 (2016)

    Google Scholar 

  33. Spark (2014). https://spark.apache.org/docs/2.1.1/mllib-clustering.html

  34. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

  35. Wang, J., Moseley, B.: Approximation bounds for hierarchical clustering: average-linkage, bisecting k-means, and local search. In: NIPS (2017)

    Google Scholar 

  36. Yaroslavtsev, G., Vadapalli, A.: Massively parallel algorithms and hardness for single-linkage clustering under lp distances. In: ICML 2018, pp. 5596–5605 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Lavastida .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lattanzi, S., Lavastida, T., Lu, K., Moseley, B. (2020). A Framework for Parallelizing Hierarchical Clustering Methods. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46150-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46149-2

  • Online ISBN: 978-3-030-46150-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics