Abstract
Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.
We introduce efficient distributed algorithms for bisecting k-means, k-median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This model is widely used to capture the class of algorithms that scale in frameworks such as Spark and MapReduce.
- 2.
We can remove this assumption by adding a small perturbation to every point.
- 3.
Note that the guarantees is on each single choice made by the algorithm but not on all the choices together.
- 4.
In prior work, Yaroslavtsev and Vadapalli [36] give an algorithm for single-linkage clustering with constant-dimensional Euclidean input that fits within our framework.
- 5.
Consider an example where the optimal 2-clustering separates only 1 point at a time.
- 6.
By the generalized triangle inequality this is true for \(p=1,2\) and it is true for \(p=\infty \). So this is true for the cost of k-center, k-means and k-median.
- 7.
It is possible to construct worst-cases instances where the minimum distance \(\delta \) can decrease between iterations of the while loop.
- 8.
In order to guarantee this second invariant, our algorithm must be allowed to make merges at distance \(O(\log ^2(n) \delta )\).
References
Andoni, A., Nikolov, A., Onak, K., Yaroslavtsev, G.: Parallel algorithms for geometric graph problems. In: Symposium on Theory of Computing (STOC) (2014)
Bader, D.A., Cong, G.: Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J. Parallel Distrib. Comput. 66(11), 1366–1378 (2006)
Balcan, M., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general communication topologies. In: NIPS, pp. 1995–2003 (2013)
Bateni, M., Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Lattanzi, S., Mirrokni, V.: On distributed hierarchical clustering. In: NIPS 2017 (2017)
Bateni, M., Bhaskara, A., Lattanzi, S., Mirrokni, V.S.: Distributed balanced clustering via mapping coresets. In: NIPS 2014 (2014)
Charikar, M., Chatziafratis, V.: Approximate hierarchical clustering via sparsest cut and spreading metrics. In: SODA 2017 (2017)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. SICOMP 33(6), 1417–1440 (2004)
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)
Dasgupta, S.: A cost function for similarity-based hierarchical clustering. In: STOC 2016, pp. 118–127 (2016)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG 2004 (2004)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)
Ene, A., Im, S., Moseley, B.: Fast clustering using MapReduce. In: KDD, pp. 681–689 (2011)
Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 18(1), 54–64 (1969)
Gower, J.C., Ross, G.J.S.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)
Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. SSS, pp. 485–585. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7_14
Heller, K.A., Ghahramani, Z.: Bayesian hierarchical clustering. In: ICML 2005, pp. 297–304 (2005)
Hochbaum, D.S., Shmoys, D.B.: A unified approach to approximation algorithms for bottleneck problems. J. ACM 33(3), 533–550 (1986)
Im, S., Moseley, B., Sun, X.: Efficient massively parallel methods for dynamic programming. In: STOC 2017 (2017)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC 1998, pp. 604–613 (1998)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Jin, C., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N.: Incremental, distributed single-linkage hierarchical clustering algorithm using MapReduce. In: HPC 2015, pp. 83–92 (2015)
Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.N.: A scalable hierarchical clustering algorithm using spark. In: BigDataService 2015, pp. 418–426 (2015)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. Comput. Geom. 28(2–3), 89–112 (2004)
Karloff, H.J., Suri, S., Vassilvitskii, S.: A model of computation for MapReduce. In: SODA 2010, pp. 938–948 (2010)
Krishnamurthy, A., Balakrishnan, S., Xu, M., Singh, A.: Efficient active algorithms for hierarchical clustering. In: ICML 2012 (2012)
Lattanzi, S., Moseley, B., Suri, S., Vassilvitskii, S.: Filtering: a method for solving graph problems in MapReduce. In: SPAA 2011 (Co-located with FCRC 2011), pp. 85–94 (2011)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Mao, Q., Zheng, W., Wang, L., Cai, Y., Mai, V., Sun, Y.: Parallel hierarchical clustering in linearithmic time for large-scale sequence analysis. In: 2015 IEEE International Conference on Data Mining, pp. 310–319, November 2015
Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rev.: Data Min. Discov. 2(1), 86–97 (2012)
Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C., Lin, X.: Scalable big graph processing in MapReduce. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD 2014, pp. 827–838 (2014)
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. 16(6), 497–502 (2005)
Roy, A., Pokutta, S.: Hierarchical clustering via spreading metrics. In: NIPS 2016, pp. 2316–2324 (2016)
Spark (2014). https://spark.apache.org/docs/2.1.1/mllib-clustering.html
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Wang, J., Moseley, B.: Approximation bounds for hierarchical clustering: average-linkage, bisecting k-means, and local search. In: NIPS (2017)
Yaroslavtsev, G., Vadapalli, A.: Massively parallel algorithms and hardness for single-linkage clustering under lp distances. In: ICML 2018, pp. 5596–5605 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Lattanzi, S., Lavastida, T., Lu, K., Moseley, B. (2020). A Framework for Parallelizing Hierarchical Clustering Methods. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-46150-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46149-2
Online ISBN: 978-3-030-46150-8
eBook Packages: Computer ScienceComputer Science (R0)