Advertisement

A Distributed Shared Nearest Neighbors Clustering Algorithm

  • Juan Zamora
  • Héctor Allende-Cid
  • Marcelo Mendoza
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10657)

Abstract

Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along several computers that partially solves the undertaken problem. Then, these partial answers are integrated in order to obtain a final solution. We introduce the Distributed Shared Nearest Neighbor based clustering algorithm (D-SNN) which is able to work with disjoint partitions of data producing a global clustering solution that achieves a competitive performance regarding centralized approaches. Our algorithm is suited for large scale problems (e.g, text clustering) where data cannot be handled by a single machine due to memory size constraints. Experimental results over five data sets show that our proposal is competitive in terms of standard clustering quality performance measures.

Keywords

Clustering Distributed algorithm Shared Nearest Neighbors 

Notes

Acknowledgments

Juan Zamora is supported by a postdoctoral project from Pontificia Universidad Católica de Valparaíso. Héctor Allende-Cid is supported by project FONDECYT initiation into research 11150248. Marcelo Mendoza was supported by project Basal FB0821.

References

  1. 1.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means ++. Proc. VLDB Endow. (PVLDB) 5, 622–633 (2012)CrossRefGoogle Scholar
  2. 2.
    Balcan, M.F., Ehrlich, S., Liang, Y.: Distributed k-means and k-median clustering on general topologies. In: Advances in Neural Information Processing Systems 26 (NIPS 2013), pp. 1–9 (2013)Google Scholar
  3. 3.
    Crestani, F., Markov, I.: Distributed information retrieval and applications. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 865–868. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-36973-5_104 CrossRefGoogle Scholar
  4. 4.
    Das, A., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, pp. 271–280. ACM (2007)Google Scholar
  5. 5.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) LSPDM 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-46502-2_13 CrossRefGoogle Scholar
  6. 6.
    Ene, A., Im, S., Moseley, B.: Fast clustering using MapReduce. In: KDD, pp. 681–689 (2011)Google Scholar
  7. 7.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the SIAM International Conference on Data Mining, pp. 47–58 (2003)Google Scholar
  8. 8.
    Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. ACM SIGKDD Explor. Newslett. 2(2), 34–38 (2000)CrossRefGoogle Scholar
  9. 9.
    Mendoza, M., Marín, M., Gil-Costa, V., Ferrarotti, F.: Reducing hardware hit by queries in web search engines. Inf. Process. Manag. 52(6), 1031–1052 (2016)CrossRefGoogle Scholar
  10. 10.
    Sarnovsky, M., Carnoka, N.: Distributed algorithm for text documents clustering based on k-means approach. Adv. Intell. Syst. Comput. 430, 165–174 (2016)Google Scholar
  11. 11.
    Yi, J., ZShang, L., Wang, J., Jin, R., Jain, A.K.: A single-pass algorithm for efficiently recovering sparse cluster centers of high-dimensional data. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, China, vol. 3, pp. 2112–2127 (2014)Google Scholar
  12. 12.
    Zhang, J., Wu, G., Hu, X., Li, S., Hao, S.: A parallel clustering algorithm with MPI – MKmeans. J. Comput. 8(1), 10–18 (2013)Google Scholar
  13. 13.
    Xu, X., Jäger, J., Kriegel, H.: A fast parallel clustering algorithm for large spatial databases. In: Guo, Y., Grossman, R. (eds.) High Performance Data Mining, pp. 263–290. Springer, Boston (1999).  https://doi.org/10.1007/0-306-47011-X_3 Google Scholar
  14. 14.
    Qi, Z., Jinze, L., Wei, W.: Approximate clustering on distributed data streams. In: Proceedings - International Conference on Data Engineering, pp. 1131–1139 (2008)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Juan Zamora
    • 1
  • Héctor Allende-Cid
    • 1
  • Marcelo Mendoza
    • 2
    • 3
  1. 1.Pontificia Universidad Católica de ValparaísoValparaísoChile
  2. 2.Universidad Técnica Federico Santa MaríaSantiagoChile
  3. 3.Centro Científico y Tecnológico de ValparaísoValparaísoChile

Personalised recommendations