Performance Evaluation of a Distributed Clustering Approach for Spatial Datasets

  • Malika Bendechache
  • Nhien-An Le-Khac
  • M-Tahar Kechadi
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 845)

Abstract

The analysis of big data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the big data challenges such as volumes, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for big data sets, as they are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this paper, we propose to study the performance of a distributed clustering, called Dynamic Distributed Clustering (DDC). DDC has the ability to remotely generate clusters and then aggregate them using an efficient aggregation algorithm. The technique is developed for spatial datasets. We evaluated the DDC using two types of communications (synchronous and asynchronous), and tested using various load distributions. The experimental results show that the approach has super-linear speed-up, scales up very well, and can take advantage of the recent programming models, such as MapReduce model, as its results are not affected by the types of communications.

Keywords

Distributed data mining Distributed computing Synchronous communication Asynchronous communication Spacial data mining Super-speedup 

Notes

Acknowledgement

The research work is conducted in the Insight Centre for Data Analytics, which is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289.

References

  1. 1.
    Aouad, L., Le-Khac, N.A., Kechadi, T.: Image analysis platform for data management in the meteorological domain. In: 7th Industrial Conference, ICDM 2007, Leipzig, Germany, July 14-18, 2007. Proceedings. vol. 4597, pp. 120–134. Springer, Heidelberg (2007)Google Scholar
  2. 2.
    Aouad, L.M., Le-Khac, N.-A., Kechadi, T.M.: Lightweight clustering technique for distributed data mining applications. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 120–134. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-73435-2_10CrossRefGoogle Scholar
  3. 3.
    Arlia, D., Coppola, M.: Experiments in parallel clustering with DBSCAN. In: Sakellariou, R., Gurd, J., Freeman, L., Keane, J. (eds.) Euro-Par 2001. LNCS, vol. 2150, pp. 326–331. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-44681-8_46CrossRefMATHGoogle Scholar
  4. 4.
    Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach. Learn. 36(1), 105–139 (1999)CrossRefGoogle Scholar
  5. 5.
    Bellifemine, F., Bergenti, F., Caire, G., Poggi, A.: Jade-a java agent development framework. In: Bordini, R.H., Dastani, M., Dix, J., El Fallah Seghrouchni, A. (eds.) Multi-agent Programming, pp. 125–147. Springer, Heidelberg (2005).  https://doi.org/10.1007/0-387-26350-0_5CrossRefGoogle Scholar
  6. 6.
    Bendechache, M., Kechadi, M.T.: Distributed clustering algorithm for spatial data mining. In: 2015 2nd IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM), pp. 60–65. IEEE (2015)Google Scholar
  7. 7.
    Bendechache, M., Kechadi, M.T., Le-Khac, N.A.: Efficient large scale clustering based on data partitioning. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 612–621. IEEE (2016)Google Scholar
  8. 8.
    Bendechache, M., Le-Khac, N.A., Kechadi, M.T.: Hierarchical aggregation approach for distributed clustering of spatial datasets. In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), pp. 1098–1103. IEEE (2016)Google Scholar
  9. 9.
    Brecheisen, S., Kriegel, H.-P., Pfeifle, M.: Parallel density-based clustering of complex objects. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 179–188. Springer, Heidelberg (2006).  https://doi.org/10.1007/11731139_22CrossRefGoogle Scholar
  10. 10.
    Chaudhuri, A., Chaudhuri, B., Parui, S.: A novel approach to computation of the shape of a dot pattern and extraction of its perceptual border. Comput. Vis. Image Understranding 68, 257–275 (1997)CrossRefGoogle Scholar
  11. 11.
    Chen, M., Gao, X., Li, H.: Parallel DBSCAN with priority r-tree. In: 2010 The 2nd IEEE International Conference on Information Management and Engineering (ICIME), pp. 508–511. IEEE (2010)Google Scholar
  12. 12.
    Coppola, M., Vanneschi, M.: High-performance data mining with skeleton-based structured parallel programming. Parallel Comput. 28(5), 793–813 (2002)CrossRefGoogle Scholar
  13. 13.
    Cortese, E.: Benchmark on jade message transport system (2005). http://jade.cselt.it/doc/tutorials/benchmark/JADERTTBenchmark.htm
  14. 14.
    Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) LSPDM 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-46502-2_13CrossRefGoogle Scholar
  15. 15.
    Duckhama, M., Kulikb, L., Worboysc, M., Galtond, A.: Efficient generation of simple polygons for characterizing the shape of a set of points in the plane. Pattern Recogn. 41, 3224–3236 (2008)CrossRefGoogle Scholar
  16. 16.
    Edelsbrunner, H., Kirkpatrick, D.G., Seidel, R.: On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29(4), 551–559 (1983)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)Google Scholar
  18. 18.
    Fadilia, M., Melkemib, M., ElMoataza, A.: Pattern Recognition Letters: Non-convex Onion-peeling Using a Shape Hull Algorithm, vol. 24. Elsevier, Amsterdam (2004)Google Scholar
  19. 19.
    Fränti, P.: Clustering datasets (2015). http://cs.uef.fi/sipu/datasets/
  20. 20.
    Fu, Y.X., Zhao, W.Z., Ma, H.F.: Research on parallel DBSCAN algorithm design based on MapReduce. In: Advanced Materials Research. vol. 301, pp. 1133–1138. Trans Tech Publications (2011)Google Scholar
  21. 21.
    Garg, A., Mangla, A., Bhatnagar, V., Gupta, N.: PBIRCH: a scalable parallel clustering algorithm for incremental data. In: 10th International Symposium on Database Engineering and Applications (IDEAS-2006), pp. 315–316 (2006)Google Scholar
  22. 22.
    Geng, H., Deng, X., Ali, H.: A new clustering algorithm using message passing and its applications in analyzing microarray data. In: Proceedings of Fourth International Conference on Machine Learning and Applications, pp. 6–pp. IEEE (2005)Google Scholar
  23. 23.
    Ghosh, S.: Distributed Systems: An Algorithmic Approach. CRC Press, Boca Raton (2014)Google Scholar
  24. 24.
    Guo, Y., Grossman, R.: A fast parallel clustering algorithm for large spatial databases, high performance data mining. Data Mining Knowl. Discov. (2002)Google Scholar
  25. 25.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn, pp. 1–38. Morgan Kaufmann Publishers Inc., San Francisco (2011). ISBN 0123814790, ISBN 9780123814791MATHGoogle Scholar
  26. 26.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)CrossRefGoogle Scholar
  27. 27.
    Januzaj, E., Kriegel, H.-P., Pfeifle, M.: DBDC: density based distributed clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-24741-8_7CrossRefGoogle Scholar
  28. 28.
    Laloux, J.F., Le-Khac, N.A., Kechadi, M.T.: Efficient distributed approach for density-based clustering. In: 20th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), pp. 145–150, 27–29 June 2011Google Scholar
  29. 29.
    Le-Khac, N.A., Bue, M., Whelan, M., Kechadi, M.-T.: A knowledge based data reduction for very large spatio-temporal datasets. In: International Conference on Advanced Data Mining and Applications (ADMA 2010), 19–21 November 2010Google Scholar
  30. 30.
    Melkemi, M., Djebali, M.: Computing the shape of a planar points set. Elsevier Sci. 33, 1423–1436 (2000)Google Scholar
  31. 31.
    Moreira, A., Santos, M.Y.: Concave hull: a k-nearest neighbours approach for the computation of the region occupied by a set of points. In: International Conference on Computer Graphics Theory and Applications (GRAPP-2007), Barcelona, Spain, pp. 61–68, 8–11 March 2007Google Scholar
  32. 32.
    Rokach, L., Schclar, A., Itach, E.: Ensemble methods for multi-label classification. Expert Syst. Appl. 41, 7507–7523 (2014)CrossRefGoogle Scholar
  33. 33.
    Solar, R., Borges, F., Suppi, R., Luque, E.: Improving communication patterns for distributed cluster-based individual-oriented fish school simulations. Procedia Comput. Sci. 18, 702–711 (2013)CrossRefGoogle Scholar
  34. 34.
    Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)CrossRefGoogle Scholar
  35. 35.
    Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. Data Mining Knowl. Discov. Arch. 3, 263–290 (1999)CrossRefGoogle Scholar
  36. 36.
    Zaki, M.J.: Parallel and distributed data mining: an introduction. In: Zaki, M.J., Ho, C.-T. (eds.) LSPDM 1999. LNCS (LNAI), vol. 1759, pp. 1–23. Springer, Heidelberg (2000).  https://doi.org/10.1007/3-540-46502-2_1CrossRefGoogle Scholar
  37. 37.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)CrossRefGoogle Scholar
  38. 38.
    Zhou, A., Zhou, S., Cao, J., Fan, Y., Hu, Y.: Approaches for scaling DBSCAN algorithm to large spatial databases. J. Comput. Sci. Technol. 15(6), 509–526 (2000)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Malika Bendechache
    • 1
  • Nhien-An Le-Khac
    • 2
  • M-Tahar Kechadi
    • 1
  1. 1.Insight Centre for Data AnalyticsUniversity College DublinBelfield, Dublin 04Ireland
  2. 2.University College DublinBelfield, Dublin 04Ireland

Personalised recommendations