DBDC: Density Based Distributed Clustering
Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology as well as many others. In most of these areas, the data are originally collected at different sites. In order to extract information from these data, they are merged at a central site and then clustered. In this paper, we propose a different approach. We cluster the data locally and extract suitable representatives from these clusters. These representatives are sent to a global server site where we restore the complete clustering based on the local representatives. This approach is very efficient, because the local clustering can be carried out quickly and independently from each other. Furthermore, we have low transmission cost, as the number of transmitted representatives is much smaller than the cardinality of the complete data set. Based on this small number of representatives, the global clustering can be done very efficiently. For both the local and the global clustering, we use a density based clustering algorithm. The combination of both the local and the global clustering forms our new DBDC (Density Based Distributed Clustering) algorithm. Furthermore, we discuss the complex problem of finding a suitable quality measure for evaluating distributed clusterings. We introduce two quality criteria which are compared to each other and which allow us to evaluate the quality of our DBDC algorithm. In our experimental evaluation, we will show that we do not have to sacrifice clustering quality in order to gain an efficiency advantage when using our distributed clustering approach.
KeywordsCluster Algorithm Central Cluster Local Model Local Cluster Local Representative
Unable to display preview. Download preview PDF.
- 1.Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. In: Proc. ACM SIGMOD, Philadelphia, PA, pp. 49–60 (1999)Google Scholar
- 4.Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: Proc. 23rd Int. VLDB, Athens, Greece, pp. 426–435 (1997)Google Scholar
- 5.Dhillon, I.S., Modh, D.S.: A Data-Clustering Algorithm On Distributed Memory Multiprocessors. In: SIGKDD 1999 (1999)Google Scholar
- 6.Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental Clustering for Mining in a Data Warehousing Environment. In: VLDB 1998 (1998)Google Scholar
- 7.Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD 1996), Portland, OR, pp. 226–231. AAAI Press, Menlo Park (1996)Google Scholar
- 8.Ertöz, L., Steinbach, M., Kumar, V.: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. In: SIAM International Conference on Data Mining (2003)Google Scholar
- 10.Hanisch, R.J.: Distributed Data Systems and Services for Astronomy and the Space Sciences. In: Manset, N., Veillet, C., Crabtree, D. (eds.) Astronomical Data Analysis Software and Systems IX. ASP Conf. Ser., vol. 216, ASP, San Francisco (2000)Google Scholar
- 12.Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules. In: SIGMOD Record: Proceedings of the 1997 ACM-SIGMOD Conference on Management of Data, Tucson, AZ, USA, pp. 277–288 (1997)Google Scholar
- 16.Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI/MIT Press (2000)Google Scholar
- 17.Shafer, J., Agrawal, R., Mehta, M.: A scalable parallel classifier for data mining. In: Proc. 22nd International Conference on VLDB, Mumbai, India (1996)Google Scholar
- 18.Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. In: Proc. 1998 International Conference on Parallel Processing (1998)Google Scholar
- 20.Sayal, M., Scheuermann, P.: A Distributed Clustering Algorithm for Web-Based Access Patterns. In: Proceedings of the 2nd ACM-SIGMOD Workshop on Distributed and Parallel Knowledge Discovery, Boston (August 2000)Google Scholar