Collective, Hierarchical Clustering from Distributed, Heterogeneous Data

  • Erik L. Johnson
  • Hillol Kargupta
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1759)


This paper presents the Collective Hierarchical Clustering (CHC) algorithm for analyzing distributed, heterogeneous data. This algorithm first generates local cluster models and then combines them to generate the global cluster model of the data. The proposed algorithm runs in O(|S|n 2) time, with a O(|S|n) space requirement and O(n) communication requirement, where n is the number of elements in the data set and |S| is the number of data sites. This approach shows significant improvement over naive methods with O(n 2) communication costs in the case that the entire distance matrix is transmitted and O(nm) communication costs to centralize the data, where m is the total number of features. A specific implementation based on the single link clustering and results comparing its performance with that of a centralized clustering algorithm are presented. An analysis of the algorithm complexity, in terms of overall computation time and communication requirements, is presented.


Hierarchical Cluster Time Complexity Leaf Node Global Model Local Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Dubes, R., Jain, A.: Clustering methodologies in exploratory data analysis. Advances In Computers 19 (1980) 113–228Google Scholar
  2. 2.
    Sibson, R.: Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16 (1973) 30–34CrossRefMathSciNetGoogle Scholar
  3. 3.
    Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press (1998) 9–15Google Scholar
  4. 4.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press (1996) 103–114Google Scholar
  5. 5.
    Ng, R., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of 20th International Conference on Very Large Data Bases, Morgan Kaufmann (1994) 144–155Google Scholar
  6. 6.
    Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: Proceedings ACM SIGMOD International Conference on Management of Data, ACM Press (1998) 73–84Google Scholar
  7. 7.
    Olson, C.: Parallel algorithms for hierarchical clustering. Parallel Computing 8 (1995) 1313–1325CrossRefMathSciNetGoogle Scholar
  8. 8.
    Dhillon, I., Modha, D.: A data clustering algorithm on distributed memory multi-processors. In: Workshop on Large-Scale Parallel KDD Systems. (1999)Google Scholar
  9. 9.
    Kargupta, H., Hamzaoglu, I., Stafford, B., Hanagandi, V., Buescher, K.: PADMA: Parallel data mining agent for scalable text classification. In: Proceedings Conference on High Performance Computing’ 97, The Society for Computer Simulation International (1996) 290–295Google Scholar
  10. 10.
    Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining using an agent based architecture. In Heckerman, D., Mannila, H., Pregibon, D., Uthurusamy, R., eds.: Proceedings of Knowledge Discovery And Data Mining, Menlo Park, CA, AAAI Press (1997) 211–214Google Scholar
  11. 11.
    Provost, F.J., Buchanan, B.: Inductive policy: The pragmatics of bias selection. Machine Learning 20 (1995) 35–61Google Scholar
  12. 12.
    Aronis, J.M., Kolluri, V., Provost, F.J., Buchanan, B.G.: The world: Knowledge discovery from multiple distributed data bases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996)Google Scholar
  13. 13.
    Kargupta, H., Park, B., Hershbereger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. Accepted in the Advances in Distributed Data Mining, Eds: Hillol Kargupta and Philip Chan, AAAI/MIT Press (1999)Google Scholar
  14. 14.
    Hershberger, D., Kargupta, H.: Distributed multivariate regression using wavelet-based collective data mining. Technical Report EECS-99-02, School of EECS, Washington State University (1999)Google Scholar
  15. 15.
    Murtagh, F.: Multidimensional Clustering Algorithms. Physica-Verlag (1985)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Erik L. Johnson
    • 1
  • Hillol Kargupta
    • 1
  1. 1.School of Electrical Engineering and Computer ScienceWashington State UniversityUSA

Personalised recommendations