Homogeneous Vs. Heterogeneous Distributed Data Clustering: A Taxonomy

Kashef, Rasha; Warraich, Marium

doi:10.1007/978-3-030-32587-9_4

Rasha Kashef⁵ &
Marium Warraich⁶

Part of the book series: Studies in Big Data ((SBD,volume 65))

1305 Accesses
4 Citations

Abstract

Recent advances in computer architecture and networking allow for the opportunity to parallelize the data clustering process. By dividing the problem into smaller partitions, tackling each one in parallel, and then combining the partial solutions, the parallel algorithms can cluster large amounts of data much more efficiently. In specific scenarios, the dataset is inherently distributed over multiple nodes, making it impossible and even infeasible to apply centralized clustering, which has created a need for performing clustering in distributed environments. Distributed clustering solves two problems: infeasibility of collecting data at a central node, due to either technical and/or privacy limitations, and intractability of traditional clustering algorithms on large datasets. In this paper, we provide a novel taxonomy of distributed data clustering algorithms and provide insight into their distributed modeling strategies. The taxonomy classifies the distributed clustering processes as either a homogeneous or heterogeneous process. Various distributed performance and quality measures are also addressed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Article Google Scholar
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proc. KDD Workshop on Text Mining (pp. 109–110). Setúbal: SciTePress.
Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley.
MATH Google Scholar
Vrahatis, M. N., Boutsinas, B., Alevizos, P., & Pavlides, G. (2002). The new k-windows algorithm for improving the k-means clustering algorithm. Journal of Complexity, 18, 375–391.
Article MathSciNet Google Scholar
Hammouda, K. M., & Kamel, M. S. (2003). Incremental document clustering using cluster similarity histograms. In Proc. IEEE/WIC International Conference on Web Intelligence (pp. 597–601). Washington, DC: IEEE Computer Society.
Google Scholar
Hammouda, K. M., & Kamel, M. S. (2004). Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10), 1279–1296.
Article Google Scholar
Savaresi, S. M., & Boley, D. L. (2001). On the performance of bisecting K-means and PDDP. In Proc. 2001 SIAM International Conference on Data Mining (pp. 1–14). Philadelphia: SIAM.
Google Scholar
Karray, F. O., & Desilva, C. W. (2004). Soft computing and intelligent systems design: Theory, tools and applications. London: Pearson Education.
Google Scholar
Rezaei, M., & Fränti, P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8), 2173–2186.
Article Google Scholar
Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E. M., & M, E. (2011). Internal versus external cluster validation indexes. International Journal of computers and communications, 5(1), 27–34.
Google Scholar
Reichart, R., & Rappoport, A. (2009). The NVI clustering evaluation measure. In Proc. Thirteenth Conference on Computational Natural Language Learning (pp. 165–173). Stroudsburg, PA: Association for Computational Linguistics.
Chapter Google Scholar
Datta, S., Bhaduri, K., Giannella, C., Wolff, R., & Kargupta, H. (2006). Distributed Data Mining in Peer-to-Peer Networks, in IEEE Internet Computing, 10(4), 18–26.
Google Scholar
Klusch, M., Lodi, S., & Moro, G. (2003). Agent-based distributed data mining: the KDEC scheme. In Intelligent information systems (Lecture notes in computer science) (Vol. 2586, pp. 104–122). Berlin: Springer.
Google Scholar
Kashef, R., & Kamel, M. S. (2006). Distributed cooperative hard-fuzzy document clustering. In Proc. 3rd Annual Scientific Conference of the LORNET Research Network (I2LOR06). Montreal: ARIES Publications.
Google Scholar
Stoffel, K., & Belkoniene, A. (1999). Parallel k/h-means clustering for large data sets. In Euro-Par ’99 parallel processing (Lecture notes in computer science) (Vol. 1685, pp. 1451–1454). Berlin: Springer.
Chapter Google Scholar
Kwok, T., Smith, K., Lozano, S., & Taniar, D. (2002). Parallel fuzzy c-means clustering for large data sets. In Euro Par ’02 parallel processing (Lecture notes in computer science) (Vol. 2400, pp. 365–374). Berlin: Springer.
Chapter Google Scholar
Hammouda, K. M., & Kamel, M. S. (2006). Collaborative document clustering. In Proc. SIAM Conference on Data Mining (SDM06) (pp. 453–463). Philadelphia: SIAM.
Chapter Google Scholar
Xu, S., & Zhang, J. (2004). A hybrid parallel web document clustering algorithm and its performance study. Journal of Supercomputing, 30(2), 117–131.
Article Google Scholar
Kashef, R. F., & Kamel, M. S. (2009). Enhanced Bisecting K-means Clustering Using Intermediate Cooperation. Journal of Pattern Recognition, 42(11), 2557–2569.
Article Google Scholar
Rehioui, H., Idrissi, A., Abourezq, M., & Zegrari, F. (2016). DENCLUE-IM: A new approach for big data clustering. Procedia Computer Science, 83, 560–567.
Article Google Scholar
Pizzuti, C., & Talia, D. (2003). P-AutoClass: Scalable parallel clustering for mining large data sets. IEEE Transactions on Knowledge and Data Engineering, 15(3), 629–641.
Article Google Scholar
Alevizos, P. D., Tasoulis, D. K., & Vrahatis, M. (2003). Parallelizing the unsupervised k-windows clustering algorithm. In Parallel processing and applied mathematics (Lecture notes in computer science) (Vol. 3019, pp. 225–232). Berlin: Springer.
Chapter Google Scholar
Zhang, J., Wu, G., Hu, X., Li, S., & Hao, S. (2013). A parallel clustering algorithm with MPI–MKmeans. Journal of Computers, 8(1), 10–17.
Google Scholar
Kargupta, H., Huang, W., Sivakumar, K., & Johnson, E. (2001). Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3, 422–448.
Article Google Scholar
Johnson, E. L., & Kargupta, H. (1999). Collective, hierarchical clustering from distributed, heterogeneous data. In Large-scale parallel data mining (Lecture notes in computer science) (Vol. 1759, pp. 221–244). Berlin: Springer.
Chapter Google Scholar
Kriegel, H. P., Kröger, P., Pryakhin, A., & Schubert, M. (2005). Effective and efficient distributed model-based clustering. In Proc. Fifth IEEE International Conference on Data Mining (ICDM05) (pp. 258–265). IEEE.
Google Scholar
Tasoulis, D. K., & Vrahatis, M. N. (2004). Unsupervised distributed clustering. In Proc. International Conference on Parallel and Distributed Computing and Networks (pp. 347–351). IEEE.
Google Scholar
Januzaj, E., Kriegel, H. P., & Pfeifle, M. (2003). Towards effective and efficient distributed clustering. In Proc. Workshop on Clustering Large Data Sets (ICDM03) (pp. 49–58). IEEE.
Google Scholar
Gupta, A., & Kumar, V. (1993). Isoefficiency function: a scalability metric for parallel algorithms and architectures. IEEE Transaction, Parallel and Distributed Technology, 1, 12–21.
Google Scholar

Download references

Author information

Authors and Affiliations

Electrical, Computer, and Biomedical Engineering Department, Ryerson University, London, ON, Canada
Rasha Kashef
Department of Management Science, Ivey Business School, London, ON, Canada
Marium Warraich

Authors

Rasha Kashef
View author publications
You can also search for this author in PubMed Google Scholar
Marium Warraich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rasha Kashef .

Editor information

Editors and Affiliations

Department of Computer Science, University of Calgary, Department of Computer Engineering Istanbul Medipol University Istanbul, Turkey, Calgary, AB, Canada
Reda Alhajj
Department of Electrical and Computer Engineering, University of Calgary, Calgary, AB, Canada
Mohammad Moshirpour
Department of Electrical and Computer Engineering, University of Calgary, Calgary, AB, Canada
Behrouz Far

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kashef, R., Warraich, M. (2020). Homogeneous Vs. Heterogeneous Distributed Data Clustering: A Taxonomy. In: Alhajj, R., Moshirpour, M., Far, B. (eds) Data Management and Analysis. Studies in Big Data, vol 65. Springer, Cham. https://doi.org/10.1007/978-3-030-32587-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-32587-9_4
Published: 21 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32586-2
Online ISBN: 978-3-030-32587-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics