A New Spark Based K-Means Clustering with Data Removing Strategy

Rziga, Kenza; Ben HajKacem, Mohamed Aymen; Essoussi, Nadia

doi:10.1007/978-3-030-30874-2_23

Kenza Rziga¹²,
Mohamed Aymen Ben HajKacem¹² &
Nadia Essoussi¹²

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 358))

Included in the following conference series:

International Conference on Digital Economy

1948 Accesses

Abstract

Clustering is an important technique in machine learning, which has been used to organize data into groups of similar data points called also clusters. In fact, conventional clustering methods are not suitable when dealing with large scale data. This is explained by the high computational cost of these methods which require unrealistic time to build the grouping. We propose in this work a new Spark based K-means Clustering with Data Removing Strategy referred to as (SKMDRS). The proposed method is based on data removing strategy which aims to reduce the computational time, by removing at each iteration data points that are unlikely to change the clusters to which they belong thereafter. In addition, the clustering process is distributed through Spark framework in order to enhance the scalability. Conducted experiments show the efficiency of the proposed method compared to existing ones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Gorodetsky, V.: Big data: opportunities, challenges and solutions. In: Ermolayev, V., Mayr, H., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) Information and Communication Technologies in Education, Research, and Industrial Applications. CCIS, vol. 469, pp. 3–22. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-13206-8_1
Chapter Google Scholar
Arora, S., Chana, I.: A survey of clustering techniques for big data analysis. In: Proceedings of the 5th International Conference on Confluence 2014: The Next Generation Information Technology Summit, pp. 59–65 (2014)
Google Scholar
Macqueen, J.: Some methods for classification and analysis of multivariate observations, pp. 281–297 (1967)
Google Scholar
Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Chang. 130, 99–113 (2018)
Article Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_71
Chapter Google Scholar
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fus. 42, 51–61 (2018)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide (2009)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, CA, USA, p. 10. USENIX Association (2010)
Google Scholar
Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (cuda). J. Supercomput. 64(3), 942–967 (2013)
Article Google Scholar
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI-the complete reference, vol. 1: The MPI core (1998)
Google Scholar
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing, vol. 96, pp. 879–899 (2008)
Article Google Scholar
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapreduce. J. Supercomput. 70(3), 1249–1259 (2014)
Article Google Scholar
HajKacem, M.A.B., N’Cir, C.-E.B., Essoussi, N.: Overview of scalable partitional methods for big data clustering. In: Nasraoui, O., Ben N’Cir, C.-E. (eds.) Clustering Methods for Big Data Analytics. USL, pp. 1–23. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97864-2_1
Chapter Google Scholar
Kusuma, I., Ma’sum, M.A., Habibie, N., Jatmiko, W., Suhartanto, H.: Design of intelligent k-means based on spark for big data clustering, pp. 89–96, October 2016
Google Scholar
Wang, B., Yin, J., Hua, Q., Wu, Z., Cao, J.: Parallelizing k-means-based clustering on spark, pp. 31–36 (2016)
Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML2003, pp. 147–153. AAAI Press (2003)
Google Scholar
Cattral, R., Oppacher, F.: Discovering rules in the poker hand dataset, p. 1870 (2007)
Google Scholar
Rui, X., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)
Article Google Scholar
Xu, X., Jäger, J., Kriegel, H.-P.: A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. 3(3), 263–290 (1999)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Université de Tunis, Institut Supérieur de Gestion de Tunis, LARODEC, 41 Avenue de la Liberté, Cité Bouchoucha, 2000, Le Bardo, Tunisia
Kenza Rziga, Mohamed Aymen Ben HajKacem & Nadia Essoussi

Authors

Kenza Rziga
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Aymen Ben HajKacem
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Essoussi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kenza Rziga , Mohamed Aymen Ben HajKacem or Nadia Essoussi .

Editor information

Editors and Affiliations

ESEN, University of Manouba, Manouba, Tunisia
Rim Jallouli
ESEN, University of Manouba, Manouba, Tunisia
Mohamed Anis Bach Tobji
Université de Sherbrooke, Sherbrooke, QC, Canada
Deny Bélisle
Université Laval, Quebec, QC, Canada
Sehl Mellouli
International University of Beirut, Mazraa, Lebanon
Farid Abdallah
American University of Beirut, Beirut, Lebanon
Ibrahim Osman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rziga, K., Ben HajKacem, M.A., Essoussi, N. (2019). A New Spark Based K-Means Clustering with Data Removing Strategy. In: Jallouli, R., Bach Tobji, M., Bélisle, D., Mellouli, S., Abdallah, F., Osman, I. (eds) Digital Economy. Emerging Technologies and Business Innovation. ICDEc 2019. Lecture Notes in Business Information Processing, vol 358. Springer, Cham. https://doi.org/10.1007/978-3-030-30874-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-30874-2_23
Published: 21 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30873-5
Online ISBN: 978-3-030-30874-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics