Abstract
Clustering is an important technique in machine learning, which has been used to organize data into groups of similar data points called also clusters. In fact, conventional clustering methods are not suitable when dealing with large scale data. This is explained by the high computational cost of these methods which require unrealistic time to build the grouping. We propose in this work a new Spark based K-means Clustering with Data Removing Strategy referred to as (SKMDRS). The proposed method is based on data removing strategy which aims to reduce the computational time, by removing at each iteration data points that are unlikely to change the clusters to which they belong thereafter. In addition, the clustering process is distributed through Spark framework in order to enhance the scalability. Conducted experiments show the efficiency of the proposed method compared to existing ones.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Gorodetsky, V.: Big data: opportunities, challenges and solutions. In: Ermolayev, V., Mayr, H., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) Information and Communication Technologies in Education, Research, and Industrial Applications. CCIS, vol. 469, pp. 3–22. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-13206-8_1
Arora, S., Chana, I.: A survey of clustering techniques for big data analysis. In: Proceedings of the 5th International Conference on Confluence 2014: The Next Generation Information Technology Summit, pp. 59–65 (2014)
Macqueen, J.: Some methods for classification and analysis of multivariate observations, pp. 281–297 (1967)
Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Chang. 130, 99–113 (2018)
Zhao, W., Ma, H., He, Q.: Parallel K-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_71
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fus. 42, 51–61 (2018)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
White, T.: Hadoop: The Definitive Guide (2009)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, CA, USA, p. 10. USENIX Association (2010)
Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (cuda). J. Supercomput. 64(3), 942–967 (2013)
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI-the complete reference, vol. 1: The MPI core (1998)
Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing, vol. 96, pp. 879–899 (2008)
Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapreduce. J. Supercomput. 70(3), 1249–1259 (2014)
HajKacem, M.A.B., N’Cir, C.-E.B., Essoussi, N.: Overview of scalable partitional methods for big data clustering. In: Nasraoui, O., Ben N’Cir, C.-E. (eds.) Clustering Methods for Big Data Analytics. USL, pp. 1–23. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97864-2_1
Kusuma, I., Ma’sum, M.A., Habibie, N., Jatmiko, W., Suhartanto, H.: Design of intelligent k-means based on spark for big data clustering, pp. 89–96, October 2016
Wang, B., Yin, J., Hua, Q., Wu, Z., Cao, J.: Parallelizing k-means-based clustering on spark, pp. 31–36 (2016)
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML2003, pp. 147–153. AAAI Press (2003)
Cattral, R., Oppacher, F.: Discovering rules in the poker hand dataset, p. 1870 (2007)
Rui, X., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. IEEE Rev. Biomed. Eng. 3, 120–154 (2010)
Xu, X., Jäger, J., Kriegel, H.-P.: A fast parallel clustering algorithm for large spatial databases. Data Min. Knowl. Discov. 3(3), 263–290 (1999)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Rziga, K., Ben HajKacem, M.A., Essoussi, N. (2019). A New Spark Based K-Means Clustering with Data Removing Strategy. In: Jallouli, R., Bach Tobji, M., Bélisle, D., Mellouli, S., Abdallah, F., Osman, I. (eds) Digital Economy. Emerging Technologies and Business Innovation. ICDEc 2019. Lecture Notes in Business Information Processing, vol 358. Springer, Cham. https://doi.org/10.1007/978-3-030-30874-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-30874-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30873-5
Online ISBN: 978-3-030-30874-2
eBook Packages: Computer ScienceComputer Science (R0)