Abstract
Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the final number of clusters has to be specified. MapReduce jobs are iteration sensitive as multiple read and write to the file system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the file system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, CL Philip, and Chun-Yang Zhang: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275: 314–347 (2014)
DT Editorial Services, Big Data Black Book, Dreamtech Press (2015)
Gantz, John, and David Reinsel.: The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east: IDC iView: IDC Analyze the Future 2007, 1–16 (2012)
Fahad, Adil, et al.: A survey of clustering algorithms for big data: Taxonomy and empirical analysis., IEEE Transactions on Emerging Topics in Computing 2.3, 267–279 (2014)
Shirkhorshidi, Ali Seyed, et al.: Big data clustering: A review. Computational Science and Its Applications–ICCSA, Springer International Publishing, 707–720 (2014)
Bahmani, Bahman, et al.: Scalable k-Means++. Proceedings of the VLDB Endowment 5.7, 622–633 (2012)
Huang, Xiaohui, Yunming Ye, and Haijun Zhang.: Extensions of kmeans-type algorithms: A new clustering framework by integrating intra cluster compactness and intercluster separation. Neural Networks and Learning Systems, IEEE Transactions on 25.8, 1433–1446 (2014)
Maldonado, Sebastián, Emilio Carrizosa, and Richard Weber: Kernel Penalized K-means: A feature selection method based on Kernel K-means. Information Sciences 322, 150–160 (2015)
Zhong, Caiming, et al.: A fast minimum spanning tree algorithm based on K-means. Information Sciences 295, 1–17 (2015)
Cui, Xiaoli, et al.: Optimized big data K-Means clustering using MapReduce. The Journal of Supercomputing 70.3, 1249–1259 (2014)
Zhao, Weizhong, Huifang Ma, and Qing He.: Parallel k-Means clustering based on MapReduce. Cloud Computing, Springer Berlin Heidelberg, 674–679 (2009)
Anchalia, Prajesh P.: Improved MapReduce k-Means Clustering Algorithm with Combiner. Computer Modelling and Simulation (UKSim), 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, IEEE (2014)
Kumar, Jitendra, et al.: Parallel k-means clustering for quantitative ecoregion delineation using large data sets. Procedia Computer Science 4, 1602–1611 (2011)
Andrade, Guilherme, et al.: G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science 18, 369–378 (2013)
Cui, Xiaohui, Jesse St Charles, and Thomas Potok: GPU enhanced parallel computing for large scale data clustering. Future Generation Computer Systems 29.7, 1736–1741 (2013)
Dean, Jeffrey, and Sanjay Ghemawat.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51.1, 107–113 (2008)
Apache. Apache hadoop. http://hadoop.apache.org
Cloudera: http://www.cloudera.com
UCI Machine Learning Repository: https://archive.ics.uci.edu
Davies, David L., and Donald W. Bouldin.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227(1979)
Traganitis, Panagiotis A., Konstanti14nos Slavakis, and Georgios B. Giannakis: Sketch and Validate for Big Data Clustering. arXiv preprint arXiv:1501.05590 (2015)
Acknowledgments
This research work is supported by Council of Scientific and Industrial Research (CSIR), New Delhi, India. The authors are grateful to CSIR for the financial assistance provided to carry out the research work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Singapore
About this paper
Cite this paper
Ankita Sinha, Jana, P.K. (2017). A Novel MapReduce Based k-Means Clustering. In: Mandal, J., Satapathy, S., Sanyal, M., Bhateja, V. (eds) Proceedings of the First International Conference on Intelligent Computing and Communication. Advances in Intelligent Systems and Computing, vol 458. Springer, Singapore. https://doi.org/10.1007/978-981-10-2035-3_26
Download citation
DOI: https://doi.org/10.1007/978-981-10-2035-3_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2034-6
Online ISBN: 978-981-10-2035-3
eBook Packages: EngineeringEngineering (R0)