A Novel MapReduce Based k-Means Clustering

Ankita Sinha; Jana, Prasanta K.

doi:10.1007/978-981-10-2035-3_26

Ankita Sinha¹⁷ &
Prasanta K. Jana¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 458))

991 Accesses

Abstract

Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the final number of clusters has to be specified. MapReduce jobs are iteration sensitive as multiple read and write to the file system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the file system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, CL Philip, and Chun-Yang Zhang: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275: 314–347 (2014)
Google Scholar
DT Editorial Services, Big Data Black Book, Dreamtech Press (2015)
Google Scholar
Gantz, John, and David Reinsel.: The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east: IDC iView: IDC Analyze the Future 2007, 1–16 (2012)
Google Scholar
Fahad, Adil, et al.: A survey of clustering algorithms for big data: Taxonomy and empirical analysis., IEEE Transactions on Emerging Topics in Computing 2.3, 267–279 (2014)
Google Scholar
Shirkhorshidi, Ali Seyed, et al.: Big data clustering: A review. Computational Science and Its Applications–ICCSA, Springer International Publishing, 707–720 (2014)
Google Scholar
Bahmani, Bahman, et al.: Scalable k-Means++. Proceedings of the VLDB Endowment 5.7, 622–633 (2012)
Google Scholar
Huang, Xiaohui, Yunming Ye, and Haijun Zhang.: Extensions of kmeans-type algorithms: A new clustering framework by integrating intra cluster compactness and intercluster separation. Neural Networks and Learning Systems, IEEE Transactions on 25.8, 1433–1446 (2014)
Google Scholar
Maldonado, Sebastián, Emilio Carrizosa, and Richard Weber: Kernel Penalized K-means: A feature selection method based on Kernel K-means. Information Sciences 322, 150–160 (2015)
Google Scholar
Zhong, Caiming, et al.: A fast minimum spanning tree algorithm based on K-means. Information Sciences 295, 1–17 (2015)
Google Scholar
Cui, Xiaoli, et al.: Optimized big data K-Means clustering using MapReduce. The Journal of Supercomputing 70.3, 1249–1259 (2014)
Google Scholar
Zhao, Weizhong, Huifang Ma, and Qing He.: Parallel k-Means clustering based on MapReduce. Cloud Computing, Springer Berlin Heidelberg, 674–679 (2009)
Google Scholar
Anchalia, Prajesh P.: Improved MapReduce k-Means Clustering Algorithm with Combiner. Computer Modelling and Simulation (UKSim), 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, IEEE (2014)
Google Scholar
Kumar, Jitendra, et al.: Parallel k-means clustering for quantitative ecoregion delineation using large data sets. Procedia Computer Science 4, 1602–1611 (2011)
Google Scholar
Andrade, Guilherme, et al.: G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science 18, 369–378 (2013)
Google Scholar
Cui, Xiaohui, Jesse St Charles, and Thomas Potok: GPU enhanced parallel computing for large scale data clustering. Future Generation Computer Systems 29.7, 1736–1741 (2013)
Google Scholar
Dean, Jeffrey, and Sanjay Ghemawat.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51.1, 107–113 (2008)
Google Scholar
Apache. Apache hadoop. http://hadoop.apache.org
Cloudera: http://www.cloudera.com
UCI Machine Learning Repository: https://archive.ics.uci.edu
Davies, David L., and Donald W. Bouldin.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227(1979)
Google Scholar
Traganitis, Panagiotis A., Konstanti14nos Slavakis, and Georgios B. Giannakis: Sketch and Validate for Big Data Clustering. arXiv preprint arXiv:1501.05590 (2015)

Download references

Acknowledgments

This research work is supported by Council of Scientific and Industrial Research (CSIR), New Delhi, India. The authors are grateful to CSIR for the financial assistance provided to carry out the research work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India
Ankita Sinha & Prasanta K. Jana

Authors

Ankita Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Prasanta K. Jana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Sinha .

Editor information

Editors and Affiliations

Department of CSE, University of Kalyani, Kalyani, West Bengal, India
Jyotsna Kumar Mandal
Department of Computer Science & Engineering, Anil Neerukonda Institute of Technology and Sciences, Vishakapatnam, Andhra Pradesh, India
Suresh Chandra Satapathy
J K Mandal, Dept of CSE, University of Kalyani, Kalyani, West Bengal, India
Manas Kumar Sanyal
Department of ECE, Sri Ramswaroop Memorial College of Engineering and Management Lucknow, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ankita Sinha, Jana, P.K. (2017). A Novel MapReduce Based k-Means Clustering. In: Mandal, J., Satapathy, S., Sanyal, M., Bhateja, V. (eds) Proceedings of the First International Conference on Intelligent Computing and Communication. Advances in Intelligent Systems and Computing, vol 458. Springer, Singapore. https://doi.org/10.1007/978-981-10-2035-3_26

Download citation

DOI: https://doi.org/10.1007/978-981-10-2035-3_26
Published: 23 November 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2034-6
Online ISBN: 978-981-10-2035-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics