Skip to main content

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 458))

  • 991 Accesses

Abstract

Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the final number of clusters has to be specified. MapReduce jobs are iteration sensitive as multiple read and write to the file system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the file system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, CL Philip, and Chun-Yang Zhang: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275: 314–347 (2014)

    Google Scholar 

  2. DT Editorial Services, Big Data Black Book, Dreamtech Press (2015)

    Google Scholar 

  3. Gantz, John, and David Reinsel.: The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east: IDC iView: IDC Analyze the Future 2007, 1–16 (2012)

    Google Scholar 

  4. Fahad, Adil, et al.: A survey of clustering algorithms for big data: Taxonomy and empirical analysis., IEEE Transactions on Emerging Topics in Computing 2.3, 267–279 (2014)

    Google Scholar 

  5. Shirkhorshidi, Ali Seyed, et al.: Big data clustering: A review. Computational Science and Its Applications–ICCSA, Springer International Publishing, 707–720 (2014)

    Google Scholar 

  6. Bahmani, Bahman, et al.: Scalable k-Means++. Proceedings of the VLDB Endowment 5.7, 622–633 (2012)

    Google Scholar 

  7. Huang, Xiaohui, Yunming Ye, and Haijun Zhang.: Extensions of kmeans-type algorithms: A new clustering framework by integrating intra cluster compactness and intercluster separation. Neural Networks and Learning Systems, IEEE Transactions on 25.8, 1433–1446 (2014)

    Google Scholar 

  8. Maldonado, Sebastián, Emilio Carrizosa, and Richard Weber: Kernel Penalized K-means: A feature selection method based on Kernel K-means. Information Sciences 322, 150–160 (2015)

    Google Scholar 

  9. Zhong, Caiming, et al.: A fast minimum spanning tree algorithm based on K-means. Information Sciences 295, 1–17 (2015)

    Google Scholar 

  10. Cui, Xiaoli, et al.: Optimized big data K-Means clustering using MapReduce. The Journal of Supercomputing 70.3, 1249–1259 (2014)

    Google Scholar 

  11. Zhao, Weizhong, Huifang Ma, and Qing He.: Parallel k-Means clustering based on MapReduce. Cloud Computing, Springer Berlin Heidelberg, 674–679 (2009)

    Google Scholar 

  12. Anchalia, Prajesh P.: Improved MapReduce k-Means Clustering Algorithm with Combiner. Computer Modelling and Simulation (UKSim), 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, IEEE (2014)

    Google Scholar 

  13. Kumar, Jitendra, et al.: Parallel k-means clustering for quantitative ecoregion delineation using large data sets. Procedia Computer Science 4, 1602–1611 (2011)

    Google Scholar 

  14. Andrade, Guilherme, et al.: G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science 18, 369–378 (2013)

    Google Scholar 

  15. Cui, Xiaohui, Jesse St Charles, and Thomas Potok: GPU enhanced parallel computing for large scale data clustering. Future Generation Computer Systems 29.7, 1736–1741 (2013)

    Google Scholar 

  16. Dean, Jeffrey, and Sanjay Ghemawat.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51.1, 107–113 (2008)

    Google Scholar 

  17. Apache. Apache hadoop. http://hadoop.apache.org

  18. Cloudera: http://www.cloudera.com

  19. UCI Machine Learning Repository: https://archive.ics.uci.edu

  20. Davies, David L., and Donald W. Bouldin.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224–227(1979)

    Google Scholar 

  21. Traganitis, Panagiotis A., Konstanti14nos Slavakis, and Georgios B. Giannakis: Sketch and Validate for Big Data Clustering. arXiv preprint arXiv:1501.05590 (2015)

Download references

Acknowledgments

This research work is supported by Council of Scientific and Industrial Research (CSIR), New Delhi, India. The authors are grateful to CSIR for the financial assistance provided to carry out the research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankita Sinha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Singapore

About this paper

Cite this paper

Ankita Sinha, Jana, P.K. (2017). A Novel MapReduce Based k-Means Clustering. In: Mandal, J., Satapathy, S., Sanyal, M., Bhateja, V. (eds) Proceedings of the First International Conference on Intelligent Computing and Communication. Advances in Intelligent Systems and Computing, vol 458. Springer, Singapore. https://doi.org/10.1007/978-981-10-2035-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2035-3_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2034-6

  • Online ISBN: 978-981-10-2035-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics