Cloud Based K-Means Clustering Running as a MapReduce Job for Big Data Healthcare Analytics Using Apache Mahout

Rallapalli, Sreekanth; Gondkar, R. R.; Madhava Rao, Golajapu Venu

doi:10.1007/978-81-322-2755-7_14

Sreekanth Rallapalli⁶,
R. R. Gondkar⁷ &
Golajapu Venu Madhava Rao⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 433))

1584 Accesses
4 Citations

Abstract

Increase in data volume and need for analytics has led towards innovation of big data. To speed up the query responses models like NoSQL has emerged. Virtualized platforms using commodity hardware and implementing Hadoop on it helps small and midsized companies use cloud environment. This will help organizations to decrease the cost for data processing and analytics. As health care generating volumes and variety of data it is required to build parallel algorithms that can support petabytes of data using hadoop and MapReduce parallel processing. K-means clustering is one of the methods for parallel algorithm. In order to build an accurate system large data sets need to be considered. Memory requirement increases with large data sets and algorithms become slow. Mahout scalable algorithms developed works better with huge data sets and improve the performance of the system. Mahout is an open source and can be used to solve problems arising with huge data sets. This paper proposes cloud based K-means clustering running as a MapReduce job. We use health care data on cloud for clustering. We then compare the results with various measures to conclude the best measure to find number of vectors in a given cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

T Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu, “An efficient K-means clustering algorithm: Analysis and implementation”, Pattern Analysis and Machine Intelligence, IEEE Transactions, Vol 24, No 7, pp. 881–892, 2002.
Google Scholar
White, T: Hadoop the definitive guide, O’Reilly Media, 2009.
Google Scholar
Fredrik Farnstorm, J: Scalability for clustering algorithms revisited—SIGKDD Explorations, 2002, 2, pp. 51–57.
Google Scholar
Rui Maximo Esteves, Chunming Rong, Rui Pais: K-means clustering in the cloud—a Mahout test, IEEE 2011 Workshops of international conference on Advanced information networking and application, pp. 514–519.
Google Scholar
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoopcommon/NativeLibraries.html.
Jain, A.K. and R.C Dubes, 1998: Algorithms for Clustering Data, Prentince Hall, New Jersy.
Google Scholar
Dweepna Garg, Kushboo Trivedi, Fuzzy k-mean clustering in MapReduce on cloud based Hadoop, 2014 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT).
Google Scholar
Lin Gu, Zhonghua sheng, Zhiqiang Ma, Xiang Gao, Charles Zhang, Yaohui Jin: K Means of cloud computing: MapReduce, DVM, and windows Azure, Fourth International Conference on Cloud Computing, GRIDs, and Virtualization (cloud computing 2013). May 27–June 1, 2013, Valencia, Spain.
Google Scholar
Budhaditya Saha, Dinh Phung, Duc-son Pham, Svetha Venkatesh, Clustering Patient Medical Records via sparse subspace representation from http://link.springer.com/chapter/10.1007/978-3-642-37456-2_11.
Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman, Mahout in Action by Manning Shelter Island.
Google Scholar
J. Dean and S. Ghemawat, “MapReduce simplified data processing on large clusters”, In Proc. Of the 6th Symposium on OS design and implementation (OSDI’04), Berkely, CA, USA, 2004, pp. 137–149.
Google Scholar

Download references

Author information

Authors and Affiliations

R&D Centre, Bharathiyar University, Coimbatore, India
Sreekanth Rallapalli
AIT, Bangalore, India
R. R. Gondkar
Botho University, Gabarone, Botswana
Golajapu Venu Madhava Rao

Authors

Sreekanth Rallapalli
View author publications
You can also search for this author in PubMed Google Scholar
R. R. Gondkar
View author publications
You can also search for this author in PubMed Google Scholar
Golajapu Venu Madhava Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sreekanth Rallapalli .

Editor information

Editors and Affiliations

Deparment of CSE, Anil Neerukonda Ins. of Tech. & Sci., Vishakapatnam, India
Suresh Chandra Satapathy
Kalyani University, Nadia, West Bengal, India
Jyotsna Kumar Mandal
University of Hyderabad, Hyderabad, Andhra Pradesh, India
Siba K. Udgata
Dept. of ECE, Shri Ramswaroop Mem. Group of Prof. Clg, Lucknow, Uttar Pradesh, India
Vikrant Bhateja

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rallapalli, S., Gondkar, R.R., Madhava Rao, G.V. (2016). Cloud Based K-Means Clustering Running as a MapReduce Job for Big Data Healthcare Analytics Using Apache Mahout. In: Satapathy, S., Mandal, J., Udgata, S., Bhateja, V. (eds) Information Systems Design and Intelligent Applications. Advances in Intelligent Systems and Computing, vol 433. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2755-7_14

Download citation

DOI: https://doi.org/10.1007/978-81-322-2755-7_14
Published: 06 February 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2753-3
Online ISBN: 978-81-322-2755-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics