A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Sinha, Ankita; Jana, Prasanta K.

doi:10.1007/s11227-017-2182-8

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Published: 11 November 2017

Volume 74, pages 1562–1579, (2018)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

759 Accesses
31 Citations
Explore all metrics

Abstract

Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means\( ++ \) initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

K-means properties on six clustering benchmark datasets

Article 26 July 2018

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Adapting k-means for graph clustering

Article Open access 04 December 2021

References

Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Article Google Scholar
IBM, Big Data and Analytics (2015). URL http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html. Accessed 10 Nov 2016
Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6:70
Google Scholar
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Sinha Ankita, Jana PK (2016) Clustering algorithms for big data: a survey, the human element of big data: issues, analytics, and performance. CRC Press, Baca Raton, pp 140–157
Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Article Google Scholar
Tan PN (2006) Introduction to data mining. Pearson Education India, Delhi
Google Scholar
De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemom Intell Lab Syst 50(1):1–18
Article Google Scholar
Teknomo Kardi (2015) Similarity measurement. http://people.revoledu.com/kardi/tutorial/Similarity/MahalanobisDistance.html. Accessed 10 Nov 2016
Xiang S, Nie F, Zhang C (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612
Article MATH Google Scholar
Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Article MATH Google Scholar
Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33
Article MATH Google Scholar
Goldberg DE (2006) Genetic algorithms. Pearson Education India, Delhi
Google Scholar
Bhattacharya RK (2012) Introduction to genetic algorithms Department of Civil Engineering. Indian Institute of Technology, Guwahati
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Reddy D, Jana PK, Member IS (2012) Initialization for K-means clustering using Voronoi diagram. Proced Technol 4:395–400
Article Google Scholar
Reddy D, Mishra D, Jana P.K (2011) MST-based cluster initialization for k-means. In: International Conference on Computer Science and Information Technology. Springer Berlin Heidelberg, pp 329–338
Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33(9):1455–1465
Article Google Scholar
Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowl Based Syst 71:345–365
Article Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE International Conference on Cloud Computing. Springer Berlin Heidelberg, pp 674–679
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Inf Syst 60:1–12
Article Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
HDFS (2016). https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. Accessed 10 Nov 2016
Verma A, Llor X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using mapreduce. In: 2009 IEEE Ninth International Conference on Intelligent Systems Design and Applications, pp 13–18
Banharnsakun A (2017) A MapReduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93:78–84
Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: 2012 IEEE 14th International Conference on Communication Technology (ICCT), pp 1203–1208
Naldi MC, Campello RJGB (2014) Evolutionary k-means for distributed datasets. Neurocomputing 127:30–42
Article Google Scholar
Apache (2016) Apache hadoop. http://hadoop.apache.org. Accessed 10 Nov 2016
Cant-Paz E (1998) A survey of parallel genetic algorithms. Calculateurs Paralleles Reseaux et Systems Repartis 10(2):141–171
Google Scholar
Gong YJ, Chen WN, Zhan ZH, Zhang J, Li Y, Zhang Q, Li JJ (2015) Distributed evolutionary algorithms and their models: a survey of the state-of-the-art. Appl Soft Comput 34:286–300
Article Google Scholar
Mitchell TM (1997) Machine learning. McGraw Hill, New York City
MATH Google Scholar
UCI Machine Learning Repository (2016). http://archive.ics.uci.edu/ml/dataset. Accessed 10 Nov 2016
Davies DL, Donald W (1979) Bouldin.: a cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Article Google Scholar
Traganitis PA, Slavakis K, Giannakis GB (2015) Sketch and validate for big data clustering. IEEE J Sel Top Sig Process 9(4):678–690
Article Google Scholar
http://libguides.library.kent.edu/SPSS/PairedSamplestTest. Accessed 10 Nov 2016

Download references

Acknowledgements

We sincerely thank the Council of Scientific and Industrial Research (CSIR), New Delhi, India, for supporting this work (File No. 09\(\backslash \)085(0111)2014-EMR-1). We are grateful to CSIR, India, for the financial support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT (ISM), Dhanbad, Dhanbad, India
Ankita Sinha & Prasanta K. Jana

Authors

Ankita Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Prasanta K. Jana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Sinha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sinha, A., Jana, P.K. A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74, 1562–1579 (2018). https://doi.org/10.1007/s11227-017-2182-8

Download citation

Published: 11 November 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11227-017-2182-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Abstract

Access this article

Similar content being viewed by others

K-means properties on six clustering benchmark datasets

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Adapting k-means for graph clustering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

Abstract

Access this article

Similar content being viewed by others

K-means properties on six clustering benchmark datasets

Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

Adapting k-means for graph clustering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation