The Journal of Supercomputing

, Volume 74, Issue 4, pp 1562–1579 | Cite as

A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets

  • Ankita Sinha
  • Prasanta K. Jana


Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means\( ++ \) initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.


Mahalanobis distance Apache Hadoop \(\textit{k}\)-means++ initialization Genetic algorithm 



We sincerely thank the Council of Scientific and Industrial Research (CSIR), New Delhi, India, for supporting this work (File No. 09\(\backslash \)085(0111)2014-EMR-1). We are grateful to CSIR, India, for the financial support.


  1. 1.
    Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
  2. 2.
    IBM, Big Data and Analytics (2015). URL Accessed 10 Nov 2016
  3. 3.
    Laney D (2001) 3D data management: controlling data volume, velocity and variety. META Group Res Note 6:70Google Scholar
  4. 4.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of big data on cloud computing: review and open research issues. Inf Syst 47:98–115CrossRefGoogle Scholar
  5. 5.
    Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRefGoogle Scholar
  6. 6.
    Sinha Ankita, Jana PK (2016) Clustering algorithms for big data: a survey, the human element of big data: issues, analytics, and performance. CRC Press, Baca Raton, pp 140–157Google Scholar
  7. 7.
    Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279CrossRefGoogle Scholar
  8. 8.
    Tan PN (2006) Introduction to data mining. Pearson Education India, DelhiGoogle Scholar
  9. 9.
    De Maesschalck R, Jouan-Rimbaud D, Massart DL (2000) The mahalanobis distance. Chemom Intell Lab Syst 50(1):1–18CrossRefGoogle Scholar
  10. 10.
    Teknomo Kardi (2015) Similarity measurement. Accessed 10 Nov 2016
  11. 11.
    Xiang S, Nie F, Zhang C (2008) Learning a Mahalanobis distance metric for data clustering and classification. Pattern Recognit 41(12):3600–3612CrossRefzbMATHGoogle Scholar
  12. 12.
    Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248CrossRefzbMATHGoogle Scholar
  13. 13.
    Drineas P, Frieze A, Kannan R, Vempala S, Vinay V (2004) Clustering large graphs via the singular value decomposition. Mach Learn 56(1–3):9–33CrossRefzbMATHGoogle Scholar
  14. 14.
    Goldberg DE (2006) Genetic algorithms. Pearson Education India, DelhiGoogle Scholar
  15. 15.
    Bhattacharya RK (2012) Introduction to genetic algorithms Department of Civil Engineering. Indian Institute of Technology, GuwahatiGoogle Scholar
  16. 16.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  17. 17.
    Reddy D, Jana PK, Member IS (2012) Initialization for K-means clustering using Voronoi diagram. Proced Technol 4:395–400CrossRefGoogle Scholar
  18. 18.
    Reddy D, Mishra D, Jana P.K (2011) MST-based cluster initialization for k-means. In: International Conference on Computer Science and Information Technology. Springer Berlin Heidelberg, pp 329–338Google Scholar
  19. 19.
    Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33(9):1455–1465CrossRefGoogle Scholar
  20. 20.
    Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowl Based Syst 71:345–365CrossRefGoogle Scholar
  21. 21.
    Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE International Conference on Cloud Computing. Springer Berlin Heidelberg, pp 674–679Google Scholar
  22. 22.
    Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259CrossRefGoogle Scholar
  23. 23.
    Shahrivari S, Jalili S (2016) Single-pass and linear-time k-means clustering based on MapReduce. Inf Syst 60:1–12CrossRefGoogle Scholar
  24. 24.
    Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035Google Scholar
  25. 25.
  26. 26.
    Verma A, Llor X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using mapreduce. In: 2009 IEEE Ninth International Conference on Intelligent Systems Design and Applications, pp 13–18Google Scholar
  27. 27.
    Banharnsakun A (2017) A MapReduce-based artificial bee colony for large-scale data clustering. Pattern Recognit Lett 93:78–84Google Scholar
  28. 28.
    Wang J, Yuan D, Jiang M (2012) Parallel K-PSO based on MapReduce. In: 2012 IEEE 14th International Conference on Communication Technology (ICCT), pp 1203–1208Google Scholar
  29. 29.
    Naldi MC, Campello RJGB (2014) Evolutionary k-means for distributed datasets. Neurocomputing 127:30–42CrossRefGoogle Scholar
  30. 30.
    Apache (2016) Apache hadoop. Accessed 10 Nov 2016
  31. 31.
    Cant-Paz E (1998) A survey of parallel genetic algorithms. Calculateurs Paralleles Reseaux et Systems Repartis 10(2):141–171Google Scholar
  32. 32.
    Gong YJ, Chen WN, Zhan ZH, Zhang J, Li Y, Zhang Q, Li JJ (2015) Distributed evolutionary algorithms and their models: a survey of the state-of-the-art. Appl Soft Comput 34:286–300CrossRefGoogle Scholar
  33. 33.
    Mitchell TM (1997) Machine learning. McGraw Hill, New York CityzbMATHGoogle Scholar
  34. 34.
    UCI Machine Learning Repository (2016). Accessed 10 Nov 2016
  35. 35.
    Davies DL, Donald W (1979) Bouldin.: a cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227CrossRefGoogle Scholar
  36. 36.
    Traganitis PA, Slavakis K, Giannakis GB (2015) Sketch and validate for big data clustering. IEEE J Sel Top Sig Process 9(4):678–690CrossRefGoogle Scholar
  37. 37.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIIT (ISM), DhanbadDhanbadIndia

Personalised recommendations