Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

  • Sayer Alshammari
  • Maslina Binti ZolkepliEmail author
  • Rusli Bin Abdullah
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 978)


Data clustering algorithm has been receiving considerable attention in many application areas such as data mining, document retrieval, image processing and pattern classification. A hybrid data clustering algorithm using the combination of genetic algorithm (GA) with a popular variant of K-Means clustering algorithm, parallel k-Means clustering algorithm (PKCA) is proposed in this paper. The objective of the proposed algorithm is to combine the search process of GA to generate new data clusters and apply parallel K-Means to further speed up the quality of the search process during clusters formation. The proposed approach is implemented using the popular MapReduce programming model on Hadoop framework. Experiments were conducted with multiple synthetic datasets to evaluate the performance of the proposed algorithm. Results show that the proposed algorithm was able to speed up document clustering process by 0.54 s on average and outperformed PKCA. Data analysts in marketing and finance, telecommunication and transport companies and researchers in academia can use this algorithm to make sense out of their huge volume of data.


K-Means Parallel K-Means Clustering algorithm MapReduce Hadoop Genetic algorithm 


  1. 1.
    Garg N, Singla S, Jangra S (2016) Challenges and techniques for testing of big data. Procedia Comput Sci 85:940–948CrossRefGoogle Scholar
  2. 2.
    Cuzzocrea A, Darmont J, Mahboubi H (2009) Fragmenting very large XML data warehouses via K-means clustering algorithm. Int J Bus Intell Data Min 4(3/4):301–328. Inderscience, GenèvaGoogle Scholar
  3. 3.
    Zhao W, Ma H, He Q (2009) Parallel K-Means clustering based on MapReduce. In: CloudCom 2009: cloud computing. Springer, Heidelberg, pp 674–679Google Scholar
  4. 4.
    Cohen-Addad V, Kanade V, Mallmann-Trenn F, Mathieu C (2017) Hierarchical clustering: objective functions and algorithms, no. 1Google Scholar
  5. 5.
    Chatziafratis V, Niazadeh R, Charikar M (2018) Hierarchical clustering with structural constraints, pp 1–23Google Scholar
  6. 6.
    Singh K, Malik D, Sharma N (2011) Evolving limitations in K-means algorithm in data mining and their removal. Int J Comput Eng Manag 12:2230–7893Google Scholar
  7. 7.
    Bouhmala N, Viken A, Lønnum JB (2015) Enhanced genetic algorithm with K-means for the clustering problem. Int J Model Optim 5(2):150–154CrossRefGoogle Scholar
  8. 8.
    Lu Y, Lu S, Fotouhi F, Deng S, Brown SJ (2004) FGKA: a fast genetic K-means clustering algorithm. In: Proceedings of the 2004 ACM symposium on applied computing, pp 622–623Google Scholar
  9. 9.
    Alswaitti M, Albughdadi M, Isa NAM (2018) Density-based particle swarm optimization algorithm for data clustering. Expert Syst Appl 91:170–186CrossRefGoogle Scholar
  10. 10.
    Wu X, Zhu X, Wu G-Q, Ding W (2014) Semana 07-data mining with big data. Knowl Data Eng IEEE Trans 26(1):97–107CrossRefGoogle Scholar
  11. 11.
    Goel L, Jain N, Srivastava S (2017) A novel pso based algorithm to find initial seeds for the k-means clustering algorithm. In: Communication and computing systems: proceedings of the international conference on communication and computing systems ICCCS 2016, pp 159–163, NovemberGoogle Scholar
  12. 12.
    Younus ZS et al (2015) Content-based image retrieval using PSO and k-means clustering algorithm. Arab J Geosci 8(8):6211–6224CrossRefGoogle Scholar
  13. 13.
    Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf Sci 146:221–327zbMATHCrossRefGoogle Scholar
  14. 14.
    Oussous A, Benjelloun FZ, Ait Lahcen A, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ Comput Inf Sci 30(4):431–448CrossRefGoogle Scholar
  15. 15.
    Shukri S, Faris H, Aljarah I, Mirjalili S, Abraham A (2018) Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer. Eng Appl Artif Intell 72(April):54–66CrossRefGoogle Scholar
  16. 16.
    Deepali AP, Varshney S (2016) Analysis of K-means and K-medoids algorithm for big data. Phys Procedia 78:507–512 (2016)Google Scholar
  17. 17.
    Ramírez-Gallego S, Fernández A, García S, Chen M, Herrera F (2018) Big data: tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Inf Fusion 42:51–61CrossRefGoogle Scholar
  18. 18.
    Hua Y, Jin Y, Hao K (2019) A clustering-based adaptive evolutionary algorithm for multiobjective optimization with irregular pareto fronts. IEEE Trans Cybern 49(7):2758–2770CrossRefGoogle Scholar
  19. 19.
    Zhang X, Tian Y, Cheng R, Jin Y (2018) A decision variable clustering-based evolutionary algorithm for large-scale. IEEE Trans Evol Comput 22(1):1–17CrossRefGoogle Scholar
  20. 20.
    Sinha A, Jana PK (2018) A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74(4):1562–1579CrossRefGoogle Scholar
  21. 21.
    Garza-Fabre M, Handl J, Knowles J (2018) An improved and more scalable evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 22(4):515–535CrossRefGoogle Scholar
  22. 22.
    Tsai CW, Chang WY, Wang YC, Chen H (2019) A high-performance parallel coral reef optimization for data clustering. Soft Comput 2Google Scholar
  23. 23.
    Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel K-means clustering of remote sensing images based on MapReduce. In: WISM 2010, vol 6318, pp 254–262CrossRefGoogle Scholar
  24. 24.
    Krishna K, Murty NM (1999) Genetic K-means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439CrossRefGoogle Scholar
  25. 25.
    Sardar TH, Ansari Z (2018) Partition based clustering of large datasets using MapReduce framework: an analysis of recent themes and directions. Futur Comput Inf J 3(2):247–261Google Scholar
  26. 26.
    Drechsler J (2011) Synthetic datasets for statistical disclosure control: theory and implementation, vol 201. Springer Science & Business MediaGoogle Scholar
  27. 27.
    Banerjee S, Choudhary A, Pal S (2016) Empirical evaluation of K-means, bisecting K-means, fuzzy C-means and genetic K-means clustering algorithms. In: 2015 IEEE International WIE Conference on Electrical and Computer Engineering WIECON-ECE 2015, pp 168–172Google Scholar
  28. 28.
    Sathiyakumari K, Preamsudha V, Manimekalai G (2011) Unsupervised approach for document clustering using modified fuzzy C mean algorithm. Int J Comput Organ Trends 1(3):10–14Google Scholar
  29. 29.
    Hotho A, Staab S, Stumme G (2003) Text clustering based on background knowledge. Inst Appl Inf 1–36Google Scholar
  30. 30.
    Nur’Aini K, Najahaty I, Hidayati L, Murfi H, Nurrohmah S (2016) Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter. In: ICACSIS 2015, pp 123–128Google Scholar
  31. 31.
    Surendra H, Mohan H (2017) A review of synthetic data generation methods for privacy preserving data publishing. Int J Sci Technol Res 6(3):95–101Google Scholar
  32. 32.
    Cicoria S, Sherlock J, Muniswamaiah M, Clarke L (2014) Classification of titanic passenger data and chances of surviving the disaster data mining with weka and kaggle competition data. In: Proceedings of the student-faculty research day, CSIS, Pace University, pp 1–6Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Sayer Alshammari
    • 1
  • Maslina Binti Zolkepli
    • 1
    Email author
  • Rusli Bin Abdullah
  1. 1.Universiti Putra MalaysiaSelangorMalaysia

Personalised recommendations