Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

Alshammari, Sayer; Zolkepli, Maslina Binti; Abdullah, Rusli Bin

doi:10.1007/978-3-030-36056-6_10

Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA)

Sayer Alshammari¹⁸,
Maslina Binti Zolkepli¹⁸ &
Rusli Bin Abdullah

Conference paper
First Online: 05 December 2019

868 Accesses
4 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 978))

Abstract

Data clustering algorithm has been receiving considerable attention in many application areas such as data mining, document retrieval, image processing and pattern classification. A hybrid data clustering algorithm using the combination of genetic algorithm (GA) with a popular variant of K-Means clustering algorithm, parallel k-Means clustering algorithm (PKCA) is proposed in this paper. The objective of the proposed algorithm is to combine the search process of GA to generate new data clusters and apply parallel K-Means to further speed up the quality of the search process during clusters formation. The proposed approach is implemented using the popular MapReduce programming model on Hadoop framework. Experiments were conducted with multiple synthetic datasets to evaluate the performance of the proposed algorithm. Results show that the proposed algorithm was able to speed up document clustering process by 0.54 s on average and outperformed PKCA. Data analysts in marketing and finance, telecommunication and transport companies and researchers in academia can use this algorithm to make sense out of their huge volume of data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Garg N, Singla S, Jangra S (2016) Challenges and techniques for testing of big data. Procedia Comput Sci 85:940–948
Article Google Scholar
Cuzzocrea A, Darmont J, Mahboubi H (2009) Fragmenting very large XML data warehouses via K-means clustering algorithm. Int J Bus Intell Data Min 4(3/4):301–328. Inderscience, Genèva
Google Scholar
Zhao W, Ma H, He Q (2009) Parallel K-Means clustering based on MapReduce. In: CloudCom 2009: cloud computing. Springer, Heidelberg, pp 674–679
Google Scholar
Cohen-Addad V, Kanade V, Mallmann-Trenn F, Mathieu C (2017) Hierarchical clustering: objective functions and algorithms, no. 1
Google Scholar
Chatziafratis V, Niazadeh R, Charikar M (2018) Hierarchical clustering with structural constraints, pp 1–23
Google Scholar
Singh K, Malik D, Sharma N (2011) Evolving limitations in K-means algorithm in data mining and their removal. Int J Comput Eng Manag 12:2230–7893
Google Scholar
Bouhmala N, Viken A, Lønnum JB (2015) Enhanced genetic algorithm with K-means for the clustering problem. Int J Model Optim 5(2):150–154
Article Google Scholar
Lu Y, Lu S, Fotouhi F, Deng S, Brown SJ (2004) FGKA: a fast genetic K-means clustering algorithm. In: Proceedings of the 2004 ACM symposium on applied computing, pp 622–623
Google Scholar
Alswaitti M, Albughdadi M, Isa NAM (2018) Density-based particle swarm optimization algorithm for data clustering. Expert Syst Appl 91:170–186
Article Google Scholar
Wu X, Zhu X, Wu G-Q, Ding W (2014) Semana 07-data mining with big data. Knowl Data Eng IEEE Trans 26(1):97–107
Article Google Scholar
Goel L, Jain N, Srivastava S (2017) A novel pso based algorithm to find initial seeds for the k-means clustering algorithm. In: Communication and computing systems: proceedings of the international conference on communication and computing systems ICCCS 2016, pp 159–163, November
Google Scholar
Younus ZS et al (2015) Content-based image retrieval using PSO and k-means clustering algorithm. Arab J Geosci 8(8):6211–6224
Article Google Scholar
Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf Sci 146:221–327
Article MATH Google Scholar
Oussous A, Benjelloun FZ, Ait Lahcen A, Belfkih S (2018) Big data technologies: a survey. J King Saud Univ Comput Inf Sci 30(4):431–448
Article Google Scholar
Shukri S, Faris H, Aljarah I, Mirjalili S, Abraham A (2018) Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer. Eng Appl Artif Intell 72(April):54–66
Article Google Scholar
Deepali AP, Varshney S (2016) Analysis of K-means and K-medoids algorithm for big data. Phys Procedia 78:507–512 (2016)
Google Scholar
Ramírez-Gallego S, Fernández A, García S, Chen M, Herrera F (2018) Big data: tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Inf Fusion 42:51–61
Article Google Scholar
Hua Y, Jin Y, Hao K (2019) A clustering-based adaptive evolutionary algorithm for multiobjective optimization with irregular pareto fronts. IEEE Trans Cybern 49(7):2758–2770
Article Google Scholar
Zhang X, Tian Y, Cheng R, Jin Y (2018) A decision variable clustering-based evolutionary algorithm for large-scale. IEEE Trans Evol Comput 22(1):1–17
Article Google Scholar
Sinha A, Jana PK (2018) A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74(4):1562–1579
Article Google Scholar
Garza-Fabre M, Handl J, Knowles J (2018) An improved and more scalable evolutionary approach to multiobjective clustering. IEEE Trans Evol Comput 22(4):515–535
Article Google Scholar
Tsai CW, Chang WY, Wang YC, Chen H (2019) A high-performance parallel coral reef optimization for data clustering. Soft Comput 2
Google Scholar
Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel K-means clustering of remote sensing images based on MapReduce. In: WISM 2010, vol 6318, pp 254–262
Chapter Google Scholar
Krishna K, Murty NM (1999) Genetic K-means algorithm. IEEE Trans Syst Man Cybern B Cybern 29(3):433–439
Article Google Scholar
Sardar TH, Ansari Z (2018) Partition based clustering of large datasets using MapReduce framework: an analysis of recent themes and directions. Futur Comput Inf J 3(2):247–261
Google Scholar
Drechsler J (2011) Synthetic datasets for statistical disclosure control: theory and implementation, vol 201. Springer Science & Business Media
Google Scholar
Banerjee S, Choudhary A, Pal S (2016) Empirical evaluation of K-means, bisecting K-means, fuzzy C-means and genetic K-means clustering algorithms. In: 2015 IEEE International WIE Conference on Electrical and Computer Engineering WIECON-ECE 2015, pp 168–172
Google Scholar
Sathiyakumari K, Preamsudha V, Manimekalai G (2011) Unsupervised approach for document clustering using modified fuzzy C mean algorithm. Int J Comput Organ Trends 1(3):10–14
Google Scholar
Hotho A, Staab S, Stumme G (2003) Text clustering based on background knowledge. Inst Appl Inf 1–36
Google Scholar
Nur’Aini K, Najahaty I, Hidayati L, Murfi H, Nurrohmah S (2016) Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter. In: ICACSIS 2015, pp 123–128
Google Scholar
Surendra H, Mohan H (2017) A review of synthetic data generation methods for privacy preserving data publishing. Int J Sci Technol Res 6(3):95–101
Google Scholar
Cicoria S, Sherlock J, Muniswamaiah M, Clarke L (2014) Classification of titanic passenger data and chances of surviving the disaster data mining with weka and kaggle competition data. In: Proceedings of the student-faculty research day, CSIS, Pace University, pp 1–6
Google Scholar

Download references

Author information

Authors and Affiliations

Universiti Putra Malaysia, 43400 UPM, Selangor, Malaysia
Sayer Alshammari & Maslina Binti Zolkepli

Authors

Sayer Alshammari
View author publications
You can also search for this author in PubMed Google Scholar
Maslina Binti Zolkepli
View author publications
You can also search for this author in PubMed Google Scholar
Rusli Bin Abdullah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maslina Binti Zolkepli .

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Rozaida Ghazali
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Nazri Mohd Nawi
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Mustafa Mat Deris
School of Information Technology, Deakin University, Geelong Waurn Ponds Campus, VIC, Australia
Jemal H. Abawajy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alshammari, S., Zolkepli, M.B., Abdullah, R.B. (2020). Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA). In: Ghazali, R., Nawi, N., Deris, M., Abawajy, J. (eds) Recent Advances on Soft Computing and Data Mining. SCDM 2020. Advances in Intelligent Systems and Computing, vol 978. Springer, Cham. https://doi.org/10.1007/978-3-030-36056-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-36056-6_10
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36055-9
Online ISBN: 978-3-030-36056-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics