Benchmarking framework for class imbalance problem using novel sampling approach for big data

  • Khyati AhlawatEmail author
  • Anuradha Chug
  • Amit Prakash Singh
Original Article


The traditional techniques of machine learning always need to be strengthened for dealing with cosmic nature of big data for systematic and methodical learning. The unbalanced distribution of classes in big data, popularly known as imbalanced big data chases the problem of learning to a much higher level. The conventional methods are being progressively modified to handle and curtail the problem of learning from imbalanced datasets in the context of big data at the data level and algorithmic level. In the current study, a cluster heads based data level sampling solution which inherits edge of K-Means and Fuzzy C-Means clustering approaches is applied. The proposed approach is evaluated with three different classifiers namely Support Vector Machines, Decision Tree and k-Nearest Neighbor and compared with conventional SMOTE algorithm. The experiment has shown promising results with an increment of 8.09% and 35.71% in terms of accuracy and AUC respectively, for all imbalanced datasets. This work imparts a baseline comparison of solutions for imbalanced classification at data level in big data scenario and proposes an efficient clustering-based solution for same.


Class imbalance SMOTE Sampling Big data Machine learning 



  1. Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K (2015) Efficient machine learning for big data: a review. Big Data Res 2(3):87–93. CrossRefGoogle Scholar
  2. Bechini A, Marcelloni F, Segatori A (2016) A MapReduce solution for associative classification of big data. Inf Sci 332:33–55. CrossRefGoogle Scholar
  3. Chacko AM, Gupta A, Kumar SDM (2017) Improving execution speed of incremental runs of MapReduce using provenance. Int J Big Data Intell 4(3):186–194. CrossRefGoogle Scholar
  4. Fernández A, Carmona CJ, Jesus MJ, Herrera F (2016) A view on fuzzy systems for big data: progress and opportunities. Int J Comput Intell Syst 9(Sup1):69–80. CrossRefGoogle Scholar
  5. Ghazi MR, Gangodkar D (2015) Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput Sci 48:45–50. CrossRefGoogle Scholar
  6. Gong J, Kim H (2017) RHSBoost: improving classification in imbalance data. Comput Stat Data Anal 111:1–13. MathSciNetCrossRefzbMATHGoogle Scholar
  7. Han J, Kamber M, Pei J (2012) Classification: basic concepts. In: Elsevier (ed) Data mining concepts and techniques, 3rd ed. Morgan Kaufmann, Waltham, pp 327–383 Google Scholar
  8. He Q, Wang H, ZhuangF Shang T, Shi Z (2015) Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst 258:117–133. MathSciNetCrossRefzbMATHGoogle Scholar
  9. Hochbaum DS, Baumann P (2014) Sparse computation for large-scale data mining. In: 2014 IEEE international conference on big data.
  10. Hu H, Wen Y, Chua T, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. CrossRefGoogle Scholar
  11. Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. CrossRefGoogle Scholar
  12. Kang Q, Chen X, Li S, Zhou M (2017) A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans Cybern 47:4263–4274. CrossRefGoogle Scholar
  13. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232. CrossRefGoogle Scholar
  14. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24. CrossRefGoogle Scholar
  15. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. CrossRefGoogle Scholar
  16. López V, Río SD, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. MathSciNetCrossRefGoogle Scholar
  17. Maillo J, Ramírez S, Triguero I, Herrera F (2017) KNN-IS: AN Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl-Based Syst 117:3–15. CrossRefGoogle Scholar
  18. Meddah IHA, Belkadi K (2017) Parallel distributed patterns mining using Hadoop MapReduce framework. Int J Grid High Perform Comput. Google Scholar
  19. Pandey R, Dhoundiyal M (2015) Quantitative evaluation of big data categorical variables through R. Procedia Comput Sci 46:582–588. CrossRefGoogle Scholar
  20. Park S-H, Ha Y-G (2014) Large imbalance data classification based on MapReduce for traffic accident prediction. In: Eighth international conference on innovative mobile and internet services in ubiquitous computing, IEEE, pp 45–49
  21. Patil SS, Sonavane SP (2017) Enriched Over_Sampling techniques for improving classification of imbalanced big data. In: Third international conference on big data computing service and applications, IEEE,
  22. Río SD, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci 285:112–137. CrossRefGoogle Scholar
  23. Río SD, López V, Benítez JM, Herrera F (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8:422–437. CrossRefGoogle Scholar
  24. Rodger JA (2015) Discovery of medical big data analytics: improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive. Inf Med Unlocked 1:17–26. CrossRefGoogle Scholar
  25. Sanz JA, Bernardo D, Herrera F, Bustince H, Hagras H (2015) A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans Fuzzy Syst 23(4):973–990. CrossRefGoogle Scholar
  26. Slagter K, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in MapReduce. Int J Parallel Program 43:489–507. CrossRefGoogle Scholar
  27. Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015a) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345. CrossRefGoogle Scholar
  28. Triguero I, Río SD, López V, Bacardit J, Benítez JM, Herrera F (2015b) ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowl-Based Syst 87:69–79. CrossRefGoogle Scholar
  29. Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: IEEE congress on evolutionary computation (CEC).
  30. Troncoso A, Ribera P, Asencio-Cortes G, Vega I, Gallego D (2018) Imbalanced classification techniques for monsoon forecasting based on a new climatic time series. Environ Model Softw 106:48–56. CrossRefGoogle Scholar
  31. Tsai CW, Lai CF, Chao HC, Vasilakos AV (2015) Big data analytics: a survey. J Big Data. Google Scholar
  32. Uskenbayeva R, Kuandykov A, Cho YI, Temirbolatova T, Amanzholova S, Kozhamzharova D (2015) Integrating of data using the Hadoop and R. Procedia Comput Sci 56:145–149. CrossRefGoogle Scholar
  33. Vluymans S, Tarragó DS, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53:36–45. CrossRefGoogle Scholar
  34. Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107. CrossRefGoogle Scholar
  35. Xing EP, Ho Q, Dai W, Kim JK, Wei J et al (2015) Petuum: a new platform for distributed machine learning on big data. IEEE Trans Big Data 1(2):49–67. CrossRefGoogle Scholar
  36. Zhang X, Cheng M, Liu Y, Li DH, Wu RM (2014) Short-term load forecasting based on big data technologies. Appl Mech Mater 687–691:1186–1192. CrossRefGoogle Scholar
  37. Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Res 5:2–8. CrossRefGoogle Scholar

Copyright information

© The Society for Reliability Engineering, Quality and Operations Management (SREQOM), India and The Division of Operation and Maintenance, Lulea University of Technology, Sweden 2019

Authors and Affiliations

  1. 1.University School of Information, Communication and TechnologyGuru Gobind Singh Indraprastha UniversityDwarkaIndia

Personalised recommendations