Skip to main content
Log in

BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

A class imbalance problem occurs when a dataset is decomposed into one majority class and one minority class. This problem is critical in the machine learning domains because it induces bias in training machine learning models. One popular method to solve this problem is using a sampling technique to balance the class distribution by either under-sampling the majority class or over-sampling the minority class. So far, diverse over-sampling techniques have suffered from overfitting and noisy data generation problems. In this paper, we propose an over-sampling scheme based on the borderline class and conditional generative adversarial network (CGAN). More specifically, we define a borderline class based on the minority class data near the majority class. Then, we generate data for the borderline class using the CGAN for data balancing. To demonstrate the performance of the proposed scheme, we conducted various experiments on diverse imbalanced datasets. We report some of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Gong Y, Jia L (2019) Research on SVM environment performance of parallel computing based on large data set of machine learning. J Supercomput 75(9):5966–5983. https://doi.org/10.1007/s11227-019-02894-7

    Article  Google Scholar 

  2. Garea AS, Heras DB, Argüello F (2019) Caffe CNN-based classification of hyperspectral images on GPU. J Supercomput 75(3):1065–1077. https://doi.org/10.1007/s11227-018-2300-2

    Article  Google Scholar 

  3. Adewole KS, Han T, Wu W, Song H, Sangaiah AK (2020) Twitter spam account detection based on clustering and classification methods. J Supercomput 76(7):4802–4837. https://doi.org/10.1007/s11227-018-2641-x

    Article  Google Scholar 

  4. Hasanin T, Khoshgoftaar TM, & Leevy JL (2019, July) A comparison of performance metrics with severely imbalanced network security big data. In: Proceedings of 2019 IEEE 20th international conference on information reuse and integration for data science (IRI). Los Angeles, CA, USA, pp 83–88. https://doi.org/10.1109/IRI.2019.00026.

  5. O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recognit 90:232–249. https://doi.org/10.1016/j.patcog.2019.01.036

    Article  Google Scholar 

  6. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239. https://doi.org/10.1016/j.eswa.2016.12.035

    Article  Google Scholar 

  7. Stolfo SJ, Fan W, Lee W, Prodromidis A, Chan PK (2000, February) Cost-based modeling for fraud and intrusion detection: Results from the JAM project. In: Proceedings of the DARPA information survivability conference and exposition, DISCEX 2000. South Carolina, USA, pp 130–144. https://doi.org/10.1109/DISCEX.2000.821515

  8. Ling CX, Li C (1998, August) Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th international conference on knowledge discovery and data mining. AAAI Press, New York, NY, pp 73–79

  9. Yang MY, Liao W, Li X, Cao Y, Rosenhahn B (2019) Vehicle detection in aerial images. Photogramm Eng Remote Sens 85(4):297–304. https://doi.org/10.14358/PERS.85.4.297

    Article  Google Scholar 

  10. Del Gaudio R, Batista G, Branco A (2014) Coping with highly imbalanced datasets: a case study with definition extraction in a multilingual setting. Nat Lang Eng 20(3):327–359. https://doi.org/10.1017/S1351324912000381

    Article  Google Scholar 

  11. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007, June) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, Corvallis, Oregon, USA, pp 935–942

  12. Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B Cybern 39(2):539–550. https://doi.org/10.1109/TSMCB.2008.2007853

    Article  Google Scholar 

  13. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  14. Liu T, Zhu X, Pedrycz W, Li Z (2020) A design of information granule-based under-sampling method in imbalanced data classification. Soft Comput. https://doi.org/10.1007/s00500-020-05023-2

    Article  Google Scholar 

  15. He H, Bai Y, Garcia EA, Li S (2008, June) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the international joint conference on neural networks, Hong Kong, China, pp 1322–1328 https://doi.org/10.1109/IJCNN.2008.4633969

  16. Han H, Wang WY, Mao BH (2005, August) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 international conference on intelligent computing (ICIC’05), Lecture notes in computer science, Hefei, China, pp 878–887 https://doi.org/10.1007/11538059_91

  17. Xie W, Liang G, Dong Z, Tan B, Zhang B (2019) An improved oversampling algorithm based on the samples selection strategy for classifying imbalanced data. Math Probl Eng 2019:3526539. https://doi.org/10.1155/2019/3526539

    Article  MATH  Google Scholar 

  18. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014, December) Generative adversarial nets. In: Proceedings of 27th international conference on neural information processing systems, Montreal, Quebec, Canada, pp 2672–2680. https://doi.org/10.3156/jsoft.29.5_177_2

  19. Fiore U, De Santis A, Perla F, Zanetti P, Palmieri F (2019) Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf Sci 479:448–455. https://doi.org/10.1016/j.ins.2017.12.030

    Article  Google Scholar 

  20. Liu J, Gu C, Wang J, Youn G, Kim JU (2019) Multi-scale multi-class conditional generative adversarial network for handwritten character generation. J Supercomput 75(4):1922–1940. https://doi.org/10.1007/s11227-017-2218-0

    Article  Google Scholar 

  21. Guo J, Lu S, Cai H, Zhang W, Yu Y, Wang J (2018, February) Long text generation via adversarial training with leaked information. In: Proceedings of the 32nd AAAI conference on artificial intelligence, New Orleans, Louisiana, USA, AAAI, pp 1–27

  22. Vondrick C, Pirsiavash H, Torralba A (2016, December) Generating videos with scene dynamics. In: Proceedings of 30th international conference on neural information processing system. Barcelona, Spain, pp 613–621

  23. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint http://arxiv.org/abs/1411.1784.

  24. Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471. https://doi.org/10.1016/j.eswa.2017.09.030

    Article  Google Scholar 

  25. Wang H Y (2008, June) Combination approach of SMOTE and biased-SVM for imbalanced datasets. In: Proceedings of the international joint conference on neural networks, Hong Kong, China, pp 228–231. https://doi.org/10.1109/IJCNN.2008.4633794

  26. Hoi C H, Chan C H, Huang K, Lyu M R, King I (2004, July) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of the international joint conference on neural networks, Budapest, Hungary, pp 3189–3194. https://doi.org/10.1109/ijcnn.2004.1381186

  27. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49. https://doi.org/10.1145/1007730.1007737

    Article  Google Scholar 

  28. Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008, December) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of the 19th international conference on pattern recognition, Tampa, USA, pp 1–4. https://doi.org/10.1109/icpr.2008.4761770.

  29. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735

    Article  Google Scholar 

  30. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421. https://doi.org/10.1109/TSMC.1972.4309137

    Article  MathSciNet  MATH  Google Scholar 

  31. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772. https://doi.org/10.1109/TSMC.1976.4309452

    Article  MathSciNet  MATH  Google Scholar 

  32. Liu Y, An A, Huang X (2006, April) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of the tenth Pacific-Asia conference on knowledge discovery and data mining. Singapore, pp 107–118

  33. Xie X, Xiong J, Lu L, Gui G, Yang J, Fan S, Li H (2020) Generative adversarial network-based credit card fraud detection. In: Liang Q, Liu X, Na Z, Wang W, Mu J, Zhang B (eds) Communications, Signal Processing, and Systems. Springer, Singapore, pp 1007–1014. https://doi.org/10.1007/978-981-13-6508-9_122

    Chapter  Google Scholar 

  34. Zhou Z, Zhang B, Lv Y, Shi T, Chang F (2019) Data augment in imbalanced learning based on generative adversarial networks. In: Gedeon T, Wong K, Lee M (eds) Neural Information Processing. Springer, Cham, pp 21–30. https://doi.org/10.1007/978-3-030-36808-1_3

    Chapter  Google Scholar 

  35. Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C (2018) Bagan: data augmentation with balancing GAN. arXiv preprint http://arxiv.org/abs/1803.09655.

  36. Bennett KP, Mangasarian OL (1992) Robust linear programming discrimination of two linearly inseparable sets. Optim Methods Softw 1(1):23–34. https://doi.org/10.1080/10556789208805504

    Article  Google Scholar 

  37. Wolberg WH, Street WN, Mangasarian OL (1995) Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Anal Quant Cytol Histol 17(2):77–87

    Google Scholar 

  38. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553. https://doi.org/10.1016/j.dss.2009.05.016

    Article  Google Scholar 

  39. Frey PW, Slate DJ (1991) Letter recognition using Holland-style adaptive classifiers. Mach Learn 6(2):161–182. https://doi.org/10.1023/A:1022606404104

    Article  Google Scholar 

  40. Zięba M, Tomczak JM, Lubicz M, Świątek J (2014) Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J 14:99–108. https://doi.org/10.1016/j.asoc.2013.07.016

    Article  Google Scholar 

  41. Horton P, Nakai K (1996, June) A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the fourth international conference on intelligent systems for molecular biology, St. Louis, USA. AAAI Press, pp 109–115

  42. Lim TS, Loh WY, Shih YS (2000) Comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228. https://doi.org/10.1023/A:1007608224229

    Article  MATH  Google Scholar 

  43. Dal Pozzolo A (2015) Adaptive machine learning for credit card fraud detection. PhD thesis. https://doi.org/10.14419/ijet.v7i2.9356.

  44. Hillstrom K (2017) Kevin Hillstrom: minethatdata project pricing http://www.minethatdata.com/ Accessed 8 January 2020.

  45. Moro S, Cortez P, Rita P (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Syst 62:22–31. https://doi.org/10.1016/j.dss.2014.03.001

    Article  Google Scholar 

  46. Schierz AC (2009) Virtual screening of bioassay data. J Cheminformatics 1(1):21. https://doi.org/10.1186/1758-2946-1-21

    Article  Google Scholar 

  47. Yeh IC, Lien CH (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst Appl 36(2):2473–2480. https://doi.org/10.1016/j.eswa.2007.12.020

    Article  Google Scholar 

  48. Nair V, Hinton G E (2010, June) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning, Haifa, Israel, pp 807–814

  49. Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint http://arxiv.org/abs/1412.6980.

  50. Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22

    Google Scholar 

  51. Haykin S (2010) Neural networks and learning machines, 3rd edn. Macmillan, New York

    Google Scholar 

  52. Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians, 1st edn. John Wiley & Sons, Hoboken, New Jersey, USA

    Book  Google Scholar 

  53. Li P, Li J, Chen Y, Pei Y, Fu G, Xie H (2020) Classification and recognition of computed tomography images using image reconstruction and information fusion methods. J Supercomput. https://doi.org/10.1007/s11227-020-03367-y

    Article  Google Scholar 

  54. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdisciplin Rev Comput Stat 2(4):433–459. https://doi.org/10.1002/wics.101

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by Energy Cloud R&D Program (Grant Number: 2019M3F2A1073184) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT and in part by Government-wide R&D Fund project for infectious disease research (GFID), Republic of Korea (Grant Number: HG19C0682).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eenjun Hwang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Correlation matrices of datasets

For better understanding of the datasets used, we present the correlation matrices of each dataset. Figure 8 describes the obtained correlation matrices with heatmap.

Fig. 8
figure 8figure 8

Correlation matrices of all datasets: a Breast, b WDBC, c Wine, d Letter, e Surgery, f Yeast, g CMC, h Card, i Email, j Tel, k Bio, l Pay (color figure online)

In the figure, each cell indicates the correlation between features in x-axis and y-axis by color. The closer the color is to dark blue, the higher the correlation value.

Appendix 2: Detailed results of Sect. 4.4

We present the more detailed results of the experiment in Sect. 4.4. Tables 5 and 6 show both averages and standard deviations of the obtained AUC. In each cell, the left value is the average AUC, and the right value is the standard deviation. We omit the standard deviations of the “Base” case because we did not conduct over-sampling in this case.

Table 5 Performance comparison of over-sampling methods and classification models with six datasets
Table 6 Performance comparison of over-sampling methods and classification models with the others

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Son, M., Jung, S., Jung, S. et al. BCGAN: A CGAN-based over-sampling model using the boundary class for data balancing. J Supercomput 77, 10463–10487 (2021). https://doi.org/10.1007/s11227-021-03688-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03688-6

Keywords

Navigation