Skip to main content

Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification

  • Conference paper
  • First Online:
Book cover Advanced Data Mining and Applications (ADMA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10086))

Included in the following conference series:

Abstract

Training a classifier with imbalanced dataset where there are more data from the majority class than the minority class is a known problem in data mining research community. The resultant classifier would become under-fitted in recognizing test instances of minority class and over-fitted with overwhelming mediocre samples from the majority class. Many existing techniques have been tried, ranging from artificially boosting the amount of the minority class training samples such as SMOTE, downsizing the volume of the majority class samples, to modifying the classification induction algorithm in favour of the minority class. However, finding the optimal ratio between the samples from the two majority/minority class for building a classifier that has the best accuracy is tricky, due to the non-linear relationships between the attributes and the class labels. Merely rebalancing the sample sizes of the two classes to exact portions will often not produce the best result. Brute-force attempt to search for the perfect combination of majority/minority class samples for the best classification result is NP-hard. In this paper, a unified preprocessing approach is proposed, using stochastic swarm heuristics to cooperatively optimize the mixtures from the two classes by progressively rebuilding the training dataset is proposed. Our novel approach is shown to outperform the existing popular methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sun, A., Ee-Peng, L., Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)

    Article  Google Scholar 

  2. Cao, H., Li, X.L., Woon, D.Y.K., Ng, S.K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)

    Article  Google Scholar 

  3. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD, vol. 1998 (1998)

    Google Scholar 

  4. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2-3), 195–215 (1998)

    Article  Google Scholar 

  5. Choe, W., Ersoy, O.K., Bina, M.: Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics 16(12), 1062–1072 (2000)

    Article  Google Scholar 

  6. Mazurowski, M.A., et al.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2), 427–436 (2008)

    Article  Google Scholar 

  7. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  8. Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 281–288 (2009)

    Article  Google Scholar 

  9. Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor. Newslett. 6(1), 30–39 (2004)

    Article  Google Scholar 

  10. Li, J., Fong, S., Mohammed, S., et al.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomput. 72, 3708 (2016). doi:10.1007/s11227-015-1541-6

    Google Scholar 

  11. Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML, vol. 3 (2003)

    Google Scholar 

  12. Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782–783 (2014)

    Article  Google Scholar 

  13. Chen, Y.-W., Lin, C.-J.: Combining SVMs with various feature selection strategies. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction: Foundations and Applications. Studies in Fuzziness and Soft Computing, pp. 315–324. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Wallace, B.C., et al.: Class imbalance, redux. In: 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE (2011)

    Google Scholar 

  15. Liu, A., Ghosh, J., Martin, C.E.: Generative oversampling for mining imbalanced datasets. In: DMIN (2007)

    Google Scholar 

  16. Batuwita, R., Palade, V: Efficient resampling methods for training support vector machines with imbalanced datasets. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)

    Google Scholar 

  17. Drummond, C., Holte, R.C.: C4. 5 class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol. 11 (2003)

    Google Scholar 

  18. Kubat, M., Matwin, S: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)

    Google Scholar 

  19. Chawla, N.V., Bowyer, K.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)

    MATH  Google Scholar 

  20. Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)

    Article  Google Scholar 

  21. Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L: Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)

    Google Scholar 

  22. Zhu, X.: Lazy bagging for classifying imbalanced data. In: IEEE ICDM 2007, pp. 763–768 (2007)

    Google Scholar 

  23. Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: IEEE ICDM 2006, pp. 592–602 (2006)

    Google Scholar 

  24. del Río, S., et al.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)

    Article  Google Scholar 

  25. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39804-2_12

    Chapter  Google Scholar 

  26. Fan, W., et al.: AdaCost: misclassification cost-sensitive boosting. In: ICML, vol. 99 (1999)

    Google Scholar 

  27. Sun, Y., et al.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40(12), 3358–3378 (2007)

    Article  MATH  Google Scholar 

  28. Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: IEEE ICDM 2003, pp. 435–442 (2003)

    Google Scholar 

  29. Kennedy, J., et al.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  30. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell. 1(1), 33–57 (2007)

    Article  Google Scholar 

  31. Li, J., et al.: Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data. Comput. Med. Imaging Graph (2016). http://dx.doi.org/10.1016/j.compmedimag.2016.05.001

  32. Van den Bergh, F., Engelbrecht, A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8(3), 225–239 (2004)

    Article  Google Scholar 

  33. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)

    Article  MATH  Google Scholar 

  34. Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)

    Google Scholar 

  35. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint handling with evolutionary algorithms. I: a unified formulation. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 28(1), 26–37 (1998)

    Article  Google Scholar 

  36. García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)

    Article  Google Scholar 

  37. Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87479-9_34

    Chapter  Google Scholar 

  38. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, vol. 96 (1996)

    Google Scholar 

  39. Alcalá, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Logic Soft Comput. 17(255-287), 11 (2010)

    Google Scholar 

Download references

Acknowledgement

The authors are thankful for the financial support from the Research Grant Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF), Grant no. MYRG2015-00128-FST, offered by the University of Macau, FST, and RDAO.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jinyan Li or Simon Fong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Li, J., Fong, S., Yuan, M., Wong, R.K. (2016). Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49586-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49585-9

  • Online ISBN: 978-3-319-49586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics