Abstract
Training a classifier with imbalanced dataset where there are more data from the majority class than the minority class is a known problem in data mining research community. The resultant classifier would become under-fitted in recognizing test instances of minority class and over-fitted with overwhelming mediocre samples from the majority class. Many existing techniques have been tried, ranging from artificially boosting the amount of the minority class training samples such as SMOTE, downsizing the volume of the majority class samples, to modifying the classification induction algorithm in favour of the minority class. However, finding the optimal ratio between the samples from the two majority/minority class for building a classifier that has the best accuracy is tricky, due to the non-linear relationships between the attributes and the class labels. Merely rebalancing the sample sizes of the two classes to exact portions will often not produce the best result. Brute-force attempt to search for the perfect combination of majority/minority class samples for the best classification result is NP-hard. In this paper, a unified preprocessing approach is proposed, using stochastic swarm heuristics to cooperatively optimize the mixtures from the two classes by progressively rebuilding the training dataset is proposed. Our novel approach is shown to outperform the existing popular methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sun, A., Ee-Peng, L., Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)
Cao, H., Li, X.L., Woon, D.Y.K., Ng, S.K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)
Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD, vol. 1998 (1998)
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2-3), 195–215 (1998)
Choe, W., Ersoy, O.K., Bina, M.: Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics 16(12), 1062–1072 (2000)
Mazurowski, M.A., et al.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2), 427–436 (2008)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(1), 281–288 (2009)
Guo, H., Viktor, H.L.: Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor. Newslett. 6(1), 30–39 (2004)
Li, J., Fong, S., Mohammed, S., et al.: Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J. Supercomput. 72, 3708 (2016). doi:10.1007/s11227-015-1541-6
Chawla, N.V.: C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of the ICML, vol. 3 (2003)
Stone, E.A.: Predictor performance with stratified data and imbalanced classes. Nat. Methods 11(8), 782–783 (2014)
Chen, Y.-W., Lin, C.-J.: Combining SVMs with various feature selection strategies. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction: Foundations and Applications. Studies in Fuzziness and Soft Computing, pp. 315–324. Springer, Heidelberg (2006)
Wallace, B.C., et al.: Class imbalance, redux. In: 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE (2011)
Liu, A., Ghosh, J., Martin, C.E.: Generative oversampling for mining imbalanced datasets. In: DMIN (2007)
Batuwita, R., Palade, V: Efficient resampling methods for training support vector machines with imbalanced datasets. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)
Drummond, C., Holte, R.C.: C4. 5 class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol. 11 (2003)
Kubat, M., Matwin, S: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)
Chawla, N.V., Bowyer, K.W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)
Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Thai-Nghe, N., Gantner, Z., Schmidt-Thieme, L: Cost-sensitive learning methods for imbalanced data. In: The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE (2010)
Zhu, X.: Lazy bagging for classifying imbalanced data. In: IEEE ICDM 2007, pp. 763–768 (2007)
Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: IEEE ICDM 2006, pp. 592–602 (2006)
del Río, S., et al.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). doi:10.1007/978-3-540-39804-2_12
Fan, W., et al.: AdaCost: misclassification cost-sensitive boosting. In: ICML, vol. 99 (1999)
Sun, Y., et al.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 40(12), 3358–3378 (2007)
Zadrozny, B., Langford, J., Abe, N.: Cost-sensitive learning by cost-proportionate example weighting. In: IEEE ICDM 2003, pp. 435–442 (2003)
Kennedy, J., et al.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001)
Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell. 1(1), 33–57 (2007)
Li, J., et al.: Adaptive swarm balancing algorithms for rare-event prediction in imbalanced healthcare data. Comput. Med. Imaging Graph (2016). http://dx.doi.org/10.1016/j.compmedimag.2016.05.001
Van den Bergh, F., Engelbrecht, A.P.: A cooperative approach to particle swarm optimization. IEEE Trans. Evol. Comput. 8(3), 225–239 (2004)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Viera, A.J., Garrett, J.M.: Understanding interobserver agreement: the kappa statistic. Fam. Med. 37(5), 360–363 (2005)
Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint handling with evolutionary algorithms. I: a unified formulation. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 28(1), 26–37 (1998)
García, S., Herrera, F.: Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87479-9_34
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, vol. 96 (1996)
Alcalá, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Logic Soft Comput. 17(255-287), 11 (2010)
Acknowledgement
The authors are thankful for the financial support from the Research Grant Temporal Data Stream Mining by Using Incrementally Optimized Very Fast Decision Forest (iOVFDF), Grant no. MYRG2015-00128-FST, offered by the University of Macau, FST, and RDAO.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Li, J., Fong, S., Yuan, M., Wong, R.K. (2016). Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-49586-6_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49585-9
Online ISBN: 978-3-319-49586-6
eBook Packages: Computer ScienceComputer Science (R0)