Abstract
The paper presents a novel approach for the resampling of imbalanced datasets aiming at the improvement of classifiers performance. The method exploits two self–organizing–maps for the determinations of the clusters of majority and minority data. Clusters centroids are used to select the samples whose under–sampling or over–sampling is more convenient while the optimal resampling rates are determined through a genetic algorithm that maximizes the classifier performance. The algorithm is tested on several datasets coming from both the UCI repository and real industrial applications and compared to other widely used resampling methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Borselli, A., Colla, V., Vannucci, M., Veroli, M.: A fuzzy inference system applied to defect detection in flat steel production. In: 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 (2010)
Cateni, S., Colla, V., Vannucci, M.: A genetic algorithm-based approach for selecting input variables and setting relevant network parameters of a som-based classifier. Int. J. Simul.: Syst. Sci. Technol. 12(2), 30–37 (2011)
Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
De Amorim, R.C., Hennig, C.: Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf. Sci. 324, 126–145 (2015)
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd (2001)
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: misclassification cost-sensitive boosting. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 97–105. Morgan Kaufmann Publishers Inc., San Francisco (1999)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI), pp. 111–117 (2000)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 69. ACM, New York (2004)
Soler, V., Prim, M.: Rectangular basis functions applied to imbalanced datasets. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 511–519. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74690-4_52
Vannucci, M., Colla, V.: Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic. Appl. Soft Comput. J. 11(2), 2383–2390 (2011)
Vannucci, M., Colla, V., Nastasi, G., Matarese, N.: Detection of rare events within industrial datasets by means of data resampling and specific algorithms. Int. J. Simul.: Syst. Sci. Technol. 11(3), 1–11 (2010)
Vannucci, M., Colla, V., Sgarbi, M., Toscanelli, O.: Thresholded neural networks for sensitive industrial classification tasks. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1320–1327. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_165
Vannucci, M., Colla, V.: Genetic algorithms based resampling for the classification of unbalanced datasets. Smart Innov. Syst. Technol. 73, 23–32 (2018)
Vannucci, M., Colla, V.: Self organizing maps based undersampling for the classification of unbalanced datasets. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–6, July 2018
Vannucci, M., Colla, V.: Classification of unbalanced datasets and detection of rare events in industry: issues and solutions. In: Jayne, C., Iliadis, L. (eds.) EANN 2016. CCIS, vol. 629, pp. 337–351. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44188-7_26
Wu, Y., Shen, L., Zhang, S.: Fuzzy multiclass support vector machines for unbalanced data. In: 2017 29th Chinese Control And Decision Conference (CCDC), pp. 2227–2231, May 2017
Yuan, Z., Bao, D., Chen, Z., Liu, M.: Integrated transfer learning algorithm using multi-source tradaboost for unbalanced samples classification. In: 2017 International Conference on Computing Intelligence and Information System (CIIS), pp. 188–195, April 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vannucci, M., Colla, V. (2019). Imbalanced Datasets Resampling Through Self Organizing Maps and Genetic Algorithms. In: Macintyre, J., Iliadis, L., Maglogiannis, I., Jayne, C. (eds) Engineering Applications of Neural Networks. EANN 2019. Communications in Computer and Information Science, vol 1000. Springer, Cham. https://doi.org/10.1007/978-3-030-20257-6_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-20257-6_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20256-9
Online ISBN: 978-3-030-20257-6
eBook Packages: Computer ScienceComputer Science (R0)