Imbalanced Datasets Resampling Through Self Organizing Maps and Genetic Algorithms

Vannucci, Marco; Colla, Valentina

doi:10.1007/978-3-030-20257-6_34

Marco Vannucci¹¹ &
Valentina Colla¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1000))

Included in the following conference series:

International Conference on Engineering Applications of Neural Networks

1954 Accesses

Abstract

The paper presents a novel approach for the resampling of imbalanced datasets aiming at the improvement of classifiers performance. The method exploits two self–organizing–maps for the determinations of the clusters of majority and minority data. Clusters centroids are used to select the samples whose under–sampling or over–sampling is more convenient while the optimal resampling rates are determined through a genetic algorithm that maximizes the classifier performance. The algorithm is tested on several datasets coming from both the UCI repository and real industrial applications and compared to other widely used resampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
Borselli, A., Colla, V., Vannucci, M., Veroli, M.: A fuzzy inference system applied to defect detection in flat steel production. In: 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 (2010)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A genetic algorithm-based approach for selecting input variables and setting relevant network parameters of a som-based classifier. Int. J. Simul.: Syst. Sci. Technol. 12(2), 30–37 (2011)
Google Scholar
Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
MATH Google Scholar
De Amorim, R.C., Hennig, C.: Recovering the number of clusters in data sets with noise features using feature rescaling factors. Inf. Sci. 324, 126–145 (2015)
Article MathSciNet Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: International Joint Conference on Artificial Intelligence, vol. 17, pp. 973–978. Lawrence Erlbaum Associates Ltd (2001)
Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Fan, W., Stolfo, S.J., Zhang, J., Chan, P.K.: AdaCost: misclassification cost-sensitive boosting. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 97–105. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI), pp. 111–117 (2000)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS (LNAI), vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Chapter Google Scholar
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, p. 69. ACM, New York (2004)
Google Scholar
Soler, V., Prim, M.: Rectangular basis functions applied to imbalanced datasets. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 511–519. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74690-4_52
Chapter Google Scholar
Vannucci, M., Colla, V.: Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic. Appl. Soft Comput. J. 11(2), 2383–2390 (2011)
Article Google Scholar
Vannucci, M., Colla, V., Nastasi, G., Matarese, N.: Detection of rare events within industrial datasets by means of data resampling and specific algorithms. Int. J. Simul.: Syst. Sci. Technol. 11(3), 1–11 (2010)
Google Scholar
Vannucci, M., Colla, V., Sgarbi, M., Toscanelli, O.: Thresholded neural networks for sensitive industrial classification tasks. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1320–1327. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_165
Chapter Google Scholar
Vannucci, M., Colla, V.: Genetic algorithms based resampling for the classification of unbalanced datasets. Smart Innov. Syst. Technol. 73, 23–32 (2018)
Article Google Scholar
Vannucci, M., Colla, V.: Self organizing maps based undersampling for the classification of unbalanced datasets. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–6, July 2018
Google Scholar
Vannucci, M., Colla, V.: Classification of unbalanced datasets and detection of rare events in industry: issues and solutions. In: Jayne, C., Iliadis, L. (eds.) EANN 2016. CCIS, vol. 629, pp. 337–351. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44188-7_26
Chapter Google Scholar
Wu, Y., Shen, L., Zhang, S.: Fuzzy multiclass support vector machines for unbalanced data. In: 2017 29th Chinese Control And Decision Conference (CCDC), pp. 2227–2231, May 2017
Google Scholar
Yuan, Z., Bao, D., Chen, Z., Liu, M.: Integrated transfer learning algorithm using multi-source tradaboost for unbalanced samples classification. In: 2017 International Conference on Computing Intelligence and Information System (CIIS), pp. 188–195, April 2017
Google Scholar

Download references

Author information

Authors and Affiliations

TeCIP Institute, Scuola Superiore Sant’Anna, Pisa, Italy
Marco Vannucci & Valentina Colla

Authors

Marco Vannucci
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Colla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Vannucci .

Editor information

Editors and Affiliations

David Goldman Informatics Centre, University of Sunderland, Sunderland, UK
John Macintyre
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Oxford Brookes University, Oxford, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vannucci, M., Colla, V. (2019). Imbalanced Datasets Resampling Through Self Organizing Maps and Genetic Algorithms. In: Macintyre, J., Iliadis, L., Maglogiannis, I., Jayne, C. (eds) Engineering Applications of Neural Networks. EANN 2019. Communications in Computer and Information Science, vol 1000. Springer, Cham. https://doi.org/10.1007/978-3-030-20257-6_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-20257-6_34
Published: 15 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20256-9
Online ISBN: 978-3-030-20257-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics