Abstract
Learning in datasets that suffer from imbalanced class distribution is an important problem in Pattern Recognition. This paper introduces a novel algorithm for data balancing, based on compact set clustering of the majority class. The proposed algorithm is able to deal with mixed, as well as incomplete data, and with arbitrarily dissimilarity functions. Numerical experiments over repository databases show the high quality performance of the method proposed in this paper according to area under the ROC curve and imbalance ratio.
Chapter PDF
Similar content being viewed by others
References
Weiss, G.M.: Learning with rare cases and small disjuncts. In: Proceedings of the International Conference on Machine Learning, ICML 2003, pp. 558–565 (2003)
Hand, D.J., Vinciotti, V.: Choosing k for two-class nearest neighbor classifiers with imbalanced classes. Pattern Recognition Letters 24, 1555–1562 (2003)
Zhang, J., Mani, I.: kNN approach to unbalanced data distribution: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets (2003)
Moreno, J., Rodriguez, D., Sicilia, M.A., Riquelme, J.C., Ruiz, R.: SMOTE-I: improvement of SMOTE algorithm for minority classes balancing. In: Proceedings of Workshops of Software Engineering and Databases 3 (2009) (in Spanish)
García, V.: Distributions of non-balanced classes: metrics, complexity analysis and learning algorithms. PhD Dissertation Thesis, Department of Languages and Computer Systems, University Jaume I, Spain (2010)
Laurikkala, J.: Instance-based data reduction for improved identification of difficult small classes. Intelligent Data Analysis 6, 311–322 (2002)
Alejo, R., Valdovinos, R.M., García, V., Pacheco-Sanchez, J.H.: A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Letters 34, 380–388 (2013)
García-Borroto, M., Ruiz-Shulcloper, J.: Selecting prototypes in Mixed and Incomplete data. In: Sanfeliu, A., Cortés, M.L. (eds.) CIARP 2005. LNCS, vol. 3773, pp. 450–459. Springer, Heidelberg (2005)
Villuendas-Rey, Y., Rey-Benguría, C., Caballero-Mota, Y., García-Lorenzo, M.M.: Nearest prototype classification of special school families based on hierarchical compact sets clustering. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 662–671. Springer, Heidelberg (2012)
Ruiz-Shulcloper, J., Abidi, M.A.: Logical combinatorial Pattern Recognition: A review. In: Pandalai, S.G. (ed.) Recent Research Developments in Pattern Recognition. Transword Research Networks, pp. 133–176 (2002)
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287 (2011)
Wilson, R.D., Martinez, T.R.: Improved heterogeneous distance functions. Journal of Artificial Intelligence Research 6, 1–34 (1997)
Bradley, A.: The use of Area under the ROC curve in the evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: a family of Discriminant measures for Performance evaluations. In: Proceedings of the Australian Conference on Artificial Intelligence, pp. 1015–1021 (2006)
Demsar, J.: Statistical comparison of classifiers over multiple datasets. Journal of Machine Learning Research 7, 1–30 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Villuendas-Rey, Y., Matilde García-Lorenzo, M. (2013). Mixed Data Balancing through Compact Sets Based Instance Selection. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2013. Lecture Notes in Computer Science, vol 8258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41822-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-642-41822-8_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41821-1
Online ISBN: 978-3-642-41822-8
eBook Packages: Computer ScienceComputer Science (R0)