Mixed Data Balancing through Compact Sets Based Instance Selection

  • Yenny Villuendas-Rey
  • María Matilde García-Lorenzo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8258)


Learning in datasets that suffer from imbalanced class distribution is an important problem in Pattern Recognition. This paper introduces a novel algorithm for data balancing, based on compact set clustering of the majority class. The proposed algorithm is able to deal with mixed, as well as incomplete data, and with arbitrarily dissimilarity functions. Numerical experiments over repository databases show the high quality performance of the method proposed in this paper according to area under the ROC curve and imbalance ratio.


imbalanced data mixed data supervised classification 


  1. 1.
    Weiss, G.M.: Learning with rare cases and small disjuncts. In: Proceedings of the International Conference on Machine Learning, ICML 2003, pp. 558–565 (2003)Google Scholar
  2. 2.
    Hand, D.J., Vinciotti, V.: Choosing k for two-class nearest neighbor classifiers with imbalanced classes. Pattern Recognition Letters 24, 1555–1562 (2003)CrossRefzbMATHGoogle Scholar
  3. 3.
    Zhang, J., Mani, I.: kNN approach to unbalanced data distribution: a case study involving information extraction. In: Proceedings of Workshop on Learning from Imbalanced Datasets (2003)Google Scholar
  4. 4.
    Moreno, J., Rodriguez, D., Sicilia, M.A., Riquelme, J.C., Ruiz, R.: SMOTE-I: improvement of SMOTE algorithm for minority classes balancing. In: Proceedings of Workshops of Software Engineering and Databases 3 (2009) (in Spanish)Google Scholar
  5. 5.
    García, V.: Distributions of non-balanced classes: metrics, complexity analysis and learning algorithms. PhD Dissertation Thesis, Department of Languages and Computer Systems, University Jaume I, Spain (2010)Google Scholar
  6. 6.
    Laurikkala, J.: Instance-based data reduction for improved identification of difficult small classes. Intelligent Data Analysis 6, 311–322 (2002)zbMATHGoogle Scholar
  7. 7.
    Alejo, R., Valdovinos, R.M., García, V., Pacheco-Sanchez, J.H.: A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Letters 34, 380–388 (2013)CrossRefGoogle Scholar
  8. 8.
    García-Borroto, M., Ruiz-Shulcloper, J.: Selecting prototypes in Mixed and Incomplete data. In: Sanfeliu, A., Cortés, M.L. (eds.) CIARP 2005. LNCS, vol. 3773, pp. 450–459. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Villuendas-Rey, Y., Rey-Benguría, C., Caballero-Mota, Y., García-Lorenzo, M.M.: Nearest prototype classification of special school families based on hierarchical compact sets clustering. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 662–671. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Ruiz-Shulcloper, J., Abidi, M.A.: Logical combinatorial Pattern Recognition: A review. In: Pandalai, S.G. (ed.) Recent Research Developments in Pattern Recognition. Transword Research Networks, pp. 133–176 (2002)Google Scholar
  11. 11.
    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287 (2011)Google Scholar
  12. 12.
    Wilson, R.D., Martinez, T.R.: Improved heterogeneous distance functions. Journal of Artificial Intelligence Research 6, 1–34 (1997)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Bradley, A.: The use of Area under the ROC curve in the evaluation of Machine Learning Algorithms. Pattern Recognition 30, 1145–1159 (1997)CrossRefGoogle Scholar
  14. 14.
    Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: a family of Discriminant measures for Performance evaluations. In: Proceedings of the Australian Conference on Artificial Intelligence, pp. 1015–1021 (2006)Google Scholar
  15. 15.
    Demsar, J.: Statistical comparison of classifiers over multiple datasets. Journal of Machine Learning Research 7, 1–30 (2006)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yenny Villuendas-Rey
    • 1
  • María Matilde García-Lorenzo
    • 2
  1. 1.Department of Computer ScienceUniversity of Ciego de ÁvilaCuba
  2. 2.Department of Computer ScienceUniversidad Central Marta Abreu of Las VillasCuba

Personalised recommendations