Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm

  • Andrés Felipe Giraldo-Forero
  • Jorge Alberto Jaramillo-Garzón
  • José Francisco Ruiz-Muñoz
  • César Germán Castellanos-Domínguez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8258)


Multi-label learning has been becoming an increasingly active area into the machine learning community since a wide variety of real world problems are naturally multi-labeled. However, it is not uncommon to find disparities among the number of samples of each class, which constitutes an additional challenge for the learning algorithm. Smote is an oversampling technique that has been successfully applied for balancing single-labeled data sets, but has not been used in multi-label frameworks so far. In this work, several strategies are proposed and compared in order to generate synthetic samples for balancing data sets in the training of multi-label algorithms. Results show that a correct selection of seed samples for oversampling improves the classification performance of multi-label algorithms. The uniform generation oversampling, provides an efficient methodology for a wide scope of real world problems.


Seed Sample Minority Class Synthetic Sample Imbalanced Data Uniform Generation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern Recognition 37(9), 1757–1771 (2004)CrossRefGoogle Scholar
  2. 2.
    Elisseeff, A.: Kernel methods for multi-labelled classification and categorical regression problems. In: Advances in Neural Information Processing (2002)Google Scholar
  3. 3.
    Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3(3), 1–13 (2007)CrossRefGoogle Scholar
  4. 4.
    Jaramillo-Garzón, J.A., et al.: Predictability of protein subcellular locations by pattern recognition techniques. In: EMBC-IEEE (2010)Google Scholar
  5. 5.
    Zhang, M., Zhou, Z.: ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40(7), 2038–2048 (2007)CrossRefzbMATHGoogle Scholar
  6. 6.
    Huang, S.J., Zhou, Z.H.: Multi-Label Learning by Exploiting Label Correlations Locally. In: IAAA (2012)Google Scholar
  7. 7.
    Kong, X., Ng, M., Zhou, Z.: Transductive Multi-Label Learning via Label Set Propagation. IEEE Transactions on Knowledge and Data Engineering, 1–14 (2011)Google Scholar
  8. 8.
    He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  9. 9.
    Chawla, N., Bowyer, K., Hall, L.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial 16 (2002)Google Scholar
  10. 10.
    Tahir, M.A., Kittler, et al.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition (2012)Google Scholar
  11. 11.
    Dendamrongvit, S., Kubat, M.: Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. In: Theeramunkong, T., Nattee, C., Adeodato, P.J.L., Chawla, N., Christen, P., Lenca, P., Poon, J., Williams, G. (eds.) PAKDD Workshops 2009. LNCS, vol. 5669, pp. 40–52. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Chen, K., Liang Lu, B.: Efficient classification of multilabel and imbalanced data using min-max modular classifiers. In: The International Joint Conference on Neural Networks (IJCNN 2006), pp. 1770–1775 (2006)Google Scholar
  13. 13.
    Tsoumakas, G., Vilcek, J., Spyromitros, E., Vlahavas, I.: Mulan: A java library for multi-label learning. Journal of Machine Learning Research 1, 1–48 (2010)Google Scholar
  14. 14.
    Zhou, Z.-H., Zhang, M.: Multi-instance multi-label learning with application to scene classification. In: Advances in Neural Information Processing Systems (2007)Google Scholar
  15. 15.
    Klimt, B., Yang, Y.: Introducing the Enron Corpus. Machine Learning (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Andrés Felipe Giraldo-Forero
    • 1
  • Jorge Alberto Jaramillo-Garzón
    • 1
    • 2
  • José Francisco Ruiz-Muñoz
    • 1
  • César Germán Castellanos-Domínguez
    • 1
  1. 1.Signal Processing and Recognition GroupUniversidad Nacional de ColombiaManizalesColombia
  2. 2.Grupo de Máquinas Inteligentes y Reconocimiento de Patrones - MIRP, Instituto Tecnológico MetropolitanoMedellínColombia

Personalised recommendations