Advertisement

Robustness of learning techniques in handling class noise in imbalanced datasets

  • D. Anyfantis
  • M. Karagiannopoulos
  • S. Kotsiantis
  • P. Pintelas
Part of the IFIP The International Federation for Information Processing book series (IFIPAICT, volume 247)

Abstract

Many real world datasets exhibit skewed class distributions in which almost all instances are allotted to a class and far fewer instances to a smaller, but more interesting class. A classifier induced from an imbalanced dataset has a low error rate for the majority class and an undesirable error rate for the minority class. Many research efforts have been made to deal with class noise but none of them was designed for imbalanced datasets. This paper provides a study on the various methodologies that have tried to handle the imbalanced datasets and examines their robustness in class noise.

Keywords

Minority Class Decision Tree Algorithm Misclassification Cost Imbalanced Dataset Class Imbalance Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Aha, D. (1997). Lazy Learning. Dordrecht: Kluwer Academic Publishers.MATHGoogle Scholar
  2. 2.
    Batista G., Carvalho A., Monard M. C. (2000), Applying One-sided Selection to Unbalanced Datasets. In O. Cairo, L. E. Sucar, and F. J. Cantu, editors, Proceedings of the Mexican International Conference on Artificial Intelligence — MICAI 2000, pages 315–325. Springer-Verlag.Google Scholar
  3. 3.
    Blake, C, Keogh, E. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California.Google Scholar
  4. 4.
    Brodley, C. E. & Friedl, M. A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11: 131–167.MATHGoogle Scholar
  5. 5.
    Chawla N., Bowyer K., Hall L., Kegelmeyer W. (2002), SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research 16, 321–357.MATHGoogle Scholar
  6. 6.
    Domingos P. (1998), How to get a free lunch: A simple cost model for machine learning applications. Proc. AAAI-98/ICML98, Workshop on the Methodology of Applying Machine Learning, pp 1–7.Google Scholar
  7. 7.
    Domingos P. & Pazzani M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.MATHCrossRefGoogle Scholar
  8. 8.
    Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155–164. ACM Press.Google Scholar
  9. 9.
    Fawcett T. and Provost F. (1997), Adaptive Fraud Detection. Data Mining and Knowledge Discovery, 1(3):291–316.CrossRefGoogle Scholar
  10. 10.
    Friedman J. H. (1997), On bias, variance, 0/1-loss and curse-of-dimensionality. Data Mining and Knowledge Discovery, 1: 55–77.CrossRefGoogle Scholar
  11. 11.
    Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing: experiments in medical domains. Applied Artificial Intelligence 14, 205–223.CrossRefGoogle Scholar
  12. 12.
    Japkowicz N. (2000), The class imbalance problem: Significance and strategies. In Proceedings of the International Conference on Artificial Intelligence, Las Vegas.Google Scholar
  13. 13.
    Japkowicz N. and Stephen, S. (2002), The Class Imbalance Problem: A Systematic Study Intelligent Data Analysis, Volume 6, Number 5.Google Scholar
  14. 14.
    John, G. H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 174–179.Google Scholar
  15. 15.
    Kotsiantis, S., Pierrakeas, C, Pintelas, P., Preventing student dropout in distance learning systems using machine learning techniques, Lecture Notes in Artificial Intelligence, KES 2003, Springer-Verlag Vol 2774, pp 267–274, 2003.Google Scholar
  16. 16.
    Kotsiantis S., Kanellopoulos, D. Pintelas, P. (2006), Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering, Vol.30(1), pp. 25–36.Google Scholar
  17. 17.
    Kubat, M. and Matwin, S. (1997), ‘Addressing the Curse of Imbalanced Data Sets: One Sided Sampling’, in the Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186.Google Scholar
  18. 18.
    Kubat, M., Holte, R. and Matwin, S. (1998), ‘Machine Learning for the Detection of Oil Spills in Radar Images’, Machine Learning, 30:195–215.CrossRefGoogle Scholar
  19. 19.
    Ling, C, & Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98) New York, NY. AAAI Press.Google Scholar
  20. 20.
    Quinlan J.R. (1993), C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco.Google Scholar
  21. 21.
    Tjen-Sien Lim, Wei-Yin Loh, Yu-Shan Shih (2000), A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning, 40, 203–228, 2000, Kluwer Academic Publishers.MATHCrossRefGoogle Scholar
  22. 22.
    Witten Ian H. and Frank Eibe (2005) “Data Mining: Practical machine learning tools and techniques”, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.MATHGoogle Scholar
  23. 23.
    Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119–145.MATHGoogle Scholar
  24. 24.
    Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.Google Scholar

Copyright information

© International Federation for Information Processing 2007

Authors and Affiliations

  • D. Anyfantis
    • 1
  • M. Karagiannopoulos
    • 1
  • S. Kotsiantis
    • 1
  • P. Pintelas
    • 1
  1. 1.Educational Software Development Laboratory, Department of MathematicsUniversity of PatrasGreece

Personalised recommendations