Robustness of learning techniques in handling class noise in imbalanced datasets

Anyfantis, D.; Karagiannopoulos, M.; Kotsiantis, S.; Pintelas, P.

doi:10.1007/978-0-387-74161-1_3

Robustness of learning techniques in handling class noise in imbalanced datasets

D. Anyfantis²,
M. Karagiannopoulos²,
S. Kotsiantis² &
…
P. Pintelas²

Conference paper

1782 Accesses
15 Citations

Part of the book series: IFIP The International Federation for Information Processing ((IFIPAICT,volume 247))

Abstract

Many real world datasets exhibit skewed class distributions in which almost all instances are allotted to a class and far fewer instances to a smaller, but more interesting class. A classifier induced from an imbalanced dataset has a low error rate for the majority class and an undesirable error rate for the minority class. Many research efforts have been made to deal with class noise but none of them was designed for imbalanced datasets. This paper provides a study on the various methodologies that have tried to handle the imbalanced datasets and examines their robustness in class noise.

Download to read the full chapter text

Chapter PDF

References

Aha, D. (1997). Lazy Learning. Dordrecht: Kluwer Academic Publishers.
MATH Google Scholar
Batista G., Carvalho A., Monard M. C. (2000), Applying One-sided Selection to Unbalanced Datasets. In O. Cairo, L. E. Sucar, and F. J. Cantu, editors, Proceedings of the Mexican International Conference on Artificial Intelligence — MICAI 2000, pages 315–325. Springer-Verlag.
Google Scholar
Blake, C, Keogh, E. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California.
Google Scholar
Brodley, C. E. & Friedl, M. A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11: 131–167.
MATH Google Scholar
Chawla N., Bowyer K., Hall L., Kegelmeyer W. (2002), SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research 16, 321–357.
MATH Google Scholar
Domingos P. (1998), How to get a free lunch: A simple cost model for machine learning applications. Proc. AAAI-98/ICML98, Workshop on the Methodology of Applying Machine Learning, pp 1–7.
Google Scholar
Domingos P. & Pazzani M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Article MATH Google Scholar
Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155–164. ACM Press.
Google Scholar
Fawcett T. and Provost F. (1997), Adaptive Fraud Detection. Data Mining and Knowledge Discovery, 1(3):291–316.
Article Google Scholar
Friedman J. H. (1997), On bias, variance, 0/1-loss and curse-of-dimensionality. Data Mining and Knowledge Discovery, 1: 55–77.
Article Google Scholar
Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing: experiments in medical domains. Applied Artificial Intelligence 14, 205–223.
Article Google Scholar
Japkowicz N. (2000), The class imbalance problem: Significance and strategies. In Proceedings of the International Conference on Artificial Intelligence, Las Vegas.
Google Scholar
Japkowicz N. and Stephen, S. (2002), The Class Imbalance Problem: A Systematic Study Intelligent Data Analysis, Volume 6, Number 5.
Google Scholar
John, G. H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 174–179.
Google Scholar
Kotsiantis, S., Pierrakeas, C, Pintelas, P., Preventing student dropout in distance learning systems using machine learning techniques, Lecture Notes in Artificial Intelligence, KES 2003, Springer-Verlag Vol 2774, pp 267–274, 2003.
Google Scholar
Kotsiantis S., Kanellopoulos, D. Pintelas, P. (2006), Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering, Vol.30(1), pp. 25–36.
Google Scholar
Kubat, M. and Matwin, S. (1997), ‘Addressing the Curse of Imbalanced Data Sets: One Sided Sampling’, in the Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186.
Google Scholar
Kubat, M., Holte, R. and Matwin, S. (1998), ‘Machine Learning for the Detection of Oil Spills in Radar Images’, Machine Learning, 30:195–215.
Article Google Scholar
Ling, C, & Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98) New York, NY. AAAI Press.
Google Scholar
Quinlan J.R. (1993), C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco.
Google Scholar
Tjen-Sien Lim, Wei-Yin Loh, Yu-Shan Shih (2000), A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning, 40, 203–228, 2000, Kluwer Academic Publishers.
Article MATH Google Scholar
Witten Ian H. and Frank Eibe (2005) “Data Mining: Practical machine learning tools and techniques”, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
MATH Google Scholar
Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119–145.
MATH Google Scholar
Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.
Google Scholar

Download references

Author information

Authors and Affiliations

Educational Software Development Laboratory, Department of Mathematics, University of Patras, Greece
D. Anyfantis, M. Karagiannopoulos, S. Kotsiantis & P. Pintelas

Authors

D. Anyfantis
View author publications
You can also search for this author in PubMed Google Scholar
M. Karagiannopoulos
View author publications
You can also search for this author in PubMed Google Scholar
S. Kotsiantis
View author publications
You can also search for this author in PubMed Google Scholar
P. Pintelas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Athens Information Technology, Greece
Christos Boukis , Aristodemos Pnevmatikakis & Lazaros Polymenakos , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anyfantis, D., Karagiannopoulos, M., Kotsiantis, S., Pintelas, P. (2007). Robustness of learning techniques in handling class noise in imbalanced datasets. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds) Artificial Intelligence and Innovations 2007: from Theory to Applications. AIAI 2007. IFIP The International Federation for Information Processing, vol 247. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-74161-1_3

Download citation

DOI: https://doi.org/10.1007/978-0-387-74161-1_3
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-74160-4
Online ISBN: 978-0-387-74161-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics