Skip to main content

Dealing with Data Difficulty Factors While Learning from Imbalanced Data

  • Chapter
  • First Online:
Challenges in Computational Statistics and Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 605))

Abstract

Learning from imbalanced data is still one of challenging tasks in machine learning and data mining. We discuss the following data difficulty factors which deteriorate classification performance: decomposition of the minority class into rare sub-concepts, overlapping of classes and distinguishing different types of examples. New experimental studies showing the influence of these factors on classifiers are presented. The paper also includes critical discussions of methods for their identification in real world data. Finally, open research issues are stated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Reuters data is at http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  2. 2.

    OSHSUMED available at http://ir.ohsu.edu/ohsumed/ohsumed.html.

References

  1. Anyfantis D, Karagiannopoulos M, Kotsiantis S, Pintelas P (2007) Robustness of learning techniques in handling class noise in imbalanced datasets. In: Proceedings of the IFIP conference on artificial intelligence applications and innovations, pp 21–28

    Google Scholar 

  2. Batista G, Prati R, Monard M (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  3. Batista G, Prati R, Monard M (2005) Balancing strategies and class overlapping. In: Proceedings of the IDA 2005, LNCS vol 3646, pp 24–35, Springer

    Google Scholar 

  4. Bishop Ch (2006) Pattern recognition and machine learning. Information science and statistics. Springer, New York

    Google Scholar 

  5. Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150(Part B):529–542

    Google Scholar 

  6. Błaszczyński J, Deckert M, Stefanowski J, Wilk Sz (2010) Integrating selective pre-processing of imbalanced data with Ivotes ensemble. In: Proceedings of the 7th international conference RSCTC 2010, LNAI vol 6086, pp 148–157, Springer

    Google Scholar 

  7. Błaszczyński J, Stefanowski J, Idkowiak L (2013) Extending bagging for imbalanced data. In: Proceedings of the 8th CORES 2013, Springer Series on Advances in Intelligent Systems and Computing, vol 226, pp 269–278

    Google Scholar 

  8. Borowski J (2014) Constructing data representations and classification of imbalanced text documents. Master Thesis, Poznan University of Technology (supervised by Stefanowski J.)

    Google Scholar 

  9. Brodley CE, Friedl M (1999) A: Identifying mislabeled training data. J Artif Intell Res 11:131–167

    Google Scholar 

  10. Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) The data mining and knowledge discovery handbook, pp 853–867, Springer, New York

    Google Scholar 

  11. Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378

    Google Scholar 

  12. Cost S, Salzberg S (1993) A weighted nearest neighbor algorithm for learning with symbolic features. Mach Learn J 10(1):1213–1228

    Google Scholar 

  13. Davis J, Goadrich M (2006) The relationship between Precision- Recall and ROC curves. In: Proceedings of the international conference on machine learning ICML, pp 233–240

    Google Scholar 

  14. Denil M, Trappenberg T (2011) A characterization of the combined effects of overlap and imbalance on the SVM classifier. In: Proceedings of CoRR conference, pp 1–10

    Google Scholar 

  15. Drummond C, Holte R (2006) Cost curves: an improved method for visualizing classifier performance. Mach Learn J 65(1):95–130

    Google Scholar 

  16. Elklan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the international joint conference on artificial intelligence IJCAI-01, pp 63–66

    Google Scholar 

  17. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases. In: Proceedings of the international conference KDD’96, pp 226–231

    Google Scholar 

  18. Fernandez A, Garcia S, Herrera F (2011) Addressing the classification with imbalanced data: open problems and new challenges on class distribution. In: Proceedings of the HAIS conference (part. 1), pp 1–10

    Google Scholar 

  19. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C: Appl Rev 99:1–22

    Google Scholar 

  20. Gamberger D, Boskovic R, Lavrac N, Groselj C (1999) Experiments with noise filtering in a medical domain. In: Proceedings of the 16th international conference on machine learning ICML’99, pp 143–151

    Google Scholar 

  21. Garcia S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  Google Scholar 

  22. Garcia V, Sanchez JS, Mollineda RA (2007) An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In: Proceedings of progress in pattern recognition, image analysis and applications 2007, LNCS, vol 4756, pp 397–406, Springer

    Google Scholar 

  23. Garcia V, Mollineda R, Sanchez JS (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280

    Article  MathSciNet  Google Scholar 

  24. Grzymala-Busse JW, Goodwin LK, Grzymala-Busse W, Zheng X (2000) An approach to imbalanced data sets based on changing rule strength. In: Proceeding of learning from imbalanced data sets, AAAI workshop at the 17th conference on AI, pp 69–74

    Google Scholar 

  25. Grzymala-Busse JW, Stefanowski J, Wilk S (2005) A comparison of two approaches to data mining from imbalanced data. J Intell Manufact 16(6):565–574

    Article  Google Scholar 

  26. Gumkowski M (2014) Using cluster analysis to classification of imbalanced data. Master Thesis, Poznan University of Technology (supervised by Stefanowski J.)

    Google Scholar 

  27. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the ICIC, LNCS vol 3644, pp 878–887, Springer

    Google Scholar 

  28. Hand D (2009) Measuring classifier performance. A coherent alternative to the area under the ROC curve. Mach Learn J 42:203–231

    Google Scholar 

  29. He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Data Knowl Eng 21(9):1263–1284

    Article  Google Scholar 

  30. He H, Ma Y (eds) (2013) Imbalanced learning. Foundations, algorithms and applications. IEEE—Wiley

    Google Scholar 

  31. Hido S, Kashima H (2009) Roughly balanced bagging for imbalance data. Stat Anal Data Min 2(5–6):412–426

    Article  MathSciNet  Google Scholar 

  32. Holte C, Acker LE, Porter BW (1989) Concept Learning and the problem of small disjuncts. In: Proceedings of the 11th IJCAI conference, pp 813–818

    Google Scholar 

  33. Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Proceedings of the Canadian conference on AI, pp 67–77

    Google Scholar 

  34. Japkowicz N (2003) Class imbalance: are we focusing on the right issue? In: Proceedings of the II workshop on learning from imbalanced data sets, ICML conference, pp 17–23:

    Google Scholar 

  35. Japkowicz N, Mohak S (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, Cambridge

    Google Scholar 

  36. Japkowicz N, Stephen S (2002) Class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450

    Google Scholar 

  37. Jo T, Japkowicz N (2004) Class Imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49

    Article  MathSciNet  Google Scholar 

  38. Japkowicz N (2013) Assessment metrics for imbalanced learning. In: He H, Ma Y (eds) Imbalanced learning. foundations, algorithms and applications. IEEE—Wiley, pp 187–206

    Google Scholar 

  39. Kaluzny K (2009) Analysis of class decomposition in imbalanced data. Master Thesis (supervised by J. Stefanowski), Poznan University of Technology

    Google Scholar 

  40. Khoshgoftaar T, Van Hulse J, Napolitano A (2011) Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans Syst Man Cybern-Part A 41(3):552–568

    Article  Google Scholar 

  41. Krawczyk B, Wozniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:544–562

    Article  Google Scholar 

  42. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-side selection. In: Proceedings of the 14th international conference on machine learning ICML-97, pp 179–186

    Google Scholar 

  43. Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in radar images. Mach Learn J 30:195–215

    Article  Google Scholar 

  44. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Technical Report A-2001-2, University of Tampere

    Google Scholar 

  45. Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of 11th international conference on machine learning, pp 148–156

    Google Scholar 

  46. Lumijarvi J, Laurikkala J, Juhola M (2004) A comparison of different heterogeneous proximity functions and Euclidean distance. Stud Health Technol Inform 107(Part 2):1362–1366

    Google Scholar 

  47. Lopez V, Fernandez A, Garcia S, Palade V, Herrera F (2014) An Insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 257:113–141

    Article  Google Scholar 

  48. Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of the IEEE symposium on computational intelligence and data mining, pp 104–111

    Google Scholar 

  49. Maimon O, Rokach L (eds) (2005) The data mining and knowledge discovery handbook, Springer, New York

    Google Scholar 

  50. Maloof M (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the II workshop on learning from imbalanced data sets, ICML conference

    Google Scholar 

  51. Moore A, Pelleg D (2000) X-means: extending k-means with efficient estimation of the numbers of clusters. In: Proceedings of the 17th ICML, pp 727–734

    Google Scholar 

  52. Napierala K (2013) Improving rule classifiers for imbalanced data. Ph.D. Thesis. Poznan University of Technology

    Google Scholar 

  53. Napierala K, Stefanowski J (2012) The influence of minority class distribution on learning from imbalance data. In: Proceedings of the 7th conference HAIS 2012, LNAI vol 7209, pp 139–150, Springer

    Google Scholar 

  54. Napierala K, Stefanowski J (2012) BRACID: a comprehensive approach to learning rules from imbalanced data. J Intell Inform Syst 39(2):335–373

    Article  Google Scholar 

  55. Napierala K, Stefanowski J, Wilk Sz (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of 7th international conference RSCTC 2010, LNAI vol 6086, pp 158–167, Springer

    Google Scholar 

  56. Napierala K, Stefanowski J, Trzcielinska M (2014) Local characteristics of minority examples in pre-processing of imbalanced data. In: Proceedings of the ISMIS 2014, pp 123–132

    Google Scholar 

  57. Nickerson A, Japkowicz N, Milios E (2001) Using unsupervised learning to guide re-sampling in imbalanced data sets. In: Proceedings of the 8th international workshop on artificial intelligence and statistics, pp 261–265

    Google Scholar 

  58. Niemann U, Spiliopoulou M, Volzke, H, Kuhn JP (2014) Subpopulation discovery in epidemiological data with subspace clustering. Found Comput Decis Sci 39(4)

    Google Scholar 

  59. Prati R, Gustavo E, Batista G, Monard M (2004) Learning with class skews and small disjuncts. In: Proceedings of the SBIA 2004, LNAI vol 3171, pp 296–306, Springer

    Google Scholar 

  60. Prati R, Batista G, Monard M (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings 3rd mexican international conference on artificial intelligence, pp 312–321

    Google Scholar 

  61. Parinaz S, Victor H, Matwin S (2014) Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Electronic Proceedings of the NFMCP 2014 workshop at ECML-PKDD 2014, Nancy

    Google Scholar 

  62. Saez JA, Luengo J, Stefanowski J, Herrera F (2015) Addressing the noisy and borderline examples problem in classification with imbalanced datasets via a class noise filtering method-based re-sampling technique. Inform Sci 291:184–203

    Article  Google Scholar 

  63. Stefanowski J (2007) On combined classifiers, rule induction and rough sets. Trans Rough Sets 6:329–350

    Google Scholar 

  64. Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna S, Jain LC, Howlett RJ (eds) Emerging paradigms in machine learning, pp 277–306

    Google Scholar 

  65. Stefanowski J, Wilk Sz (2008) Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th international confernace DaWaK 2008. LNCS vol 5182, pp 283–292, Springer

    Google Scholar 

  66. Stefanowski J, Wilk Sz (2009) Extending rule-based classifiers to improve recognition of imbalanced classes. In: Ras ZW, Dardzinska A (eds) Advances in data management, Studies in computational intelligence, vol 223, pp 131–154, Springer

    Google Scholar 

  67. Ting K (1997) The problem of small disjuncts. Its remedy in decision trees. In: Proceedings of the 10th Canadian conference on AI, pp 91–97

    Google Scholar 

  68. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun 6:769–772

    Article  MathSciNet  Google Scholar 

  69. Van Hulse J, Khoshgoftarr T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68:1513–1542

    Article  Google Scholar 

  70. Van Hulse J, Khoshgoftarr T, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of ICML, pp 935–942

    Google Scholar 

  71. Verbiest N, Ramentol E, Cornelis C, Herrera F (2012) Improving SMOTE with fuzzy rough prototype selection to detect noise in imbalanced classification data. In: Proceedings of the international conference IBERAMIA, pp 169–178

    Google Scholar 

  72. Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19

    Article  Google Scholar 

  73. Weiss GM, Hirsh H (2000) A quantitative study of small disjuncts. In: Proceedings of the 17th national conference on artificial intelligence—AAAI00, pp 665–670

    Google Scholar 

  74. Weiss GM, Provost F (2003) Learning when training data are costly: the efect of class distribution on tree induction. J Artif Intell Res 19:315–354

    Google Scholar 

  75. Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34

    MathSciNet  Google Scholar 

  76. Zhu X, Wu X, Yang Y (2014) Error detection and impact-sensitive instance ranking in noisy data sets. In: Proceeding of the 19th national conference on AI, AAAI’04

    Google Scholar 

Download references

Acknowledgments

The research was funded by the the Polish National Science Center, grant no. DEC-2013/11/B/ST6/00963. Close co-operation with Krystyna Napierala in research on types of examples is also acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jerzy Stefanowski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Stefanowski, J. (2016). Dealing with Data Difficulty Factors While Learning from Imbalanced Data. In: Matwin, S., Mielniczuk, J. (eds) Challenges in Computational Statistics and Data Mining. Studies in Computational Intelligence, vol 605. Springer, Cham. https://doi.org/10.1007/978-3-319-18781-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18781-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18780-8

  • Online ISBN: 978-3-319-18781-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics