A new rule-based knowledge extraction approach for imbalanced datasets

  • Aouatef MahaniEmail author
  • Ahmed Riadh Baba-Ali
Regular Paper


Classification consists of extracting a classifier from large datasets. A dataset is imbalanced if it contains more instances in one class compared to the others. An imbalanced dataset contains majority instances and minority ones. It is worth noting that classical learning algorithms have a bias toward majority instances. If classification is applied to imbalanced datasets, it is called partial classification. Its approaches are generally based on sampling methods or algorithmic methods. In this paper, we propose a new hybrid approach using a three-phase-rule-based extraction process. Initially, the first classifier is extracted; it contains classification rules representing only majority instances. Then, we delete the majority instances, which are well classified by these rules, to produce a balanced dataset. The deleted majority instances are replaced by the extracted classification rules, which prevent any information loss. Subsequently, our algorithm is applied to the obtained balanced dataset to produce the second classifier which contains rules that represent both majority and minority instances. Finally, we add the rules of the first classifier to the second classifier to obtain the final classifier, which will be post-processed. Our approach has been tested on several imbalanced binary datasets. The obtained results show its efficiency compared to other results.


Classification Class imbalance problem Data mining Genetic algorithms Imbalanced datasets sampling 



We thank anonymous reviewers for their very useful and helpful comments and suggestions. We also thank Ladjel Bellatreche, Sadjia Ben Khider, Malika Boussoualim, Nacera Boussoualim and all members of the LRPE laboratory for their support in improving the paper.


  1. 1.
    Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In Boulicaut JF, Esposito F, Giannotti F, Pedreschi D (eds). In: Proceedings of the 15th european conference on Machine Learning, Pisa, Italy, September 2004, pp 39–50Google Scholar
  2. 2.
    Alcalà-Fdez J, Sànchez L, Garcìa S et al (2008) KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318Google Scholar
  3. 3.
    Alcalà-Fdez J, Fernàndez A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult Valued Logic Soft Comput 17(2–3):255–287Google Scholar
  4. 4.
    Alejo R, Garcìa V, Sotoca JM et al (2007) Improving the performance of the RBF neural networks trained with imbalanced samples. Lect Notes Comput Sci 4507:162–169Google Scholar
  5. 5.
    Ali K, Manganaris S, Srikant R (1997) Partial classification using association rules. In: Proceedings of the 3rd international conference on Knowledge discovery and data mining, AAAI Press, Newport, CA, August 1997, pp  115–118Google Scholar
  6. 6.
    Alpaydin E, Kaynak C (1998) Cascading classifiers. KYBENETIKA 34(4):369–474zbMATHGoogle Scholar
  7. 7.
    Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences.*mlearn/ML Repository.html
  8. 8.
    Batista G, Prati RC, Monard MC A study of the behaviour of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29Google Scholar
  9. 9.
    Buckland M, Gey F (1994) The Relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19Google Scholar
  10. 10.
    Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: synthetic minority over-sampling technique. Artif Intell Res 16:321–357zbMATHGoogle Scholar
  11. 11.
    Chawla NV, Cieslak D, Hall LO et al (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetGoogle Scholar
  12. 12.
    Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavra N., Gamberger D., Todorovski L., Blockeel H. (eds) Knowledge discovery in databases: PKDD 2003. PKDD 2003. Lecture notes in computer science, vol 2838. Springer, BerlinGoogle Scholar
  13. 13.
    Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning. Morgan Kaufmann , CA, USA, July 1995, pp 115–123Google Scholar
  14. 14.
    Cohen G, Hilario M, Sax H et al (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18Google Scholar
  15. 15.
    Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20(5):381–417Google Scholar
  16. 16.
    Dems̃ar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7:1–30MathSciNetGoogle Scholar
  17. 17.
    Dhar V, Chou D, Provost F (2000) Discovering interesting patterns for investment decision making with GLOWER - a genetic learner overlaid with entropy reduction. Data Min Knowl Discov 4(4):251–280zbMATHGoogle Scholar
  18. 18.
    Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceeding of international conference on machine learning (ICML 2003), Workshop on learning from imbalanced data sets, Washington, DC, August 2003, pp 1–8Google Scholar
  19. 19.
    Fawcett T, Provost F (1997) Adaptive fraud detection. Data Min Knowl Discov 1(3):291–316Google Scholar
  20. 20.
    Fernàndez A, Garcìa S, del jesus MJ et al (2007) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2387–2398MathSciNetGoogle Scholar
  21. 21.
    Ferri C, Flach P, Hernandez-Orallo J (2004) Delegating Classifiers. In: Proceedings of the 21st international conference on machine learning, Alberta, Canada, July 2004, pp 289–296Google Scholar
  22. 22.
    Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning. Morgan Kaufmann , San Francisco, CA, USA, July 1988, pp 144–15Google Scholar
  23. 23.
    Freund Y, Schapire RE (1997) A decision theoretic generalization of on-line learning and an application of boosting. Comput Syst Sci 55(1):119–139MathSciNetzbMATHGoogle Scholar
  24. 24.
    Fu X, Wang L, Chua KS et al (2002) Training rbf neural networks on unbalanced data. In: Proceedings of the 9th international conference on neural information processing (ICONIP’02), Singapore, November 2002, pp 1016–1020Google Scholar
  25. 25.
    Galar M, Fernàndez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging, boosting and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484Google Scholar
  26. 26.
    Garcìa V, Mollineda R, Sànchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280MathSciNetGoogle Scholar
  27. 27.
    Garcìa S, Herrera F (2009) Evolutionary undersampling for classification with imbalance datasets: proposals and taxonomy. Evol Comput 17(3):275–306MathSciNetGoogle Scholar
  28. 28.
    Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-IM approach. SIGKDD Explor 6(1):30–39Google Scholar
  29. 29.
    Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco, pp 322–324zbMATHGoogle Scholar
  30. 30.
    He H, Garcìa E (2009) Learning from imbalanced data. IEEE Trans Data Knowl Eng 21(9):1263–1284Google Scholar
  31. 31.
    Hodges JL (1962) Rank methods for combination of independent experiments in analysis of variance. Annal Math Stat 33(2):482–497MathSciNetzbMATHGoogle Scholar
  32. 32.
    Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. Proceedings of the 11th International Joint conference on artificial intelligence (IJCAI’89), Detroit, MI, USA, August 1989, pp 813–818Google Scholar
  33. 33.
    Holte RC, Japkowicz N, Ling CX et al (eds) (2000) Learning from imbalanced data Sets workshop, Technical Report WS-00-05, American Association for Artificial Intelligence, Menlo Park, CA, USAGoogle Scholar
  34. 34.
    Hornick MF, Marcadé E, Venkeyala S (eds) (2006) Java data mining: strategy, standard and practice: a practical guide for architecture, design, and implementation. Morgan Kaufmann, BurlingtonGoogle Scholar
  35. 35.
    Huang J, Ling CY (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310Google Scholar
  36. 36.
    Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449zbMATHGoogle Scholar
  37. 37.
    Joshi M (2002) On evaluating performance of classifiers for rare classes. In: Proceedings of IEEE international conference on data mining, IEEE, Maebashi City, Japan, December 2002, pp 641–644Google Scholar
  38. 38.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: One-Sided Selection. In: Proceedings of the 14th international conference on machine learning, Morgan Kaufmann Publishers, Nashville, TN, USA, July 1997, pp 179–186Google Scholar
  39. 39.
    Landgrebe TCW, Paclick P, Duin RPW et al (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of the 18th IEEE international conference on pattern recognition, Hong Kong, China, August 2006, pp 123–127Google Scholar
  40. 40.
    Ling C, Sheng V, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067Google Scholar
  41. 41.
    Liu XY, Wu J, Zhou ZH (2006) Exploratory under sampling for class imbalance learning. In: Proceedings of 6th International conference on data mining (ICDM’06), Hong Kong, December 2006, pp 965–969Google Scholar
  42. 42.
    Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. Mach Learn Res 8:409–439zbMATHGoogle Scholar
  43. 43.
    Murphey YL, Wang H, Ou G et al (2007) OAHO: an effective algorithm for multi-class learning from imbalanced data. In: IEEE international joint conference on neural networks, IEEE, Orlando, FL, USA, August 2007, pp 406–411Google Scholar
  44. 44.
    Orriols-Puig A, Bernadó-Mansilla O, Goldberg DE et al (2009) Facetwise analysis of XCS for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119Google Scholar
  45. 45.
    Parpinelli RS, Lopes HS, Freitas AA (2001) An ant colony based system for data mining: applications to medical data. In: Proceedings of the 3rd annual conference on genetic and evolutionary computation, Morgan Kaufmann, San Francisco, California, July 2001, pp 791–797Google Scholar
  46. 46.
    Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Springer, DordrechtzbMATHGoogle Scholar
  47. 47.
    Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Heckerman D, Mannila H, Pregibon D, Uthurusamy R (eds). In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAI Press, Newport, CA, USA, August 1997, pp 43–48Google Scholar
  48. 48.
    Qiong G, Xian-Ming W, Zhao W et al (2016) An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification. Digit Inf Manag 14(2):92–103Google Scholar
  49. 49.
    Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San MateoGoogle Scholar
  50. 50.
    Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3):252–264Google Scholar
  51. 51.
    Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a Boeing manufacturing domain. Appl Artif Intell 8(1):125–147Google Scholar
  52. 52.
    Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227Google Scholar
  53. 53.
    Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336zbMATHGoogle Scholar
  54. 54.
    Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Humans 40(1):185–197Google Scholar
  55. 55.
    Sun Y, Kamel MS, Wong ACK et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378zbMATHGoogle Scholar
  56. 56.
    Sun Y, Wong ACK, Kamel MS (2009) Classification of imbalanced data: a review. Pattern Recognit Artif Intell 23(4):678–719Google Scholar
  57. 57.
    Syswerda G (1991) A study of reproduction in generational and steady-state genetic algorithms. Found Genet Algorithms 1:94–101Google Scholar
  58. 58.
    Tang Y, Zhang Y, Chawla NV et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B Cybern 39(1):281–288Google Scholar
  59. 59.
    Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772MathSciNetzbMATHGoogle Scholar
  60. 60.
    Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newslett 6(1):7–19Google Scholar
  61. 61.
    Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. Artif Intell Res Arch 19(1):315–354zbMATHGoogle Scholar
  62. 62.
    Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83Google Scholar
  63. 63.
    Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286zbMATHGoogle Scholar
  64. 64.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and technique. Morgan Kaufmann, San FranciscozbMATHGoogle Scholar
  65. 65.
    Wu G, Chang EY (2005) KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795Google Scholar
  66. 66.
    Yager RR, Zadeh LA (1994) Fuzzy Sets, Neural networks and soft computing. Thomson Learning, PortlandzbMATHGoogle Scholar
  67. 67.
    Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of the 5th international conference on hybrid intelligent systems, Rio de Janeiro, Brazil, November 2005, pp 303–308Google Scholar
  68. 68.
    Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318Google Scholar
  69. 69.
    Yu HL, Gu GC, Liu HB et al (2009) A modified ant colony optimization algorithm for tumor marker gene selection. Genomics Proteomics Bioinform 7(4):200–208Google Scholar
  70. 70.
    Zar JH (1999) Biostatistical analysis. Prentice-Hall, Upper Saddle River, p 101Google Scholar
  71. 71.
    Zhang J, Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets, Washington, DC, August 2003Google Scholar
  72. 72.
    Zhou ZH (ed) (2012) Ensemble methods: foundations and algorithms. UK: Chapman and Hall, London. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAI Press, Newport, CA, August 1997, pp 263–266Google Scholar
  73. 73.
    Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer Sciences DepartmentUniversity of Sciences and Technology Houari BoumedieneEl AliaAlgeria

Personalised recommendations