A new rule-based knowledge extraction approach for imbalanced datasets
- 46 Downloads
Abstract
Classification consists of extracting a classifier from large datasets. A dataset is imbalanced if it contains more instances in one class compared to the others. An imbalanced dataset contains majority instances and minority ones. It is worth noting that classical learning algorithms have a bias toward majority instances. If classification is applied to imbalanced datasets, it is called partial classification. Its approaches are generally based on sampling methods or algorithmic methods. In this paper, we propose a new hybrid approach using a three-phase-rule-based extraction process. Initially, the first classifier is extracted; it contains classification rules representing only majority instances. Then, we delete the majority instances, which are well classified by these rules, to produce a balanced dataset. The deleted majority instances are replaced by the extracted classification rules, which prevent any information loss. Subsequently, our algorithm is applied to the obtained balanced dataset to produce the second classifier which contains rules that represent both majority and minority instances. Finally, we add the rules of the first classifier to the second classifier to obtain the final classifier, which will be post-processed. Our approach has been tested on several imbalanced binary datasets. The obtained results show its efficiency compared to other results.
Keywords
Classification Class imbalance problem Data mining Genetic algorithms Imbalanced datasets samplingNotes
Acknowledgements
We thank anonymous reviewers for their very useful and helpful comments and suggestions. We also thank Ladjel Bellatreche, Sadjia Ben Khider, Malika Boussoualim, Nacera Boussoualim and all members of the LRPE laboratory for their support in improving the paper.
References
- 1.Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In Boulicaut JF, Esposito F, Giannotti F, Pedreschi D (eds). In: Proceedings of the 15th european conference on Machine Learning, Pisa, Italy, September 2004, pp 39–50Google Scholar
- 2.Alcalà-Fdez J, Sànchez L, Garcìa S et al (2008) KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318Google Scholar
- 3.Alcalà-Fdez J, Fernàndez A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Mult Valued Logic Soft Comput 17(2–3):255–287Google Scholar
- 4.Alejo R, Garcìa V, Sotoca JM et al (2007) Improving the performance of the RBF neural networks trained with imbalanced samples. Lect Notes Comput Sci 4507:162–169Google Scholar
- 5.Ali K, Manganaris S, Srikant R (1997) Partial classification using association rules. In: Proceedings of the 3rd international conference on Knowledge discovery and data mining, AAAI Press, Newport, CA, August 1997, pp 115–118Google Scholar
- 6.Alpaydin E, Kaynak C (1998) Cascading classifiers. KYBENETIKA 34(4):369–474zbMATHGoogle Scholar
- 7.Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/*mlearn/ML Repository.html
- 8.Batista G, Prati RC, Monard MC A study of the behaviour of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29Google Scholar
- 9.Buckland M, Gey F (1994) The Relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19Google Scholar
- 10.Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: synthetic minority over-sampling technique. Artif Intell Res 16:321–357zbMATHGoogle Scholar
- 11.Chawla NV, Cieslak D, Hall LO et al (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2):225–252MathSciNetGoogle Scholar
- 12.Chawla NV, Lazarevic A, Hall LO et al (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: Lavra N., Gamberger D., Todorovski L., Blockeel H. (eds) Knowledge discovery in databases: PKDD 2003. PKDD 2003. Lecture notes in computer science, vol 2838. Springer, BerlinGoogle Scholar
- 13.Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning. Morgan Kaufmann , CA, USA, July 1995, pp 115–123Google Scholar
- 14.Cohen G, Hilario M, Sax H et al (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18Google Scholar
- 15.Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20(5):381–417Google Scholar
- 16.Dems̃ar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7:1–30MathSciNetGoogle Scholar
- 17.Dhar V, Chou D, Provost F (2000) Discovering interesting patterns for investment decision making with GLOWER - a genetic learner overlaid with entropy reduction. Data Min Knowl Discov 4(4):251–280zbMATHGoogle Scholar
- 18.Drummond C, Holte R (2003) C4.5, Class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceeding of international conference on machine learning (ICML 2003), Workshop on learning from imbalanced data sets, Washington, DC, August 2003, pp 1–8Google Scholar
- 19.Fawcett T, Provost F (1997) Adaptive fraud detection. Data Min Knowl Discov 1(3):291–316Google Scholar
- 20.Fernàndez A, Garcìa S, del jesus MJ et al (2007) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2387–2398MathSciNetGoogle Scholar
- 21.Ferri C, Flach P, Hernandez-Orallo J (2004) Delegating Classifiers. In: Proceedings of the 21st international conference on machine learning, Alberta, Canada, July 2004, pp 289–296Google Scholar
- 22.Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning. Morgan Kaufmann , San Francisco, CA, USA, July 1988, pp 144–15Google Scholar
- 23.Freund Y, Schapire RE (1997) A decision theoretic generalization of on-line learning and an application of boosting. Comput Syst Sci 55(1):119–139MathSciNetzbMATHGoogle Scholar
- 24.Fu X, Wang L, Chua KS et al (2002) Training rbf neural networks on unbalanced data. In: Proceedings of the 9th international conference on neural information processing (ICONIP’02), Singapore, November 2002, pp 1016–1020Google Scholar
- 25.Galar M, Fernàndez A, Barrenechea E et al (2012) A review on ensembles for the class imbalance problem: bagging, boosting and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484Google Scholar
- 26.Garcìa V, Mollineda R, Sànchez J (2008) On the k-nn performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280MathSciNetGoogle Scholar
- 27.Garcìa S, Herrera F (2009) Evolutionary undersampling for classification with imbalance datasets: proposals and taxonomy. Evol Comput 17(3):275–306MathSciNetGoogle Scholar
- 28.Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-IM approach. SIGKDD Explor 6(1):30–39Google Scholar
- 29.Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco, pp 322–324zbMATHGoogle Scholar
- 30.He H, Garcìa E (2009) Learning from imbalanced data. IEEE Trans Data Knowl Eng 21(9):1263–1284Google Scholar
- 31.Hodges JL (1962) Rank methods for combination of independent experiments in analysis of variance. Annal Math Stat 33(2):482–497MathSciNetzbMATHGoogle Scholar
- 32.Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. Proceedings of the 11th International Joint conference on artificial intelligence (IJCAI’89), Detroit, MI, USA, August 1989, pp 813–818Google Scholar
- 33.Holte RC, Japkowicz N, Ling CX et al (eds) (2000) Learning from imbalanced data Sets workshop, Technical Report WS-00-05, American Association for Artificial Intelligence, Menlo Park, CA, USAGoogle Scholar
- 34.Hornick MF, Marcadé E, Venkeyala S (eds) (2006) Java data mining: strategy, standard and practice: a practical guide for architecture, design, and implementation. Morgan Kaufmann, BurlingtonGoogle Scholar
- 35.Huang J, Ling CY (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310Google Scholar
- 36.Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449zbMATHGoogle Scholar
- 37.Joshi M (2002) On evaluating performance of classifiers for rare classes. In: Proceedings of IEEE international conference on data mining, IEEE, Maebashi City, Japan, December 2002, pp 641–644Google Scholar
- 38.Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: One-Sided Selection. In: Proceedings of the 14th international conference on machine learning, Morgan Kaufmann Publishers, Nashville, TN, USA, July 1997, pp 179–186Google Scholar
- 39.Landgrebe TCW, Paclick P, Duin RPW et al (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of the 18th IEEE international conference on pattern recognition, Hong Kong, China, August 2006, pp 123–127Google Scholar
- 40.Ling C, Sheng V, Yang Q (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8):1055–1067Google Scholar
- 41.Liu XY, Wu J, Zhou ZH (2006) Exploratory under sampling for class imbalance learning. In: Proceedings of 6th International conference on data mining (ICDM’06), Hong Kong, December 2006, pp 965–969Google Scholar
- 42.Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. Mach Learn Res 8:409–439zbMATHGoogle Scholar
- 43.Murphey YL, Wang H, Ou G et al (2007) OAHO: an effective algorithm for multi-class learning from imbalanced data. In: IEEE international joint conference on neural networks, IEEE, Orlando, FL, USA, August 2007, pp 406–411Google Scholar
- 44.Orriols-Puig A, Bernadó-Mansilla O, Goldberg DE et al (2009) Facetwise analysis of XCS for problems with class imbalances. IEEE Trans Evol Comput 13(5):1093–1119Google Scholar
- 45.Parpinelli RS, Lopes HS, Freitas AA (2001) An ant colony based system for data mining: applications to medical data. In: Proceedings of the 3rd annual conference on genetic and evolutionary computation, Morgan Kaufmann, San Francisco, California, July 2001, pp 791–797Google Scholar
- 46.Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Springer, DordrechtzbMATHGoogle Scholar
- 47.Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In Heckerman D, Mannila H, Pregibon D, Uthurusamy R (eds). In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAI Press, Newport, CA, USA, August 1997, pp 43–48Google Scholar
- 48.Qiong G, Xian-Ming W, Zhao W et al (2016) An improved SMOTE algorithm based on genetic algorithm for imbalanced data classification. Digit Inf Manag 14(2):92–103Google Scholar
- 49.Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San MateoGoogle Scholar
- 50.Raudys S, Jain A (1991) Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 13(3):252–264Google Scholar
- 51.Riddle P, Segal R, Etzioni O (1994) Representation design and brute-force induction in a Boeing manufacturing domain. Appl Artif Intell 8(1):125–147Google Scholar
- 52.Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227Google Scholar
- 53.Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336zbMATHGoogle Scholar
- 54.Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Humans 40(1):185–197Google Scholar
- 55.Sun Y, Kamel MS, Wong ACK et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378zbMATHGoogle Scholar
- 56.Sun Y, Wong ACK, Kamel MS (2009) Classification of imbalanced data: a review. Pattern Recognit Artif Intell 23(4):678–719Google Scholar
- 57.Syswerda G (1991) A study of reproduction in generational and steady-state genetic algorithms. Found Genet Algorithms 1:94–101Google Scholar
- 58.Tang Y, Zhang Y, Chawla NV et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B Cybern 39(1):281–288Google Scholar
- 59.Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772MathSciNetzbMATHGoogle Scholar
- 60.Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newslett 6(1):7–19Google Scholar
- 61.Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. Artif Intell Res Arch 19(1):315–354zbMATHGoogle Scholar
- 62.Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83Google Scholar
- 63.Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286zbMATHGoogle Scholar
- 64.Witten IH, Frank E (2005) Data mining: practical machine learning tools and technique. Morgan Kaufmann, San FranciscozbMATHGoogle Scholar
- 65.Wu G, Chang EY (2005) KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Trans Knowl Data Eng 17(6):786–795Google Scholar
- 66.Yager RR, Zadeh LA (1994) Fuzzy Sets, Neural networks and soft computing. Thomson Learning, PortlandzbMATHGoogle Scholar
- 67.Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of the 5th international conference on hybrid intelligent systems, Rio de Janeiro, Brazil, November 2005, pp 303–308Google Scholar
- 68.Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318Google Scholar
- 69.Yu HL, Gu GC, Liu HB et al (2009) A modified ant colony optimization algorithm for tumor marker gene selection. Genomics Proteomics Bioinform 7(4):200–208Google Scholar
- 70.Zar JH (1999) Biostatistical analysis. Prentice-Hall, Upper Saddle River, p 101Google Scholar
- 71.Zhang J, Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets, Washington, DC, August 2003Google Scholar
- 72.Zhou ZH (ed) (2012) Ensemble methods: foundations and algorithms. UK: Chapman and Hall, London. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAI Press, Newport, CA, August 1997, pp 263–266Google Scholar
- 73.Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77Google Scholar