Improving Identification of Difficult Small Classes by Balancing Class Distribution

  • Jorma Laurikkala
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2101)


We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods in experiments with ten data sets. All reduction methods improved identification of small classes (20–30%), but the differences were insignificant. However, significant differences in accuracies, true-positive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored NCL. The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cochran, W.G.: Sampling Techniques. 3rd edn. Wiley, New York (1977)zbMATHGoogle Scholar
  2. 2.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Mach. Learn. 6 (1991) 37–66Google Scholar
  3. 3.
    Wilson, D.R., Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Mach. Learn. 38 (2000) 257–286zbMATHCrossRefGoogle Scholar
  4. 4.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Fisher, D.H. (ed.): Proceedings of the Fourteenth International Conference in Machine Learning. Morgan Kaufmann, San Francisco (1997) 179–186Google Scholar
  5. 5.
    Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases []. Irvine, University of California, Department of Information and Computer Science (1998)Google Scholar
  6. 6.
    Laurikkala, J., Juhola, M., Lammi, S., Penttinen, J., Aukee P.: Analysis of the Imputed Female Urinary Incontinence Data for the Evaluation of Expert System Parameters. Comput. Biol. Med. 31 (2001)Google Scholar
  7. 7.
    Kentala, E.: Characteristics of Six Otologic Diseases Involving Vertigo. Am. J. Otol. 17 (1996) 883–892Google Scholar
  8. 8.
    Laurikkala J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution []. Dept. of Computer and Information Sciences, University of Tampere, Tech. Report A-2001-2, April 2001
  9. 9.
    Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufman, San Mateo (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Jorma Laurikkala
    • 1
  1. 1.Department of Computer and Information SciencesUniversity of TampereUniversity of TampereFinland

Personalised recommendations