Smart Under-Sampling for the Detection of Rare Patterns in Unbalanced Datasets

  • Marco VannucciEmail author
  • Valentina Colla
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 56)


A novel resampling approach is presented which improves the performance of classifiers when coping with unbalanced datasets. The method selects the frequent samples, whose elimination from the training dataset is most beneficial, and automatically determines the optimal unbalance rate. The results achieved test datasets put into evidence the efficiency of the method, that allows a sensible increase of the rare patterns detection rate and an improvement of the classification performance.


False Alarm Training Dataset Frequent Pattern Minority Class Class Imbalance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Stepenosky, N., Polikar, R., Kounios, J., Clark, C.: Ensemble techniques with weighted combination rules for early diagnosis of Alzheimer’s disease. In: International Joint Conference on Neural Networks, IJCNN’06 (2006)Google Scholar
  2. 2.
    Shreekant, G., Bin Y., Meckl, P.: Fault detection for nonlinear systems in presence of input unmodeled dynamics. In: International Conference on Advanced Intelligent Mechatronics, pp. 1-5, IEEE/ASME (2007)Google Scholar
  3. 3.
    Borselli, A., Colla, V., Vannucci, M., Veroli, M: A fuzzy inference system applied to defect detection in flat steel production. In: 2010 World Congress on Computational Intelligence, Barcelona (Spain), 18–23 July 2010, pp. 148-153 (2010)Google Scholar
  4. 4.
    Estabrooks, A.: A combination scheme for inductive learning from imbalanced datasets. MSC Thesis. Faculty of Computer Science, Dalhouise University (2000)Google Scholar
  5. 5.
    Estabrooks, A., Japkowicz, N.: A multiple resampling method for learning from imbalanced datasets. Comput. Intell. 20(1), 18–36 (2004)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Japkowicz, N.: The class imbalance problem: significance and strategies In: International Conference on Artificial Intelligence, Las Vegas, Nevada pp. 111–117 (2000)Google Scholar
  7. 7.
    Pazzani, M., Marz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification cost. In: 11th International Conference on Machine Learning, pp. 217–225 (1994)Google Scholar
  8. 8.
    Elkan, C.: The foundations of cost–sensitive learning. In: 17th International Joint Conference on Artificial Intelligence, pp. 973–978. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  9. 9.
    Vannucci, M., Colla, V.: Novel classification method for sensitive problems and uneven datasets based on neural networks and fuzzy logic. Appl. Soft Comput. 11(2), 2383–2390 (2011)CrossRefGoogle Scholar
  10. 10.
    Soler, V., Prim, M.: Rectangular basis functions applied to imbalanced datasets. Lecture Notes in Computer Science, vol. 4668. pp. 511–519. Springer (2007)Google Scholar
  11. 11.
    Li, P., Chan, K.L., Fang, W.: Hybrid kernel machine ensemble for imbalanced data sets. In: 18th International Conference on Pattern Recognition. IEEE (2006)Google Scholar
  12. 12.
    Scholkopf, B., et al.: New support vector algorithms. Neural Comput. 12, 1207–1245 (2000)Google Scholar
  13. 13.
    Vannucci, M., Colla, V., Sgarbi, M., Toscanelli, O.: Thresholded neural networks for sensitive industrial classification tasks. Lecture Notes in Computer Science vol. 5517 LNCS, pp. 1320-1327 (2009)Google Scholar
  14. 14.
    Chawla, N.V.: C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Workshop on Learning from Imbalanced Dataset II, ICML, Washington DC (2003)Google Scholar
  15. 15.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  16. 16.
    Ling, C., Li, C.: Data mining for direct marketing problems and solutions. In: Fourth International Conference on Knowledge Discovery and Data Mining, New York, vol. 2, pp. 73–78 (1998)Google Scholar
  17. 17.
    Cateni, S., Colla, V., Vannucci, M.: A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135, 32–41 (2014)CrossRefGoogle Scholar
  18. 18.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  19. 19.
    Lichman, M.: UCI ML Repository. University of California, School of Information and Computer Science, Irvine, CA (2013).

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.TeCIP Institute, Scuola Superiore Sant’AnnaPisaItaly

Personalised recommendations