Predicting the Influence of Additional Training Data on Classification Performance for Imbalanced Data

  • Stephen KockentiedtEmail author
  • Klaus Tönnies
  • Erhardt Gierke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8753)


It is desirable to predict the influence of additional training data on classification performance because the generation of samples is often costly. Current methods can only predict performance as measured by accuracy, which is not suitable if one class is much rarer than another. We propose an approach which is able to also predict other measures such as G-mean and F-measure, which are used in cases of imbalanced data. We show that our method leads to more correct decisions whether to generate more training samples or not using a highly imbalanced real-world dataset of scanning electron microscopy images of nanoparticles.


Training Sample Classification Performance Classifier Performance Misclassification Rate Class Imbalance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  2. 2.
    Kockentiedt, S., Tönnies, K., Gierke, E., Dziurowitz, N., Thim, C., Plitzko, S.: Automatic detection and recognition of engineered nanoparticles in SEM images. In: VMV 2012: Vision, Modeling & Visualization, pp. 23–30. Eurographics Association (2012)Google Scholar
  3. 3.
    Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the 13th International Conference on Machine Learning, pp. 275–283 (1996)Google Scholar
  4. 4.
    Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T.R., Mesirov, J.P.: Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol. 10(2), 119–142 (2003)CrossRefGoogle Scholar
  5. 5.
    Smith, J.E., Tahir, M.A.: Stop wasting time: on predicting the success or failure of learning for industrial applications. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 673–683. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Smith, J.E., Tahir, M.A., Sannen, D., Van Brussel, H.: Making early predictions of the accuracy of machine learning classifiers. In: Sayed-Mouchaweh, M., Lughofer, E. (eds.) Learning in Non-stationary Environments, Chap. 6, pp. 125–151. Springer, New York (2012)Google Scholar
  7. 7.
    Sun, Y., Wong, A.K., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)CrossRefGoogle Scholar
  8. 8.
    Webb, G.I., Conilione, P.: Estimating bias and variance from data. Technical report, Monash University, Melbourne (2003).

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Stephen Kockentiedt
    • 1
    • 2
    Email author
  • Klaus Tönnies
    • 1
  • Erhardt Gierke
    • 2
  1. 1.Department of Simulation and Graphics, Faculty of Computer ScienceUniversity of MagdeburgMagdeburgGermany
  2. 2.Federal Institute for Occupational Safety and HealthBerlinGermany

Personalised recommendations