Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 578)


Performance evaluation of supervised classification learning method related to its prediction ability on independent data is very important in machine learning. It is also almost unthinkable to carry out any research work without the comparison of the new, proposed classifier with other already existing ones. This paper aims to review the most important aspects of the classifier evaluation process including the choice of evaluating metrics (scores) as well as the statistical comparison of classifiers. Critical view, recommendations and limitations of the reviewed methods are presented. The article provides a quick guide to understand the complexity of the classifier evaluation process and tries to warn the reader about the wrong habits.


Supervised classification Classifier evaluation Performance metrics Statistical classifier comparison 


  1. 1.
    Batuvita, R., Palade, V.: A new performance measure for class imbalance learning: application to bioinformatics problem. In: Proceedings of 26th International Conference Machine Learning and Applications, pp. 545–550 (2009)Google Scholar
  2. 2.
    Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)MATHGoogle Scholar
  3. 3.
    Bouckaert, R.: Estimating replicability of classifier learning experiments. In: Proceedings of the 21st Conference on ICML. AAAI Press (2004)Google Scholar
  4. 4.
    Bradley, P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30, 1145–1159 (1997)CrossRefGoogle Scholar
  5. 5.
    Dietterich, T.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1924 (1998)CrossRefGoogle Scholar
  6. 6.
    Demsar, J.: Statistical comparison of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATHGoogle Scholar
  7. 7.
    Dmochowski, J., et al.: Maximum likelihood in cost-sensitive learning: model specification, approximation and upper bounds. J. Mach. Learn. Res. 11, 3313–3332 (2010)MathSciNetMATHGoogle Scholar
  8. 8.
    Duda, R., Hart, P., Stork, D.: Pattern Classification and Scene Analysis. Wiley, New York (2000)Google Scholar
  9. 9.
    Drummond, C., Holte, R.: Cost curves: an improved method for visualizing classifier performance. Mach. Learn. 65(1), 95–130 (2006)CrossRefGoogle Scholar
  10. 10.
    Elkan, C.: The foundation of cost-sensitive learning. In: Proceedings of 4th International Conference Artificial Intelligence, vol. 17, pp. 973–978 (2001)Google Scholar
  11. 11.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Ferri, C., et al.: An experimental comparison of performance measures for classification. Pattern Recogn. Lett. 30(1), 27–38 (2009)CrossRefGoogle Scholar
  13. 13.
    Finner, H.: On a monotonicity problem in step-down multiple test procedures. J. Am. Stat. Assoc. 88, 920–923 (1993)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11, 86–92 (1940)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Gama J., et. al.: On evaluating stream learning algorithms. Mach. Learn., pp. 1–30 (2013)Google Scholar
  16. 16.
    Garcia, S., Herrera, F.: An extension on statistical comparison of classifiers over multiple datasets for all pair-wise comparisons. J. Mach. Learn. Res. 9(12), 2677–2694 (2008)MATHGoogle Scholar
  17. 17.
    Garcia, S., Fernandez, A., Lutengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in the computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)CrossRefGoogle Scholar
  18. 18.
    García, V., Mollineda, R.A., Sánchez, J.S.: Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 441–448. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-02172-5_57 CrossRefGoogle Scholar
  19. 19.
    Górecki, T., Krzyśko, M.: Regression methods for combining multiple classifiers. Commun. Stat. Simul. Comput. 44, 739–755 (2015)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Hand, D., Till, R.: A simple generalization of the area under the ROC curve for multiple class classification problems. Mach. Learn. 45, 171–186 (2001)CrossRefMATHGoogle Scholar
  21. 21.
    Hand, D.: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77, 103–123 (2009)CrossRefGoogle Scholar
  22. 22.
    Hand, D., Anagnostopoulos, C.: A better beta for the H measure of classification performance. Pattern Recogn. Lett. 40, 41–46 (2014)CrossRefGoogle Scholar
  23. 23.
    He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans Data Knowl. Eng. 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  24. 24.
    Hochberg, Y.: A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802 (1988)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Hodges, J.L., Lehmann, E.L.: Ranks methods for combination of independent experiments in analysis of variance. Ann. Math. Stat. 33, 482–487 (1962)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Hollander, M., Wolfe, D.: Nonparametric Statistical Methods. Wiley, New York (2013)MATHGoogle Scholar
  27. 27.
    Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)MathSciNetMATHGoogle Scholar
  28. 28.
    Iman, R., Davenport, J.: Approximations of the critical region of the Friedman statistic. Comput. Stat. 9(6), 571–595 (1980)MATHGoogle Scholar
  29. 29.
    Japkowicz, N., Stephen, N.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 40–49 (2002)MATHGoogle Scholar
  30. 30.
    Japkowicz, N., Shah, M.: Evaluating learning algorithms: a classification perspective. Cambridge University Press, Cambridge (2011)CrossRefMATHGoogle Scholar
  31. 31.
    Krzyśko, M., Wołyński, W., Górecki, T., Skorzybut, M.: Learning Systems. In: WNT, Warszawa (2008) (in Polish)Google Scholar
  32. 32.
    Kubat, M., Matwin, S.: Adressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th ICML, pp. 179–186 (1997)Google Scholar
  33. 33.
    Kurzyński, M.: Pattern Recognition. Statistical Approach. Wrocław University Technology Press, Wrocław (1997) (in Polish)Google Scholar
  34. 34.
    Malina, W., Śmiatacz, M.: Pattern Recognition. EXIT Press, Warszawa (2010) (in Polish)Google Scholar
  35. 35.
    Nadeau, C., Bengio, Y.: Inference for the generalization error. Mach. Learn. 52(3), 239–281 (2003)CrossRefMATHGoogle Scholar
  36. 36.
    Prati, R., et al.: A survey on graphical methods for classification predictive performance evaluation. IEEE Trans. Knowl. Data Eng. 23(11), 1601–1618 (2011)CrossRefGoogle Scholar
  37. 37.
    Ranavana, R., Palade, V.: Optimized precision: a new measure for classifier performance evaluation. In: Proceedings of the 23rd IEEE International Conference on Evolutionary Computation, pp. 2254–2261 (2006)Google Scholar
  38. 38.
    Quade, D.: Using weighted rankings in the analysis of complete blocks with additive block effects. J. Am. Stat. Assoc. 74, 680–683 (1979)MathSciNetCrossRefMATHGoogle Scholar
  39. 39.
    Salzberg, S.: On comparing classifiers: pitfalls to avoid and recommended approach. Data Min. Knowl. Disc. 1, 317–328 (1997)CrossRefGoogle Scholar
  40. 40.
    Sánchez-Crisostomo, J.P., Alejo, R., López-González, E., Valdovinos, R.M., Pacheco-Sánchez, J.H.: Empirical analysis of assessments metrics for multi-class imbalance learning on the back-propagation context. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds.) ICSI 2014. LNCS, vol. 8795, pp. 17–23. Springer, Cham (2014). doi: 10.1007/978-3-319-11897-0_3 Google Scholar
  41. 41.
    Santafe, G., et al.: Dealing with the evaluation of supervised classification algorithms. Artif. Intell. Rev. 44, 467–508 (2015)CrossRefGoogle Scholar
  42. 42.
    Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561–584 (1995)CrossRefGoogle Scholar
  43. 43.
    Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Proc. Manag. 45, 427–437 (2009)CrossRefGoogle Scholar
  44. 44.
    Stąpor, K.: Classification methods in computer vision. In: PWN, Warszawa (2011) (in Polish)Google Scholar
  45. 45.
    Sun, Y., et al.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)CrossRefGoogle Scholar
  46. 46.
    Sun, Y., et. al.: Boosting for learning multiple classes with imbalanced class distribution. In: Proceedings of International Conference on Data Mining, pp. 592–602 (2006)Google Scholar
  47. 47.
    Tadeusiewicz, R., Flasiński, M.: Pattern recognition. In: PWN, Warszawa (1991) (in Polish)Google Scholar
  48. 48.
    Wolpert, D.: The lack of a priori distinctions between learning algorithms. Neural Comput. 8(7), 1341–1390 (1996)CrossRefGoogle Scholar
  49. 49.
    Woźniak, M.: Hybrid classifiers. Methods of Data, Knowledge and Classifier Combination. SCI, vol. 519, Springer, Heidelberg (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Institute of Computer ScienceSilesian Technical UniversityGliwicePoland

Personalised recommendations