Skip to main content

The Prevalence of Errors in Machine Learning Experiments

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2019 (IDEAL 2019)

Abstract

Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments.

Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors.

Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors.

Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error).

Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A confusion matrix is a \(2 \times 2\) contingency table where the cells represent true positives (TP), false negatives (FN), false positives (FP) and true negatives (TN) respectively. Most classification performance statistics, e.g. precision, recall and the Matthews correlation coefficient (MCC), can be defined from this matrix.

  2. 2.

    Our data may be retrieved from Figshare http://tiny.cc/vvvqbz.

  3. 3.

    Of 13 papers using NHST, 12 have \(\alpha =0.05\) and, unusually, one study interprets \(0.05< p <0.1\) with \(p=0.077\) as being ‘significant’.

References

  1. Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)

    MathSciNet  MATH  Google Scholar 

  2. Bender, R., Lange, S.: Adjusting for multiple testing - when and how? J. Clin. Epidemiol. 54(4), 343–349 (2001)

    Article  Google Scholar 

  3. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)

    MathSciNet  MATH  Google Scholar 

  4. Bowes, D., Hall, T., Gray, D.: DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom. Softw. Eng. 21(2), 287–313 (2014)

    Article  Google Scholar 

  5. Brown, N., Heathers, J.: The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8(4), 363–369 (2017)

    Article  Google Scholar 

  6. Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)

    Article  Google Scholar 

  7. Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. Royal Soc. Open Sci. 1, 140216 (2014)

    Article  Google Scholar 

  8. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  9. Earp, B., Trafimow, D.: Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol. 6, 621 (2015)

    Article  Google Scholar 

  10. Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)

    Article  Google Scholar 

  11. Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)

    Article  Google Scholar 

  12. Kitchenham, B., Budgen, D., Brereton, P.: Evidence-Based Software Engineering and Systematic Reviews. CRC Press, Boca Raton (2015)

    Google Scholar 

  13. Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. (2019, under review)

    Google Scholar 

  14. Munafò, M., et al.: A manifesto for reproducible science. Nat. Hum. Behav. 1(1), 0021 (2017)

    Article  Google Scholar 

  15. Nuijten, M., Hartgerink, C., van Assen, M., Epskamp, S., Wicherts, J.: The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48(4), 1205–1226 (2016)

    Article  Google Scholar 

  16. Perlin, M., Imasato, T., Borenstein, D.: Is predatory publishing a real threat? Evidence from a large database study. Scientometrics (2018, online). https://doi.org/10.1007/s11192-018-2750-6

    Article  Google Scholar 

  17. Shepperd, M., Bowes, D., Hall, T.: Researcher bias: the use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Shepperd .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shepperd, M. et al. (2019). The Prevalence of Errors in Machine Learning Experiments. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A., Menezes, R., Allmendinger, R. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2019. IDEAL 2019. Lecture Notes in Computer Science(), vol 11871. Springer, Cham. https://doi.org/10.1007/978-3-030-33607-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33607-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33606-6

  • Online ISBN: 978-3-030-33607-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics