The Prevalence of Errors in Machine Learning Experiments

Shepperd, Martin; Guo, Yuchen; Li, Ning; Arzoky, Mahir; Capiluppi, Andrea; Counsell, Steve; Destefanis, Giuseppe; Swift, Stephen; Tucker, Allan; Yousefi, Leila

doi:10.1007/978-3-030-33607-3_12

Martin Shepperd¹⁴,
Yuchen Guo¹⁵,
Ning Li¹⁶,
Mahir Arzoky¹⁴,
Andrea Capiluppi¹⁴,
Steve Counsell¹⁴,
Giuseppe Destefanis¹⁴,
Stephen Swift¹⁴,
Allan Tucker¹⁴ &
…
Leila Yousefi¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11871))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1739 Accesses
4 Citations

Abstract

Context: Conducting experiments is central to research machine learning research to benchmark, evaluate and compare learning algorithms. Consequently it is important we conduct reliable, trustworthy experiments.

Objective: We investigate the incidence of errors in a sample of machine learning experiments in the domain of software defect prediction. Our focus is simple arithmetical and statistical errors.

Method: We analyse 49 papers describing 2456 individual experimental results from a previously undertaken systematic review comparing supervised and unsupervised defect prediction classifiers. We extract the confusion matrices and test for relevant constraints, e.g., the marginal probabilities must sum to one. We also check for multiple statistical significance testing errors.

Results: We find that a total of 22 out of 49 papers contain demonstrable errors. Of these 7 were statistical and 16 related to confusion matrix inconsistency (one paper contained both classes of error).

Conclusions: Whilst some errors may be of a relatively trivial nature, e.g., transcription errors their presence does not engender confidence. We strongly urge researchers to follow open science principles so errors can be more easily be detected and corrected, thus as a community reduce this worryingly high error rate with our computational experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A confusion matrix is a \(2 \times 2\) contingency table where the cells represent true positives (TP), false negatives (FN), false positives (FP) and true negatives (TN) respectively. Most classification performance statistics, e.g. precision, recall and the Matthews correlation coefficient (MCC), can be defined from this matrix.
2.
Our data may be retrieved from Figshare http://tiny.cc/vvvqbz.
3.
Of 13 papers using NHST, 12 have \(\alpha =0.05\) and, unusually, one study interprets \(0.05< p <0.1\) with \(p=0.077\) as being ‘significant’.

References

Benavoli, A., Corani, G., Demšar, J., Zaffalon, M.: Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J. Mach. Learn. Res. 18(1), 2653–2688 (2017)
MathSciNet MATH Google Scholar
Bender, R., Lange, S.: Adjusting for multiple testing - when and how? J. Clin. Epidemiol. 54(4), 343–349 (2001)
Article Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
MathSciNet MATH Google Scholar
Bowes, D., Hall, T., Gray, D.: DConfusion: a technique to allow cross study performance evaluation of fault prediction studies. Autom. Softw. Eng. 21(2), 287–313 (2014)
Article Google Scholar
Brown, N., Heathers, J.: The GRIM test: a simple technique detects numerous anomalies in the reporting of results in psychology. Soc. Psychol. Pers. Sci. 8(4), 363–369 (2017)
Article Google Scholar
Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)
Article Google Scholar
Colquhoun, D.: An investigation of the false discovery rate and the misinterpretation of p-values. Royal Soc. Open Sci. 1, 140216 (2014)
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Earp, B., Trafimow, D.: Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol. 6, 621 (2015)
Article Google Scholar
Hall, T., Beecham, S., Bowes, D., Gray, D., Counsell, S.: A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38(6), 1276–1304 (2012)
Article Google Scholar
Ioannidis, J.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)
Article Google Scholar
Kitchenham, B., Budgen, D., Brereton, P.: Evidence-Based Software Engineering and Systematic Reviews. CRC Press, Boca Raton (2015)
Google Scholar
Li, N., Shepperd, M., Guo, Y.: A systematic review of unsupervised learning techniques for software defect prediction. Inf. Softw. Technol. (2019, under review)
Google Scholar
Munafò, M., et al.: A manifesto for reproducible science. Nat. Hum. Behav. 1(1), 0021 (2017)
Article Google Scholar
Nuijten, M., Hartgerink, C., van Assen, M., Epskamp, S., Wicherts, J.: The prevalence of statistical reporting errors in psychology (1985–2013). Behav. Res. Methods 48(4), 1205–1226 (2016)
Article Google Scholar
Perlin, M., Imasato, T., Borenstein, D.: Is predatory publishing a real threat? Evidence from a large database study. Scientometrics (2018, online). https://doi.org/10.1007/s11192-018-2750-6
Article Google Scholar
Shepperd, M., Bowes, D., Hall, T.: Researcher bias: the use of machine learning in software defect prediction. IEEE Trans. Softw. Eng. 40(6), 603–616 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Brunel University London, London, UK
Martin Shepperd, Mahir Arzoky, Andrea Capiluppi, Steve Counsell, Giuseppe Destefanis, Stephen Swift, Allan Tucker & Leila Yousefi
Xi’an Jiaotong University, Xi’an, China
Yuchen Guo
Northwestern Polytechnical University, Xi’an, China
Ning Li

Authors

Martin Shepperd
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Ning Li
View author publications
You can also search for this author in PubMed Google Scholar
Mahir Arzoky
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Capiluppi
View author publications
You can also search for this author in PubMed Google Scholar
Steve Counsell
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Destefanis
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Swift
View author publications
You can also search for this author in PubMed Google Scholar
Allan Tucker
View author publications
You can also search for this author in PubMed Google Scholar
Leila Yousefi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Shepperd .

Editor information

Editors and Affiliations

University of Manchester, Manchester, UK
Hujun Yin
Technical University of Madrid, Madrid, Spain
David Camacho
University of Birmingham, Birmingham, UK
Peter Tino
University of Huelva, Huelva, Spain
Antonio J. Tallón-Ballesteros
University of Exeter, Exeter, UK
Ronaldo Menezes
University of Manchester, Manchester, UK
Richard Allmendinger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shepperd, M. et al. (2019). The Prevalence of Errors in Machine Learning Experiments. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A., Menezes, R., Allmendinger, R. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2019. IDEAL 2019. Lecture Notes in Computer Science(), vol 11871. Springer, Cham. https://doi.org/10.1007/978-3-030-33607-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-33607-3_12
Published: 18 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33606-6
Online ISBN: 978-3-030-33607-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics