The p-value Case, a Review of the Debate: Issues and Plausible Remedies

Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 227)

Abstract

We review the recent debate on the lack of reliability of scientific results and its connections to the statistical methodologies at the core of the discovery paradigm. Null hypotheses statistical testing, in particular, has often been related to, if not blamed for, the present situation. We argue that a loose relation exists: although NHST, if properly used, could not be seen as a cause, some common misuses may mask or even favour bad practices leading to the lack of reliability. We discuss various proposals which have been put forward to deal with these issues.

Keywords

Null hypotheses statistical testing p-value Reproducibility 

Notes

Acknowledgements

This work was supported by Univesity of Trieste within the FRA project “Politiche strutturali e riforme. Analisi degli indicatori e valutazione degli effetti”.

References

  1. 1.
    Baker, M.: Is there a reproducibility crisis? Nature 533, 452–454 (2016)CrossRefGoogle Scholar
  2. 2.
    Beall, A.T., Tracy, J.L.: Women are more likely to wear red or pink at peak fertility. Psychol. Sci. 24, 1837–1841 (2013)CrossRefGoogle Scholar
  3. 3.
    Berger, J.O.: Could Fisher, Jeffreys and Neyman have agreed on testing? Stat. Sci. 18(1), 1–12 (2003)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Boland, M.R., Shahn, Z., Madigan, D., Hripcsak, G., Tatonetti, N.P.: Birth month affects lifetime disease risk: a phenome-wide method. J. Am. Med. Inform. Assoc. ocv046 (2015)Google Scholar
  5. 5.
    Brodeur, A., Lé, M., Sangnier, M., Zylberberg, Y.: Star wars: the empirics strike back. Am. Econ. J. Appl. Econ. 8(1), 1–32 (2016)CrossRefGoogle Scholar
  6. 6.
    Burnham, K., Anderson, D.: P values are only an index to evidence: 20th-vs. 21st-century statistical science. Ecology 95(3), 627–630 (2014)CrossRefGoogle Scholar
  7. 7.
    Cohen, J.: The earth is round (\(p\,<\,0.05\)). Am. Psychol. 49, 997–1003 (1994)CrossRefGoogle Scholar
  8. 8.
    Cowan, G., Cranmer, K., Gross, E., Vitells, O.: Asymptotic formulae for likelihood-based tests of new physics. Eur. Phys. J. C 71(2), 1–19 (2011)CrossRefGoogle Scholar
  9. 9.
    Cowen, R.: Big bang finding challenged. Nature 510(7503), 20 (2014)CrossRefGoogle Scholar
  10. 10.
    Cumming, G.: The new statistics why and how. Psychol. Sci. 25, 7–29 (2013)CrossRefGoogle Scholar
  11. 11.
    Fidler, F., Loftus, G.R.: Why figures with error bars should replace p values: some conceptual arguments and empirical demonstrations. J. Psychol. 217(1), 27–37 (2009)Google Scholar
  12. 12.
    Fisher, R.A., et al.: Statistical methods for research workers. In: Statistical Methods for Research Workers, 10th. edn. (1946)Google Scholar
  13. 13.
    Gelman, A.: Commentary: P values and statistical practice. Epidemiology 24(1), 69–72 (2013)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Gelman, A., Loken, E.: The statistical crisis in science. Am. Sci. 102, 460–465 (2014)CrossRefGoogle Scholar
  15. 15.
    Gigerenzer, G.: Mindless statistics. J. Socio-Econ. 33(5), 587–606 (2004)CrossRefGoogle Scholar
  16. 16.
    Goodman, S.N.: Toward evidence-based medical statistics. 1: the p value fallacy. Ann. Intern. Med. 130(12), 995–1004 (1999)CrossRefGoogle Scholar
  17. 17.
    Goodman, S.N.: Toward evidence-based medical statistics. 2: the bayes factor. Ann. Intern. Med. 130(12), 1005–1013 (1999)CrossRefGoogle Scholar
  18. 18.
    Goodman, S.N.: Aligning statistical and scientific reasoning. Science 352, 1180–1181 (2016)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Greenland, S., Poole, C.: Living with p values: resurrecting a bayesian perspective on frequentist statistics. Epidemiology 24(1), 62–68 (2013)CrossRefGoogle Scholar
  20. 20.
    Hart, et al.: Dogs are sensitive to small variations of the Earth’s magnetic field. Front. Zool. 10, 80 (2013)Google Scholar
  21. 21.
    Hauer, E.: The harm done by tests of significance. Accident Analysis & Prevention 36(3), 495–500 (2004)CrossRefGoogle Scholar
  22. 22.
    Head, M.L., Holman, L., Lanfear, R., Kahn, A.T., Jennions, M.D.: The extent and consequences of p-hacking in science. PLoS Biol. 13(3), e1002,106 (2015)Google Scholar
  23. 23.
    Hoover, K.D., Siegler, M.V.: Sound and fury: Mccloskey and significance testing in economics. J. Econ. Method. 15(1), 1–37 (2008)CrossRefGoogle Scholar
  24. 24.
    Ioannidis, J.P.: Contradicted and initially stronger effects in highly cited clinical research. Jama 294(2), 218–228 (2005)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Ioannidis, J.P.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005)CrossRefGoogle Scholar
  26. 26.
    Kaplan, R.M., Irvin, V.L.: Likelihood of null effects of large nhlbi clinical trials has increased over time. PloS one 10(8), e0132,382 (2015)Google Scholar
  27. 27.
    Klein, J.R., Roodman, A.: Blind analysis in nuclear and particle physics. Ann. Rev. Nucl. Part. Sci. 55(1), 141–163 (2005)CrossRefGoogle Scholar
  28. 28.
    Krantz, D.H.: The null hypothesis testing controversy in psychology. J. Am. Stat. Assoc. 94(448), 1372–1381 (1999)CrossRefGoogle Scholar
  29. 29.
    Leek, J.T., Peng, R.D.: Statistics: P-values are just the tip of the iceberg. Nature 520(7549) (2015)Google Scholar
  30. 30.
    Lovell, D.: Biological importance and statistical significance. J. Agric. Food Chem. 61(35), 8340–8348 (2013)CrossRefGoogle Scholar
  31. 31.
    MacCoun, R., Perlmutter, S.: Blind analysis: hide results to seek the truth. Nature 526(7572), 187–189 (2015)CrossRefGoogle Scholar
  32. 32.
    Masicampo, E.J., Lalande, D.R.: A peculiar prevalence of p-values just below.05. Q. J. Exp. Psychol. 65(11), 2271–2279 (2012)CrossRefGoogle Scholar
  33. 33.
    Mayo, D.G., Spanos, A.: Severe testing as a basic concept in a neymanpearson philosophy of induction. Br. J. Philos. Sci. 57(2), 323–357 (2006)CrossRefMATHGoogle Scholar
  34. 34.
    McCloskey, D.: The insignificance of statistical significance. Sci. Am. 272, 32–33 (1995)CrossRefGoogle Scholar
  35. 35.
    McCloskey, D.N., Ziliak, S.T.: The standard error of regressions. J. Econ. Lit. 34(1), 97–114 (1996)Google Scholar
  36. 36.
    Meehl, P.: The problem is epistemology, not statistics: replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In: What if there were no significance tests, pp. 393–425. Psychology press (2013)Google Scholar
  37. 37.
    Neyman, J., Pearson, E.S.: On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lon. Ser. A 231, 289–337 (1933)Google Scholar
  38. 38.
    Nicholls, N.: Commentary and analysis: the insignificance of significance testing. Bull. Am. Meteorol. Soc. 82(5), 981–986 (2001)CrossRefGoogle Scholar
  39. 39.
    Nuzzo, R.: Scientific method: statistical errors. Nature 506(7487), 150–152 (2014)CrossRefGoogle Scholar
  40. 40.
    Reich, E.S.: Timing glitches dog neutrino claim. Nature 483(7387), 17 (2012)CrossRefGoogle Scholar
  41. 41.
    Rogoff, K., Reinhart, C.: Growth in a time of debt. Am. Econ. Rev. 100, 573–578 (2010)CrossRefGoogle Scholar
  42. 42.
    Rothman, K.J.: Writing for epidemiology. Epidemiology 9(3), 333–337 (1998)CrossRefGoogle Scholar
  43. 43.
    Royall, R.: Statistical Evidence: A Likelihood Paradigm (Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Chapman and Hall/CRC (1997)Google Scholar
  44. 44.
    Schmidt, F., Hunter, J.: Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: S.A.S.J. Harlow L.L. (ed.) What if There were no Significance Tests?, pp. 37–64. Psychology Press (1997)Google Scholar
  45. 45.
    Simmons, J.P., Nelson, L.D., Simonsohn, U.: False-Positive psychology-undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22(11), 1359–1366 (2011)CrossRefGoogle Scholar
  46. 46.
    Simonsohn, U., Nelson, L.D., Simmons, J.P.: P-curve: a key to the file-drawer. J. Exp. Psychol. Gen. 143(2), 534–547 (2014)CrossRefGoogle Scholar
  47. 47.
    Sterne, J.A.C., Smith, G.D., Cox, D.R.: Sifting the evidence-what’s wrong with significance tests? Phys. Ther. 81(8), 1464–1469 (2001)CrossRefGoogle Scholar
  48. 48.
    Trafimow, D.: Editorial. Basic Appl. Soc. Psychol. 36(1), 1–2 (2014)Google Scholar
  49. 49.
    Trafimow, D., Marks, M.: Editorial. Basic Appl. Soc. Psychol. 37(1), 1–2 (2015)Google Scholar
  50. 50.
    Wagenmakers, E.J.J.: A practical solution to the pervasive problems of p values. Psychon. Bull. Rev. 14(5), 779–804 (2007)CrossRefGoogle Scholar
  51. 51.
    Wasserstein, R.L., Lazar, N.A.: The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70(2), 129–133 (2016)MathSciNetCrossRefGoogle Scholar
  52. 52.
    Ziliak, S., McCloskey, D.: Size matters: the standard error of regressions in the american economic review. J. Socio-Econ. 33(5), 527–546 (2004)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.DEAMSUniversity of TriesteTriesteItaly

Personalised recommendations