Artificial Intelligence in Medicine: Validation and Study Design

  • Luke Oakden-Rayner
  • Lyle John Palmer


There has been a vast expansion in the volume of artificial intelligence (AI) research in biomedicine over the last several years. Simultaneously, we have begun to see the first medical AI systems rapidly translating from research into clinical practice. Evaluating AI systems for clinical tasks can be quite different than for other applications of AI. In medicine, the stakes are often higher—both risks and rewards. In this chapter, we explore key concepts underpinning the design, performance and validation of medical AI experiments. We also discuss several unresolved challenges the field currently faces.


Study design Medical AI Deep learning Machine learning 


  1. 1.
    Giger ML. Machine learning in medical imaging. J Am Coll Radiol. 2018;15:512–20.CrossRefGoogle Scholar
  2. 2.
    Harris S. Record year for investment in medical imaging AI companies. 2017. <>
  3. 3.
    Petryna A. When experiments travel: clinical trials and the global search for human subjects. Princeton, NJ: Princeton University Press; 2009.CrossRefGoogle Scholar
  4. 4.
    Simonite T. Google’s AI doctor gets ready to go to work in India. 2017. <>
  5. 5.
    Enlitic. Enlitic to partner with Paiyipai to deploy deep learning in health check centers across China. 2017. <>
  6. 6.
    U.S. Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. 2018. <>
  7. 7.
    Euser AM, Zoccali C, Jager KJ, Dekker FW. Cohort studies: prospective versus retrospective. Nephron Clin Pract. 2009;113:c214–7.CrossRefGoogle Scholar
  8. 8.
    Rothman KJ, Greenland S, Lash TL. Modern epidemiology. Philadelphia, PA: Wolters Kluwer Health; 2008.Google Scholar
  9. 9.
    Wang X, et al. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2017. p. 3462–3471.Google Scholar
  10. 10.
    Anderson E, Muir B, Walsh J, Kirkpatrick A. The efficacy of double reading mammograms in breast screening. Clin Radiol. 1994;49:248–51.CrossRefGoogle Scholar
  11. 11.
    Manrai AK, Patel CJ, Ioannidis JP. In the era of precision medicine and big data, who is normal? JAMA. 2018;319:1981–2.CrossRefGoogle Scholar
  12. 12.
    Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10.CrossRefGoogle Scholar
  13. 13.
    Punjabi NM. The epidemiology of adult obstructive sleep apnea. Proc Am Thorac Soc. 2008;5:136–43.CrossRefGoogle Scholar
  14. 14.
    Ogasawara KK. Variation in fetal ultrasound biometry based on differences in fetal ethnicity. Am J Obstet Gynecol. 2009;200:676. e671–4.CrossRefGoogle Scholar
  15. 15.
    Shipp TD, Bromley B, Mascola M, Benacerraf B. Variation in fetal femur length with respect to maternal race. J Ultrasound Med. 2001;20:141–4.CrossRefGoogle Scholar
  16. 16.
    BBC News. Google apologises for Photos app’s racist blunder. 2015. <>
  17. 17.
    Agarwala A. Automatic photography with google clips. 2018. <>
  18. 18.
    Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med. 1994;331:1493–9.CrossRefGoogle Scholar
  19. 19.
    Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8.CrossRefGoogle Scholar
  20. 20.
    Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.CrossRefGoogle Scholar
  21. 21.
    Shiraishi J, Pesce LL, Metz CE, Doi K. Experimental design and data analysis in receiver operating characteristic studies: lessons learned from reports in radiology from 1997 to 2006. Radiology. 2009;253:822–30.CrossRefGoogle Scholar
  22. 22.
    U.S. Food and Drug Administration. Software as a medical device: clinical evaluation. 2017. <>
  23. 23.
    Gal Y, Ghahramani Z. In: International conference on machine learning. 2016. p. 1050–1059.Google Scholar
  24. 24.
    Obuchowski NA, et al. Multireader, multicase receiver operating characteristic analysis:: an empirical comparison of five methods. Acad Radiol. 2004;11:980–95.PubMedGoogle Scholar
  25. 25.
    Obuchowski NA. Sample size tables for receiver operating characteristic studies. Am J Roentgenol. 2000;175:603–8.CrossRefGoogle Scholar
  26. 26.
    Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York, NY: Springer; 1992. p. 569–93.CrossRefGoogle Scholar
  27. 27.
    Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–33.CrossRefGoogle Scholar
  28. 28.
    Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods. 2015;12:179.CrossRefGoogle Scholar
  29. 29.
    Ioannidis JP. The proposal to lower P value thresholds to .005. JAMA. 2018;319:1429–30.CrossRefGoogle Scholar
  30. 30.
    Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124.CrossRefGoogle Scholar
  31. 31.
    Deng J, et al. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE; 2009. p. 248–255.Google Scholar
  32. 32.
    Kohli A, Jha S. Why CAD failed in mammography. J Am Coll Radiol. 2018;15:535–7.CrossRefGoogle Scholar
  33. 33.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Luke Oakden-Rayner
    • 1
    • 2
  • Lyle John Palmer
    • 1
    • 2
  1. 1.School of Public Health, The University of AdelaideAdelaideAustralia
  2. 2.Australian Institute of Machine LearningAdelaideAustralia

Personalised recommendations