Skip to main content

Artificial Intelligence in Medicine: Validation and Study Design

  • Chapter
  • First Online:

Abstract

There has been a vast expansion in the volume of artificial intelligence (AI) research in biomedicine over the last several years. Simultaneously, we have begun to see the first medical AI systems rapidly translating from research into clinical practice. Evaluating AI systems for clinical tasks can be quite different than for other applications of AI. In medicine, the stakes are often higher—both risks and rewards. In this chapter, we explore key concepts underpinning the design, performance and validation of medical AI experiments. We also discuss several unresolved challenges the field currently faces.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Giger ML. Machine learning in medical imaging. J Am Coll Radiol. 2018;15:512–20.

    Article  Google Scholar 

  2. Harris S. Record year for investment in medical imaging AI companies. 2017. <https://www.signifyresearch.net/medical-imaging/record-year-investment-medical-imaging-ai-companies/>

  3. Petryna A. When experiments travel: clinical trials and the global search for human subjects. Princeton, NJ: Princeton University Press; 2009.

    Book  Google Scholar 

  4. Simonite T. Google’s AI doctor gets ready to go to work in India. 2017. <https://www.wired.com/2017/06/googles-ai-eye-doctor-gets-ready-go-work-india/>

  5. Enlitic. Enlitic to partner with Paiyipai to deploy deep learning in health check centers across China. 2017. <https://www.prnewswire.com/news-releases/enlitic-to-partner-with-paiyipai-to-deploy-deep-learning-in-health-check-centers-across-china-300433790.html>

  6. U.S. Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. 2018. <https://www.fda.gov/NewsEvents/Newsroom/PressAnnouncements/ucm604357.htm>

  7. Euser AM, Zoccali C, Jager KJ, Dekker FW. Cohort studies: prospective versus retrospective. Nephron Clin Pract. 2009;113:c214–7.

    Article  Google Scholar 

  8. Rothman KJ, Greenland S, Lash TL. Modern epidemiology. Philadelphia, PA: Wolters Kluwer Health; 2008.

    Google Scholar 

  9. Wang X, et al. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2017. p. 3462–3471.

    Google Scholar 

  10. Anderson E, Muir B, Walsh J, Kirkpatrick A. The efficacy of double reading mammograms in breast screening. Clin Radiol. 1994;49:248–51.

    Article  CAS  Google Scholar 

  11. Manrai AK, Patel CJ, Ioannidis JP. In the era of precision medicine and big data, who is normal? JAMA. 2018;319:1981–2.

    Article  Google Scholar 

  12. Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10.

    Article  Google Scholar 

  13. Punjabi NM. The epidemiology of adult obstructive sleep apnea. Proc Am Thorac Soc. 2008;5:136–43.

    Article  Google Scholar 

  14. Ogasawara KK. Variation in fetal ultrasound biometry based on differences in fetal ethnicity. Am J Obstet Gynecol. 2009;200:676. e671–4.

    Article  Google Scholar 

  15. Shipp TD, Bromley B, Mascola M, Benacerraf B. Variation in fetal femur length with respect to maternal race. J Ultrasound Med. 2001;20:141–4.

    Article  CAS  Google Scholar 

  16. BBC News. Google apologises for Photos app’s racist blunder. 2015. <http://www.bbc.com/news/technology-33347866>

  17. Agarwala A. Automatic photography with google clips. 2018. <https://ai.googleblog.com/2018/05/automatic-photography-with-google-clips.html>

  18. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med. 1994;331:1493–9.

    Article  CAS  Google Scholar 

  19. Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8.

    Article  CAS  Google Scholar 

  20. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.

    Article  CAS  Google Scholar 

  21. Shiraishi J, Pesce LL, Metz CE, Doi K. Experimental design and data analysis in receiver operating characteristic studies: lessons learned from reports in radiology from 1997 to 2006. Radiology. 2009;253:822–30.

    Article  Google Scholar 

  22. U.S. Food and Drug Administration. Software as a medical device: clinical evaluation. 2017. <https://www.fda.gov/downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm524904.pdf>

  23. Gal Y, Ghahramani Z. In: International conference on machine learning. 2016. p. 1050–1059.

    Google Scholar 

  24. Obuchowski NA, et al. Multireader, multicase receiver operating characteristic analysis:: an empirical comparison of five methods. Acad Radiol. 2004;11:980–95.

    PubMed  Google Scholar 

  25. Obuchowski NA. Sample size tables for receiver operating characteristic studies. Am J Roentgenol. 2000;175:603–8.

    Article  CAS  Google Scholar 

  26. Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York, NY: Springer; 1992. p. 569–93.

    Chapter  Google Scholar 

  27. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–33.

    Article  Google Scholar 

  28. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods. 2015;12:179.

    Article  CAS  Google Scholar 

  29. Ioannidis JP. The proposal to lower P value thresholds to .005. JAMA. 2018;319:1429–30.

    Article  Google Scholar 

  30. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124.

    Article  Google Scholar 

  31. Deng J, et al. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE; 2009. p. 248–255.

    Google Scholar 

  32. Kohli A, Jha S. Why CAD failed in mammography. J Am Coll Radiol. 2018;15:535–7.

    Article  Google Scholar 

  33. Google. Google self-driving car project monthly report. 2015. <https://static.googleusercontent.com/media/www.google.com/en//selfdrivingcar/files/reports/report-1015.pdf>

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Oakden-Rayner, L., Palmer, L.J. (2019). Artificial Intelligence in Medicine: Validation and Study Design. In: Ranschaert, E., Morozov, S., Algra, P. (eds) Artificial Intelligence in Medical Imaging. Springer, Cham. https://doi.org/10.1007/978-3-319-94878-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94878-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94877-5

  • Online ISBN: 978-3-319-94878-2

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics