Abstract
There has been a vast expansion in the volume of artificial intelligence (AI) research in biomedicine over the last several years. Simultaneously, we have begun to see the first medical AI systems rapidly translating from research into clinical practice. Evaluating AI systems for clinical tasks can be quite different than for other applications of AI. In medicine, the stakes are often higher—both risks and rewards. In this chapter, we explore key concepts underpinning the design, performance and validation of medical AI experiments. We also discuss several unresolved challenges the field currently faces.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Giger ML. Machine learning in medical imaging. J Am Coll Radiol. 2018;15:512–20.
Harris S. Record year for investment in medical imaging AI companies. 2017. <https://www.signifyresearch.net/medical-imaging/record-year-investment-medical-imaging-ai-companies/>
Petryna A. When experiments travel: clinical trials and the global search for human subjects. Princeton, NJ: Princeton University Press; 2009.
Simonite T. Google’s AI doctor gets ready to go to work in India. 2017. <https://www.wired.com/2017/06/googles-ai-eye-doctor-gets-ready-go-work-india/>
Enlitic. Enlitic to partner with Paiyipai to deploy deep learning in health check centers across China. 2017. <https://www.prnewswire.com/news-releases/enlitic-to-partner-with-paiyipai-to-deploy-deep-learning-in-health-check-centers-across-china-300433790.html>
U.S. Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. 2018. <https://www.fda.gov/NewsEvents/Newsroom/PressAnnouncements/ucm604357.htm>
Euser AM, Zoccali C, Jager KJ, Dekker FW. Cohort studies: prospective versus retrospective. Nephron Clin Pract. 2009;113:c214–7.
Rothman KJ, Greenland S, Lash TL. Modern epidemiology. Philadelphia, PA: Wolters Kluwer Health; 2008.
Wang X, et al. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE; 2017. p. 3462–3471.
Anderson E, Muir B, Walsh J, Kirkpatrick A. The efficacy of double reading mammograms in breast screening. Clin Radiol. 1994;49:248–51.
Manrai AK, Patel CJ, Ioannidis JP. In the era of precision medicine and big data, who is normal? JAMA. 2018;319:1981–2.
Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–10.
Punjabi NM. The epidemiology of adult obstructive sleep apnea. Proc Am Thorac Soc. 2008;5:136–43.
Ogasawara KK. Variation in fetal ultrasound biometry based on differences in fetal ethnicity. Am J Obstet Gynecol. 2009;200:676. e671–4.
Shipp TD, Bromley B, Mascola M, Benacerraf B. Variation in fetal femur length with respect to maternal race. J Ultrasound Med. 2001;20:141–4.
BBC News. Google apologises for Photos app’s racist blunder. 2015. <http://www.bbc.com/news/technology-33347866>
Agarwala A. Automatic photography with google clips. 2018. <https://ai.googleblog.com/2018/05/automatic-photography-with-google-clips.html>
Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med. 1994;331:1493–9.
Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–8.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.
Shiraishi J, Pesce LL, Metz CE, Doi K. Experimental design and data analysis in receiver operating characteristic studies: lessons learned from reports in radiology from 1997 to 2006. Radiology. 2009;253:822–30.
U.S. Food and Drug Administration. Software as a medical device: clinical evaluation. 2017. <https://www.fda.gov/downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm524904.pdf>
Gal Y, Ghahramani Z. In: International conference on machine learning. 2016. p. 1050–1059.
Obuchowski NA, et al. Multireader, multicase receiver operating characteristic analysis:: an empirical comparison of five methods. Acad Radiol. 2004;11:980–95.
Obuchowski NA. Sample size tables for receiver operating characteristic studies. Am J Roentgenol. 2000;175:603–8.
Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York, NY: Springer; 1992. p. 569–93.
Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–33.
Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods. 2015;12:179.
Ioannidis JP. The proposal to lower P value thresholds to .005. JAMA. 2018;319:1429–30.
Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2:e124.
Deng J, et al. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE; 2009. p. 248–255.
Kohli A, Jha S. Why CAD failed in mammography. J Am Coll Radiol. 2018;15:535–7.
Google. Google self-driving car project monthly report. 2015. <https://static.googleusercontent.com/media/www.google.com/en//selfdrivingcar/files/reports/report-1015.pdf>
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Oakden-Rayner, L., Palmer, L.J. (2019). Artificial Intelligence in Medicine: Validation and Study Design. In: Ranschaert, E., Morozov, S., Algra, P. (eds) Artificial Intelligence in Medical Imaging. Springer, Cham. https://doi.org/10.1007/978-3-319-94878-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-94878-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94877-5
Online ISBN: 978-3-319-94878-2
eBook Packages: MedicineMedicine (R0)