Evaluating Medical Expert Systems: What To Test, And How ?

  • Jeremy Wyatt
  • David Spiegelhalter
Part of the Lecture Notes in Medical Informatics book series (LNMED, volume 47)


Few medical expert systems have been rigorously evaluated, yet some believe that these systems have great potential to improve health care. For this and many other reasons, objective evaluation is necessary. We discuss the evaluation of medical expert systems in two stages: laboratory and field testing. In the former, the perspectives of both prospective users and experts are valuable. In the latter, the study must be designed to answer, in an unbiased manner, the questions: “Will the system be used in practice ?” and “When it is used, how does it affect the structure, process and outcome of health care ?”. We conclude with some proposals for encouraging the objective evaluation of medical expert systems.


Expert System Percutaneous Nephrolithotomy Prospective User Health Care Process Silver Standard 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Adams ID, Chan M, Clifford PC et alii: “Computer aided diagnosis of acute abdominal pain: a multicentre study”; Brit Med J vol. 293, 1986, pp. 800–804.CrossRefGoogle Scholar
  2. Anon: Checklists for statisticians. Brit Med J vol. 298, 1989, pp. 41–42Google Scholar
  3. Armitage P, Berry G: Statistical methods in medical research ( 2nd edition ). Blackwell Scientific, Oxford, 1987.Google Scholar
  4. Bucknall CE, Robertson C, Moran F and Stevenson RD: Differences in hospital asthma management. Lancet i: 748–750, 1988.Google Scholar
  5. Bundy A and Clutterbuck R: “Raising the standards of AI products”; Dept. of AI research paper no. 261, Dept. of AI, Edinburgh Univ. (also in proc. IJCAI-85), 1985.Google Scholar
  6. Chandrasekaran B: “On evaluating AI systems for medical diagnosis”; AI Magazine Summer 1983, pp. 34–37 + p.48, 1983.Google Scholar
  7. Charig CR, Webb DR, Payne SR, Wickham OE: Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy and extracorporeal shock wave lithotripsy. Brit Med J vol. 292; 879–882, 1986.CrossRefGoogle Scholar
  8. Cohen P, Howe A: How evaluation guides AI research. AI Magazine Winter 1988, vol. 9, pp. 35–43, 1988.Google Scholar
  9. de Dombal FT et alii:Computer-aided diagnosis of acute abdominal pain: multi-centre study Phase II. Final Report, pub. DHSS, London; p. 4Google Scholar
  10. Donabedian A: Evaluating the quality of medical care. Millbank Mem. Quaterly vol. 44; 166–206, 1966.Google Scholar
  11. Essex B: Evaluation of algorithms for the management of mental illness (abstract). In: The validation and testing of decision-aids in medicine, eds. Wyatt J and Spiegelhalter D, pub. British Medical Informatics Society, London, 1989.Google Scholar
  12. Fox J, Myers CD, Greaves MF, Pegram S: “Knowledge acquisition for expert systems: experience in leukaemia diagnosis”; Meth. Inf. Med. vol. 24, pp. 65–72, 1985.Google Scholar
  13. Gaschnig J, Klahr P, Pople H, Shortliffe E, Terry A: “Evaluation of expert systems: issues and case studies”; in Hayes-Roth F, Waterman DA and Lenat D (eds), “Building expert systems”, Addison Wesley 1983.Google Scholar
  14. Greenes RA, TarabarDB, Krauss M, Anderson G, Wolnik WJ, Cope L et al:“Knowledge management as a decision-support method: a diagnostic workup strategy application ” Comp Biomed Res; 22: 113–135, 1989.CrossRefGoogle Scholar
  15. Habbema JDF, Hilden J and Bjerregaard B: “The measurement of performance in probabilistic diagnosis; general recommendations”; Meth. Inf. Med. vol. 20, pp. 97–100, 1981.Google Scholar
  16. Lundsgaarde HP: “Evaluating medical expert systems”; Soc. Sci. Med. vol. 24, pp. 805–819, 1987.PubMedCrossRefGoogle Scholar
  17. Miller PL: “The evaluation of artificial intelligence systems in medicine”; Comp. Meth. and Prog. in Biomedicine, vol. 22, pp. 5–11, 1986.Google Scholar
  18. Montgomery A: GEMINI: Government Expert Systems Methodology Initiative. In: Research and Development in Expert Systems V (proc. ES’88), eds. Kelly B and Rector A, pub. Cambridge Univ. Press, 1989; pp. 14–24, 1989.Google Scholar
  19. Murray GD, Murray LS, Barlow P et alii: “Assessing the performance and clinical impact of a computerised prognostic system in severe head injury” Stats in Med vol 5; 403–410, 1986.CrossRefGoogle Scholar
  20. National Library of Medicine: Long Range Plan, report 4: Medical Informatics. pub. National Institutes of Health, Bethesda, Maryland; p. 73, 1986.Google Scholar
  21. Oppenheim AN: “Questionnaire design and attitude measurement” pub. Heinemann, London, 1982.Google Scholar
  22. Pearl J: “Probabilistic reasoning in intelligent systems.” Morgan Kaufman, San Mateo, California, 1988.Google Scholar
  23. Pocock SJ: “Controlled clinical trials: a practical approach.” Wiley, Chichester, Sussex, 1984.Google Scholar
  24. Quaglini S, Stefanelli M, Barosi G, Berzuini A: “A performance evaluation of the expert system ANEMIA” Comp. Biomed. Res. vol. 21, 307–323, 1988.Google Scholar
  25. Roethligsburger FJ, Dickson WJ: Management and the worker. pub. Harvard Univ. Press, Cambridge, Mass, 1939.Google Scholar
  26. Rossi-Mori A, Ricci FL: “Some comments on the evaluation of medical expert systems”; presented at AIM workshop 6, EEC, Brussels, 25/11/87, 1987.Google Scholar
  27. Schwartz D, Lellouch J: Explanatory and pragmatic attitudes in therapeutic trials J Chron Dis; 20: 637–48, 1967.Google Scholar
  28. Shortliffe EH, Davis R: “Some considerations for the implementation of knowledge-based expert systems” SIGART Newsletter no. 55, December 1975, pp. 9–12, 1975.Google Scholar
  29. Shortliffe EH: Evaluating expert systems. Report STAN-HPP-81–9; (partially reproduced as Gaschnig 1983 ), 1981.Google Scholar
  30. Shortliffe EH: “Computer programs to support medical decisions”; JAMA vol. 258, pp. 61–66, 1987.PubMedCrossRefGoogle Scholar
  31. Smith T: Taming high technology (editorial). Brit Med J vol. 289; 393–394, 1984.CrossRefGoogle Scholar
  32. Spiegelhalter DJ: “Evaluation of clinical decision aids, with an application to a system for dyspepsia”; Statistics in Med vol. 2, pp. 207–216, 1983.CrossRefGoogle Scholar
  33. Szolovits P and Pauker S: “Computers and clinical decision making: whether, how and for whom ?”; proc. IEEE vol. 67, pp. 1224–1226, 1979.Google Scholar
  34. Turing, A: “Computing machinery and intelligence”; Mind, vol. 59, pp. 236–248, 1950.Google Scholar
  35. Wasson JH, Sox HC, Neff RK and Goldman L: “Clinical prediction rules: applications and methodological standards”; NEJMed vol.313, pp.793-’799, 1985.Google Scholar
  36. Wyatt JC: “The evaluation of clinical decision support systems: a discussion of the methodology used in the ACORN project”; Lect. Notes in Med. Inform. vol 33, pp. 15–24 (Proc. AIME’87, Marseilles), 1987Google Scholar
  37. Wyatt JC: Lessons learned from the field trial of ACORN, a chest pain advisor. In Proc. Medlnfo’89, Beijing, eds. Manning P, Barber B. Amsterdam North-Holland, 1989.Google Scholar
  38. Yu VL, Fagan LM, Wraith SM et alii: “Antimicrobial selection by computer: a blinded evaluation by infectious disease experts”; JAMA, vol. 242, pp. 1279–1282, 1979.PubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1991

Authors and Affiliations

  • Jeremy Wyatt
    • 1
  • David Spiegelhalter
    • 2
  1. 1.The National Heart & Lung InstituteLondon and IBM Scientific CentreUK
  2. 2.MRC Biostatistics UnitCambridgeUK

Personalised recommendations