Skip to main content

Herding QATs: Quality Assessment Tools for Evidence in Medicine

  • Chapter
  • First Online:
Classification, Disease and Evidence

Part of the book series: History, Philosophy and Theory of the Life Sciences ((HPTL,volume 7))

Abstract

Medical scientists employ ‘quality assessment tools’ (QATs) to measure the quality of evidence from clinical studies, especially randomized controlled trials (RCTs). These tools are designed to take into account various methodological details of clinical studies, including randomization, blinding, and other features of studies deemed relevant to minimizing bias and error. There are now dozens available. The various QATs on offer differ widely from each other, and second-order empirical studies show that QATs have low inter-rater reliability and low inter-tool reliability. This is an instance of a more general problem I call the underdetermination of evidential significance. Disagreements about the strength of a particular piece of evidence can be due to different—but in principle equally good—weightings of the fine-grained methodological features which constitute QATs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    I discuss evidence hierarchies in more detail below. Such evidence hierarchies are commonly employed in evidence-based medicine. Examples include those of the Oxford Centre for Evidence-Based Medicine, the Scottish Intercollegiate Guidelines Network (SIGN), and The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group. These evidence hierarchies have recently received much criticism. See, for example, Bluhm (2005), Upshur (2005), Borgerson (2008), and La Caze (2011), and for a specific critique of placing meta-analysis at the top of such hierarchies, see Stegenga (2011). In footnote 5 below I cite several recent criticisms of the assumption that RCTs ought to be necessarily near the top of such hierarchies.

  2. 2.

    The general norm is usually called the principle of total evidence, associated with Carnap (1947). See also Good (1967). Howick (2011) invokes the principle of total evidence for systematic reviews of evidence related to medical hypotheses. A presently unpublished paper by Bert Leuridan contains a good discussion of the principle of total evidence as it applies to medicine.

  3. 3.

    See, for example, Hacking (1981), Thagard (1998), Bechtel (2002), and Weber (2009).

  4. 4.

    Although one only needs to consider the prominence of randomization in QATs to see that QATs have, in fact, been indirectly criticized by the recent literature criticizing the assumed ‘gold standard’ status of RCTs (see footnote 5). In the present paper I do not attempt a thorough normative evaluation of any particular QAT. Considering the role of randomization suggests what a large task a thorough normative evaluation of a particular QAT would be. But for a systematic survey of the most prominent QATs, see West et al. (2002).

  5. 5.

    The view that RCTs are the ‘gold standard’ of evidence has recently been subjected to much philosophical criticism. See, for example, Worrall (2002, 2007), Cartwright (2007), and Cartwright (2010); for an assessment of the arguments for and against the gold standard status of RCTs, see Howick (2011). Observational studies also have QATs, such as QATSO (Quality Assessment Checklist for Observational Studies) and NOQAT (Newcastle-Ottawa Quality Assessment Scale – Case Control Studies).

  6. 6.

    A note about terminology: sometimes the term ‘trial’ in the medical literature refers specifically to an experimental design (such as a randomized controlled trial) while the term ‘study’ refers to an observational design (such as a case control study), but this use is inconsistent. I will use both terms freely to refer to any method of generating evidence in biomedical research, including both experimental and observational designs.

  7. 7.

    There are several commonly employed measures of effect size, including mean difference (for continuous variables), or odds ratio, risk ratio, or risk difference (for dichotomous variables). The weighting factor is sometimes determined by the QAT score, but a common method of determining the weight of a trial is simply based on the size of the trial (Egger et al. 1997), often by using the inverse variability of the data from a trial to measure that trial’s weight (because inverse variability is correlated with trial size).

  8. 8.

    See, for example, Moher et al. (1998), Balk et al. (2002), and Hempel et al. (2011).

  9. 9.

    The parallel argument for the principle of total evidence is based on a concern to avoid ‘defeating’ evidence. Defeating evidence has the following property. Suppose some hypothesis H is confirmed by some piece of evidence (ec). Then some other piece of evidence (ed) is defeating if p(H|ec & ed) < p(H|ec). This could arise, for instance, because ed provides strong reason to believe that ec is, in fact, spurious.

  10. 10.

    The latter notion—H passing a severe test T with x0—occurs when “1) x0 agrees with H, (for a suitable notion of ‘agreement’) and 2) with very high probability, test T would have produced a result that accords less well with H than does x0, if H were false or incorrect” (Mayo and Spanos 2011).

  11. 11.

    For simplicity I will describe Cohen’s Kappa, which measures the agreement of two reviewers who classify items into discrete categories, and is computed as follows:κ = [p(a) – p(e)]/[1 – p(e)]where p(a) is the probability of agreement (based on the observed frequency of agreement) and p(e) is the probability of chance agreement (also calculated using observed frequency data). Kappa was first introduced as a statistical measure by Cohen (1960). For more than two reviewers, a measure called Fleiss’ Kappa can be used. I give an example of a calculation of κ below.

  12. 12.

    I owe Jonah Schupbach thanks for noting that a κ measure can not only seem inappropriately low, as in the above cases of poor inter-rater reliability, but can seem inappropriately high as well. If a κ measure approaches 1, this might suggest agreement which is ‘too good to be true’. Returning to my toy example, if Beth and Sara had a very high a κ measure, then one might wonder if they colluded in their grading. Thus when using a κ statistic to assess inter-rater reliability, we should hope for a κ measure above some minimal threshold (below which indicates too much disagreement) but below some maximum threshold (above which indicates too much agreement). What exactly these thresholds should be are beyond the scope of this paper (and are, I suppose, context sensitive).

  13. 13.

    For this latter reason I refrain from describing or illustrating the particular statistical analyses employed in tests of the inter-tool reliability of QATs, as I did in section “Inter-rater reliability” on tests of the inter-rater reliability of QATs. Nearly every published test of inter-rater reliability uses a different statistic to measure agreement of quality assessment between tools. Analyses employed include Kendall’s rank correlation coefficient (τ), Kendall’s coefficient of concordance (W), and Spearman’s rank correlation coefficient (ρ).

  14. 14.

    This latter consideration is somewhat controversial, both because it has been contradicted by other empirical studies, and because it assumes that the correct estimate of the efficacy of medical interventions is given by what are purported to be higher quality studies.

  15. 15.

    The relata in such purported causal relations are, of course, the medical intervention under investigation and the change in value of one or more parameters of a group of subjects.

  16. 16.

    There is a tendency among medical scientists to suppose that the relative importance of various methodological features is merely an empirical matter. One need not entirely sympathize with such methodological naturalism to agree with the point expressed by Cho and Bero here: we lack reasons to prefer one weighting of methodological features over another, regardless of whether one thinks of these reasons as empirical or principled.

  17. 17.

    See also Olivo et al. (2007) for an empirical critique of QATs.

References

  • Balk EM, Bonis PA, Moskowitz H, Schmid CH, Ioannidis JP, Wang C, Lau J (2002) Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 287(22):2973–2982

    Article  Google Scholar 

  • Bechtel W (2002) Aligning multiple research techniques in cognitive neuroscience: why is it important? Philos Sci 69:S48–S58

    Article  Google Scholar 

  • Bluhm R (2005) From hierarchy to network: a richer view of evidence for evidence-based medicine. Perspect Biol Med 48(4):535–547

    Article  Google Scholar 

  • Borgerson K (2008) Valuing and evaluating evidence in medicine. PhD dissertation, University of Toronto

    Google Scholar 

  • Carnap R (1947) On the application of inductive logic. Philos Phenomenol Res 8:133–148

    Article  Google Scholar 

  • Cartwright N (2007) Are RCTs the gold standard? Biosocieties 2:11–20

    Article  Google Scholar 

  • Cartwright N (2010) The long road from ‘it works somewhere’ to ‘it will work for us’. Philosophy of Science Association, Presidential Address

    Google Scholar 

  • Chalmers TC, Smith H, Blackburn B et al (1981) A method for assessing the quality of a randomized control trial. Control Clin Trials 2:31–49

    Article  Google Scholar 

  • Cho MK, Bero LA (1994) Instruments for assessing the quality of drug studies published in the medical literature. JAMA 272:101–104

    Article  Google Scholar 

  • Clark HD, Wells GA, Huët C, McAlister FA, Salmi LR, Fergusson D, Laupacis A (1999) Assessing the quality of randomized trials: reliability of the Jadad scale. Control Clin Trials 20:448–452

    Article  Google Scholar 

  • Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46

    Article  Google Scholar 

  • Egger M, Smith GD, Phillips AN (1997) Meta-analysis: principles and procedures. Br Med J 315:1533–1537

    Article  Google Scholar 

  • Good IJ (1967) On the principle of total evidence. Br J Philos Sci 17(4):319–321

    Article  Google Scholar 

  • Hacking I (1981) Do we see through a microscope? Pac Philos Quart 63:305–322

    Google Scholar 

  • Hartling L, Ospina M, Liang Y, Dryden D, Hooten N, Seida J, Klassen T (2009) Risk of bias versus quality assessment of randomised controlled trials: cross sectional study. Br Med J 339:b4012

    Article  Google Scholar 

  • Hartling L, Bond K, Vandermeer B, Seida J, Dryen DM, Rowe BH (2011) Applying the risk of bias tool in a systematic review of combination long-acting beta-agonists and inhaled corticosteroids for persistent asthma. PLoS One 6(2):1–6, e17242

    Article  Google Scholar 

  • Hempel S, Suttorp MJ, Miles JNV, Wang Z, Maglione M, Morton S, Johnsen B, Valentine D, Shekelle PG (2011) Empirical evidence of associations between trial quality and effect sizes. Methods research report, AHRQ publication no. 11-EHC045-EF. Available at: http://effectivehealthcare.ahrq.gov

  • Herbison P, Hay-Smith J, Gillespie WJ (2006) Adjustment of meta-analyses on the basis of quality scores should be abandoned. J Clin Epidemiol 59:1249–1256

    Article  Google Scholar 

  • Howick J (2011) The philosophy of evidence-based medicine. Wiley-Blackwell, Chichester/Hoboken

    Book  Google Scholar 

  • Jadad AR, Moore RA, Carroll D et al (1996) Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 17:1–12

    Article  Google Scholar 

  • Jüni P, Witschi A, Bloch R, Egger M (1999) The hazards of scoring the quality of clinical trials for meta-analysis. J Am Med Assoc 282(11):1054–1060

    Article  Google Scholar 

  • Jüni P, Altman DG, Egger M (2001) Assessing the quality of randomised controlled trials. In: Egger M, Smith GD, Altman DG (eds) Systematic reviews in health care: meta-analysis in context. BMJ Publishing Group, London

    Google Scholar 

  • La Caze A (2011) The role of basic science in evidence-based medicine. Biol Philos 26(1):81–98

    Article  Google Scholar 

  • Linde K, Clausius N, Ramirez G et al (1997) Are the clinical effects of homoeopathy placebo effects? Lancet 350:834–843

    Article  Google Scholar 

  • Mayo D (1996) Error and the growth of experimental knowledge. University of Chicago Press, Chicago

    Book  Google Scholar 

  • Mayo D, Spanos A (2011) The error statistical philosophy. In: Mayo D, Spanos A (eds) Error and inference: recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press, New York

    Google Scholar 

  • Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S (1995) Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 16:62–73

    Article  Google Scholar 

  • Moher D, Jadad AR, Tugwell P (1996) Assessing the quality of randomized controlled trials. Current issues and future directions. Int J Technol Assess Health Care 12(2):195–208

    Article  Google Scholar 

  • Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP (1998) Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 352(9128):609–613

    Article  Google Scholar 

  • Olivo SA, Macedo LG, Gadotti IC, Fuentes J, Stanton T, Magee DJ (2007) Scales to assess the quality of randomized controlled trials: a systematic review. Phys Ther 88(2):156–175

    Article  Google Scholar 

  • Reisch JS, Tyson JE, Mize SG (1989) Aid to the evaluation of therapeutic studies. Pediatrics 84:815–827

    Google Scholar 

  • Spitzer WO, Lawrence V, Dales R et al (1990) Links between passive smoking and disease: a best-evidence synthesis. A report of the working group on passive smoking. Clin Invest Med 13:17–42

    Google Scholar 

  • Stegenga J (2011) Is meta-analysis the platinum standard of evidence? Stud Hist Philos Biol Biomed Sci 42:497–507

    Article  Google Scholar 

  • Stegenga J (forthcoming) Down with the hierarchies

    Google Scholar 

  • Thagard P (1998) Ulcers and bacteria I: discovery and acceptance. Stud Hist Philos Biol Biomed Sci 29:107–136

    Article  Google Scholar 

  • Upshur R (2005) Looking for rules in a world of exceptions: reflections on evidence-based practice. Perspect Biol Med 48(4):477–489

    Article  Google Scholar 

  • Weber M (2009) The crux of crucial experiments: Duhem’s problems and inference to the best explanation. Br J Philos Sci 60:19–49

    Article  Google Scholar 

  • West S, King V, Carey TS, Lohr KN, McKoy N, Sutton SF, Lux L (2002) Systems to rate the strength of scientific evidence. Evidence report/technology assessment number 47, AHRQ publication no. 02-E016

    Google Scholar 

  • Worrall J (2002) What evidence in evidence-based medicine? Philos Sci 69:S316–S330

    Article  Google Scholar 

  • Worrall J (2007) Why there’s no cause to randomize. Br J Philos Sci 58:451–488

    Article  Google Scholar 

Download references

Acknowledgements

This paper has benefited from discussion with Nancy Cartwright, Eran Tal, Jonah Schupbach, and audiences at the University of Utah, University of Toronto, and the Canadian Society for the History and Philosophy of Science. I owe the title to Frédéric Bouchard. Medical scientists Ken Bond and David Moher provided detailed written commentary. All remaining errors remain mine alone. I am grateful for financial support from the Social Sciences and Humanities Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jacob Stegenga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Stegenga, J. (2015). Herding QATs: Quality Assessment Tools for Evidence in Medicine. In: Huneman, P., Lambert, G., Silberstein, M. (eds) Classification, Disease and Evidence. History, Philosophy and Theory of the Life Sciences, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8887-8_10

Download citation

Publish with us

Policies and ethics