Abstract
Medical scientists employ ‘quality assessment tools’ (QATs) to measure the quality of evidence from clinical studies, especially randomized controlled trials (RCTs). These tools are designed to take into account various methodological details of clinical studies, including randomization, blinding, and other features of studies deemed relevant to minimizing bias and error. There are now dozens available. The various QATs on offer differ widely from each other, and second-order empirical studies show that QATs have low inter-rater reliability and low inter-tool reliability. This is an instance of a more general problem I call the underdetermination of evidential significance. Disagreements about the strength of a particular piece of evidence can be due to different—but in principle equally good—weightings of the fine-grained methodological features which constitute QATs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
I discuss evidence hierarchies in more detail below. Such evidence hierarchies are commonly employed in evidence-based medicine. Examples include those of the Oxford Centre for Evidence-Based Medicine, the Scottish Intercollegiate Guidelines Network (SIGN), and The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group. These evidence hierarchies have recently received much criticism. See, for example, Bluhm (2005), Upshur (2005), Borgerson (2008), and La Caze (2011), and for a specific critique of placing meta-analysis at the top of such hierarchies, see Stegenga (2011). In footnote 5 below I cite several recent criticisms of the assumption that RCTs ought to be necessarily near the top of such hierarchies.
- 2.
The general norm is usually called the principle of total evidence, associated with Carnap (1947). See also Good (1967). Howick (2011) invokes the principle of total evidence for systematic reviews of evidence related to medical hypotheses. A presently unpublished paper by Bert Leuridan contains a good discussion of the principle of total evidence as it applies to medicine.
- 3.
- 4.
Although one only needs to consider the prominence of randomization in QATs to see that QATs have, in fact, been indirectly criticized by the recent literature criticizing the assumed ‘gold standard’ status of RCTs (see footnote 5). In the present paper I do not attempt a thorough normative evaluation of any particular QAT. Considering the role of randomization suggests what a large task a thorough normative evaluation of a particular QAT would be. But for a systematic survey of the most prominent QATs, see West et al. (2002).
- 5.
The view that RCTs are the ‘gold standard’ of evidence has recently been subjected to much philosophical criticism. See, for example, Worrall (2002, 2007), Cartwright (2007), and Cartwright (2010); for an assessment of the arguments for and against the gold standard status of RCTs, see Howick (2011). Observational studies also have QATs, such as QATSO (Quality Assessment Checklist for Observational Studies) and NOQAT (Newcastle-Ottawa Quality Assessment Scale – Case Control Studies).
- 6.
A note about terminology: sometimes the term ‘trial’ in the medical literature refers specifically to an experimental design (such as a randomized controlled trial) while the term ‘study’ refers to an observational design (such as a case control study), but this use is inconsistent. I will use both terms freely to refer to any method of generating evidence in biomedical research, including both experimental and observational designs.
- 7.
There are several commonly employed measures of effect size, including mean difference (for continuous variables), or odds ratio, risk ratio, or risk difference (for dichotomous variables). The weighting factor is sometimes determined by the QAT score, but a common method of determining the weight of a trial is simply based on the size of the trial (Egger et al. 1997), often by using the inverse variability of the data from a trial to measure that trial’s weight (because inverse variability is correlated with trial size).
- 8.
- 9.
The parallel argument for the principle of total evidence is based on a concern to avoid ‘defeating’ evidence. Defeating evidence has the following property. Suppose some hypothesis H is confirmed by some piece of evidence (ec). Then some other piece of evidence (ed) is defeating if p(H|ec & ed) < p(H|ec). This could arise, for instance, because ed provides strong reason to believe that ec is, in fact, spurious.
- 10.
The latter notion—H passing a severe test T with x0—occurs when “1) x0 agrees with H, (for a suitable notion of ‘agreement’) and 2) with very high probability, test T would have produced a result that accords less well with H than does x0, if H were false or incorrect” (Mayo and Spanos 2011).
- 11.
For simplicity I will describe Cohen’s Kappa, which measures the agreement of two reviewers who classify items into discrete categories, and is computed as follows:κ = [p(a) – p(e)]/[1 – p(e)]where p(a) is the probability of agreement (based on the observed frequency of agreement) and p(e) is the probability of chance agreement (also calculated using observed frequency data). Kappa was first introduced as a statistical measure by Cohen (1960). For more than two reviewers, a measure called Fleiss’ Kappa can be used. I give an example of a calculation of κ below.
- 12.
I owe Jonah Schupbach thanks for noting that a κ measure can not only seem inappropriately low, as in the above cases of poor inter-rater reliability, but can seem inappropriately high as well. If a κ measure approaches 1, this might suggest agreement which is ‘too good to be true’. Returning to my toy example, if Beth and Sara had a very high a κ measure, then one might wonder if they colluded in their grading. Thus when using a κ statistic to assess inter-rater reliability, we should hope for a κ measure above some minimal threshold (below which indicates too much disagreement) but below some maximum threshold (above which indicates too much agreement). What exactly these thresholds should be are beyond the scope of this paper (and are, I suppose, context sensitive).
- 13.
For this latter reason I refrain from describing or illustrating the particular statistical analyses employed in tests of the inter-tool reliability of QATs, as I did in section “Inter-rater reliability” on tests of the inter-rater reliability of QATs. Nearly every published test of inter-rater reliability uses a different statistic to measure agreement of quality assessment between tools. Analyses employed include Kendall’s rank correlation coefficient (τ), Kendall’s coefficient of concordance (W), and Spearman’s rank correlation coefficient (ρ).
- 14.
This latter consideration is somewhat controversial, both because it has been contradicted by other empirical studies, and because it assumes that the correct estimate of the efficacy of medical interventions is given by what are purported to be higher quality studies.
- 15.
The relata in such purported causal relations are, of course, the medical intervention under investigation and the change in value of one or more parameters of a group of subjects.
- 16.
There is a tendency among medical scientists to suppose that the relative importance of various methodological features is merely an empirical matter. One need not entirely sympathize with such methodological naturalism to agree with the point expressed by Cho and Bero here: we lack reasons to prefer one weighting of methodological features over another, regardless of whether one thinks of these reasons as empirical or principled.
- 17.
See also Olivo et al. (2007) for an empirical critique of QATs.
References
Balk EM, Bonis PA, Moskowitz H, Schmid CH, Ioannidis JP, Wang C, Lau J (2002) Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 287(22):2973–2982
Bechtel W (2002) Aligning multiple research techniques in cognitive neuroscience: why is it important? Philos Sci 69:S48–S58
Bluhm R (2005) From hierarchy to network: a richer view of evidence for evidence-based medicine. Perspect Biol Med 48(4):535–547
Borgerson K (2008) Valuing and evaluating evidence in medicine. PhD dissertation, University of Toronto
Carnap R (1947) On the application of inductive logic. Philos Phenomenol Res 8:133–148
Cartwright N (2007) Are RCTs the gold standard? Biosocieties 2:11–20
Cartwright N (2010) The long road from ‘it works somewhere’ to ‘it will work for us’. Philosophy of Science Association, Presidential Address
Chalmers TC, Smith H, Blackburn B et al (1981) A method for assessing the quality of a randomized control trial. Control Clin Trials 2:31–49
Cho MK, Bero LA (1994) Instruments for assessing the quality of drug studies published in the medical literature. JAMA 272:101–104
Clark HD, Wells GA, Huët C, McAlister FA, Salmi LR, Fergusson D, Laupacis A (1999) Assessing the quality of randomized trials: reliability of the Jadad scale. Control Clin Trials 20:448–452
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Egger M, Smith GD, Phillips AN (1997) Meta-analysis: principles and procedures. Br Med J 315:1533–1537
Good IJ (1967) On the principle of total evidence. Br J Philos Sci 17(4):319–321
Hacking I (1981) Do we see through a microscope? Pac Philos Quart 63:305–322
Hartling L, Ospina M, Liang Y, Dryden D, Hooten N, Seida J, Klassen T (2009) Risk of bias versus quality assessment of randomised controlled trials: cross sectional study. Br Med J 339:b4012
Hartling L, Bond K, Vandermeer B, Seida J, Dryen DM, Rowe BH (2011) Applying the risk of bias tool in a systematic review of combination long-acting beta-agonists and inhaled corticosteroids for persistent asthma. PLoS One 6(2):1–6, e17242
Hempel S, Suttorp MJ, Miles JNV, Wang Z, Maglione M, Morton S, Johnsen B, Valentine D, Shekelle PG (2011) Empirical evidence of associations between trial quality and effect sizes. Methods research report, AHRQ publication no. 11-EHC045-EF. Available at: http://effectivehealthcare.ahrq.gov
Herbison P, Hay-Smith J, Gillespie WJ (2006) Adjustment of meta-analyses on the basis of quality scores should be abandoned. J Clin Epidemiol 59:1249–1256
Howick J (2011) The philosophy of evidence-based medicine. Wiley-Blackwell, Chichester/Hoboken
Jadad AR, Moore RA, Carroll D et al (1996) Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 17:1–12
Jüni P, Witschi A, Bloch R, Egger M (1999) The hazards of scoring the quality of clinical trials for meta-analysis. J Am Med Assoc 282(11):1054–1060
Jüni P, Altman DG, Egger M (2001) Assessing the quality of randomised controlled trials. In: Egger M, Smith GD, Altman DG (eds) Systematic reviews in health care: meta-analysis in context. BMJ Publishing Group, London
La Caze A (2011) The role of basic science in evidence-based medicine. Biol Philos 26(1):81–98
Linde K, Clausius N, Ramirez G et al (1997) Are the clinical effects of homoeopathy placebo effects? Lancet 350:834–843
Mayo D (1996) Error and the growth of experimental knowledge. University of Chicago Press, Chicago
Mayo D, Spanos A (2011) The error statistical philosophy. In: Mayo D, Spanos A (eds) Error and inference: recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press, New York
Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S (1995) Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 16:62–73
Moher D, Jadad AR, Tugwell P (1996) Assessing the quality of randomized controlled trials. Current issues and future directions. Int J Technol Assess Health Care 12(2):195–208
Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP (1998) Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 352(9128):609–613
Olivo SA, Macedo LG, Gadotti IC, Fuentes J, Stanton T, Magee DJ (2007) Scales to assess the quality of randomized controlled trials: a systematic review. Phys Ther 88(2):156–175
Reisch JS, Tyson JE, Mize SG (1989) Aid to the evaluation of therapeutic studies. Pediatrics 84:815–827
Spitzer WO, Lawrence V, Dales R et al (1990) Links between passive smoking and disease: a best-evidence synthesis. A report of the working group on passive smoking. Clin Invest Med 13:17–42
Stegenga J (2011) Is meta-analysis the platinum standard of evidence? Stud Hist Philos Biol Biomed Sci 42:497–507
Stegenga J (forthcoming) Down with the hierarchies
Thagard P (1998) Ulcers and bacteria I: discovery and acceptance. Stud Hist Philos Biol Biomed Sci 29:107–136
Upshur R (2005) Looking for rules in a world of exceptions: reflections on evidence-based practice. Perspect Biol Med 48(4):477–489
Weber M (2009) The crux of crucial experiments: Duhem’s problems and inference to the best explanation. Br J Philos Sci 60:19–49
West S, King V, Carey TS, Lohr KN, McKoy N, Sutton SF, Lux L (2002) Systems to rate the strength of scientific evidence. Evidence report/technology assessment number 47, AHRQ publication no. 02-E016
Worrall J (2002) What evidence in evidence-based medicine? Philos Sci 69:S316–S330
Worrall J (2007) Why there’s no cause to randomize. Br J Philos Sci 58:451–488
Acknowledgements
This paper has benefited from discussion with Nancy Cartwright, Eran Tal, Jonah Schupbach, and audiences at the University of Utah, University of Toronto, and the Canadian Society for the History and Philosophy of Science. I owe the title to Frédéric Bouchard. Medical scientists Ken Bond and David Moher provided detailed written commentary. All remaining errors remain mine alone. I am grateful for financial support from the Social Sciences and Humanities Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Stegenga, J. (2015). Herding QATs: Quality Assessment Tools for Evidence in Medicine. In: Huneman, P., Lambert, G., Silberstein, M. (eds) Classification, Disease and Evidence. History, Philosophy and Theory of the Life Sciences, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8887-8_10
Download citation
DOI: https://doi.org/10.1007/978-94-017-8887-8_10
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-017-8886-1
Online ISBN: 978-94-017-8887-8
eBook Packages: Humanities, Social Sciences and LawPhilosophy and Religion (R0)