Herding QATs: Quality Assessment Tools for Evidence in Medicine

Stegenga, Jacob

doi:10.1007/978-94-017-8887-8_10

Jacob Stegenga^7,8

Part of the book series: History, Philosophy and Theory of the Life Sciences ((HPTL,volume 7))

1164 Accesses
1 Citations

Abstract

Medical scientists employ ‘quality assessment tools’ (QATs) to measure the quality of evidence from clinical studies, especially randomized controlled trials (RCTs). These tools are designed to take into account various methodological details of clinical studies, including randomization, blinding, and other features of studies deemed relevant to minimizing bias and error. There are now dozens available. The various QATs on offer differ widely from each other, and second-order empirical studies show that QATs have low inter-rater reliability and low inter-tool reliability. This is an instance of a more general problem I call the underdetermination of evidential significance. Disagreements about the strength of a particular piece of evidence can be due to different—but in principle equally good—weightings of the fine-grained methodological features which constitute QATs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
I discuss evidence hierarchies in more detail below. Such evidence hierarchies are commonly employed in evidence-based medicine. Examples include those of the Oxford Centre for Evidence-Based Medicine, the Scottish Intercollegiate Guidelines Network (SIGN), and The Grading of Recommendations Assessment, Development and Evaluation (GRADE) Working Group. These evidence hierarchies have recently received much criticism. See, for example, Bluhm (2005), Upshur (2005), Borgerson (2008), and La Caze (2011), and for a specific critique of placing meta-analysis at the top of such hierarchies, see Stegenga (2011). In footnote 5 below I cite several recent criticisms of the assumption that RCTs ought to be necessarily near the top of such hierarchies.
2.
The general norm is usually called the principle of total evidence, associated with Carnap (1947). See also Good (1967). Howick (2011) invokes the principle of total evidence for systematic reviews of evidence related to medical hypotheses. A presently unpublished paper by Bert Leuridan contains a good discussion of the principle of total evidence as it applies to medicine.
3.
See, for example, Hacking (1981), Thagard (1998), Bechtel (2002), and Weber (2009).
4.
Although one only needs to consider the prominence of randomization in QATs to see that QATs have, in fact, been indirectly criticized by the recent literature criticizing the assumed ‘gold standard’ status of RCTs (see footnote 5). In the present paper I do not attempt a thorough normative evaluation of any particular QAT. Considering the role of randomization suggests what a large task a thorough normative evaluation of a particular QAT would be. But for a systematic survey of the most prominent QATs, see West et al. (2002).
5.
The view that RCTs are the ‘gold standard’ of evidence has recently been subjected to much philosophical criticism. See, for example, Worrall (2002, 2007), Cartwright (2007), and Cartwright (2010); for an assessment of the arguments for and against the gold standard status of RCTs, see Howick (2011). Observational studies also have QATs, such as QATSO (Quality Assessment Checklist for Observational Studies) and NOQAT (Newcastle-Ottawa Quality Assessment Scale – Case Control Studies).
6.
A note about terminology: sometimes the term ‘trial’ in the medical literature refers specifically to an experimental design (such as a randomized controlled trial) while the term ‘study’ refers to an observational design (such as a case control study), but this use is inconsistent. I will use both terms freely to refer to any method of generating evidence in biomedical research, including both experimental and observational designs.
7.
There are several commonly employed measures of effect size, including mean difference (for continuous variables), or odds ratio, risk ratio, or risk difference (for dichotomous variables). The weighting factor is sometimes determined by the QAT score, but a common method of determining the weight of a trial is simply based on the size of the trial (Egger et al. 1997), often by using the inverse variability of the data from a trial to measure that trial’s weight (because inverse variability is correlated with trial size).
8.
See, for example, Moher et al. (1998), Balk et al. (2002), and Hempel et al. (2011).
9.
The parallel argument for the principle of total evidence is based on a concern to avoid ‘defeating’ evidence. Defeating evidence has the following property. Suppose some hypothesis H is confirmed by some piece of evidence (e_c). Then some other piece of evidence (e_d) is defeating if p(H|e_c & e_d) < p(H|e_c). This could arise, for instance, because e_d provides strong reason to believe that e_c is, in fact, spurious.
10.
The latter notion—H passing a severe test T with x₀—occurs when “1) x₀ agrees with H, (for a suitable notion of ‘agreement’) and 2) with very high probability, test T would have produced a result that accords less well with H than does x₀, if H were false or incorrect” (Mayo and Spanos 2011).
11.
For simplicity I will describe Cohen’s Kappa, which measures the agreement of two reviewers who classify items into discrete categories, and is computed as follows:κ = [p(a) – p(e)]/[1 – p(e)]where p(a) is the probability of agreement (based on the observed frequency of agreement) and p(e) is the probability of chance agreement (also calculated using observed frequency data). Kappa was first introduced as a statistical measure by Cohen (1960). For more than two reviewers, a measure called Fleiss’ Kappa can be used. I give an example of a calculation of κ below.
12.
I owe Jonah Schupbach thanks for noting that a κ measure can not only seem inappropriately low, as in the above cases of poor inter-rater reliability, but can seem inappropriately high as well. If a κ measure approaches 1, this might suggest agreement which is ‘too good to be true’. Returning to my toy example, if Beth and Sara had a very high a κ measure, then one might wonder if they colluded in their grading. Thus when using a κ statistic to assess inter-rater reliability, we should hope for a κ measure above some minimal threshold (below which indicates too much disagreement) but below some maximum threshold (above which indicates too much agreement). What exactly these thresholds should be are beyond the scope of this paper (and are, I suppose, context sensitive).
13.
For this latter reason I refrain from describing or illustrating the particular statistical analyses employed in tests of the inter-tool reliability of QATs, as I did in section “Inter-rater reliability” on tests of the inter-rater reliability of QATs. Nearly every published test of inter-rater reliability uses a different statistic to measure agreement of quality assessment between tools. Analyses employed include Kendall’s rank correlation coefficient (τ), Kendall’s coefficient of concordance (W), and Spearman’s rank correlation coefficient (ρ).
14.
This latter consideration is somewhat controversial, both because it has been contradicted by other empirical studies, and because it assumes that the correct estimate of the efficacy of medical interventions is given by what are purported to be higher quality studies.
15.
The relata in such purported causal relations are, of course, the medical intervention under investigation and the change in value of one or more parameters of a group of subjects.
16.
There is a tendency among medical scientists to suppose that the relative importance of various methodological features is merely an empirical matter. One need not entirely sympathize with such methodological naturalism to agree with the point expressed by Cho and Bero here: we lack reasons to prefer one weighting of methodological features over another, regardless of whether one thinks of these reasons as empirical or principled.
17.
See also Olivo et al. (2007) for an empirical critique of QATs.

References

Balk EM, Bonis PA, Moskowitz H, Schmid CH, Ioannidis JP, Wang C, Lau J (2002) Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 287(22):2973–2982
Article Google Scholar
Bechtel W (2002) Aligning multiple research techniques in cognitive neuroscience: why is it important? Philos Sci 69:S48–S58
Article Google Scholar
Bluhm R (2005) From hierarchy to network: a richer view of evidence for evidence-based medicine. Perspect Biol Med 48(4):535–547
Article Google Scholar
Borgerson K (2008) Valuing and evaluating evidence in medicine. PhD dissertation, University of Toronto
Google Scholar
Carnap R (1947) On the application of inductive logic. Philos Phenomenol Res 8:133–148
Article Google Scholar
Cartwright N (2007) Are RCTs the gold standard? Biosocieties 2:11–20
Article Google Scholar
Cartwright N (2010) The long road from ‘it works somewhere’ to ‘it will work for us’. Philosophy of Science Association, Presidential Address
Google Scholar
Chalmers TC, Smith H, Blackburn B et al (1981) A method for assessing the quality of a randomized control trial. Control Clin Trials 2:31–49
Article Google Scholar
Cho MK, Bero LA (1994) Instruments for assessing the quality of drug studies published in the medical literature. JAMA 272:101–104
Article Google Scholar
Clark HD, Wells GA, Huët C, McAlister FA, Salmi LR, Fergusson D, Laupacis A (1999) Assessing the quality of randomized trials: reliability of the Jadad scale. Control Clin Trials 20:448–452
Article Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46
Article Google Scholar
Egger M, Smith GD, Phillips AN (1997) Meta-analysis: principles and procedures. Br Med J 315:1533–1537
Article Google Scholar
Good IJ (1967) On the principle of total evidence. Br J Philos Sci 17(4):319–321
Article Google Scholar
Hacking I (1981) Do we see through a microscope? Pac Philos Quart 63:305–322
Google Scholar
Hartling L, Ospina M, Liang Y, Dryden D, Hooten N, Seida J, Klassen T (2009) Risk of bias versus quality assessment of randomised controlled trials: cross sectional study. Br Med J 339:b4012
Article Google Scholar
Hartling L, Bond K, Vandermeer B, Seida J, Dryen DM, Rowe BH (2011) Applying the risk of bias tool in a systematic review of combination long-acting beta-agonists and inhaled corticosteroids for persistent asthma. PLoS One 6(2):1–6, e17242
Article Google Scholar
Hempel S, Suttorp MJ, Miles JNV, Wang Z, Maglione M, Morton S, Johnsen B, Valentine D, Shekelle PG (2011) Empirical evidence of associations between trial quality and effect sizes. Methods research report, AHRQ publication no. 11-EHC045-EF. Available at: http://effectivehealthcare.ahrq.gov
Herbison P, Hay-Smith J, Gillespie WJ (2006) Adjustment of meta-analyses on the basis of quality scores should be abandoned. J Clin Epidemiol 59:1249–1256
Article Google Scholar
Howick J (2011) The philosophy of evidence-based medicine. Wiley-Blackwell, Chichester/Hoboken
Book Google Scholar
Jadad AR, Moore RA, Carroll D et al (1996) Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 17:1–12
Article Google Scholar
Jüni P, Witschi A, Bloch R, Egger M (1999) The hazards of scoring the quality of clinical trials for meta-analysis. J Am Med Assoc 282(11):1054–1060
Article Google Scholar
Jüni P, Altman DG, Egger M (2001) Assessing the quality of randomised controlled trials. In: Egger M, Smith GD, Altman DG (eds) Systematic reviews in health care: meta-analysis in context. BMJ Publishing Group, London
Google Scholar
La Caze A (2011) The role of basic science in evidence-based medicine. Biol Philos 26(1):81–98
Article Google Scholar
Linde K, Clausius N, Ramirez G et al (1997) Are the clinical effects of homoeopathy placebo effects? Lancet 350:834–843
Article Google Scholar
Mayo D (1996) Error and the growth of experimental knowledge. University of Chicago Press, Chicago
Book Google Scholar
Mayo D, Spanos A (2011) The error statistical philosophy. In: Mayo D, Spanos A (eds) Error and inference: recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of science. Cambridge University Press, New York
Google Scholar
Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S (1995) Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 16:62–73
Article Google Scholar
Moher D, Jadad AR, Tugwell P (1996) Assessing the quality of randomized controlled trials. Current issues and future directions. Int J Technol Assess Health Care 12(2):195–208
Article Google Scholar
Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP (1998) Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 352(9128):609–613
Article Google Scholar
Olivo SA, Macedo LG, Gadotti IC, Fuentes J, Stanton T, Magee DJ (2007) Scales to assess the quality of randomized controlled trials: a systematic review. Phys Ther 88(2):156–175
Article Google Scholar
Reisch JS, Tyson JE, Mize SG (1989) Aid to the evaluation of therapeutic studies. Pediatrics 84:815–827
Google Scholar
Spitzer WO, Lawrence V, Dales R et al (1990) Links between passive smoking and disease: a best-evidence synthesis. A report of the working group on passive smoking. Clin Invest Med 13:17–42
Google Scholar
Stegenga J (2011) Is meta-analysis the platinum standard of evidence? Stud Hist Philos Biol Biomed Sci 42:497–507
Article Google Scholar
Stegenga J (forthcoming) Down with the hierarchies
Google Scholar
Thagard P (1998) Ulcers and bacteria I: discovery and acceptance. Stud Hist Philos Biol Biomed Sci 29:107–136
Article Google Scholar
Upshur R (2005) Looking for rules in a world of exceptions: reflections on evidence-based practice. Perspect Biol Med 48(4):477–489
Article Google Scholar
Weber M (2009) The crux of crucial experiments: Duhem’s problems and inference to the best explanation. Br J Philos Sci 60:19–49
Article Google Scholar
West S, King V, Carey TS, Lohr KN, McKoy N, Sutton SF, Lux L (2002) Systems to rate the strength of scientific evidence. Evidence report/technology assessment number 47, AHRQ publication no. 02-E016
Google Scholar
Worrall J (2002) What evidence in evidence-based medicine? Philos Sci 69:S316–S330
Article Google Scholar
Worrall J (2007) Why there’s no cause to randomize. Br J Philos Sci 58:451–488
Article Google Scholar

Download references

Acknowledgements

This paper has benefited from discussion with Nancy Cartwright, Eran Tal, Jonah Schupbach, and audiences at the University of Utah, University of Toronto, and the Canadian Society for the History and Philosophy of Science. I owe the title to Frédéric Bouchard. Medical scientists Ken Bond and David Moher provided detailed written commentary. All remaining errors remain mine alone. I am grateful for financial support from the Social Sciences and Humanities Research Council of Canada.

Author information

Authors and Affiliations

Institute for the History and Philosophy of Science and Technology, University of Toronto, 91 Charles Street West, Toronto, M5S 1 K7, ON, Canada
Jacob Stegenga
Department of Philosophy, University of Utah, 215 South Central Campus Drive, Carolyn Tanner Irish Humanities Building, 4th Floor, Salt Lake City, UT, 84112, USA
Jacob Stegenga

Authors

Jacob Stegenga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacob Stegenga .

Editor information

Editors and Affiliations

IHPST, CNRS/Université Paris I/ENS, Paris, France
Philippe Huneman
Centre Cavaillès, Ecole Normale Supérieure, Paris, France
Gérard Lambert
Éditions Matériologiques, Paris, France
Marc Silberstein

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stegenga, J. (2015). Herding QATs: Quality Assessment Tools for Evidence in Medicine. In: Huneman, P., Lambert, G., Silberstein, M. (eds) Classification, Disease and Evidence. History, Philosophy and Theory of the Life Sciences, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8887-8_10

Download citation

DOI: https://doi.org/10.1007/978-94-017-8887-8_10
Published: 22 August 2014
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-017-8886-1
Online ISBN: 978-94-017-8887-8
eBook Packages: Humanities, Social Sciences and LawPhilosophy and Religion (R0)

Publish with us

Policies and ethics