Getting serious about test–retest reliability: a critique of retest research and some recommendations

Polit, Denise F.

doi:10.1007/s11136-014-0632-9

Getting serious about test–retest reliability: a critique of retest research and some recommendations

Published: 07 February 2014

Volume 23, pages 1713–1720, (2014)
Cite this article

Quality of Life Research Aims and scope Submit manuscript

Denise F. Polit^1,2

10k Accesses
218 Citations
Explore all metrics

Abstract

Purpose

To focus attention on the need for rigorous and carefully designed test–retest reliability assessments for new patient-reported outcomes and to encourage retest researchers to be thoughtful, ambitious, and creative in their retest efforts.

Methods

The paper outlines key challenges that confront retest researchers, calls attention to some limitations in meeting those challenges, and describes some strategies to improve retest research.

Results

Modest retest coefficients are often reported as acceptable, and many important decisions—such as the retest interval—appear not to be evidence-based. Retest assessments are seldom undertaken before a measure has been finalized, which rules out using retest data to select strong, reproducible items.

Conclusions

Strategies for improving retest research include seeking input from patients or experts regarding the stability of the construct to support decisions about the retest interval, analyzing item-level retest data to identify items to revise or discard, establishing a priori standards of acceptability for reliability coefficients, using large, heterogeneous, and representative retest samples and collecting follow-up data to better understand consistent and inconsistent responses over time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Brundage, M., Blazeby, J., Revicki, D., Bass, B., DeVet, H., Duffy, H., et al. (2013). Patient-reported outcomes in randomized clinical trials: Development of ISOQOL reporting standards. Quality of Life Research, 22, 1161–1175.
Article PubMed Central PubMed Google Scholar
Mokkink, L. B., Terwee, C., Patrick, D., Alonso, J., Stratford, P., Knol, D. L., et al. (2010). The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. Journal of Clinical Epidemiology, 63, 737–745.
Article PubMed Google Scholar
DeVellis, R. F. (2012). Scale development: Theory and application (3rd ed.). Thousand Oaks, CA: Sage.
Google Scholar
Streiner, D. L. (2003). Being inconsistent about consistency: When coefficient alpha does and doesn’t matter. Journal of Personality Assessment, 80, 217–222.
Article PubMed Google Scholar
DeVet, H. C. W., Terwee, C., Mokkink, L. B., & Knol, D. L. (2011). Measurement in medicine: A practical guide. Cambridge: Cambridge University Press.
Book Google Scholar
U. S. Food and Drug Administration. (2009). Guidance for industry, patient-reported outcome measures: Use in medical product development to support labeling claims. Washington, DC: U. S. Department of Health and Human Services.
Google Scholar
Polit, D. F., & Yang, F. (2014). Measurement and the measurement of change: A primer for health professionals. Philadelphia: Lippincott Williams & Wilkins.
Cronbach, L. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12, 1–16.
Article CAS PubMed Google Scholar
Nunnally, J., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Google Scholar
Courvoisier, D., Cullati, S., Haller, C., Schmidt, R., Haller, G., Agoritsas, T., et al. (2013). Validation of a 10-item Care-related Regret Intensity Scale (RAI-10) for health care professionals. Medical Care, 51, 285–291.
Article PubMed Google Scholar
Simon, A. E., Forbes, L., Boniface, D., Warburton, F., Brain, K., Dessaix, A., et al. (2012). An international measure of awareness and beliefs about cancer: Development and testing of the ABC. BMJ Open, 2(6). doi:10.1136/bmjopen-2012-001758.
Poelman, M. P., Vermeer, W. M., Vyth, E., & Steenhuis, I. (2013). “I don’t have to go to the gym because I ate very healthy today”: The development of a scale to assess diet-related compensatory health beliefs. Public Health Nutrition, 16, 267–273.
Article PubMed Google Scholar
Ma, X., Barnes, T. L., Freedman, D., Bell, B., Colabianchi, N., & Liese, A. (2013). Test–retest reliability of a questionnaire measuring perceptions of neighbourhood food environment. Health & Place, 21, 65–69.
Article Google Scholar
Kröz, M., Schad, F., Reif, M., von Laue, H., Feder, G., Zerm, R., et al. (2011). Validation of the state version questionnaire on autonomic regulation (state-aR) for cancer patients. European Journal of Medical Research, 16, 457–468.
Article PubMed Central PubMed Google Scholar
Watson, D. (2004). Stability versus change, dependability versus error: Issues in the assessment of personality over time. Journal of Research in Personality, 8, 319–350.
Article Google Scholar
Schmidt, F. L., Le, H., & Ilies, R. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual difference constructs. Psychological Methods, 8, 206–224.
Article PubMed Google Scholar
Tourangeau, R., Lance, J. R., & Rasinski, K. (2000). The psychology of survey response. Cambridge: Cambridge University Press.
Book Google Scholar
Sprangers, M. A., & Schwartz, C. E. (1999). Integrating response shift into health-related quality-of-life research: A theoretical model. Social Science and Medicine, 48, 1507–1515.
Article CAS PubMed Google Scholar
Rapkin, B. D., & Schwartz, C. E. (2004). Towards a theoretical model of quality-of-life appraisal: Implications of findings from studies of response shift. Health and Quality of Life Outcomes, 2, 14.
Article PubMed Central PubMed Google Scholar
Geere, J. H., Geere, J. L., & Hunter, P. R. (2013). Meta-analysis identifies Back Pain Questionnaire reliability influenced more by instrument than study design or population. Journal of Clinical Epidemiology, 66, 261–267.
Article PubMed Google Scholar
Willis, G. B. (2005). Cognitive interviewing. Thousand Oaks, CA: Sage.
Google Scholar
Polit, D., Beck, C. T., & Owen, S. (2007). Is the CVI an acceptable indicator of content validity? Appraisal and recommendations. Research in Nursing & Health, 30, 459–467.
Article Google Scholar
Nevo, B. (1977). Using item test–retest stability (ITRS) as a criterion for item selection. Educational and Psychological Measurement, 37, 847–852.
Article Google Scholar
Ashford, S., Turner-Stokes, L., Siegert, R., & Slade, M. (2013). Initial psychometric evaluation of the Arm Activity Measure (ArmA): A measure of activity in the hemiparetic arm. Clinical Rehabilitation, 27, 728–740.
Article PubMed Google Scholar
Jones, R. R., & Goldberg, L. R. (1967). Interrelationships among personality scale parameters: Item response stability and scale reliability. Educational and Psychological Measurement, 27, 323–333.
Article Google Scholar
Yorke, J., Swigris, J., Russell, A., Moosavi, S. H., Kwong, G. N. M., Longshaw, M., et al. (2011). Dyspnea-12 is a valid and reliable measure of breathlessness in patients with interstitial lung disease. Chest, 139, 159–164.
Article PubMed Central PubMed Google Scholar
Deyo, R. A., Diehr, P., & Patrick, D. L. (1991). Reproducibility and responsiveness of health status measures: Statistics and strategies for evaluation. Controlled Clinical Trials, 12(4 suppl), 142S–158S.
Article CAS PubMed Google Scholar
Giraudeau, B., & Mary, J. Y. (2001). Planning a reproducibility study: How many subjects and how many replicates per subject for an expected width of 95 percent confidence interval for the intraclass correlation coefficient? Statistics in Medicine, 20, 3205–3214.
Article CAS PubMed Google Scholar
Terwee, C. B., Mokkink, L. B., Knol, D. L., Ostelo, R., Bouter, L. M., & DeVet, H. C. W. (2012). Rating the methodological quality in systematic reviews of studies on measurement properties: A scoring system for the COSMIN checklist. Quality of Life Research, 21, 651–657.
Article PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Humanalysis, Inc., Saratoga Springs, NY, USA
Denise F. Polit
Centre for Health Practice Innovation, Griffith University, Brisbane, QLD, Australia
Denise F. Polit

Authors

Denise F. Polit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denise F. Polit.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Polit, D.F. Getting serious about test–retest reliability: a critique of retest research and some recommendations. Qual Life Res 23, 1713–1720 (2014). https://doi.org/10.1007/s11136-014-0632-9

Download citation

Accepted: 20 January 2014
Published: 07 February 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11136-014-0632-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Getting serious about test–retest reliability: a critique of retest research and some recommendations