Skip to main content

Approaches to the Development and Use of PRO Measures: A New Roadmap

  • Chapter
  • First Online:
Living with Chronic Disease: Measuring Important Patient-Reported Outcomes

Abstract

Various recent developments are driving an evolution in the way PRO measures are developed. For example, drug regulatory agencies (e.g. the FDA) can now be more substantively involved or consulted on measure development work; the patients’ role is now seen as going beyond being a study subject to being involved as research partners. Use of modern test theory in development of scoring algorithms and overall measure development work is now the new ‘gold standard’.

The challenge

(anecdote)

In the measurement of glucose, the glucose assay and computational algorithm are embodied in the instrument that is used to provide glucose readings…The quality of life measurement process is essentially an assay that defines the construct, uses question and response formats to obtain answers, and has an algorithm that scores those responses to yield quality of life readings. However, because of the subjectivity and indirect measurement issues, it is harder to adopt a standard assay for quality of life.

Testa 2000

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altman D, Machin D, Bryant T, Gardner M (2013) Statistics with confidence: confidence intervals and statistical guidelines. John Wiley & Sons, Hoboken

    Google Scholar 

  • Andrich D, Sheridan B, Luo G (2012a) Interpreting RUMM 2030 analysis: part i dichotomous data. RUMM Laboratory, Perth

    Google Scholar 

  • Andrich D, Humphry SM, Marais I (2012b) Quantifying local, response dependence between two polytomous items using the Rasch Model. Appl Psychol Meas 36(4):309–324

    Article  Google Scholar 

  • Baldwin M, Spong A, Doward L, Gnanasakthy A (2011) Patient-reported outcomes, patient-reported information from randomized controlled trials to the social web and beyond. Patient 4:11–17

    Article  PubMed  Google Scholar 

  • Bender JL, Jimenez-Marroquin MC, Jadad AR (2011) Seeking support on facebook: a content analysis of breast cancer groups. J Med Internet Res 13(1). https://doi.org/10.2196/jmir.1560

  • Bond T (2004) Validity and assessment: a Rasch measurement perspective. Metodol Ciencias Del Comportamiento 5:179–194

    Google Scholar 

  • Bond T, Fox CM (2015) Applying the Rasch Model: fundamental measurement in the human sciences. Routledge, London

    Book  Google Scholar 

  • Both H, Essink-Bot M-L, Busschbach J, Nijsten T (2007) Critical review of generic and dermatology-specific health-related quality of life instruments. J Investig Dermatol 127(12):2726–2739

    Article  CAS  PubMed  Google Scholar 

  • Braun V, Clarke V (2006) Using thematic analysis in psychology. Qual Res Psychol 3(2):77–101

    Article  Google Scholar 

  • Brod M, Tesler LE, Christensen TL (2009) Qualitative research and content validity: developing best practices based on science and experience. Qual Life Res 18(9):1263

    Article  PubMed  Google Scholar 

  • Brown TA (2014) Confirmatory factor analysis for applied research. Guilford Publications, New York

    Google Scholar 

  • Byrne B (2011) Structural equation modeling with mplus: basic concepts, applications, and programming. Routledge, New York. http://books.google.de/books?id=u58MPwAACAAJ

    Google Scholar 

  • Coons SJ, Gwaltney CJ, Hays RD, Lundy JJ, Sloan JA, Revicki DA, Lenderking WR, Cella D, Basch E (2009) Recommendations on evidence needed to support measurement equivalence between electronic and paper-based patient-reported outcome (PRO) measures: ISPOR ePRO Good Research Practices Task Force report. Value Health 12(4):419–429

    Article  PubMed  Google Scholar 

  • Coste J, Guillemin F, Pouchot J, Fermanian J (1997) Methodological approaches to shortening composite measurement scales. J Clin Epidemiol 50(3):247–252

    Article  CAS  PubMed  Google Scholar 

  • Costello AB, Osborne JW (2005) Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis. Pract Assess Res Eval 10(7):1–9

    Google Scholar 

  • Dalal AA, Nelson L, Gilligan T, McLeod L, Lewis S, DeMuro-Mercon C (2011) Evaluating patient-reported outcome measurement comparability between paper and alternate versions, using the lung function questionnaire as an example. Value Health 14(5):712–720

    Article  PubMed  Google Scholar 

  • DeVellis RF (2016) Scale development: theory and applications. Applied social research methods. SAGE Publications, Thousand Oaks. https://books.google.de/books?id=48ACCwAAQBAJ

    Google Scholar 

  • DiBenedetti DB, Coles TM, Sharma T (2013) Recruiting patients with a rare blood disorder and their caregivers through social media. Value Health 16(3):A51

    Article  Google Scholar 

  • Dillman DA (2007) Mail and internet surveys: the tailored design method. Wiley, Hoboken

    Google Scholar 

  • Edelen MO, Thissen D, Teresi JA, Kleinman M, Ocepek-Welikson K (2006) Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: application to the mini-mental state examination. Med Care 44(11):S134–S142

    Article  Google Scholar 

  • Epstein RS (2000) Responsiveness in quality-of-life assessment: nomenclature, determinants, and clinical applications. Med Care 38(9 Suppl):II91

    PubMed  CAS  Google Scholar 

  • Fairclough DL (2010) Design and analysis of quality of life studies in clinical trials. CRC Press, Boca Raton

    Google Scholar 

  • Fayers PM, Machin D (2007) Quality of life: the assessment, analysis and interpretation of patient-reported outcomes, 2nd edn. John Wiley & Sons, West Sussex

    Book  Google Scholar 

  • Food and Drug Administration (2009) Guidance for industry: patient-reported outcome measures: use in medical product development to support labeling claims. Fed Regist 74(235):65132–65133

    Google Scholar 

  • Forsythe LP, Szydlowski V, Murad MH, Ip S, Wang Z, Elraiyah TA, Fleurence R, Hickam DH (2014) A systematic review of approaches for engaging patients for research on rare diseases. J Gen Intern Med 29(3):788–800

    Article  PubMed Central  Google Scholar 

  • Frank L, Forsythe L, Ellis L, Schrandt S, Sheridan S, Gerson J, Konopka K, Daugherty S (2015) Conceptual and practical foundations of patient engagement in research at the patient-centered outcomes research institute. Qual Life Res 24(5):1033–1041

    Article  PubMed  PubMed Central  Google Scholar 

  • Frost J, Okun S, Vaughan T, Heywood J, Wicks P (2011) Patient-reported outcomes as a source of evidence in off-label prescribing: analysis of data from PatientsLikeMe. J Med Internet Res 13(1):e6

    Article  PubMed  PubMed Central  Google Scholar 

  • Frost MH, Reeve BB, Liepa AM, Stauffer JW, Hays RD, Mayo/FDA Patient-Reported Outcomes Consensus Meeting Group (2007a) What is sufficient evidence for the reliability and validity of patient-reported outcome measures? Value Health 10:S94–S105

    Article  PubMed  Google Scholar 

  • Frost MH, Bonomi AE, Cappelleri JC, Schünemann HJ, Moynihan TJ, Aaronson NK (2007b) Applying quality-of-life data formally and systematically into clinical practice. Mayo Clin Proc 82(10):1214–1228

    Article  PubMed  Google Scholar 

  • Gorecki C, Lamping DL, Nixon J, Brown JM, Cano S (2011) Applying mixed methods to pretest the pressure ulcer quality of life (PU-QOL) instrument. Qual Life Res 21(3):441–451

    Article  PubMed  Google Scholar 

  • Gustafson DL, Woodworth CF (2014) Methodological and ethical issues in research using social media: a metamethod of human papillomavirus vaccine studies. BMC Med Res Methodol 14(1):127

    Article  PubMed  PubMed Central  Google Scholar 

  • Guyatt G, Walter S, Norman G (1987) Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis 40(2):171–178

    Article  CAS  PubMed  Google Scholar 

  • Guyatt GH, Osoba D, Wu AW, Wyrwich KW, Norman GR (2002) Methods to explain the clinical significance of health status measures. Mayo Clin Proc 77(4):371–383

    Article  PubMed  Google Scholar 

  • Guyatt GH, Feeny DH, Patrick DL (1993) Measuring health-related quality of life. Ann Intern Med 118(8):622–629

    Article  CAS  PubMed  Google Scholar 

  • Hamm MP, Chisholm A, Shulhan J, Milne A, Scott SD, Given LM, Hartling L (2013) Social media use among patients and caregivers: a scoping review. BMJ Open 3(5). https://doi.org/10.1136/bmjopen-2013-002819

  • Haynes SN, Richard D, Kubany ES (1995) Content validity in psychological assessment: a functional approach to concepts and methods. Psychol Assess 7(3):238

    Article  Google Scholar 

  • Hays RD, Hadorn D (1992) Responsiveness to change: an aspect of validity, not a separate dimension. Qual Life Res 1(1):73–75

    Article  CAS  PubMed  Google Scholar 

  • Haywood K, Brett J, Salek S, Marlett N, Penman C, Shklarov S, Norris C, Santana MJ, Staniszewska S (2015) Patient and public engagement in health-related quality of life and patient-reported outcomes research: what is important and why should we care? Findings from the first ISOQOL patient engagement symposium. Qual Life Res 24(5):1069–1076

    Article  PubMed  Google Scholar 

  • Higginson IJ, Carr AJ (2001) Measuring quality of life: using quality of life measures in the clinical setting. BMJ 322(7297):1297

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Holgado-Tello FP, Chacón-Moscoso S, Barbero-García I, Vila-Abad E (2010) Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Qual Quant 44(1):153–166

    Article  Google Scholar 

  • Johnson, C, N Aaronson, JM Blazeby, A Bottomley, P Fayers, M Koller, D Kuliś, et al. 2011. “The European Organization for Research and Treatment of Cancer: guidelines for developing questionnaire modules, 4th edn. Qual Life Res 2

    Google Scholar 

  • Kamudoni P, Mueller B, Salek MS (2015) The development and validation of a disease-specific quality of life measure in hyperhidrosis: the Hyperhidrosis Quality of Life Index (HidroQOL©). Qual Life Res 24(4):1017–1027

    Article  CAS  PubMed  Google Scholar 

  • Kerr C, Nixon A, Wild D (2010) Assessing and demonstrating data saturation in qualitative inquiry supporting patient-reported outcomes research. Expert Rev Pharmacoecon Outcomes Res 10(3):269–281

    Article  PubMed  Google Scholar 

  • Kline P (1994) An easy guide to factor analysis. Taylor & Francis Group, London. http://books.google.co.uk/books?id=6PHzhLD-bSoC

    Google Scholar 

  • Lackey NR, Sullivan JJ, Pett MA (2003) Making sense of factor analysis: the use of factor analysis for instrument development in health care research. SAGE Publications Ltd, London. http://books.google.co.uk/books?id=5Jyaa2LQWbQC

    Google Scholar 

  • Leidy NK, Murray LT (2013) Patient-reported outcome (PRO) measures for clinical trials of COPD: the EXACT and E-RS. COPD: J Chron Obstruct Pulmon Dis 10(3):393–398

    Article  Google Scholar 

  • Leidy NK, Revicki DA, Genesté B (1999) Recommendations for evaluating the validity of quality of life claims for labeling and promotion. Value Health 2(2):113–127

    Article  CAS  PubMed  Google Scholar 

  • Liang MH (2000) Longitudinal construct validity: establishment of clinical meaning in patient evaluative instruments. Med Care 38(9):II

    Article  Google Scholar 

  • Linacre JM (1998) Detecting multidimensionality: which residual data-type works best? J Outcome Meas 2:266–283

    PubMed  CAS  Google Scholar 

  • Linacre MH (1999) Investigating rating scale category utility. J Outcome Meas 3(2):103

    PubMed  CAS  Google Scholar 

  • Linacre JM (2009) A user’s guide to WINSTEPS/MINISTEPS. Rasch-Model Computer Programs, Chicago. Winsteps.com

    Google Scholar 

  • Lipsey MW (1990) Design sensitivity: statistical power for experimental research, 19th edn. SAGE Publication Ltd., London

    Google Scholar 

  • Liu H, Cella D, Gershon R, Shen J, Morales LS, Riley W et al (2010) Representativeness of the patient-reported outcomes measurement information system internet panel. J Clin Epidemiol 63(11):1169–1178

    Article  PubMed  PubMed Central  Google Scholar 

  • Lohr KN (2002) Assessing health status and quality-of-life instruments: attributes and review criteria. Qual Life Res 11(3):193–205. https://doi.org/10.1023/a:1015291021312

    Article  PubMed  Google Scholar 

  • Lynn MR (1986) Determination and quantification of content validity. Nurs Res 35(6):382–385

    Article  CAS  PubMed  Google Scholar 

  • Messick S (1988) The once and future issues of validity: assessing the meaning and consequences of measurement. Test Valid 33:45

    Google Scholar 

  • Muehlhausen W, Doll H, Quadri N, Fordham B, O’Donohoe P, Dogar N, Wild DJ (2015) Equivalence of electronic and paper administration of patient-reported outcome measures: a systematic review and meta-analysis of studies conducted between 2007 and 2013. Health Qual Life Outcomes 13(1):167

    Article  PubMed  PubMed Central  Google Scholar 

  • Muthen LK, Muthén BO (2010) Mplus user’s guide, v. 6.1. Muthen and Muthen, UCLA, Los Angeles

    Google Scholar 

  • Norman GR, Sloan JA, Wyrwich KW (2003) Interpretation of changes in health-related quality of life: the remarkable universality of half a standard deviation. Med Care 41(5):582

    PubMed  Google Scholar 

  • Norquist JM, Girman C, Fehnel S, DeMuro-Mercon C, Santanello N (2011) Choice of recall period for patient-reported outcome (PRO) measures: criteria for consideration. Qual Life Res 21:1013–1021

    Article  PubMed  Google Scholar 

  • Nunnally JC, Bernstein IH (1994) Psychometric theory. McGraw-Hill, London. http://books.google.de/books?id=r0fuAAAAMAAJ

    Google Scholar 

  • Ozer ZC, Firat MZ, Bektas HA (2009) Confirmatory and exploratory factor analysis of the caregiver quality of life index-cancer with Turkish samples. Qual Life Res 18(7):913–921

    Article  PubMed  Google Scholar 

  • Pallant JF, Tennant A (2007) An introduction to the rasch measurement model: an example using the Hospital Anxiety and Depression Scale (HADS). Br J Clin Psychol 46(1):1–18

    Article  PubMed  Google Scholar 

  • Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L (2011a) Content validity – establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 2 – assessing respondent understanding. Value Health 14(8):978–988

    Article  PubMed  Google Scholar 

  • Patrick DL, Burke LB, Gwaltney CJ, Leidy NK, Martin ML, Molsen E, Ring L (2011b) Content validity – establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 1 – eliciting concepts for a new PRO instrument. Value Health 14(8):967–977. https://doi.org/10.1016/j.jval.2011.06.014

    Article  PubMed  Google Scholar 

  • Patrick DL, Deyo RA (1989) Generic and disease-specific measures in assessing health status and quality of life. Med Care 27:217–232

    Article  Google Scholar 

  • Pope C, Mays N (2008) Qualitative research in health care, 3rd edn. Wiley-Blackwell, Oxford. http://books.google.de/books?id=DMxS6R3s2a4C

    Google Scholar 

  • Prieto L, Alonso J, Lamarca R (2003) Classical test theory versus Rasch analysis for quality of life questionnaire reduction. Health Qual Life Outcomes 1(1):27

    Article  PubMed  PubMed Central  Google Scholar 

  • Prinsen CAC, Lindeboom R, Sprangers MAG, Legierse CM, de Korte J (2010) Health-related quality of life assessment in dermatology: interpretation of Skindex-29 scores using patient-based anchors. J Invest Dermatol 130(5):1318–1322

    Article  CAS  PubMed  Google Scholar 

  • Reeve BB, Hays RD, Chang CH, Perfetto EM (2007) Applying item response theory to enhance health outcomes assessment. Qual Life Res 16:1–3

    Article  Google Scholar 

  • Reeve BB, Mâsse LC (2004) Methods for testing and evaluating survey questionnaires. In: Presser S, Rothgeb JM, Couper MP, Lessler JT, Martin E, Martin J, Singer E (eds) Item response theory modeling for questionnaire evaluation. John Wiley & Sons, Inc, Hoboken, pp 247–273. https://doi.org/10.1002/0471654728.ch13

    Chapter  Google Scholar 

  • Reise SP, Haviland MG (2005) Item response theory and the measurement of clinical change. J Pers Assess 84(3):228–238

    Article  PubMed  Google Scholar 

  • Reise SP, Waller NG, Comrey AL (2000) Factor analysis and scale revision. Psychol Assess 12(3):287

    Article  CAS  PubMed  Google Scholar 

  • Revicki D, Hays RD, Cella D, Sloan J (2008) Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. J Clin Epidemiol 61(2):102–109

    Article  PubMed  Google Scholar 

  • Rothman M, Gnanaskathy A, Wicks P, Papadopoulos EJ (2015) Can we use social media to support content validity of patient-reported outcome instruments in medical product development? Value Health 18(1):1–4

    Article  PubMed  Google Scholar 

  • Rothman ML, Beltran P, Cappelleri JC, Lipscomb J, Teschendorf B, F. D. A. Patient-Reported Outcomes Consensus Meeting Group the Mayo (2007) Patient-reported outcomes: conceptual issues. Value Health 10:S66–S75. https://doi.org/10.1111/j.1524-4733.2007.00269.x

    Article  PubMed  Google Scholar 

  • Salek MS, Luscombe DK (1992) Health-related quality of life assessment: a review. J Drug Dev 5:137–137

    Google Scholar 

  • Salek S (1998) Compendium of quality of life instruments, vol 4. Euromed Communications, Haslemere. http://books.google.de/books?id=2_LaAAAAMAAJ

    Google Scholar 

  • Schmitt TA (2011) Current methodological considerations in exploratory and confirmatory factor analysis. J Psychoeduc Assess 29(4):304–321

    Article  Google Scholar 

  • Shea T, Tennant A, Pallant J (2009) Rasch model analysis of the Depression, Anxiety and Stress Scales (DASS). BMC Psychiatry 9(1):21

    Article  PubMed  PubMed Central  Google Scholar 

  • Smith EV (2002) Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. J Appl Meas 3(2):205–231

    PubMed  Google Scholar 

  • Streiner DL, Norman GR (2008) Health measurement scales: a practical guide to their development and use. Oxford University Press, Oxford. http://books.google.co.uk/books?id=UbKijeRqndwC

    Book  Google Scholar 

  • Tennant A, Conaghan PG (2007) The Rasch Measurement Model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Care Res 57(8):1358–1362

    Article  Google Scholar 

  • Tennant A, McKenna SP, Hagell P (2004) Application of Rasch analysis in the development and application of quality of life instruments. Value Health 7(Suppl 1):S22–S26

    Article  PubMed  Google Scholar 

  • Tennant A, Pallant JF (2006) Unidimensionality matters. Rasch Measur Trans 20(1):1048–1051

    Google Scholar 

  • Teresi JA, Fleishman JA (2007) Differential item functioning and health assessment. Qual Life Res 16:33–42

    Article  PubMed  Google Scholar 

  • Terwee CB, Bot SDM, De Boer MR, van der Windt DAWM, Knol DL, Dekker J, Bouter LM, De Vet HCW (2007) Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol 60(1):34–42

    Article  PubMed  Google Scholar 

  • Thornicroft G, Slade M (2000) Are routine outcome measures feasible in mental health? Qual Health Care 9(2):84–84

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Tourangeau R (1984) Cognitive aspects of survey methodology: building a bridge between disciplines. In: Jabine TB et al (eds) Cognitive sciences and survey methods. Institute for Social and Economic Research, Essex, pp 73–100

    Google Scholar 

  • Tweet MS, Gulati R, Aase LA, Hayes SN (2011) Spontaneous coronary artery dissection: a disease-specific, social networking community-initiated study. Mayo Clin Proc 86(9):845–850

    Article  PubMed  PubMed Central  Google Scholar 

  • Valderas JM, Ferrer M, Mendívil J, Garin O, Rajmil L, Herdman M, Alonso J (2008) Development of EMPRO: a tool for the standardized assessment of patient-reported outcome measures. Value Health 11(4):700–708

    Article  PubMed  Google Scholar 

  • Ware JE, Kosinski M, Keller SD (1994) SF-36 physical and mental health summary scales: a user’s manual. The Health Institute, Boston, MA

    Google Scholar 

  • Whittemore R, Chase SK, Mandle CL (2001) Validity in qualitative research. Qual Health Res 11(4):522–537

    Article  CAS  PubMed  Google Scholar 

  • Williams B, Onsman A, Brown T (2010) Exploratory factor analysis: a five-step guide for novices. J Emerg Primary Health Care 8(3):1–13

    Google Scholar 

  • de Wit MPT, Kvien TK, Gossec L (2015) Patient participation as an integral part of patient-reported outcomes development ensures the representation of the patient voice: a case study from the field of rheumatology. RMD Open 1(1):e000129

    Article  PubMed  PubMed Central  Google Scholar 

  • Wright BD, Masters GN (1982) Rating scale analysis. Mesa Press, Chicago. http://books.google.de/books?id=ZfjFQgAACAAJ

    Google Scholar 

  • Wynd CA, Schmidt B, Schaefer MA (2003) Two quantitative approaches for estimating content validity. West J Nurs Res 25(5):508–518

    Article  PubMed  Google Scholar 

  • Wyrwich KW, Bullinger M, Aaronson N, Hays RD, Patrick DL, Symonds T (2005) Estimating clinically significant differences in quality of life outcomes. Qual Life Res 14(2):285–295

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix: Technical Notes

Appendix: Technical Notes

Practical Considerations in Study Design During PRO Development

Sample Size

Sample size considerations differ between qualitative and quantitative research. In the former, it is not possible to determine the needed sample size prior to data collection; rather sample adequacy is determined in the course of data collection. Data collection continues until ‘saturation’ has been reached, which reflects a situation where further data collection (e.g. interviews) is not yielding new data (Kerr et al. 2010). On the other hand, in quantitative research, sample size is dependent on the particular statistical analysis performed. Required sample size will reflect the intended power of analysis, the magnitude of effect size to be observed and the chosen level of significance and reliability of measurement (Lipsey 1990). Exploratory studies, where the magnitude of the effect size and reliability are unknown a priori, may present some challenges in this regard. A useful recommendation is to use a sample matrix based on key disease or treatment characteristics for a particular disease, where each subcategory (each cell) should have at least 15 subjects (Johnson et al. 2011). For initial estimates of reliability and validity, at least 200 subjects are recommended (Frost et al. 2007a). If a test-retest correlation of 0.85 is observed with a sample size of 100, the 95% confidence interval is 0.78–0.90, while a sample size of 150 would narrow this to 0.8–0.89 (Johnson et al. 2011).

Rules of thumb on sample size requirements for correlation analysis and factor analysis vary in their guidance, ranging from 5 to 20 observations per variable with more suggestion above and below this ratio (Costello and Osborne 2005). However, the minimum sample size required for accurate recovery of the population factor pattern matrix is influenced by many factors including the distribution and reliability of the variables, degree of association among variables, communalities and degree to which factors are overidentified (Reise et al. 2000; Schmitt 2011). Thus power and precision ought to be core considerations in parametric estimation-based factor methods (Schmitt 2011), while in non-parametric approaches when communalities are high, sample size of 100 may be adequate (Reise et al. 2000).

Assessment of adequacy of sample size for a given statistical test should be made along with other key considerations relating to the sample, for instance, ensuring that the target population is adequately represented along with all important disease characteristics. Otherwise, appropriate tools should be applied to indicate the uncertainty surrounding estimates, e.g. using confidence intervals in presenting results.

Missing Data

Situations where a question or an entire questionnaire has not been completed are common during data collection in QoL research. The reason behind the missing data has an influence on choice of tools for dealing with the consequent problems in data analysis, for example, whether an item is skipped by mistake or due to its irrelevance. There are three main classifications of patterns of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) (Fayers and Machin 2007). MCAR arises where the probability of having a missing item (questionnaire) is independent of previous or unobserved current and future scores. MAR occurs where missingness is dependent on known covariates and scores of previous items, but not on the unobserved scores. The third case relates to where the unobserved HRQoL influences the missingness. The presence of MAR and MCAR is not worrisome, as their impact on accurate measurement of HRQoL is minimal (Leidy et al. 1999). MNAR causes the greatest concern as its presence may lead to an over- or underestimation of HRQoL, highlighting the need for transparent approaches in addressing its presence.

There are no clear guidelines on the number of missing items to warrant the exclusion of an entire respondent’s questionnaire from analyses although Streiner and Norman (2008) have mentioned a ceiling of 5% of items. However, it’s worth noting that where Rasch scoring is applied, a higher number of missing items may be tolerated without much bias in measurement (Fayers and Machin 2007). On the other hand, in some situations (e.g. during instrument development work), data imputation to replace the missing data offers a viable alternative. This is done in various ways including using the last observed value carried forward, by calculating a simple mean or using regression methods (Fairclough 2010). Other more sophisticated imputation approaches such as hot-deck and Markov chain are capable of preserving variability in the data.

Strategies for preventing problems of missing data from rising should be considered, e.g. inviting the patients to cross-check their questionnaires to make sure all items are completed (~assuming paper-and-pencil in-person administration). Here electronic PRO administration may have advantages.

Suggestions for Statistical Analysis in PROM Development

Data should initially be explored through descriptive analysis of each variable, calculating measures of central tendency (mean, median), variability (SD) and interquartile range for continuous variable. Frequency counts for ordinal and categorical variables. Further analyses will involve making inferences based on various hypotheses tests. In order to reject a null hypothesis, observed probability of a false positive, type I error, as reflected in P-value, needs to be less than the required level of significance (α) (Altman et al. 2013). Most studies will use a level of significance (α) of 5%. Where several hypotheses need to be simultaneously tested, Bonferroni adjustment should be applied to the level of significance, as (α/k), where k is the number of tests (Fayers and Machin 2007).

  • Testing for differences between two means should use independent or paired t-test, depending on whether the two means are mutually exclusive or are related. The Mann-Whitney and Wilcoxon tests are the non-parametric alternatives, respectively, for situations where assumptions of the t-tests are not met. These latter tests are somewhat more conservative.

  • Hypothesis tests involving differences among more than two groups should be carried out using the ANOVA test. Where the core assumptions of this test are not met, particularly, the assumption of homogenous variances across group, the Kruskal-Wallis test should be used alternatively.

  • Testing of hypotheses relating to associations between means of variables should be carried out based on Pearson’s correlations. Where the data is not continuous, Spearman’s rank correlation should be used.

  • Polychoric correlations can be estimated in order to assess multicollinearity among items. This type of correlation produces consistent and robust results in ordinal data. They are based on the assumption that the variable is linear and continuous but divided up in a series of categories (Holgado-Tello et al. 2010). Multicollinearity is identified when correlation coefficients are 0.8 or greater.

  • Possible influences on the magnitude of observed inter-item correlations including range of score values, homogeneity of items, distribution of the data (particularly departures from normality) and existence of outliers in the data (Fayers and Machin 2007) should be explored. Normality assumption implies the absolute value of skewness not exceeding 3, while the absolute value of Kurtosis must not be greater than 7 (Ozer et al. 2009; Byrne 2011). While the former impacts on means, covariance tends to be vulnerable to Kurtosis values (Byrne 2011).

Further statistical analyses carried out during construct validation can use various forms of regression methods, modelling latent variables including exploratory factor analysis, confirmatory factor analysis and the Rasch model.

Performing Exploratory Factor Analysis

The aim is to identify the smallest number of interpretable factors explaining the covariation among items (Muthen and Muthén 2010). This involves first generating the variance-covariance matrix, followed by the estimation of the factors which entails putting together those items sharing the highest covariation. Subsequently, the initial factor solution is rotated in order to achieve a simple structure that is more interpretable, as the initial solution is not unique (DeVellis 2016).

To perform an EFA on the instrument, first a polychoric correlation matrix should be generated. This more appropriately takes into account the ordinality of the data and remains robust when data are skewed, in comparison to the conventional Pearson’s correlation coefficients (Byrne 2011). The initial factor estimation can be carried out using a robust diagonally least squares estimator (WLSMV) which yields robust test statistics, parameter estimates and standard errors when indicator variables are categorical and where normality assumptions are violated (Byrne 2011). Rotation can be performed using the Geomin routine (~available in Mplus software, equivalents might be known by other names in other software), which allows correlation among factors. This rotation is particularly suitable for psychosocial domains known to be highly related (Lackey et al. 2003). Where the factors are not related, Geomin still performs well yielding results comparable to orthogonal rotation routines. Choice of the appropriate number of factors to be extracted will be based on the parallel analysis and will be confirmed by statistical goodness of fit measures (Schmitt 2011). Kaiser’s rule, based on size of eigenvalues; scree plot, which is a graph of number of factors against eigenvalues; and parallel analysis, comparing actual against ones randomly generated, should also be reported. The following criteria can be applied:

  • Kaiser’s rule: factors with eigenvalues greater than are included (Kaiser (1960) in (DeVellis 2016)).

  • Scree plot: all factors to the left of the ‘ankle’ are extracted, where there is a change in the slope.

  • Parallel analysis: the last factor to be retained must have an eigenvalue greater than the one that would be produced randomly (Williams et al. 2010).

An advantage of factor estimation using Likelihood methods is the possibility to generate goodness of fit indices to explore how well hypothesised models fit data. These can be classified into three groups:

  • Chi-square-based indices compare a single factor model against a model with the chosen number of factors (k). For the ‘chi-goodness of fit test’, a non-significant chi-statistic represents good fit (Lackey et al. 2003).

  • Practical fit indices evaluate the proportionate improvement in the model by comparing a hypothesised model against a less restricted baseline model (Byrne 2011). For comparative fit index and Tucker-Lewis Index, values of below 0.9 and 0.95 indicate acceptable and adequate fit, respectively (Schmitt 2011).

  • Absolute fit indices are based on analysis of residuals after fitting the model to the data. For Root Mean Square Error of Approximation (RMSEA), a value below 0.05 shows good fit, 0.08–0.1 mediocre fit and above 0.1 poor fit (Brown 2014). For the Standardized Root Mean Square Residual (SRMR), values lower than 0.05 indicate ‘adequate fit’ although values below 0.8 are still acceptable. The Weighted Root Mean Square Residual uses a cut-off value of 0.95 for good fit.

Performing Rasch Analysis

Appropriate fit to the Rasch model ensures that an PRO measure is sufficiently unidimensional and that it complies with conjoint measurement principles, a precondition for converting the data from the instrument into interval scales (Bond 2004). The intention of Rasch analysis, therefore, is to evaluate whether data have sufficient fit to the model to warrant such claims.

Demonstrating conformity to the Rasch model may have several advantages for a PRO instrument. First, ordinal scores may be transformed into interval-level logit scores using the RM—a requisite property for the calculation of effect sizes and other statistics in clinical research that is usually taken for granted (Reise and Haviland 2005). Second, by conceptualising measurement error as an item-level property, high reliability can be attained even with a shorter questionnaire, making it possible to minimise patient burden without compromising precision (Reeve et al. 2007). In addition, much more complex comparisons such as ‘anchoring’ or ‘equating’ may be easily carried out between an instrument and other instruments.

When assessing conformity to the Rasch model, its assumptions and properties involve the following:

  1. 1.

    Assessing whether the response categories are functioning optimally. Average latent measure across observations in a response category and category thresholds should monotonically increase with the category; each response category should have a distinct peak on the category probability curve graph reflecting the space along the latent variable where it is most probable (Linacre 1999). Category characteristic curves define the most likely response category for a specific person location value on the latent variable. The category threshold indicates a location on the latent variable where the probability of selecting adjacent categories is equivalent (Linacre 1999).

  2. 2.

    Testing item and person fit to the model. This uses residuals obtained after fitting data to the model, calculating a fit residual statistic and an item-trait interaction chi-statistic.

    The residual statistic for items is calculated as the squared summation of the standardised residuals of the responses of all persons to an item (Andrich et al. 2012b). Fit residuals exceeding | ± 2.5| indicate poor fit (Andrich et al. 2012a). As the Rasch model does not distinguish between items and persons, the residual fit statistics for persons is calculated and interpreted in a similar way as the statistic for items (Bond and Fox 2015).

  3. 3.

    The item-trait interaction test of fit assesses the discrepancy between actual and model scores of class intervals (which group patients according to ability), visually reflected by discrepancies between the ICC and empirical counterpart. An item chi-value is generated by adding all standardised differences for class intervals (Andrich et al. 2012a).

  4. 4.

    Testing of overall model fit. Mean fit residual value of 0 and standard deviation of 1 reflect overall model fit (Shea et al. 2009). The item-trait interaction statistics for all items are summed up into total item-trait interaction statistic. Optimal fit is reflected in a non-significant statistic (chi-squared statistic, P-value >0.05). Good fit to the Rasch model implies that the hierarchical ordering of the items remains invariant across the different levels of disease severity assessed by the construct.

  5. 5.

    How well the instrument can differentiate persons according to disease severity should be assessed. This is reflected in the Person Separation Index (PSI) which reflects the proportion of variance explained by the model out of the total person variability (Wright and Masters 1982; Bond and Fox 2015). A PSI of 0.8 reflects capability to reliably distinguish patients into at least two groups of severity, e.g. high and low severity.

  6. 6.

    Assessing targeting of items. The item-person map is visually examined for adequacy in spread of the items along the breadth of the latent variable, and ideally there should not be large gaps between items (Wright and Masters 1982); mean location of persons should be close to 0 to match the item mean location centred at 0 logits (Gorecki et al. 2011).

  7. 7.

    Assessing unidimensionality. First, a principal component analysis should be carried out on the residuals after fitting the Rasch model. Unidimensionality is supported if the first component accounts for no more than 30% of the variance in the data and has an eigenvalue of 3 or less (Linacre 1998). A more stringent assessment of unidimensionality has been suggested by Smith (2002). Items are grouped according to their loading on the first residual factor, comprised of high positive and high negative loading items, respectively. Pairs of person estimates generated from the two item sets are compared using a series of t-tests. If the proportion of significant tests (or the lower bound of its confidence interval) exceeds 5%, unidimensionality is ruled out (Tennant and Pallant 2006).

  8. 8.

    The assumption of local independence can be assessed by examining the correlation matrix of the item residuals. Residual correlation exceeding 0.2–0.3 reflects a violation of this assumption. The magnitude of the response dependence is calculated as the shift in the latent variable range representing a given response choice on the dependent item, induced by a particular response choice on the independent item (Andrich et al. 2012b).

  9. 9.

    Assessing for invariance across demographic factors. Differential item functioning (DIF) can be assessed for key demographic factors using a two-way ANOVA test. A significant main effect (demographic variable) at a 0.05 level of significance, with Bonferroni adjustment, indicates presence of uniform DIF. On the other hand, a significant interaction effect (demographic variable X class interval representing ability groups along the latent trait), after Bonferroni adjustment, indicates non-uniform DIF (Andrich et al. 2012a). Identification of DIF requires a pure set of items, upon which the scale is anchored (Teresi and Fleishman 2007).

    Any action on DIF requires an understanding of its magnitude and impact. Magnitude indicates the difference between item difficulty estimates based on all patients and comparable estimates specific for each demographic group (Linacre 2009). The impact of the DIF on estimation of person estimates is assessed by comparing person estimates generated from the DIF-free items against estimates based on all items including those with DIF (Pallant and Tennant 2007). Using a t-test, significant results, at 0.05 level of significance, indicate that DIF has an impact. The Item Characteristic Curve (ICC) of the two series may also be useful in assessing whether the pairs of person ability estimates agree. Impact of DIF can also be explored by assessing whether the test characteristic curves (TCCs) from different demographic groups are comparable, i.e. whether there is a relationship between the raw score and the underlying latent variable varies across the demographic groups. Identical TCCs indicate the absence of impact of DIF on the total score (Edelen et al. 2006). The criterion for magnitude of DIF is also relevant for differential scale functioning.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kamudoni, P., Johns, N., Salek, S. (2018). Approaches to the Development and Use of PRO Measures: A New Roadmap. In: Living with Chronic Disease: Measuring Important Patient-Reported Outcomes. Adis, Singapore. https://doi.org/10.1007/978-981-10-8414-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8414-0_2

  • Published:

  • Publisher Name: Adis, Singapore

  • Print ISBN: 978-981-10-8413-3

  • Online ISBN: 978-981-10-8414-0

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics