Quality of Life Research

, Volume 16, Supplement 1, pp 95–108 | Cite as

Developing tailored instruments: item banking and computerized adaptive assessment

  • Jakob Bue Bjorner
  • Chih-Hung Chang
  • David Thissen
  • Bryce B. Reeve
Original Paper


Item banks and Computerized Adaptive Testing (CAT) have the potential to greatly improve the assessment of health outcomes. This review describes the unique features of item banks and CAT and discusses how to develop item banks. In CAT, a computer selects the items from an item bank that are most relevant for and informative about the particular respondent; thus optimizing test relevance and precision. Item response theory (IRT) provides the foundation for selecting the items that are most informative for the particular respondent and for scoring responses on a common metric. The development of an item bank is a multi-stage process that requires a clear definition of the construct to be measured, good items, a careful psychometric analysis of the items, and a clear specification of the final CAT. The psychometric analysis needs to evaluate the assumptions of the IRT model such as unidimensionality and local independence; that the items function the same way in different subgroups of the population; and that there is an adequate fit between the data and the chosen item response models. Also, interpretation guidelines need to be established to help the clinical application of the assessment. Although medical research can draw upon expertise from educational testing in the development of item banks and CAT, the medical field also encounters unique opportunities and challenges.


Computerized adaptive testing Health Status Indicators Questionnaires Algorithms Mental health Factor analysis Statistical 



This paper builds upon presentations by the authors at the conference: Advances in Health Outcomes Measurement: Exploring the Current State and the Future of Item Response Theory, Item Banks, and Computer-Adaptive Testing, Bethesda, MD, June, 2004. This work was supported in part by a grant from the Small Business Innovation Research Program of the National Institute of Neurological Disorders and Stroke, under grant title Computerized Adaptive Assessment of Headache Impact (grant no. 1R43NS047763-01) and in part by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project. The authors would like to thank Howard Wainer of the National Board of Medical Examiners and three anonymous reviewers for comments on a previous version of the paper.


  1. 1.
    Wainer, H., Dorans, N. J., & Eignor, D., et al. (2000). Computerized adaptive testing: A primer. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  2. 2.
    Fischer, G. H., & Molenaar, I. W. (1995). Rasch models—foundations, recent developments, and applications. Berlin: Springer-Verlag.Google Scholar
  3. 3.
    Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. London: Sage Publications.Google Scholar
  4. 4.
    van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. Berlin: Springer.Google Scholar
  5. 5.
    Ware, J. E., Jr., Bjorner, J. B., & Kosinski, M. (2000). Practical implications of item response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Medical Care, 38, II73–II82PubMedCrossRefGoogle Scholar
  6. 6.
    Veit, C. L., & Ware, J. E., Jr. (1983). The structure of psychological distress and well-being in general populations. Journal of Consulting and Clinical Psychology, 51, 730–742.PubMedCrossRefGoogle Scholar
  7. 7.
    Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 3–50). Berlin: Springer.Google Scholar
  8. 8.
    Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.Google Scholar
  9. 9.
    Muraki, E. (1997). A Generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 153–164). Berlin: Springer.Google Scholar
  10. 10.
    Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–173.CrossRefGoogle Scholar
  11. 11.
    Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 101–122). Berlin: Springer.Google Scholar
  12. 12.
    Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573.CrossRefGoogle Scholar
  13. 13.
    Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). Berlin: Springer.Google Scholar
  14. 14.
    Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, 34(Suppl 17), 1–97.Google Scholar
  15. 15.
    Lord, F. M., & Norvick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.Google Scholar
  16. 16.
    Mellenbergh, G. J. (1995). Conceptual notes on models for discrete polytomous item responses. Applied Psychological Measurement, 19, 91–100.CrossRefGoogle Scholar
  17. 17.
    Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577.CrossRefGoogle Scholar
  18. 18.
    Roberts, J. S., Donoghue, J. R., & Laughlin, J. E. (2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.Google Scholar
  19. 19.
    Maydeu-Olivares, A., Drasgow, F., & Mead, A. D. (1994). Distinguishing among parametric item response models for polychotomous ordered data. Applied Psychological Measurement, 18, 245–256.CrossRefGoogle Scholar
  20. 20.
    Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.Google Scholar
  21. 21.
    Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408CrossRefGoogle Scholar
  22. 22.
    Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351–363.CrossRefGoogle Scholar
  23. 23.
    Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444.CrossRefGoogle Scholar
  24. 24.
    Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah: Lawrence Erlbaum.Google Scholar
  25. 25.
    van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 27–52). Dordrecht: Kluwer Academic Publishers.Google Scholar
  26. 26.
    Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.CrossRefGoogle Scholar
  27. 27.
    Tarlov, A. R., Ware, J. E., Jr., Greenfield, S., Nelson, E. C., Perrin, E., & Zubkoff, M. (1989). The medical outcomes study. An application of methods for monitoring the results of medical care. JAMA, 262, 925–930.PubMedCrossRefGoogle Scholar
  28. 28.
    Ware, J. E., Jr., Bayliss, M. S., Rogers, W. H., Kosinski, M., & Tarlov, A. R. (1996). Differences in 4-year health outcomes for elderly and poor, chronically ill patients treated in HMO and fee-for-service systems. Results from the Medical Outcomes Study. JAMA, 276, 1039–1047.PubMedCrossRefGoogle Scholar
  29. 29.
    Ware, J. E., Jr., & Kosinski, M. (2001). SF36 physical and mental health summary scales: A manual for users of version 1. Lincoln RI: QualityMetric Inc.Google Scholar
  30. 30.
    Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the headache impact test (HIT). Quality of Life Research, 12, 913–933.PubMedCrossRefGoogle Scholar
  31. 31.
    Hill, C. D. (2004). Precisions of parameter estimates for the graded item response model. (Masters Thesis) Chapel Hill: University of North Carolina.Google Scholar
  32. 32.
    Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.CrossRefGoogle Scholar
  33. 33.
    Dillman, D. (2007). Mail and Internet surveys: The tailored design method—2007 update with new Internet, visual, and mixed-mode guide. New York, NY: J. Wiley.Google Scholar
  34. 34.
    Bjorner, J. B., Ware, J. E., Jr., & Kosinski, M. (2003). The potential synergy between cognitive models and modern psychometric models. Quality of Life Research, 12, 261–274.PubMedCrossRefGoogle Scholar
  35. 35.
    McHorney, C. A., Kosinski, M., & Ware, J. E., Jr. (1994). Comparisons of the costs and quality of norms for the SF-36 health survey collected by mail versus telephone interview: Results from a national survey. Medical Care, 32, 551–567.PubMedCrossRefGoogle Scholar
  36. 36.
    Cook, A. J., Roberts, D. A., Henderson, M. D., Van Winkle, L. C., Chastain, D. C., & Hamill-Ruth, R. J. (2004). Electronic pain questionnaires: A randomized, crossover comparison with paper questionnaires for chronic pain assessment. Pain, 110, 310–317.PubMedCrossRefGoogle Scholar
  37. 37.
    Ryan, J. M., Corry, J. R., Attewell, R., & Smithson, M. J. (2002). A comparison of an electronic version of the SF-36 general health questionnaire to the standard paper version. Quality of Life Research, 11, 19–26.PubMedCrossRefGoogle Scholar
  38. 38.
    Velikova, G., Wright, E. P., & Smith, A. B., et al. (1999). Automated collection of quality-of-life data: A comparison of paper and computer touch-screen questionnaires. Journal of Clinical Oncology, 17, 998–1007.PubMedGoogle Scholar
  39. 39.
    Muthen, B. O., & Muthen, L. (2001). Mplus user’s guide. Los Angeles: Muthén & Muthén.Google Scholar
  40. 40.
    Chen, W.-H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Educational and Behavioral Statistics, 22, 265–289.Google Scholar
  41. 41.
    Christensen, K. B., Bjorner, J. B., Kreiner, S., & Petersen, J. H. (2002). Tests for unidimensionality in polytomous Rasch models. Psychometrika, 67, 563–574.CrossRefGoogle Scholar
  42. 42.
    Muraki, E., & Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement, 19, 73–90.CrossRefGoogle Scholar
  43. 43.
    Stout, W., Habing, B., Douglas, J., Kim, R. H., Roussos, L., & Zhang, J. (2001). Conditional covariance-based nonparametric multidimensionality assessment. Psychological Measurement, 20, 331–354.CrossRefGoogle Scholar
  44. 44.
    Ramsay, J. O. (1995). TestGraf—a program for the graphical analysis of multiple choice test and questionnaire data. Montreal: McGill University.Google Scholar
  45. 45.
    van der Linden, W. J., & Hambleton, R. K. (1997). Item response theory: Brief history, common models, and extensions. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 1–28). Berlin: Springer.Google Scholar
  46. 46.
    Rasch, G. (1966). An item analysis which takes individual differences into account. The British Journal of Mathematical and Statistical Psychology, 19, 49–57.PubMedGoogle Scholar
  47. 47.
    Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.Google Scholar
  48. 48.
    Andrich, D. (1988). Rasch models for measurement. Beverly Hills: Sage Publications.Google Scholar
  49. 49.
    Andrich, D., & Luo, G.(2003). Conditional pairwise estimation in the Rasch model for ordered response categories using principal components. Journal of Applied Measurement, 4, 205–221.PubMedGoogle Scholar
  50. 50.
    Molenaar, I. W. (1995). Estimation of item parameters. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations recent developments and applications (pp. 39–52). Berlin: Springer.Google Scholar
  51. 51.
    Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Chapman & Hall, CRC.Google Scholar
  52. 52.
    Fischer, G. H., & Ponocny, I. (1995). Extended rating scale and partial credit models for assessing change. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 353–370). Berlin: Springer.Google Scholar
  53. 53.
    Glas, C. A. W., & Verhelst, N. D. (1995). Tests of fit for polytomous Rasch models. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 325–352). Berlin: Springer.Google Scholar
  54. 54.
    Glas, C. A. W., & Verhelst, N. D. (1995). Testing the Rasch model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models—foundations, recent developments, and applications (pp. 69–95). Berlin: Springer.Google Scholar
  55. 55.
    Muraki, E., & Bock, R. D. (1996). Parscale—IRT based test scoring and item analysis for graded open-ended exercises and performance tasks. Chicago: Scientific Software Inc.Google Scholar
  56. 56.
    Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. The Journal of Educational Measurement, 4, 331–352.CrossRefGoogle Scholar
  57. 57.
    Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. The Journal of Educational Measurement, 37, 58–75.CrossRefGoogle Scholar
  58. 58.
    Stone, C. A. (2003). Empirical power and type I error rates for an IRT fit statistic that considers the precision of ability estimates. Educational and Psychological Measurement, 63, 566–586.CrossRefGoogle Scholar
  59. 59.
    Glas, C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64, 273–294.CrossRefGoogle Scholar
  60. 60.
    Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.CrossRefGoogle Scholar
  61. 61.
    Bjorner, J. B., Kosinski, M., & Ware, J. E., Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.PubMedCrossRefGoogle Scholar
  62. 62.
    Kosinski, M., Bayliss, M. S., & Bjorner, J. B., et al. (2003). A six-item short-form survey for measuring headache impact: the HIT-6. Quality of Life Research, 12, 963–974.PubMedCrossRefGoogle Scholar
  63. 63.
    Sands, W. A., Waters, B. K., & McBride, J. R. (1997). Computerized adaptive testing: From inquiry to operation. Washington (DC): American Psychological Association.Google Scholar
  64. 64.
    Berwick, D. M., Murphy, J. M., Goldman, P. A., Ware, J. E., Jr., Barsky, A. J., & Weinstein, M. C. (1991). Performance of a five-item mental health screening test. Medical Care, 29, 169–176.PubMedCrossRefGoogle Scholar
  65. 65.
    van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing, theory and practice (pp. 1–25). Dordrecht: Kluwer Adacemic Publishers.Google Scholar
  66. 66.
    Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298.CrossRefGoogle Scholar
  67. 67.
    Ware, J. E., Jr., Snow, K. K., Kosinski, M., & Gandek, B.(1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.Google Scholar
  68. 68.
    Ware, J. E., Jr., Kosinski, M., & Bjorner, J. B., et al. (2003). Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research, 12, 935–952.PubMedCrossRefGoogle Scholar
  69. 69.
    Bayliss, M. S., Dewey, J. E., & Dunlap, I., et al. (2003). A study of the feasibility of Internet administration of a computerized health survey: The headache impact test (HIT). Quality of Life Research, 12, 953–961.PubMedCrossRefGoogle Scholar
  70. 70.
    Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.CrossRefGoogle Scholar
  71. 71.
    Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • Jakob Bue Bjorner
    • 1
    • 2
  • Chih-Hung Chang
    • 3
  • David Thissen
    • 4
  • Bryce B. Reeve
    • 5
  1. 1.QualityMetric IncorporatedLincolnUSA
  2. 2.Health Assessment LabWalthamUSA
  3. 3.Northwestern University Feinberg School of MedicineChicagoUSA
  4. 4.The University of North Carolina at Chapel HillChapel HillUSA
  5. 5.National Cancer Institute, NIHBethesdaUSA

Personalised recommendations