Abstract
In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.
Similar content being viewed by others
Notes
We assume that the response patterns have been sorted in an arbitrary but fixed order.
Because the nuisance parameters \(\varvec{\xi }\) were estimated by ML from the previous data \(\mathbf{Y}'\), \({\varvec{\Omega }}_{\varvec{\xi }}\) amounts to the inverse Fisher information matrix with respect to the intercept and slope parameters of the first 9 items.
References
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Methodological), 57, 289–300.
Birnbaum, A. (1968). Some latent train models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459.
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for \(n\) dichotomously scored items. Psychometrika, 35(2), 179–197.
Bock, R. D., & Zimowski, M. F. (1997). Multiple group irt. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433–448). New York: Springer.
Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems: I. Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25(2), 290–302.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168.
Breithaupt, K., Ariel, A. A., & Hare, D. R. (2010). Assembling an inventory of multistage adaptive testing systems. In W. J. van der Linden & C. A. Glas (Eds.), Elements of adaptive testing (pp. 247–266). New York, NY: Springer.
Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44(1), 108–132.
Cai, L., & Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66(2), 245–276.
Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173–194.
Cheng, Y., & Yuan, K.-H. (2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280–291.
Cochran, W. G. (1952). The \({\chi }^{2}\) test of goodness of fit. The Annals of Mathematical Statistics, 23(3), 315–345.
Cressie, N., & Read, T. R. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society, Series B (Methodological), 46(3), 440–464.
Curran, P. J., & Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100.
Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19(2), 143–166.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
Fox, J.-P. (2005). Multilevel irt using dichotomous and polytomous response data. British Journal of Mathematical and Statistical Psychology, 58(1), 145–172.
Glas, C. A. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546.
Glas, C. A. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64(3), 273–294.
Glas, C. A., & Suárez Falcón, J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. https://doi.org/10.1177/0146621602250530.
Gong, G., & Samaniego, F. J. (1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 9(4), 861–869.
Gunsjö, A. (1994). Faktoranalys av ordinala variabler. Stockholm: Acta Universitatis Upsaliensis.
Haberman, S. J. (2006). Adaptive quadrature for item response models. ETS Research Report Series, 2006(2), 1–10.
Haberman, S. J., & Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108(504), 1435–1444.
Haberman, S. J., Sinharay, S., & Chon, K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440.
Haley, S. M., Ni, P., Jette, A. M., Tao, W., Moed, R., Meyers, D., et al. (2009). Replenishing a computerized adaptive test of patient-reported daily activity functioning. Quality of Life Research, 18(4), 461–471.
Hofer, S. M., & Piccinin, A. M. (2009). Integrative data analysis through coordination of measurement and analysis protocol across independent longitudinal studies. Psychological Methods, 14(2), 150–164.
Joe, H., & Maydeu-Olivares, A. (2006). On the asymptotic distribution of pearson’s x2 in cross-validation samples. Psychometrika, 71(3), 587–592.
Joe, H., & Maydeu-Olivares, A. (2010). A general family of limited information goodness-of-fit statistics for multinomial data. Psychometrika, 75(3), 393–419.
Jöreskog, K. G., & Moustaki, I. (2001). Factor analysis of ordinal variables: A comparison of three approaches. Multivariate Behavioral Research, 36(3), 347–387.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381.
Lai, J.-S., Stucky, B. D., Thissen, D., Varni, J. W., DeWitt, E. M., Irwin, D. E., et al. (2013). Development and psychometric properties of the promisÂő pediatric fatigue item banks. Quality of Life Research, 22(9), 2417–2427. https://doi.org/10.1007/s11136-013-0357-1.
Liu, Y., & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49(4), 354–371.
Liu, Y., & Thissen, D. (2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670–688.
Liu, Y., & Thissen, D. (2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496–513.
Liu, Y., & Yang, J. S. (2017). Interval estimation of latent variable scores in item response theory. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/1076998617732764.
Liu, Y., & Yang, J. S. (2018). Bootstrap-calibrated interval estimates for latent variable scores in item response theory. Psychometrika, 83(2), 333–354.
Luecht, R. M. (2006). Operational issues in computer-based testing. In D. Bartram & R. Hambleton (Eds.), Computer-based testing and the internet: Issues and advances (pp. 91–114). New York: Wiley.
Magnus, J., & Neudecker, H. (1999). Matrix differential calculus with applications in statistics and econometrics. New York: Wiley.
Maydeu-Olivares, A., & Joe, H. (2005). Limited-and full-information estimation and goodness-of-fit testing in \(2^{n}\) contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020.
Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71(4), 713–732.
Maydeu-Olivares, A., & Joe, H. (2008). An overview of limited information goodness-of-fit testing in multidimensional contingency tables. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New trends in psychometrics (pp. 253–262). Tokyo: Universal Academy Press.
Maydeu-Olivares, A., & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49(4), 305–328.
Maydeu-Olivares, A., & Liu, Y. (2015). Item diagnostics in multivariate discrete data. Psychological Methods, 20(2), 276–292.
McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100–117.
Meng, X.-L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831–860.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525–543.
Mosier, C. I. (1951). Symposium: The need and means of cross-validation. i. Problems and designs of cross-validation. Educational and Psychological Measurement, 11(1), 5–11.
Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 43(4), 551–560.
Muthén, B. (1983). Latent variable structural equation modeling with categorical data. Journal of Econometrics, 22(1–2), 43–65.
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115–132.
Muthén, B. (1993). Goodness of fit with categorical and other nonnormal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205–234). Newbury Park, CA: Sage.
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide [Computer software manual]. Los Angeles, CA.
Parke, W. R. (1986). Pseudo maximum likelihood estimation: The asymptotic distribution. The Annals of Statistics, 14(1), 355–357.
R Core Team. (2018). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
Ranger, J., & Kuhn, J.-T. (2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49(3), 247–268.
Rao, C. R. (1973). Linear statistical inference and its applications. New York: Wiley.
Read, T. R. (1984). Closer asymptotic approximations for the distributions of the power divergence goodness-of-fit statistics. Annals of the Institute of Statistical Mathematics, 36(1), 59–69.
Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61(3), 509–528.
Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics, 12(4), 1151–1172.
Rupp, A. A. (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3–38.
Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional IRT models. Educational and Psychological Measurement, 66(1), 63–84.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika monograph No. 17. Richmond, VA: Psychometric Society.
Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70(3), 533–555.
Thissen, D., Liu, Y., Magnus, B., & Quinn, H. (2015). Extending the use of multidimensional IRT calibration as projection: Many-to-one linking and linear computation of projected scores. In Quantitative psychology research (pp. 1–16). Springer.
Thissen, D., & Steinberg, L. (2009). Item response theory. In R. Millsap & A. Maydeu-Olivares (Eds.), The sage handbook of quantitative methods in psychology (pp. 148–177). London: Sage Publications.
Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates.
Thissen, D., Varni, J. W., Stucky, B. D., Liu, Y., Irwin, D. E., & DeWalt, D. A. (2011). Using the PedsQLtm 3.0 asthma module to obtain scores comparable with those of the PROMIS pediatric asthma impact scale (PAIS). Quality of Life Research, 20(9), 1497–1505.
van der Vaart, A. W. (2000). Asymptotic statistics. New York: Cambridge University Press.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). New York: Springer. (ISBN 0-387-95457-0).
von Davier, M., & von Davier, A. A. (2007). A unified approach to IRT scale linking and scale transformations. Methodology, 3(3), 115–124.
Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307–330.
Yang, J. S., Hansen, M., & Cai, L. (2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and psychological measurement, 72(2), 264–290.
Zhao, Y., & Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. Canadian Journal of Statistics, 33(3), 335–356.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors would like to thank Dr. David Thissen from the Department of Psychology at the University of North Carolina at Chapel Hill for his feedback and suggestions about this work. The participation of Ji Seung Yang was supported by the National Science Foundation under Grant EHR-1534846.. The participation of Alberto Maydeu-Olivares was supported by the National Science Foundation under Grant SES-1659936.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix A The Asymptotic Distribution of Residuals
Appendix A The Asymptotic Distribution of Residuals
1.1 A.1 The Pseudo-Maximum Likelihood Estimator
Under the setup of RR and Conditions A1–A6 of Gong and Samaniego (1981), the following expansion of the n-sample likelihood equation (Eq. 4) is defined for \((\hat{\varvec{\xi }}, \hat{\varvec{\eta }}){}^\top \) in some neighborhood of \((\underline{\varvec{\xi }}, \underline{\varvec{\eta }}){}^\top \):
in which \(\mathbf{g}(\mathbf{y}_i; {\varvec{\xi }}, {\varvec{\eta }}) = \partial \log \pi (\mathbf{y}_{i};{\varvec{\xi }}, {\varvec{\eta }})/\partial {\varvec{\eta }}\), \(\mathbf{H}_{\varvec{\eta }}(\mathbf{y}_i; {\varvec{\xi }}, {\varvec{\eta }}) = \partial ^2\log \pi (\mathbf{y}_{i};{\varvec{\xi }}, {\varvec{\eta }})/\partial {\varvec{\eta }}\partial {\varvec{\eta }}{}^\top \), \(\mathbf{H}_{\varvec{\eta \xi }}(\mathbf{y}_i; {\varvec{\xi }}, {\varvec{\eta }}) = \partial ^2\log \pi (\mathbf{y}_{i};{\varvec{\xi }}, {\varvec{\eta }})/\partial {\varvec{\eta }}\partial {\varvec{\xi }}{}^\top \), and \(\mathbf{R}_n\) denotes the remainder term. As usual, the hat symbol and underline indicate evaluations at the pseudo-ML estimates and the true values of parameters, respectively. If \(\frac{1}{n}\sum _{i=1}^{n}\underline{\mathbf{H}}_{ \varvec{\eta }}(\mathbf{y}_i)\) is invertible, then Eq. 24 can be rewritten as
The assumed regularity conditions guarantee that
Equation 6 is established by combining Eqs. 25 and 26 and applying Slutsky’s lemma.
1.2 A.2 The Asymptotic Covariance Matrix of Residuals
It is straightforward to show that \(\mathbf{p} - \underline{\varvec{\pi }}\), \(\hat{\varvec{\xi }} - \underline{\varvec{\xi }}\), and \(\hat{\varvec{\eta }} - \underline{\varvec{\eta }}\) are jointly asymptotically normal when the model is correctly specified:
in which \(\underline{\varvec{\Theta }}_{21} = \mathbf {0}\) because \(\hat{\varvec{\xi }}\) and \(\mathbf {Y}'\) are independent, and
By the Delta method,
in which \(\mathbf{I}_{C}\) denotes a \(C\times C\) identity matrix. Combining Eqs. 27 and 29 yield
which simplifies to Eq. 14.
Rights and permissions
About this article
Cite this article
Liu, Y., Yang, J.S. & Maydeu-Olivares, A. Restricted Recalibration of Item Response Theory Models. Psychometrika 84, 529–553 (2019). https://doi.org/10.1007/s11336-019-09667-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-019-09667-4