Correction for Item Response Theory Latent Trait Measurement Error in Linear Mixed Effects Models

  • Chun WangEmail author
  • Gongjun Xu
  • Xue Zhang
Original Research


When latent variables are used as outcomes in regression analysis, a common approach that is used to solve the ignored measurement error issue is to take a multilevel perspective on item response modeling (IRT). Although recent computational advancement allows efficient and accurate estimation of multilevel IRT models, we argue that a two-stage divide-and-conquer strategy still has its unique advantages. Within the two-stage framework, three methods that take into account heteroscedastic measurement errors of the dependent variable in stage II analysis are introduced; they are the closed-form marginal MLE, the expectation maximization algorithm, and the moment estimation method. They are compared to the naïve two-stage estimation and the one-stage MCMC estimation. A simulation study is conducted to compare the five methods in terms of model parameter recovery and their standard error estimation. The pros and cons of each method are also discussed to provide guidelines for practitioners. Finally, a real data example is given to illustrate the applications of various methods using the National Educational Longitudinal Survey data (NELS 88).


item response theory measurement error marginal maximum likelihood estimation expectation–maximization estimation two-stage estimation 



This project is supported by IES R305D160010 and NSF SES-1659328


  1. Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22(1), 47–76.CrossRefGoogle Scholar
  2. Anderson, J. C., & Gerbing, D. W. (1984). The effect of sampling error on convergence, improper solutions, and goodness-of-fit indices for maximum likelihood confirmatory factor analysis. Psychometrika, 49(2), 155–173.CrossRefGoogle Scholar
  3. Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin, 103, 411–423.CrossRefGoogle Scholar
  4. Bacharach, V. R., Baumeister, A. A., & Furr, R. M. (2003). Racial and gender science achievement gaps in secondary education. The Journal of Genetic Psychology, 164(1), 115–126.CrossRefGoogle Scholar
  5. Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques. NewYork: Dekker.CrossRefGoogle Scholar
  6. Bianconcini, S., & Cagnone, S. (2012). A general multivariate latent growth model with applications to student achievement. Journal of Educational and Behavioral Statistics, 37, 339–364.CrossRefGoogle Scholar
  7. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.CrossRefGoogle Scholar
  8. Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA Journal of Applied Mathematics, 6, 76.CrossRefGoogle Scholar
  9. Buonaccorsi, J. P. (1996). Measurement error in the response in the general linear model. Journal of the American Statistical Association, 91(434), 633–642.CrossRefGoogle Scholar
  10. Burt, R. S. (1973). Confirmatory factor-analytic structures and the theory construction process. Sociological Methods and Research, 2(2), 131–190.CrossRefGoogle Scholar
  11. Burt, R. S. (1976). Interpretational confounding of unobserved variables in structural equation models. Sociological Methods and Research, 5(1), 3–52.CrossRefGoogle Scholar
  12. Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190–1208.CrossRefGoogle Scholar
  13. Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61(2), 309–329.CrossRefGoogle Scholar
  14. Carroll, R., Ruppert, D., Stefanski, L., & Crainiceanu, C. (2006). Measurement error in nonlinear models: A modern perspective (2nd ed.). London: Chapman and Hall.CrossRefGoogle Scholar
  15. Chang, H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52.CrossRefGoogle Scholar
  16. Cohen, A. S., Bottge, B. A., & Wells, C. S. (2001). Using item response theory to assess effects of mathematics instruction in special populations. Exceptional Children, 68(1), 23–44. Scholar
  17. Congdon, P. (2001). Bayesian statistical modeling. Chichester: Wiley.Google Scholar
  18. De Boeck, P., & Wilson, M. (2004). A framework for item response models. New York: Springer.CrossRefGoogle Scholar
  19. De Fraine, B., Van Damme, J., & Onghena, P. (2007). A longitudinal analysis of gender differences in academic self-concept and language achievement: A multivariate multilevel latent growth approach. Contemporary Educational Psychology, 32(1), 132–150.CrossRefGoogle Scholar
  20. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38.CrossRefGoogle Scholar
  21. Devanarayan, V., & Stefanski, L. (2002). Empirical simulation extrapolation for measurement error models with replicate measurements. Statistics and Probability Letters, 59, 219–225.CrossRefGoogle Scholar
  22. Diakow, R. (2010). The use of plausible values in multilevel modeling. Unpublished masters thesis. Berkeley: University of California.Google Scholar
  23. Diakow, R. P. (2013). Improving explanatory inferences from assessments. Unpublished doctoral dissertation. University of California-Berkley.Google Scholar
  24. Drechsler, J. (2015). Multiple imputation of multilevel missing data—Rigor versus simplicity. Journal of Educational and Behavioral Statistics, 40(1), 69–95.CrossRefGoogle Scholar
  25. Fan, X., Chen, M., & Matsumoto, A. R. (1997). Gender differences in mathematics achievement: Findings from the National Education Longitudinal Study of 1988. Journal of Experimental Education, 65(3), 229–242.CrossRefGoogle Scholar
  26. Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13, 317.CrossRefGoogle Scholar
  27. Fox, J.-P. (2010). Bayesian item response theory modeling: Theory and applications. New York: Springer.CrossRefGoogle Scholar
  28. Fox, J.-P., & Glas, C. A. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66(2), 271–288.CrossRefGoogle Scholar
  29. Fox, J.-P., & Glas, C. A. (2003). Bayesian modeling of measurement error in predictor variables using item response theory. Psychometrika, 68(2), 169–191.CrossRefGoogle Scholar
  30. Fuller, W. (2006). Measurement error models (2nd ed.). New York, NY: Wiley.Google Scholar
  31. Goldfarb, D. (1970). A family of variable metric updates derived by variational means. Mathematics of Computation, 24, 23–26.CrossRefGoogle Scholar
  32. Goldhaber, D. D., & Brewer, D. J. (1997). Why don’t schools and teachers seem to matter? Assessing the impact of unobservables on educational productivity. The Journal of Human Resources, 32(3), 505–523.CrossRefGoogle Scholar
  33. Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects of teachers’ mathematical knowledge for teaching on student achievement. American Educational Research Journal, 42(2), 371–406.CrossRefGoogle Scholar
  34. Hong, G., & Yu, B. (2007). Early-grade retention and children’s reading and math learning in elementary years. Educational Evaluation and Policy Analysis, 29, 239–261.CrossRefGoogle Scholar
  35. Hsiao, Y., Kwok, O., & Lai, M. (2018). Evaluation of two methods for modeling measurement errors when testing interaction effects with observed composite scores. Educational and Psychological Measurement, 78, 181–202.CrossRefGoogle Scholar
  36. Jeynes, W. H. (1999). Effects of remarriage following divorce on the academic achievement of children. Journal of Youth and Adolescence, 28(3), 385–393. Scholar
  37. Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38, 79–93.CrossRefGoogle Scholar
  38. Khoo, S., West, S., Wu, W., & Kwok, O. (2006). Longitudinal methods. In M. Eid & E. Diener (Eds.), Handbook of psychological measurement: A multimethod perspective (pp. 301–317). Washington, DC: APA.CrossRefGoogle Scholar
  39. Koedel, C., Leatherman, R., & Parsons, E. (2012). Test measurement error and inference from value-added models. The B. E. Journal of Economic Analysis and Policy, 12, 1–37.CrossRefGoogle Scholar
  40. Kohli, N., Hughes, J., Wang, C., Zopluoglu, C., & Davison, M. L. (2015). Fitting a linear–linear piecewise growth mixture model with unknown knots: A comparison of two common approaches to inference. Psychological Methods, 20(2), 259.CrossRefGoogle Scholar
  41. Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307.CrossRefGoogle Scholar
  42. Lee, S., & Song, X. (2003). Bayesian analysis of structural equation models with dichotomous variables. Statistics in Medicine, 22, 3073–3088.CrossRefGoogle Scholar
  43. Lindstrom, M. J., & Bates, D. (1988). Newton–Raphson and EM algorithms for linear mixed-effects models for repeated measure data. Journal of the American Statistical Association, 83, 1014–1022.Google Scholar
  44. Liu, Y., & Yang, J. (2018). Bootstrap-calibrated interval estimates for latent variable scores in item response theory. Psychometrika, 83, 333–354.CrossRefGoogle Scholar
  45. Lockwood, L. R., & McCaffrey, D. F. (2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39, 22–52.CrossRefGoogle Scholar
  46. Lu, I. R., Thomas, D. R., & Zumbo, B. D. (2005). Embedding IRT in structural equation models: A comparison with regression based on IRT scores. Structural Equation Modeling, 12(2), 263–277.CrossRefGoogle Scholar
  47. Magis, D., & Raiche, G. (2012). On the relationships between Jeffrey’s model and weighted likelihood estimation of ability under logistic IRT models. Psychometrika, 77, 163–169.CrossRefGoogle Scholar
  48. Meng, X. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 10, 538–573.CrossRefGoogle Scholar
  49. Meng, X., & Rubin, D. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80, 267–278.CrossRefGoogle Scholar
  50. Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29(2), 133–161.CrossRefGoogle Scholar
  51. Monseur, C., & Adams, R. J. (2009). Plausible values: How to deal with their limitations. Journal of Applied Measurement, 10(3), 320–334.Google Scholar
  52. Murphy, K. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Online file at
  53. Nelder, J. A., & Mead, R. (1965). A simplex algorithm for function minimization. Computer Journal, 7, 308–313.CrossRefGoogle Scholar
  54. Nussbaum, E., Hamilton, L., & Snow, R. (1997). Enhancing the validity and usefulness of large-scale educational assessment: IV.NELS:88 Science achievement to 12th grade. American Educational Research Journal, 34, 151–173.Google Scholar
  55. Pastor, D. A., & Beretvas, N. S. (2006). Longitudinal Rasch modeling in the context of psychotherapy outcomes assessment. Applied Psychological Measurement, 30, 100–120.CrossRefGoogle Scholar
  56. Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of computational and Graphical Statistics, 4(1), 12–35.Google Scholar
  57. Rabe-Hesketh, S., & Skrondal, A. (2008). Multilevel and longitudinal modeling using Stata. New York: STATA Press.Google Scholar
  58. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). GLLAMM manual. Oakland/Berkeley: University of California/Berkeley Electronic Press.Google Scholar
  59. Raudenbush, S. W., & Bryk, A. S. (1985). Empirical Bayes meta-analysis. Journal of Educational and Behavioral Statistics, 10, 75–98.CrossRefGoogle Scholar
  60. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods. Thousand Oaks, CA: Sage.Google Scholar
  61. Raudenbush, S. W., Bryk, A. S., & Congdon, R. (2004). HLM 6 for windows (computer software). Lincolnwood, IL: Scientific Software International.Google Scholar
  62. Raudenbush, S. W., & Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological Methods, 5(2), 199.CrossRefGoogle Scholar
  63. Rijmen, F., Vansteelandt, K., & De Boeck, P. (2008). Latent class models for diary method data: Parameter estimation by local computations. Psychometrika, 73(2), 167–182.CrossRefGoogle Scholar
  64. Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. Scholar
  65. Shang, Y. (2012). Measurement error adjustment using the SIMEX method: An application to student growth percentiles. Journal of Educational Measurement, 49, 446–465.CrossRefGoogle Scholar
  66. Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24, 647–656.CrossRefGoogle Scholar
  67. Sirotnik, K., & Wellington, R. (1977). Incidence sampling: an integrated theory for “matrix sampling”. Journal of Educational Measurement, 14, 343–399.CrossRefGoogle Scholar
  68. Skrondal, A., & Kuha, J. (2012). Improved regression calibration. Psychometrika, 77, 649–669.CrossRefGoogle Scholar
  69. Skrondal, A., & Laake, P. (2001). Regression among factor scores. Psychometrika, 66(4), 563–575.CrossRefGoogle Scholar
  70. Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. Boca Raton: CRC Press.CrossRefGoogle Scholar
  71. StataCorp., (2011). Stata statistical software: Release 12. College Station, TX: StataCorp LP.Google Scholar
  72. Stoel, R. D., Garre, F. G., Dolan, C., & Van Den Wittenboer, G. (2006). On the likelihood ratio test in structural equation modeling when parameters are subject to boundary constraints. Psychological Methods, 11(4), 439.CrossRefGoogle Scholar
  73. Thompson, N., & Weiss, D. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research and Evaluation, 16(1).
  74. Tian, W., Cai, L., Thissen, D., & Xin, T. (2013). Numerical differentiation methods for computing error covariance matrices in item response theory modeling: An evaluation and a new proposal. Educational and Psychological Measurement, 73(3), 412–439.CrossRefGoogle Scholar
  75. van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive testing (Statistics for social and behavioral sciences series). New York: Springer.Google Scholar
  76. Verhelst, N. (2010). IRT models: Parameter estimation, statistical testing and application in EER. In B. P. Creemers, L. Kyriakides, & P. Sammons (Eds.), Methodological advances in educational effectiveness research (pp. 183–218). New York: Routledge.Google Scholar
  77. von Davier, M., & Sinharay, S. (2007). An importance sampling EM algorithm for latent regression models. Journal of Educational and Behavioral Statistics, 32(3), 233–251.CrossRefGoogle Scholar
  78. Wang, C. (2015). On latent trait estimation in multidimensional compensatory item response models. Psychometrika, 80, 428–449.CrossRefGoogle Scholar
  79. Wang, C., Kohli, N., & Henn, L. (2016). A second-order longitudinal model for binary outcomes: Item response theory versus structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23, 455–465.CrossRefGoogle Scholar
  80. Wang, C., & Nydick, S. (2015). Comparing two algorithms for calibrating the restricted non-compensatory multidimensional IRT model. Applied Psychological Measurement, 39, 119–134.CrossRefGoogle Scholar
  81. Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450.CrossRefGoogle Scholar
  82. Ye, F. (2016). Latent growth curve analysis with dichotomous items: Comparing four approaches. British Journal of Mathematical and Statistical Psychology, 69, 43–61.CrossRefGoogle Scholar
  83. Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors. Psychometrika, 56(4), 589–600.CrossRefGoogle Scholar

Copyright information

© The Psychometric Society 2019

Authors and Affiliations

  1. 1.Measurement and Statistics, College of EducationUniversity of WashingtonSeattleUSA
  2. 2.University of MichiganAnn ArborUSA
  3. 3.Northeast Normal UniversityChangchunChina

Personalised recommendations