, Volume 84, Issue 1, pp 186–211 | Cite as

Penalized Best Linear Prediction of True Test Scores

  • Lili YaoEmail author
  • Shelby J. Haberman
  • Mo Zhang


In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP, a modified approach, penalized best linear prediction, is proposed that weights both mean square error of prediction and a quadratic measure of subgroup biases. The proposed methodology is applied to three high-stakes writing assessments.


true test score PBLP subgroup biases 



Lili Yao was partially supported by the National Natural Science Foundation of China (61863012, 61263010) and partially by the Research Project of Science and Technology Department of Jiangxi Province, China (20181BBE50020, 20161BBE50082, 20161BAB202067).


  1. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning and Assessment, 4(3), 1–29.Google Scholar
  2. Attali, Y., Burstein, J., & Andreyev, S. (2003). E-rater Version 2.0: Combining writing analysis feedback with automated essay scoring. Princeton, NJ: Educational Testing Service.Google Scholar
  3. Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine, 25(3), 27–36.Google Scholar
  4. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  5. Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. Scholar
  6. Haberman, S. J. (1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12, 971–988. Scholar
  7. Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. Scholar
  8. Haberman, S. J., & Qian, J. (2007). Linear prediction of a true score from a direct estimate and several derived estimates. Journal of Educational and Behavioral Statistics, 32, 6–23. Scholar
  9. Haberman, S. J., & Sinharay, S. (2010a). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational and Behavioral Statistics, 35, 586–602. Scholar
  10. Haberman, S. J., & Sinharay, S. (2010b). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227. Scholar
  11. Haberman, S. J. & Sinharay, S. (2011). How does the knowledge of subgroup membership of examinees affect the prediction of true subscores? Research Report No. RR-11-43. Princeton, NJ, Educational Testing Service.
  12. Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores? British Journal of Mathematical and Statistical Psychology, 66, 452–469.Google Scholar
  13. Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95. Scholar
  14. Haberman, S. J., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223–251. Scholar
  15. Haberman, S. J., Yao, L., & Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363–385. Scholar
  16. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.Google Scholar
  17. Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 421–28. Scholar
  18. Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113–140. Scholar
  19. Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Swygert, K. A., & Thissen, D. (2001). Augmented scores-"Borrowing strength" to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Erlbaum.Google Scholar

Copyright information

© The Psychometric Society 2018

Authors and Affiliations

  1. 1.Educational Testing ServicePrincetonUSA
  2. 2.EdusoftJerusalemIsrael

Personalised recommendations