Abstract
In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP, a modified approach, penalized best linear prediction, is proposed that weights both mean square error of prediction and a quadratic measure of subgroup biases. The proposed methodology is applied to three high-stakes writing assessments.
Similar content being viewed by others
References
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V.2. Journal of Technology, Learning and Assessment, 4(3), 1–29.
Attali, Y., Burstein, J., & Andreyev, S. (2003). E-rater Version 2.0: Combining writing analysis feedback with automated essay scoring. Princeton, NJ: Educational Testing Service.
Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The Criterion online writing service. AI Magazine, 25(3), 27–36.
Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel–Haenszel and standardization. Hillsdale, NJ: Lawrence Erlbaum Associates.
Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. https://doi.org/10.1111/j.1745-3984.2000.tb01088.x.
Haberman, S. J. (1984). Adjustment by minimum discriminant information. The Annals of Statistics, 12, 971–988. https://doi.org/10.1214/aos/1176346715.
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229. https://doi.org/10.3102/1076998607302636.
Haberman, S. J., & Qian, J. (2007). Linear prediction of a true score from a direct estimate and several derived estimates. Journal of Educational and Behavioral Statistics, 32, 6–23. https://doi.org/10.3102/1076998606298036.
Haberman, S. J., & Sinharay, S. (2010a). The application of the cumulative logistic regression model to automated essay scoring. Journal of Educational and Behavioral Statistics, 35, 586–602. https://doi.org/10.3102/1076998610375839.
Haberman, S. J., & Sinharay, S. (2010b). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227. https://doi.org/10.1007/S11336-010-9158-4.
Haberman, S. J. & Sinharay, S. (2011). How does the knowledge of subgroup membership of examinees affect the prediction of true subscores? Research Report No. RR-11-43. Princeton, NJ, Educational Testing Service. https://doi.org/10.1002/j.2333-8504.2011.tb02279.x
Haberman, S. J., & Sinharay, S. (2013). Does subgroup membership information lead to better estimation of true subscores? British Journal of Mathematical and Statistical Psychology, 66, 452–469.
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95. https://doi.org/10.1348/000711007x248875.
Haberman, S. J., & Yao, L. (2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52, 223–251. https://doi.org/10.1111/jedm.12075.
Haberman, S. J., Yao, L., & Sinharay, S. (2015). Prediction of true test scores from observed item scores and ancillary data. British Journal of Mathematical and Statistical Psychology, 68, 363–385. https://doi.org/10.1111/bmsp.12052.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26, 421–28. https://doi.org/10.1111/j.1745-3992.2007.00105.x.
Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113–140. https://doi.org/10.1111/j.1745-3984.2000.tb01079.x.
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Swygert, K. A., & Thissen, D. (2001). Augmented scores-"Borrowing strength" to compute scores based on small numbers of items. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 343–387). Mahwah, NJ: Erlbaum.
Acknowledgements
Lili Yao was partially supported by the National Natural Science Foundation of China (61863012, 61263010) and partially by the Research Project of Science and Technology Department of Jiangxi Province, China (20181BBE50020, 20161BBE50082, 20161BAB202067).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yao, L., Haberman, S.J. & Zhang, M. Penalized Best Linear Prediction of True Test Scores. Psychometrika 84, 186–211 (2019). https://doi.org/10.1007/s11336-018-9636-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11336-018-9636-7