Abstract
This study compared several rater agreement indices using data simulated using a generalizability theory framework. Information from previous generalizability studies conducted with data from large-scale writing assessments was used to inform the variance components in the simulations. Rater agreement indices, including percent agreement, weighted and unweighted kappa, polychoric, Pearson, Spearman, and intraclass correlations, and Gwet’s AC1 and AC2, were compared with each other and with the generalizability coefficients. Results showed that some indices performed similarly while others had values that ranged from below 0.4 to over 0.8. The impact of the underlying score distributions, the number of score categories, rater/prompt variability, and rater/prompt assignment on these indices was also investigated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altman, D. G. (1991). Practical statistics for medical research. London: Chapman & Hall/CRC.
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3–23.
Breland, H. M., Bridgeman, B., & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework. (College Board Report No. 99-3; GRE Board Research Report No. 96-12R; ETS RR No. 99-3).
Brennan, R. L. (2001). Generalizability theory. Springer.
Brenner, H., & Kliebsch, U. (1996). Dependence of weighed kappa coefficients on the number of categories. Epidemiology, 7, 199–202.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.
Gwet, K. L. (2010). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (2nd ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Report MS-31, RR-05-14). http://onlinelibrary.wiley.com/doi/10.1002/j.2333-8504.2005.tb01991.x/epdf.
Quarfoot, D., & Levine, R. A. (2016). How robust are multi-rater inter-rater reliability indices to changes in frequency distribution? The American Statistician, 70(4), 373–384.
Reardon, S. F., & Ho, A. D. (2014). Practical issues in estimating achievement gaps from coarsened data. https://cepa.stanford.edu/sites/default/files/reardon%20ho%20practical%20gap%20estimation%2025feb2014.pdf.
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64(2), 243–253.
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81(2), 399–410.
Warrens, M. J. (2012). Some paradoxical results for the quadratically weighted kappa. Psychometrika, 77, 315–323.
Yang, J., & Chinchilli, V. M. (2011). Fixed-effects modeling of Cohen’s weighted kappa for bivariate multinomial data. Computational Statistics & Data Analysis, 55, 1061–1070.
Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36(1), 419–480.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Detailed Results for FPFR
Appendix: Detailed Results for FPFR
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Li, D., Yi, Q., Andrews, B. (2018). An Evaluation of Rater Agreement Indices Using Generalizability Theory. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds) Quantitative Psychology. IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. https://doi.org/10.1007/978-3-319-77249-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-77249-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77248-6
Online ISBN: 978-3-319-77249-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)