An Evaluation of Rater Agreement Indices Using Generalizability Theory

Li, Dongmei; Yi, Qing; Andrews, Benjamin

doi:10.1007/978-3-319-77249-3_7

Dongmei Li⁶,
Qing Yi⁶ &
Benjamin Andrews⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 233))

Included in the following conference series:

The Annual Meeting of the Psychometric Society

1472 Accesses
2 Citations

Abstract

This study compared several rater agreement indices using data simulated using a generalizability theory framework. Information from previous generalizability studies conducted with data from large-scale writing assessments was used to inform the variance components in the simulations. Rater agreement indices, including percent agreement, weighted and unweighted kappa, polychoric, Pearson, Spearman, and intraclass correlations, and Gwet’s AC₁ and AC₂, were compared with each other and with the generalizability coefficients. Results showed that some indices performed similarly while others had values that ranged from below 0.4 to over 0.8. The impact of the underlying score distributions, the number of score categories, rater/prompt variability, and rater/prompt assignment on these indices was also investigated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Altman, D. G. (1991). Practical statistics for medical research. London: Chapman & Hall/CRC.
Google Scholar
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3–23.
Article MathSciNet Google Scholar
Breland, H. M., Bridgeman, B., & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework. (College Board Report No. 99-3; GRE Board Research Report No. 96-12R; ETS RR No. 99-3).
Google Scholar
Brennan, R. L. (2001). Generalizability theory. Springer.
Book Google Scholar
Brenner, H., & Kliebsch, U. (1996). Dependence of weighed kappa coefficients on the number of categories. Epidemiology, 7, 199–202.
Article Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
Article Google Scholar
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.
MATH Google Scholar
Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
Article Google Scholar
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.
Article MathSciNet Google Scholar
Gwet, K. L. (2010). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (2nd ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Google Scholar
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics, LLC.
Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.
Article Google Scholar
Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Report MS-31, RR-05-14). http://onlinelibrary.wiley.com/doi/10.1002/j.2333-8504.2005.tb01991.x/epdf.
Article Google Scholar
Quarfoot, D., & Levine, R. A. (2016). How robust are multi-rater inter-rater reliability indices to changes in frequency distribution? The American Statistician, 70(4), 373–384.
Article MathSciNet Google Scholar
Reardon, S. F., & Ho, A. D. (2014). Practical issues in estimating achievement gaps from coarsened data. https://cepa.stanford.edu/sites/default/files/reardon%20ho%20practical%20gap%20estimation%2025feb2014.pdf.
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64(2), 243–253.
Article MathSciNet Google Scholar
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81(2), 399–410.
Article MathSciNet Google Scholar
Warrens, M. J. (2012). Some paradoxical results for the quadratically weighted kappa. Psychometrika, 77, 315–323.
Article MathSciNet Google Scholar
Yang, J., & Chinchilli, V. M. (2011). Fixed-effects modeling of Cohen’s weighted kappa for bivariate multinomial data. Computational Statistics & Data Analysis, 55, 1061–1070.
Article MathSciNet Google Scholar
Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36(1), 419–480.
Article Google Scholar

Download references

Author information

Authors and Affiliations

ACT, Inc, Iowa City, IA, 52243, USA
Dongmei Li, Qing Yi & Benjamin Andrews

Authors

Dongmei Li
View author publications
You can also search for this author in PubMed Google Scholar
Qing Yi
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Andrews
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongmei Li .

Editor information

Editors and Affiliations

Umeå School of Business, Economics and Statistics, Umeå University, Umeå, Sweden
Marie Wiberg
Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, USA
Steven Culpepper
Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
Rianne Janssen
Faculty of Mathematics, Pontificia Universidad Católica de Chile, Santiago, Chile
Jorge González
Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands
Dylan Molenaar

Appendix: Detailed Results for FPFR

Table A1 Results for FPFR on Data 1

Full size table

Table A2 Results for FPFR on Data 2

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, D., Yi, Q., Andrews, B. (2018). An Evaluation of Rater Agreement Indices Using Generalizability Theory. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds) Quantitative Psychology. IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. https://doi.org/10.1007/978-3-319-77249-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-77249-3_7
Published: 21 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77248-6
Online ISBN: 978-3-319-77249-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

An Evaluation of Rater Agreement Indices Using Generalizability Theory

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Detailed Results for FPFR

Appendix: Detailed Results for FPFR

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation