Skip to main content

An Evaluation of Rater Agreement Indices Using Generalizability Theory

  • Conference paper
  • First Online:
Quantitative Psychology (IMPS 2017)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 233))

Included in the following conference series:

Abstract

This study compared several rater agreement indices using data simulated using a generalizability theory framework. Information from previous generalizability studies conducted with data from large-scale writing assessments was used to inform the variance components in the simulations. Rater agreement indices, including percent agreement, weighted and unweighted kappa, polychoric, Pearson, Spearman, and intraclass correlations, and Gwet’s AC1 and AC2, were compared with each other and with the generalizability coefficients. Results showed that some indices performed similarly while others had values that ranged from below 0.4 to over 0.8. The impact of the underlying score distributions, the number of score categories, rater/prompt variability, and rater/prompt assignment on these indices was also investigated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altman, D. G. (1991). Practical statistics for medical research. London: Chapman & Hall/CRC.

    Google Scholar 

  • Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3–23.

    Article  MathSciNet  Google Scholar 

  • Breland, H. M., Bridgeman, B., & Fowles, M. E. (1999). Writing assessment in admission to higher education: Review and framework. (College Board Report No. 99-3; GRE Board Research Report No. 96-12R; ETS RR No. 99-3).

    Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. Springer.

    Book  Google Scholar 

  • Brenner, H., & Kliebsch, U. (1996). Dependence of weighed kappa coefficients on the number of categories. Epidemiology, 7, 199–202.

    Article  Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

    Article  Google Scholar 

  • Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.

    MATH  Google Scholar 

  • Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.

    Article  Google Scholar 

  • Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29–48.

    Article  MathSciNet  Google Scholar 

  • Gwet, K. L. (2010). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (2nd ed.). Gaithersburg, MD: Advanced Analytics, LLC.

    Google Scholar 

  • Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics, LLC.

    Google Scholar 

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

    Article  Google Scholar 

  • Lee, Y.-W., & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Report MS-31, RR-05-14). http://onlinelibrary.wiley.com/doi/10.1002/j.2333-8504.2005.tb01991.x/epdf.

    Article  Google Scholar 

  • Quarfoot, D., & Levine, R. A. (2016). How robust are multi-rater inter-rater reliability indices to changes in frequency distribution? The American Statistician, 70(4), 373–384.

    Article  MathSciNet  Google Scholar 

  • Reardon, S. F., & Ho, A. D. (2014). Practical issues in estimating achievement gaps from coarsened data. https://cepa.stanford.edu/sites/default/files/reardon%20ho%20practical%20gap%20estimation%2025feb2014.pdf.

  • Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64(2), 243–253.

    Article  MathSciNet  Google Scholar 

  • Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81(2), 399–410.

    Article  MathSciNet  Google Scholar 

  • Warrens, M. J. (2012). Some paradoxical results for the quadratically weighted kappa. Psychometrika, 77, 315–323.

    Article  MathSciNet  Google Scholar 

  • Yang, J., & Chinchilli, V. M. (2011). Fixed-effects modeling of Cohen’s weighted kappa for bivariate multinomial data. Computational Statistics & Data Analysis, 55, 1061–1070.

    Article  MathSciNet  Google Scholar 

  • Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder reliability indices. Annals of the International Communication Association, 36(1), 419–480.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongmei Li .

Editor information

Editors and Affiliations

Appendix: Detailed Results for FPFR

Appendix: Detailed Results for FPFR

Table A1 Results for FPFR on Data 1
Table A2 Results for FPFR on Data 2

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, D., Yi, Q., Andrews, B. (2018). An Evaluation of Rater Agreement Indices Using Generalizability Theory. In: Wiberg, M., Culpepper, S., Janssen, R., González, J., Molenaar, D. (eds) Quantitative Psychology. IMPS 2017. Springer Proceedings in Mathematics & Statistics, vol 233. Springer, Cham. https://doi.org/10.1007/978-3-319-77249-3_7

Download citation

Publish with us

Policies and ethics