Skip to main content

Inter-annotator Agreement

  • Chapter
  • First Online:
Book cover Handbook of Linguistic Annotation

Abstract

This chapter touches upon several issues in the calculation and assessment of inter-annotator agreement. It gives an introduction to the theory behind agreement coefficients and examples of their application to linguistic annotation tasks. Specific examples explore variation in annotator performance due to heterogeneous data, complex labels, item difficulty, and annotator differences, showing how global agreement coefficients may mask these sources of variation, and how detailed agreement studies can give insight into both the annotation process and the nature of the underlying data. The chapter also reviews recent work on using machine learning to exploit the variation among annotators and learn detailed models from which accurate labels can be inferred. I therefore advocate an approach where agreement studies are not used merely as a means to accept or reject a particular annotation scheme, but as a tool for exploring patterns in the data that are being annotated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)

    Article  Google Scholar 

  2. Artstein, R., Gandhe, S., Gerten, J., Leuski, A., Traum, D.: Semi-formal evaluation of conversational characters. In: Grumberg, O., Kaminski, M., Katz, S., Wintner, S. (eds) Languages: From formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday, Lecture Notes in Computer Science, vol. 5533, pp 22–35. Springer, Heidelberg (2009)

    Google Scholar 

  3. Artstein, R., Gandhe, S., Rushforth, M., Traum, D.: Viability of a simple dialogue act scheme for a tactical questioning dialogue system. DiaHolmia 2009: Proceedings of the 13th Workshop on the Semantics and Pragmatics of Dialogue, pp. 43–50. Stockholm, Sweden (2009)

    Google Scholar 

  4. Artstein, R., Rushforth, M., Gandhe, S., Traum, D., Donigian, A.: Limits of simple dialogue acts for tactical questioning dialogues. In: Proceedings of the 7th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 1–8. Barcelona, Spain (2011)

    Google Scholar 

  5. Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Comput. Linguist. 37(4), 699–725 (2011)

    Google Scholar 

  6. Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited questioning. Public Opin. Q. 18(3), 303–308 (1954)

    Article  Google Scholar 

  7. Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiol. 46(5), 423–429 (1993)

    Article  Google Scholar 

  8. Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)

    Google Scholar 

  9. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  10. Craggs, R., McGee Wood, M.: Evaluating discourse and dialogue coding schemes. Comput. Linguist. 31(3), 289–295 (2005)

    Article  Google Scholar 

  11. DeVault, D., Traum, D., Artstein, R.: Making grammar-based generation easier to deploy in dialogue systems. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, pp. 198–207. Columbus, Ohio, http://www.aclweb.org/anthology/W/W08/W08-0130 (2008)

  12. Di Eugenio, B., Glass, M.: The kappa statistic: a second look. Computational Linguistics 30(1), 95–101 (2004)

    Google Scholar 

  13. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)

    Google Scholar 

  14. Fleiss, J.L.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)

    Article  Google Scholar 

  15. Fort, K., François, C., Galibert, O., Ghribi, M.: Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1474–1480. Istanbul, Turkey (2012)

    Google Scholar 

  16. Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)

    Article  Google Scholar 

  17. Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1120–1130. Atlanta, Georgia, http://www.aclweb.org/anthology/N13-1132 (2013)

  18. Hsu, L.M., Field, R.: Interrater agreement measures: comments on kappa\(_n\), Cohen’s kappa, Scott’s \(\pi \), and Aickin’s \(\alpha \). Underst. Stat. 2(3), 205–219 (2003)

    Article  Google Scholar 

  19. Kang, S.H., Gratch, J., Sidner, C., Artstein, R., Huang, L., Morency, L.P.: Towards building a virtual counselor: Modeling nonverbal behavior during intimate self-disclosure. In: Eleventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Valencia, Spain (2012)

    Google Scholar 

  20. Kilgarriff, A.: 95% replicability for manual word sense tagging. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 277–278 (1999)

    Google Scholar 

  21. Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Soc. Methodol. 2, 139–150 (1970)

    Article  Google Scholar 

  22. Krippendorff K (1978) Reliability of binary attribute data. Biometrics 34(1):142–144, letter to the editor, with a reply by Joseph L. Fleiss

    Google Scholar 

  23. Krippendorff, K.: Content analysis: an introduction to its methodology. Sage, Beverly Hills, CA, chap 12, 129–154 (1980)

    Google Scholar 

  24. Krippendorff, K.: Content analysis: an introduction to its methodology. 2nd edn. Sage, Thousand Oaks, CA, chap 11, 211–256 (2004)

    Google Scholar 

  25. Krippendorff, K.: Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)

    Google Scholar 

  26. Passonneau, R., Habash, N., Rambow, O.: Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pp. 1951–1956. Genoa, Italy, http://www.lrec-conf.org/proceedings/lrec2006/summaries/634.html (2006)

  27. Passonneau, R.J., Carpenter, B.: The benefits of a model of annotation. Trans. Assoc. Comput.l Linguist. 2, 311–326, http://www.aclweb.org/anthology/Q/Q14/Q14-1025.pdf (2014)

  28. Passonneau, R.J., Bhardwaj, V., Salleb-Aouissi, A., Ide, N.: Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Lang. Res. Eval. 46(2), 219–252 (2012)

    Article  Google Scholar 

  29. Poesio, M., Sturt, P., Artstein, R., Filik, R.: Underspecification and anaphora: theoretical issues and preliminary evidence. Discourse Processes 42(2), 157–175 (2006)

    Article  Google Scholar 

  30. Reidsma, D., Carletta, J.: Reliability measurement without limits. Comput. Linguist. 34(3), 319–326 (2008)

    Article  Google Scholar 

  31. Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)

    Article  Google Scholar 

  32. Scott, D., Barone, R., Koeling, R.: Corpus annotation as a scientific task. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1481–1485. Istanbul, Turkey (2012)

    Google Scholar 

  33. Zwick, R.: Another look at interrater agreement. Psychological Bulletin 103(3), 374–378 (1988)

    Google Scholar 

Download references

Acknowledgements

I thank the editors of this volume and two anonymous reviewers for valuable feedback and comments on an earlier draft. Any remaining errors and omissions are my own.

The work depicted here was sponsored by the U.S. Army. Statements and opinions expressed do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ron Artstein .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Artstein, R. (2017). Inter-annotator Agreement. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-0881-2_11

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-0879-9

  • Online ISBN: 978-94-024-0881-2

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics