Inter-annotator Agreement

Artstein, Ron

doi:10.1007/978-94-024-0881-2_11

Ron Artstein³

2989 Accesses
47 Citations

Abstract

This chapter touches upon several issues in the calculation and assessment of inter-annotator agreement. It gives an introduction to the theory behind agreement coefficients and examples of their application to linguistic annotation tasks. Specific examples explore variation in annotator performance due to heterogeneous data, complex labels, item difficulty, and annotator differences, showing how global agreement coefficients may mask these sources of variation, and how detailed agreement studies can give insight into both the annotation process and the nature of the underlying data. The chapter also reviews recent work on using machine learning to exploit the variation among annotators and learn detailed models from which accurate labels can be inferred. I therefore advocate an approach where agreement studies are not used merely as a means to accept or reject a particular annotation scheme, but as a tool for exploring patterns in the data that are being annotated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Article Google Scholar
Artstein, R., Gandhe, S., Gerten, J., Leuski, A., Traum, D.: Semi-formal evaluation of conversational characters. In: Grumberg, O., Kaminski, M., Katz, S., Wintner, S. (eds) Languages: From formal to natural. Essays dedicated to Nissim Francez on the occasion of his 65th birthday, Lecture Notes in Computer Science, vol. 5533, pp 22–35. Springer, Heidelberg (2009)
Google Scholar
Artstein, R., Gandhe, S., Rushforth, M., Traum, D.: Viability of a simple dialogue act scheme for a tactical questioning dialogue system. DiaHolmia 2009: Proceedings of the 13th Workshop on the Semantics and Pragmatics of Dialogue, pp. 43–50. Stockholm, Sweden (2009)
Google Scholar
Artstein, R., Rushforth, M., Gandhe, S., Traum, D., Donigian, A.: Limits of simple dialogue acts for tactical questioning dialogues. In: Proceedings of the 7th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 1–8. Barcelona, Spain (2011)
Google Scholar
Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Comput. Linguist. 37(4), 699–725 (2011)
Google Scholar
Bennett, E.M., Alpert, R., Goldstein, A.C.: Communications through limited questioning. Public Opin. Q. 18(3), 303–308 (1954)
Article Google Scholar
Byrt, T., Bishop, J., Carlin, J.B.: Bias, prevalence and kappa. J. Clin. Epidemiol. 46(5), 423–429 (1993)
Article Google Scholar
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Article Google Scholar
Craggs, R., McGee Wood, M.: Evaluating discourse and dialogue coding schemes. Comput. Linguist. 31(3), 289–295 (2005)
Article Google Scholar
DeVault, D., Traum, D., Artstein, R.: Making grammar-based generation easier to deploy in dialogue systems. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Association for Computational Linguistics, pp. 198–207. Columbus, Ohio, http://www.aclweb.org/anthology/W/W08/W08-0130 (2008)
Di Eugenio, B., Glass, M.: The kappa statistic: a second look. Computational Linguistics 30(1), 95–101 (2004)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)
Google Scholar
Fleiss, J.L.: Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)
Article Google Scholar
Fort, K., François, C., Galibert, O., Ghribi, M.: Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1474–1480. Istanbul, Turkey (2012)
Google Scholar
Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 1(1), 77–89 (2007)
Article Google Scholar
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E.: Learning whom to trust with MACE. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 1120–1130. Atlanta, Georgia, http://www.aclweb.org/anthology/N13-1132 (2013)
Hsu, L.M., Field, R.: Interrater agreement measures: comments on kappa\(_n\), Cohen’s kappa, Scott’s \(\pi \), and Aickin’s \(\alpha \). Underst. Stat. 2(3), 205–219 (2003)
Article Google Scholar
Kang, S.H., Gratch, J., Sidner, C., Artstein, R., Huang, L., Morency, L.P.: Towards building a virtual counselor: Modeling nonverbal behavior during intimate self-disclosure. In: Eleventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS). Valencia, Spain (2012)
Google Scholar
Kilgarriff, A.: 95% replicability for manual word sense tagging. In: Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 277–278 (1999)
Google Scholar
Krippendorff, K.: Bivariate agreement coefficients for reliability of data. Soc. Methodol. 2, 139–150 (1970)
Article Google Scholar
Krippendorff K (1978) Reliability of binary attribute data. Biometrics 34(1):142–144, letter to the editor, with a reply by Joseph L. Fleiss
Google Scholar
Krippendorff, K.: Content analysis: an introduction to its methodology. Sage, Beverly Hills, CA, chap 12, 129–154 (1980)
Google Scholar
Krippendorff, K.: Content analysis: an introduction to its methodology. 2nd edn. Sage, Thousand Oaks, CA, chap 11, 211–256 (2004)
Google Scholar
Krippendorff, K.: Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)
Google Scholar
Passonneau, R., Habash, N., Rambow, O.: Inter-annotator agreement on a multilingual semantic annotation task. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), pp. 1951–1956. Genoa, Italy, http://www.lrec-conf.org/proceedings/lrec2006/summaries/634.html (2006)
Passonneau, R.J., Carpenter, B.: The benefits of a model of annotation. Trans. Assoc. Comput.l Linguist. 2, 311–326, http://www.aclweb.org/anthology/Q/Q14/Q14-1025.pdf (2014)
Passonneau, R.J., Bhardwaj, V., Salleb-Aouissi, A., Ide, N.: Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Lang. Res. Eval. 46(2), 219–252 (2012)
Article Google Scholar
Poesio, M., Sturt, P., Artstein, R., Filik, R.: Underspecification and anaphora: theoretical issues and preliminary evidence. Discourse Processes 42(2), 157–175 (2006)
Article Google Scholar
Reidsma, D., Carletta, J.: Reliability measurement without limits. Comput. Linguist. 34(3), 319–326 (2008)
Article Google Scholar
Scott, W.A.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
Article Google Scholar
Scott, D., Barone, R., Koeling, R.: Corpus annotation as a scientific task. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 1481–1485. Istanbul, Turkey (2012)
Google Scholar
Zwick, R.: Another look at interrater agreement. Psychological Bulletin 103(3), 374–378 (1988)
Google Scholar

Download references

Acknowledgements

I thank the editors of this volume and two anonymous reviewers for valuable feedback and comments on an earlier draft. Any remaining errors and omissions are my own.

The work depicted here was sponsored by the U.S. Army. Statements and opinions expressed do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred.

Author information

Authors and Affiliations

Institute for Creative Technologies, University of Southern California, 12015 Waterfront Drive, Playa Vista, CA, USA
Ron Artstein

Authors

Ron Artstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ron Artstein .

Editor information

Editors and Affiliations

Department of Computer Science, Vassar College, Poughkeepsie, New York, USA
Nancy Ide
Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, Massachusetts, USA
James Pustejovsky

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Artstein, R. (2017). Inter-annotator Agreement. In: Ide, N., Pustejovsky, J. (eds) Handbook of Linguistic Annotation. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_11

Download citation

DOI: https://doi.org/10.1007/978-94-024-0881-2_11
Published: 17 June 2017
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-0879-9
Online ISBN: 978-94-024-0881-2
eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics