Skip to main content
  • 539 Accesses

Abstract

This chapter contains requirements to be posed to agreement indices, informs on the history of these indices and points to research situations that are more complex than the ones in which only two raters assign codes. This results in expressing a preference for a specific index. Three issues are extremely relevant for finding high agreement. These are the units that are coded, the raters who perform the coding task and the categories that are used. Many aspects that are related to these three issues are discussed to make that the future researchers are aware of these.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • A. Agresti, Modelling patterns of agreement and disagreement. Stat. Methods Med. Res. 1(2), 201–218 (1992)

    Article  Google Scholar 

  • M. Aickin, Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46(2), 293–302 (1990)

    Google Scholar 

  • R.F. Bales, Interaction Process Analysis. A Method for the Study of Small Groups (Addison Wesley, Reading, MA, 1950)

    Google Scholar 

  • M. Banerjee, M. Capozzoli, L. McSweeney, D. Sinha, Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27(1), 3–23 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • J.J. Bartko, The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19(1), 3–11 (1966)

    Article  Google Scholar 

  • J.J. Bartko, W.T. Carpenter, Methods and theory of reliability. J. Nerv. Mental Dis. 163(5), 307–317 (1976)

    Article  Google Scholar 

  • E.M. Bennett, R.L. Blomquist, A.C. Goldstein, Response stability in limited response questioning. Publ. Opin. Q. 18(2), 218–223 (1954)

    Article  Google Scholar 

  • D.A. Bloch, H.C. Kraemer, 2 × 2 kappa coefficients: measures of agreement or association. Biometrics 45(1), 269–287 (1989)

    Article  MATH  Google Scholar 

  • R.L. Brennan, D.J. Prediger, Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(4), 687–699 (1981)

    Article  Google Scholar 

  • H. Brenner, U. Kliebsch, Dependence of weighted kappa coefficients on the number of categories. Epidemiology 7(2), 199–202 (1996)

    Article  Google Scholar 

  • N.W. Burton, Estimating scorer agreement for nominal categorization systems. Educ. Psychol. Meas. 41(4), 953–962 (1981)

    Article  Google Scholar 

  • A.B. Cantor, Sample size calculations for Cohen’s kappa. Psychol. Methods 1(2), 150–153 (1996)

    Article  Google Scholar 

  • D.V. Cicchetti, A.R. Feinstein, High agreement but low kappa, II: resolving the paradoxes. J. Clin. Epidemiol. 43(6), 551–558 (1990)

    Article  Google Scholar 

  • J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  • J. Cohen, Weighted kappa. Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)

    Article  Google Scholar 

  • A.J. Conger, Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322–328 (1980)

    Article  Google Scholar 

  • P.E. Converse, The nature of belief systems in mass publics. Ideology and discontent, in Ideology and Discontent, ed. by D.E. Apter (The Free Press, New York, 1964), pp. 206–261

    Google Scholar 

  • R.T. Craig, Generalization of Scott’s index of intercoder agreement. Public Opin. Q. 45(2), 260–264 (1981)

    Google Scholar 

  • K.S. Crittenden, Actual and reconstructed coding procedsure, in Academic Janus, ed. by R. McGee (Jossey Bass, San Francisco, 1971), pp. 228–246

    Google Scholar 

  • K.S. Crittenden, R.J. Hill, Coding reliability and validity of interview data. Am. Sociol. Rev. 36(6), 1073–1080 (1971)

    Article  Google Scholar 

  • L.J. Cronbach, Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)

    Article  MATH  Google Scholar 

  • J.N. Darroch, P.I. McCloud, Category distinguishability and observer agreement. Aust. NZ. J. Stat. 28(3), 371–388 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  • H.C. De Vet, L.B. Mokkink, C.B. Terwee, O.S. Hoekstra, D.L. Knol, Clinicians are right not to like Cohen’s κ. Br. Med. J. 12(346), f2125 (2013)

    Article  Google Scholar 

  • W. Dijkstra, T. Taris, Measuring the agreement between sequences. Sociol. Methods Res. 24(2), 214–231 (1995)

    Article  Google Scholar 

  • P.F. Duggan, Time to abolish ‘gold standard’. Br. Med. J. 304(6811), 1568–1569 (1992)

    Article  Google Scholar 

  • G.C. Feng, Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013)

    Article  Google Scholar 

  • V.F. Flack, A.A. Afifi, P.A. Lachenbruch, H.J.A. Schouten, Sample size determinations for the two rater kappa statistic. Psychometrika 53(3), 321–325 (1988)

    Article  MATH  Google Scholar 

  • N.A. Flanders, Estimating reliability, in Interaction Analysis: Theory, Research and Applications, ed. by E.J. Amidon, J.B. Hough (Addison-Wesley, Reading, Mass., 1967), pp. 161–166

    Google Scholar 

  • J.L. Fleiss, Estimating the accuracy of dichotomous judgements. Psychometrika 30(4), 469–479 (1965)

    Article  MATH  Google Scholar 

  • J.L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)

    Article  Google Scholar 

  • J.L. Fleiss, Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)

    Google Scholar 

  • J.L. Fleiss, Response to Krippendorff. Biometrics 34(1), 144 (1978)

    MathSciNet  Google Scholar 

  • J.L. Fleiss, J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33(3), 613–619 (1973)

    Article  Google Scholar 

  • S.L. Foster, J.D. Cone, Design and use of direct observation procedures, in Handbook of Behavioral Assessment, ed. by A.R. Ciminero, K.S. Calhoun, H.E. Adams (Wiley, New York, 1986), pp. 253–324

    Google Scholar 

  • J. Galtung, Theory and Methods of Social Research (Allen & Unwin, London, 1967)

    Google Scholar 

  • J. Galtung, Measurement of agreement, in Papers on Methodology. Theory and Methods of Social Research. Volume II, ed. by J. Galtung (Christian Eijlers, Copenhagen, 1979), pp. 82–135

    Google Scholar 

  • L.A. Goodman, W.H. Kruskal, Measures of association for cross classifications. J. Am. Stat. Assoc. 49(268), 732–764 (1954)

    MATH  Google Scholar 

  • P. Graham, R. Jackson, The analysis of ordinal agreement data: beyond weighted kappa. J. Clin. Epidemiol. 46(9), 1055–1062 (1993)

    Article  Google Scholar 

  • K.L. Gwet, Handbook of inter-rater reliability. The definitive guide to measuring the extent of agreement among raters (Advanced Analytics LLC, Gaithersburg, USA, 2014)

    Google Scholar 

  • K.L. Gwet, Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)

    Article  MathSciNet  Google Scholar 

  • T. Hak, T. Bernts, Coder training: theoretical training or practical socialization? Qual. Sociol. 19(2), 235–257 (1996)

    Article  Google Scholar 

  • D.P. Hartmann, Considerations in the choice of interobserver reliability estimates. J. Appl. Behav. Anal. 10(1), 103–116 (1977)

    Article  Google Scholar 

  • A.R. Hollenbeck, in Problems of reliability in observational research, in Observing Behavior, Volume 2, ed. by G.P. Sacker (University Park Press, London, 1978), pp. 79–98

    Google Scholar 

  • J.W. Holley, J.P. Guilford, A note on the G-index of agreement. Educ. Psychol. Meas. 24(4), 749–753 (1964)

    Article  Google Scholar 

  • O.R. Holsti, Content analysis for the social sciences and humanities (Addison Wesley, London, 1969)

    Google Scholar 

  • L.M. Hsu, R. Field, Interrater agreement measures: comments on kappan, Cohen’s kappa, Scott’s π and Aickin’s α. Underst. Stat. 2(3), 205–219 (2003)

    Article  Google Scholar 

  • L.J. Hubert, Kappa revisited. Psychol. Bull. 84(2), 289–297 (1977)

    Article  Google Scholar 

  • S. Janson, J. Vegelius, On the generalization of the G index and the phi coefficient to nominal scales. Multivar. Behav. Res. 14(2), 255–269 (1979)

    Article  Google Scholar 

  • A.E. Kazdin, Artifact, bias, and complexity of assessment: the ABC’s of reliability. J. Appl. Behav. Anal. 10(1), 141–150 (1977)

    Article  Google Scholar 

  • R.N. Kent, K.D. O’Leary, A. Dietz, C. Diament, Comparison of observational recordings in vivo, via mirror, and via television. J. Appl. Behav. Anal. 12(4), 517–522 (1979)

    Article  Google Scholar 

  • K. Krippendorff, Reliability of binary attribute data. Biometrics 34(1), 142–144 (1978)

    MathSciNet  Google Scholar 

  • K. Krippendorff, Association, agreement, and equity. Qual. Quant. 21(1), 109–123 (1987)

    Google Scholar 

  • K. Krippendorff, Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)

    Google Scholar 

  • K. Krippendorff, Systematic and random disagreement and the reliability of nominal data. Commun. Methods Meas. 2(4), 323–338 (2008)

    Article  Google Scholar 

  • K. Krippendorff, Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112

    Google Scholar 

  • S. Lacy, D. Riffe, Sampling error and selecting intercoder reliability samples for nominal content categories. J. Mass Commun. Q. 73(4), 963–973 (1996)

    Google Scholar 

  • J.R. Landis, G.G. Koch, A review of statistical methods in the analysis of data arising from observer reliability studies. Part 2. Stat. Neerl. 29(2), 151–161 (1975)

    Article  MATH  Google Scholar 

  • J.R. Landis, G.G. Koch, The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MATH  Google Scholar 

  • R.J. Light, Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)

    Article  Google Scholar 

  • A. Linderman, Computer content analysis and manual coding techniques: A comparative analysis. in Theory, Method, and Practice in Computer Content Analysis, ed. by M.D. West (Ablex, Westport, CT, 2001), pp. 97–109

    Google Scholar 

  • A.E. Maxwell, Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977)

    Article  Google Scholar 

  • S. Mikhaylov, M. Laver, K. Benoit, Coder reliability and misclassification in the human coding of party manifestos. Polit. Anal. 20(1), 78–91 (2012)

    Article  Google Scholar 

  • R.J. Mokken, A Theory and Procedure of Scale Analysis: With Applications in Political Research (Mouton, The Hague, 1971)

    Google Scholar 

  • A.C. Montgomery, K.S. Crittenden, Improving coding reliability for open-ended questions. Public Opin. Q. 41(2), 235–243 (1977)

    Article  Google Scholar 

  • K.A. Neuendorf, The Content Analysis Guidebook (Sage, Thousand Oaks, CA, 2002)

    Google Scholar 

  • A.M. Noda, H.C. Kraemer, J.A. Yesavage, V.S. Periyakoil, How many raters are needed for a reliable diagnosis? Int. J. Methods Psychiatr. Res. 10(3), 119–125 (2006)

    Article  Google Scholar 

  • R. Popping, “Het indelen naar werkoriëntaties” [Classifying job orientations], in Modellen in de sociologie, ed. by S. Lindenberg, F.N. Stokman (Van Loghum Slaterus, Deventer, 1983), pp. 233–247

    Google Scholar 

  • R. Popping, Some views on agreement to be used in content analysis studies. Qual. Quant. 44(6), 1067–1078 (2010)

    Article  Google Scholar 

  • R. Popping, On agreement indices for nominal data, in Sociometric Research, Vol. I, ed. by W.E. Saris, I.N. Gallhofer (McMillan, London, 1988), pp. 90–105

    Google Scholar 

  • R. Popping, Analyzing open-ended questions by means of text analysis procedures. Bull. Méthodologie Sociol. 128, 23–39 (2015)

    Article  Google Scholar 

  • R. Popping, C.W. Roberts, Coding issues in semantic text analysis. Field Methods 21(3), 244–264 (2009)

    Article  Google Scholar 

  • D. Quade, Nonparametric partial correlation, in Measurement in the Social Sciences. Theories and Strategies, ed. by H.M. Blalock (MacMillan, London, 1974), pp. 369–398

    Google Scholar 

  • E. Rogot, I.D. Goldberg, A proposed index for measuring agreement in test-retest studies. J. Chron. Dis. 19(9), 991–1006 (1966)

    Article  Google Scholar 

  • R.G. Romanczyk, R.N. Kent, C. Diament, K.D. O’Leary, Measuring the reliability of observational data: a reactive process. J. Appl. Behav. Anal. 6(1), 175–184 (1973)

    Article  Google Scholar 

  • D.C. Ross, Testing patterned hypothesis in multi-way contingency tables using weighted kappa and weighted chi square. Educ. Psychol. Meas. 37(2), 291–307 (1977)

    Article  Google Scholar 

  • H.J.A. Schouten, Measuring pairwise agreement among many observers. II. Some improvements and additions. Biometrical J. 24(5), 431–435 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  • H.J.A. Schouten, Estimating kappa from binocular data and comparing marginal probabilities. Stat. Med. 12(23), 2207–2217 (1993)

    Google Scholar 

  • W.A. Scott, Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)

    Article  Google Scholar 

  • M.M. Shoukri, Measures of Interobserver Agreement and Reliability (CRC Press, Boca Raton, Fl, 2011)

    MATH  Google Scholar 

  • P.E. Shrout, J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 80(2), 420–428 (1979)

    Article  Google Scholar 

  • P. Simon, Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educ. Psychol. Meas. 66(5), 765–777 (2006)

    Article  MathSciNet  Google Scholar 

  • J. Spanjer, B. Krol, R. Popping, J.W. Groothoff, S. Brouwer, Disability assessment interview: the role of concrete and detailed information on functioning besides medical history taking. J. Rehabil. Med. 41(4), 267–272 (2009)

    Article  Google Scholar 

  • A. Stuart, A test of homogeneity of marginal distributions in a two-way classification. Biometrika 42(3/4), 412–416 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  • H.K. Suen, Agreement, reliability, accuracy, and validity: toward a clarification. Behav. Assess. 10(4), 343–366 (1988)

    Google Scholar 

  • J.S. Uebersax, Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140–146 (1987)

    Article  Google Scholar 

  • J.S. Uebersax, Statistical modeling of expert ratings on medical treatment appropriateness. J. Am. Stat. Assoc. 88(2), 421–427 (1993)

    Google Scholar 

  • U.N. Umesh, R.A. Peterson, M.H. Sauber, Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49(4), 835–850 (1989)

    Article  Google Scholar 

  • S. Vanbelle, A. Albert, Agreement between an isolated rater and a group of raters. Stat. Neerl. 63(1), 82–100 (2009)

    Article  MathSciNet  Google Scholar 

  • J. Vegelius, S. Janson, Criteria for symmetric measures of association for nominal data. Qual. Quant. 16(4), 243–250 (1982)

    Google Scholar 

  • J. Vegelius, On the utility of the E-correlation coefficient concept in psychological research. Educ. Psychol. Meas. 38(3), 605–611 (1978)

    Article  Google Scholar 

  • A. Von Eye, Alternatives to Cohen’s κ. Eur. Psychol. 11(1), 12–24 (2006)

    Article  Google Scholar 

  • H.F. Weisberg, Models of statistical relationship. Am. Polit. Sci. Rev. 68(4), 1638–1655 (1974)

    Google Scholar 

  • J.L. Woodward, R. Franzen, A study on coding reliability. Public Opin. Q. 12(2), 253–257 (1948)

    Article  Google Scholar 

  • G.L. Yang, M.K. Chen, A note on weighted kappa. Soc. Econ. Plann. Sci. 12(5), 293–294 (1978)

    Article  Google Scholar 

  • R. Zwick, Another look at interrater agreement. Psychol. Bull. 103(3), 374–378 (1988)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roel Popping .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Popping, R. (2019). Interrater Agreement. In: Introduction to Interrater Agreement for Nominal Data. Springer, Cham. https://doi.org/10.1007/978-3-030-11671-2_3

Download citation

Publish with us

Policies and ethics