Interrater Agreement

Popping, Roel

doi:10.1007/978-3-030-11671-2_3

Roel Popping²

539 Accesses

Abstract

This chapter contains requirements to be posed to agreement indices, informs on the history of these indices and points to research situations that are more complex than the ones in which only two raters assign codes. This results in expressing a preference for a specific index. Three issues are extremely relevant for finding high agreement. These are the units that are coded, the raters who perform the coding task and the categories that are used. Many aspects that are related to these three issues are discussed to make that the future researchers are aware of these.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Agresti, Modelling patterns of agreement and disagreement. Stat. Methods Med. Res. 1(2), 201–218 (1992)
Article Google Scholar
M. Aickin, Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46(2), 293–302 (1990)
Google Scholar
R.F. Bales, Interaction Process Analysis. A Method for the Study of Small Groups (Addison Wesley, Reading, MA, 1950)
Google Scholar
M. Banerjee, M. Capozzoli, L. McSweeney, D. Sinha, Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27(1), 3–23 (1999)
Article MathSciNet MATH Google Scholar
J.J. Bartko, The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19(1), 3–11 (1966)
Article Google Scholar
J.J. Bartko, W.T. Carpenter, Methods and theory of reliability. J. Nerv. Mental Dis. 163(5), 307–317 (1976)
Article Google Scholar
E.M. Bennett, R.L. Blomquist, A.C. Goldstein, Response stability in limited response questioning. Publ. Opin. Q. 18(2), 218–223 (1954)
Article Google Scholar
D.A. Bloch, H.C. Kraemer, 2 × 2 kappa coefficients: measures of agreement or association. Biometrics 45(1), 269–287 (1989)
Article MATH Google Scholar
R.L. Brennan, D.J. Prediger, Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(4), 687–699 (1981)
Article Google Scholar
H. Brenner, U. Kliebsch, Dependence of weighted kappa coefficients on the number of categories. Epidemiology 7(2), 199–202 (1996)
Article Google Scholar
N.W. Burton, Estimating scorer agreement for nominal categorization systems. Educ. Psychol. Meas. 41(4), 953–962 (1981)
Article Google Scholar
A.B. Cantor, Sample size calculations for Cohen’s kappa. Psychol. Methods 1(2), 150–153 (1996)
Article Google Scholar
D.V. Cicchetti, A.R. Feinstein, High agreement but low kappa, II: resolving the paradoxes. J. Clin. Epidemiol. 43(6), 551–558 (1990)
Article Google Scholar
J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Article Google Scholar
J. Cohen, Weighted kappa. Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
Article Google Scholar
A.J. Conger, Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322–328 (1980)
Article Google Scholar
P.E. Converse, The nature of belief systems in mass publics. Ideology and discontent, in Ideology and Discontent, ed. by D.E. Apter (The Free Press, New York, 1964), pp. 206–261
Google Scholar
R.T. Craig, Generalization of Scott’s index of intercoder agreement. Public Opin. Q. 45(2), 260–264 (1981)
Google Scholar
K.S. Crittenden, Actual and reconstructed coding procedsure, in Academic Janus, ed. by R. McGee (Jossey Bass, San Francisco, 1971), pp. 228–246
Google Scholar
K.S. Crittenden, R.J. Hill, Coding reliability and validity of interview data. Am. Sociol. Rev. 36(6), 1073–1080 (1971)
Article Google Scholar
L.J. Cronbach, Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)
Article MATH Google Scholar
J.N. Darroch, P.I. McCloud, Category distinguishability and observer agreement. Aust. NZ. J. Stat. 28(3), 371–388 (1986)
Article MathSciNet MATH Google Scholar
H.C. De Vet, L.B. Mokkink, C.B. Terwee, O.S. Hoekstra, D.L. Knol, Clinicians are right not to like Cohen’s κ. Br. Med. J. 12(346), f2125 (2013)
Article Google Scholar
W. Dijkstra, T. Taris, Measuring the agreement between sequences. Sociol. Methods Res. 24(2), 214–231 (1995)
Article Google Scholar
P.F. Duggan, Time to abolish ‘gold standard’. Br. Med. J. 304(6811), 1568–1569 (1992)
Article Google Scholar
G.C. Feng, Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013)
Article Google Scholar
V.F. Flack, A.A. Afifi, P.A. Lachenbruch, H.J.A. Schouten, Sample size determinations for the two rater kappa statistic. Psychometrika 53(3), 321–325 (1988)
Article MATH Google Scholar
N.A. Flanders, Estimating reliability, in Interaction Analysis: Theory, Research and Applications, ed. by E.J. Amidon, J.B. Hough (Addison-Wesley, Reading, Mass., 1967), pp. 161–166
Google Scholar
J.L. Fleiss, Estimating the accuracy of dichotomous judgements. Psychometrika 30(4), 469–479 (1965)
Article MATH Google Scholar
J.L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Article Google Scholar
J.L. Fleiss, Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)
Google Scholar
J.L. Fleiss, Response to Krippendorff. Biometrics 34(1), 144 (1978)
MathSciNet Google Scholar
J.L. Fleiss, J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33(3), 613–619 (1973)
Article Google Scholar
S.L. Foster, J.D. Cone, Design and use of direct observation procedures, in Handbook of Behavioral Assessment, ed. by A.R. Ciminero, K.S. Calhoun, H.E. Adams (Wiley, New York, 1986), pp. 253–324
Google Scholar
J. Galtung, Theory and Methods of Social Research (Allen & Unwin, London, 1967)
Google Scholar
J. Galtung, Measurement of agreement, in Papers on Methodology. Theory and Methods of Social Research. Volume II, ed. by J. Galtung (Christian Eijlers, Copenhagen, 1979), pp. 82–135
Google Scholar
L.A. Goodman, W.H. Kruskal, Measures of association for cross classifications. J. Am. Stat. Assoc. 49(268), 732–764 (1954)
MATH Google Scholar
P. Graham, R. Jackson, The analysis of ordinal agreement data: beyond weighted kappa. J. Clin. Epidemiol. 46(9), 1055–1062 (1993)
Article Google Scholar
K.L. Gwet, Handbook of inter-rater reliability. The definitive guide to measuring the extent of agreement among raters (Advanced Analytics LLC, Gaithersburg, USA, 2014)
Google Scholar
K.L. Gwet, Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
Article MathSciNet Google Scholar
T. Hak, T. Bernts, Coder training: theoretical training or practical socialization? Qual. Sociol. 19(2), 235–257 (1996)
Article Google Scholar
D.P. Hartmann, Considerations in the choice of interobserver reliability estimates. J. Appl. Behav. Anal. 10(1), 103–116 (1977)
Article Google Scholar
A.R. Hollenbeck, in Problems of reliability in observational research, in Observing Behavior, Volume 2, ed. by G.P. Sacker (University Park Press, London, 1978), pp. 79–98
Google Scholar
J.W. Holley, J.P. Guilford, A note on the G-index of agreement. Educ. Psychol. Meas. 24(4), 749–753 (1964)
Article Google Scholar
O.R. Holsti, Content analysis for the social sciences and humanities (Addison Wesley, London, 1969)
Google Scholar
L.M. Hsu, R. Field, Interrater agreement measures: comments on kappan, Cohen’s kappa, Scott’s π and Aickin’s α. Underst. Stat. 2(3), 205–219 (2003)
Article Google Scholar
L.J. Hubert, Kappa revisited. Psychol. Bull. 84(2), 289–297 (1977)
Article Google Scholar
S. Janson, J. Vegelius, On the generalization of the G index and the phi coefficient to nominal scales. Multivar. Behav. Res. 14(2), 255–269 (1979)
Article Google Scholar
A.E. Kazdin, Artifact, bias, and complexity of assessment: the ABC’s of reliability. J. Appl. Behav. Anal. 10(1), 141–150 (1977)
Article Google Scholar
R.N. Kent, K.D. O’Leary, A. Dietz, C. Diament, Comparison of observational recordings in vivo, via mirror, and via television. J. Appl. Behav. Anal. 12(4), 517–522 (1979)
Article Google Scholar
K. Krippendorff, Reliability of binary attribute data. Biometrics 34(1), 142–144 (1978)
MathSciNet Google Scholar
K. Krippendorff, Association, agreement, and equity. Qual. Quant. 21(1), 109–123 (1987)
Google Scholar
K. Krippendorff, Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)
Google Scholar
K. Krippendorff, Systematic and random disagreement and the reliability of nominal data. Commun. Methods Meas. 2(4), 323–338 (2008)
Article Google Scholar
K. Krippendorff, Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112
Google Scholar
S. Lacy, D. Riffe, Sampling error and selecting intercoder reliability samples for nominal content categories. J. Mass Commun. Q. 73(4), 963–973 (1996)
Google Scholar
J.R. Landis, G.G. Koch, A review of statistical methods in the analysis of data arising from observer reliability studies. Part 2. Stat. Neerl. 29(2), 151–161 (1975)
Article MATH Google Scholar
J.R. Landis, G.G. Koch, The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Article MATH Google Scholar
R.J. Light, Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)
Article Google Scholar
A. Linderman, Computer content analysis and manual coding techniques: A comparative analysis. in Theory, Method, and Practice in Computer Content Analysis, ed. by M.D. West (Ablex, Westport, CT, 2001), pp. 97–109
Google Scholar
A.E. Maxwell, Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977)
Article Google Scholar
S. Mikhaylov, M. Laver, K. Benoit, Coder reliability and misclassification in the human coding of party manifestos. Polit. Anal. 20(1), 78–91 (2012)
Article Google Scholar
R.J. Mokken, A Theory and Procedure of Scale Analysis: With Applications in Political Research (Mouton, The Hague, 1971)
Google Scholar
A.C. Montgomery, K.S. Crittenden, Improving coding reliability for open-ended questions. Public Opin. Q. 41(2), 235–243 (1977)
Article Google Scholar
K.A. Neuendorf, The Content Analysis Guidebook (Sage, Thousand Oaks, CA, 2002)
Google Scholar
A.M. Noda, H.C. Kraemer, J.A. Yesavage, V.S. Periyakoil, How many raters are needed for a reliable diagnosis? Int. J. Methods Psychiatr. Res. 10(3), 119–125 (2006)
Article Google Scholar
R. Popping, “Het indelen naar werkoriëntaties” [Classifying job orientations], in Modellen in de sociologie, ed. by S. Lindenberg, F.N. Stokman (Van Loghum Slaterus, Deventer, 1983), pp. 233–247
Google Scholar
R. Popping, Some views on agreement to be used in content analysis studies. Qual. Quant. 44(6), 1067–1078 (2010)
Article Google Scholar
R. Popping, On agreement indices for nominal data, in Sociometric Research, Vol. I, ed. by W.E. Saris, I.N. Gallhofer (McMillan, London, 1988), pp. 90–105
Google Scholar
R. Popping, Analyzing open-ended questions by means of text analysis procedures. Bull. Méthodologie Sociol. 128, 23–39 (2015)
Article Google Scholar
R. Popping, C.W. Roberts, Coding issues in semantic text analysis. Field Methods 21(3), 244–264 (2009)
Article Google Scholar
D. Quade, Nonparametric partial correlation, in Measurement in the Social Sciences. Theories and Strategies, ed. by H.M. Blalock (MacMillan, London, 1974), pp. 369–398
Google Scholar
E. Rogot, I.D. Goldberg, A proposed index for measuring agreement in test-retest studies. J. Chron. Dis. 19(9), 991–1006 (1966)
Article Google Scholar
R.G. Romanczyk, R.N. Kent, C. Diament, K.D. O’Leary, Measuring the reliability of observational data: a reactive process. J. Appl. Behav. Anal. 6(1), 175–184 (1973)
Article Google Scholar
D.C. Ross, Testing patterned hypothesis in multi-way contingency tables using weighted kappa and weighted chi square. Educ. Psychol. Meas. 37(2), 291–307 (1977)
Article Google Scholar
H.J.A. Schouten, Measuring pairwise agreement among many observers. II. Some improvements and additions. Biometrical J. 24(5), 431–435 (1982)
Article MathSciNet MATH Google Scholar
H.J.A. Schouten, Estimating kappa from binocular data and comparing marginal probabilities. Stat. Med. 12(23), 2207–2217 (1993)
Google Scholar
W.A. Scott, Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
Article Google Scholar
M.M. Shoukri, Measures of Interobserver Agreement and Reliability (CRC Press, Boca Raton, Fl, 2011)
MATH Google Scholar
P.E. Shrout, J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 80(2), 420–428 (1979)
Article Google Scholar
P. Simon, Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educ. Psychol. Meas. 66(5), 765–777 (2006)
Article MathSciNet Google Scholar
J. Spanjer, B. Krol, R. Popping, J.W. Groothoff, S. Brouwer, Disability assessment interview: the role of concrete and detailed information on functioning besides medical history taking. J. Rehabil. Med. 41(4), 267–272 (2009)
Article Google Scholar
A. Stuart, A test of homogeneity of marginal distributions in a two-way classification. Biometrika 42(3/4), 412–416 (1955)
Article MathSciNet MATH Google Scholar
H.K. Suen, Agreement, reliability, accuracy, and validity: toward a clarification. Behav. Assess. 10(4), 343–366 (1988)
Google Scholar
J.S. Uebersax, Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140–146 (1987)
Article Google Scholar
J.S. Uebersax, Statistical modeling of expert ratings on medical treatment appropriateness. J. Am. Stat. Assoc. 88(2), 421–427 (1993)
Google Scholar
U.N. Umesh, R.A. Peterson, M.H. Sauber, Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49(4), 835–850 (1989)
Article Google Scholar
S. Vanbelle, A. Albert, Agreement between an isolated rater and a group of raters. Stat. Neerl. 63(1), 82–100 (2009)
Article MathSciNet Google Scholar
J. Vegelius, S. Janson, Criteria for symmetric measures of association for nominal data. Qual. Quant. 16(4), 243–250 (1982)
Google Scholar
J. Vegelius, On the utility of the E-correlation coefficient concept in psychological research. Educ. Psychol. Meas. 38(3), 605–611 (1978)
Article Google Scholar
A. Von Eye, Alternatives to Cohen’s κ. Eur. Psychol. 11(1), 12–24 (2006)
Article Google Scholar
H.F. Weisberg, Models of statistical relationship. Am. Polit. Sci. Rev. 68(4), 1638–1655 (1974)
Google Scholar
J.L. Woodward, R. Franzen, A study on coding reliability. Public Opin. Q. 12(2), 253–257 (1948)
Article Google Scholar
G.L. Yang, M.K. Chen, A note on weighted kappa. Soc. Econ. Plann. Sci. 12(5), 293–294 (1978)
Article Google Scholar
R. Zwick, Another look at interrater agreement. Psychol. Bull. 103(3), 374–378 (1988)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Sociology, University of Groningen, Groningen, The Netherlands
Roel Popping

Authors

Roel Popping
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roel Popping .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Popping, R. (2019). Interrater Agreement. In: Introduction to Interrater Agreement for Nominal Data. Springer, Cham. https://doi.org/10.1007/978-3-030-11671-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-11671-2_3
Published: 23 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11670-5
Online ISBN: 978-3-030-11671-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics