Abstract
This chapter contains requirements to be posed to agreement indices, informs on the history of these indices and points to research situations that are more complex than the ones in which only two raters assign codes. This results in expressing a preference for a specific index. Three issues are extremely relevant for finding high agreement. These are the units that are coded, the raters who perform the coding task and the categories that are used. Many aspects that are related to these three issues are discussed to make that the future researchers are aware of these.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Agresti, Modelling patterns of agreement and disagreement. Stat. Methods Med. Res. 1(2), 201–218 (1992)
M. Aickin, Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics 46(2), 293–302 (1990)
R.F. Bales, Interaction Process Analysis. A Method for the Study of Small Groups (Addison Wesley, Reading, MA, 1950)
M. Banerjee, M. Capozzoli, L. McSweeney, D. Sinha, Beyond kappa: a review of interrater agreement measures. Can. J. Stat. 27(1), 3–23 (1999)
J.J. Bartko, The intraclass correlation coefficient as a measure of reliability. Psychol. Rep. 19(1), 3–11 (1966)
J.J. Bartko, W.T. Carpenter, Methods and theory of reliability. J. Nerv. Mental Dis. 163(5), 307–317 (1976)
E.M. Bennett, R.L. Blomquist, A.C. Goldstein, Response stability in limited response questioning. Publ. Opin. Q. 18(2), 218–223 (1954)
D.A. Bloch, H.C. Kraemer, 2 × 2 kappa coefficients: measures of agreement or association. Biometrics 45(1), 269–287 (1989)
R.L. Brennan, D.J. Prediger, Coefficient kappa: some uses, misuses, and alternatives. Educ. Psychol. Meas. 41(4), 687–699 (1981)
H. Brenner, U. Kliebsch, Dependence of weighted kappa coefficients on the number of categories. Epidemiology 7(2), 199–202 (1996)
N.W. Burton, Estimating scorer agreement for nominal categorization systems. Educ. Psychol. Meas. 41(4), 953–962 (1981)
A.B. Cantor, Sample size calculations for Cohen’s kappa. Psychol. Methods 1(2), 150–153 (1996)
D.V. Cicchetti, A.R. Feinstein, High agreement but low kappa, II: resolving the paradoxes. J. Clin. Epidemiol. 43(6), 551–558 (1990)
J. Cohen, A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
J. Cohen, Weighted kappa. Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
A.J. Conger, Integration and generalization of kappas for multiple raters. Psychol. Bull. 88(2), 322–328 (1980)
P.E. Converse, The nature of belief systems in mass publics. Ideology and discontent, in Ideology and Discontent, ed. by D.E. Apter (The Free Press, New York, 1964), pp. 206–261
R.T. Craig, Generalization of Scott’s index of intercoder agreement. Public Opin. Q. 45(2), 260–264 (1981)
K.S. Crittenden, Actual and reconstructed coding procedsure, in Academic Janus, ed. by R. McGee (Jossey Bass, San Francisco, 1971), pp. 228–246
K.S. Crittenden, R.J. Hill, Coding reliability and validity of interview data. Am. Sociol. Rev. 36(6), 1073–1080 (1971)
L.J. Cronbach, Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951)
J.N. Darroch, P.I. McCloud, Category distinguishability and observer agreement. Aust. NZ. J. Stat. 28(3), 371–388 (1986)
H.C. De Vet, L.B. Mokkink, C.B. Terwee, O.S. Hoekstra, D.L. Knol, Clinicians are right not to like Cohen’s κ. Br. Med. J. 12(346), f2125 (2013)
W. Dijkstra, T. Taris, Measuring the agreement between sequences. Sociol. Methods Res. 24(2), 214–231 (1995)
P.F. Duggan, Time to abolish ‘gold standard’. Br. Med. J. 304(6811), 1568–1569 (1992)
G.C. Feng, Underlying determinants driving agreement among coders. Qual. Quant. 47(5), 2983–2997 (2013)
V.F. Flack, A.A. Afifi, P.A. Lachenbruch, H.J.A. Schouten, Sample size determinations for the two rater kappa statistic. Psychometrika 53(3), 321–325 (1988)
N.A. Flanders, Estimating reliability, in Interaction Analysis: Theory, Research and Applications, ed. by E.J. Amidon, J.B. Hough (Addison-Wesley, Reading, Mass., 1967), pp. 161–166
J.L. Fleiss, Estimating the accuracy of dichotomous judgements. Psychometrika 30(4), 469–479 (1965)
J.L. Fleiss, Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
J.L. Fleiss, Measuring agreement between two judges on the presence or absence of a trait. Biometrics 31(3), 651–659 (1975)
J.L. Fleiss, Response to Krippendorff. Biometrics 34(1), 144 (1978)
J.L. Fleiss, J. Cohen, The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas. 33(3), 613–619 (1973)
S.L. Foster, J.D. Cone, Design and use of direct observation procedures, in Handbook of Behavioral Assessment, ed. by A.R. Ciminero, K.S. Calhoun, H.E. Adams (Wiley, New York, 1986), pp. 253–324
J. Galtung, Theory and Methods of Social Research (Allen & Unwin, London, 1967)
J. Galtung, Measurement of agreement, in Papers on Methodology. Theory and Methods of Social Research. Volume II, ed. by J. Galtung (Christian Eijlers, Copenhagen, 1979), pp. 82–135
L.A. Goodman, W.H. Kruskal, Measures of association for cross classifications. J. Am. Stat. Assoc. 49(268), 732–764 (1954)
P. Graham, R. Jackson, The analysis of ordinal agreement data: beyond weighted kappa. J. Clin. Epidemiol. 46(9), 1055–1062 (1993)
K.L. Gwet, Handbook of inter-rater reliability. The definitive guide to measuring the extent of agreement among raters (Advanced Analytics LLC, Gaithersburg, USA, 2014)
K.L. Gwet, Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61(1), 29–48 (2008)
T. Hak, T. Bernts, Coder training: theoretical training or practical socialization? Qual. Sociol. 19(2), 235–257 (1996)
D.P. Hartmann, Considerations in the choice of interobserver reliability estimates. J. Appl. Behav. Anal. 10(1), 103–116 (1977)
A.R. Hollenbeck, in Problems of reliability in observational research, in Observing Behavior, Volume 2, ed. by G.P. Sacker (University Park Press, London, 1978), pp. 79–98
J.W. Holley, J.P. Guilford, A note on the G-index of agreement. Educ. Psychol. Meas. 24(4), 749–753 (1964)
O.R. Holsti, Content analysis for the social sciences and humanities (Addison Wesley, London, 1969)
L.M. Hsu, R. Field, Interrater agreement measures: comments on kappan, Cohen’s kappa, Scott’s π and Aickin’s α. Underst. Stat. 2(3), 205–219 (2003)
L.J. Hubert, Kappa revisited. Psychol. Bull. 84(2), 289–297 (1977)
S. Janson, J. Vegelius, On the generalization of the G index and the phi coefficient to nominal scales. Multivar. Behav. Res. 14(2), 255–269 (1979)
A.E. Kazdin, Artifact, bias, and complexity of assessment: the ABC’s of reliability. J. Appl. Behav. Anal. 10(1), 141–150 (1977)
R.N. Kent, K.D. O’Leary, A. Dietz, C. Diament, Comparison of observational recordings in vivo, via mirror, and via television. J. Appl. Behav. Anal. 12(4), 517–522 (1979)
K. Krippendorff, Reliability of binary attribute data. Biometrics 34(1), 142–144 (1978)
K. Krippendorff, Association, agreement, and equity. Qual. Quant. 21(1), 109–123 (1987)
K. Krippendorff, Reliability in content analysis: some common misconceptions and recommendations. Hum. Commun. Res. 30(3), 411–433 (2004)
K. Krippendorff, Systematic and random disagreement and the reliability of nominal data. Commun. Methods Meas. 2(4), 323–338 (2008)
K. Krippendorff, Agreement and information in the reliability of coding. Commun. Methods Meas. 5(2), 93–112
S. Lacy, D. Riffe, Sampling error and selecting intercoder reliability samples for nominal content categories. J. Mass Commun. Q. 73(4), 963–973 (1996)
J.R. Landis, G.G. Koch, A review of statistical methods in the analysis of data arising from observer reliability studies. Part 2. Stat. Neerl. 29(2), 151–161 (1975)
J.R. Landis, G.G. Koch, The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
R.J. Light, Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol. Bull. 76(5), 365–377 (1971)
A. Linderman, Computer content analysis and manual coding techniques: A comparative analysis. in Theory, Method, and Practice in Computer Content Analysis, ed. by M.D. West (Ablex, Westport, CT, 2001), pp. 97–109
A.E. Maxwell, Coefficients of agreement between observers and their interpretation. Br. J. Psychiatry 130(1), 79–83 (1977)
S. Mikhaylov, M. Laver, K. Benoit, Coder reliability and misclassification in the human coding of party manifestos. Polit. Anal. 20(1), 78–91 (2012)
R.J. Mokken, A Theory and Procedure of Scale Analysis: With Applications in Political Research (Mouton, The Hague, 1971)
A.C. Montgomery, K.S. Crittenden, Improving coding reliability for open-ended questions. Public Opin. Q. 41(2), 235–243 (1977)
K.A. Neuendorf, The Content Analysis Guidebook (Sage, Thousand Oaks, CA, 2002)
A.M. Noda, H.C. Kraemer, J.A. Yesavage, V.S. Periyakoil, How many raters are needed for a reliable diagnosis? Int. J. Methods Psychiatr. Res. 10(3), 119–125 (2006)
R. Popping, “Het indelen naar werkoriëntaties” [Classifying job orientations], in Modellen in de sociologie, ed. by S. Lindenberg, F.N. Stokman (Van Loghum Slaterus, Deventer, 1983), pp. 233–247
R. Popping, Some views on agreement to be used in content analysis studies. Qual. Quant. 44(6), 1067–1078 (2010)
R. Popping, On agreement indices for nominal data, in Sociometric Research, Vol. I, ed. by W.E. Saris, I.N. Gallhofer (McMillan, London, 1988), pp. 90–105
R. Popping, Analyzing open-ended questions by means of text analysis procedures. Bull. Méthodologie Sociol. 128, 23–39 (2015)
R. Popping, C.W. Roberts, Coding issues in semantic text analysis. Field Methods 21(3), 244–264 (2009)
D. Quade, Nonparametric partial correlation, in Measurement in the Social Sciences. Theories and Strategies, ed. by H.M. Blalock (MacMillan, London, 1974), pp. 369–398
E. Rogot, I.D. Goldberg, A proposed index for measuring agreement in test-retest studies. J. Chron. Dis. 19(9), 991–1006 (1966)
R.G. Romanczyk, R.N. Kent, C. Diament, K.D. O’Leary, Measuring the reliability of observational data: a reactive process. J. Appl. Behav. Anal. 6(1), 175–184 (1973)
D.C. Ross, Testing patterned hypothesis in multi-way contingency tables using weighted kappa and weighted chi square. Educ. Psychol. Meas. 37(2), 291–307 (1977)
H.J.A. Schouten, Measuring pairwise agreement among many observers. II. Some improvements and additions. Biometrical J. 24(5), 431–435 (1982)
H.J.A. Schouten, Estimating kappa from binocular data and comparing marginal probabilities. Stat. Med. 12(23), 2207–2217 (1993)
W.A. Scott, Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
M.M. Shoukri, Measures of Interobserver Agreement and Reliability (CRC Press, Boca Raton, Fl, 2011)
P.E. Shrout, J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 80(2), 420–428 (1979)
P. Simon, Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educ. Psychol. Meas. 66(5), 765–777 (2006)
J. Spanjer, B. Krol, R. Popping, J.W. Groothoff, S. Brouwer, Disability assessment interview: the role of concrete and detailed information on functioning besides medical history taking. J. Rehabil. Med. 41(4), 267–272 (2009)
A. Stuart, A test of homogeneity of marginal distributions in a two-way classification. Biometrika 42(3/4), 412–416 (1955)
H.K. Suen, Agreement, reliability, accuracy, and validity: toward a clarification. Behav. Assess. 10(4), 343–366 (1988)
J.S. Uebersax, Diversity of decision-making models and the measurement of interrater agreement. Psychol. Bull. 101(1), 140–146 (1987)
J.S. Uebersax, Statistical modeling of expert ratings on medical treatment appropriateness. J. Am. Stat. Assoc. 88(2), 421–427 (1993)
U.N. Umesh, R.A. Peterson, M.H. Sauber, Interjudge agreement and the maximum value of kappa. Educ. Psychol. Meas. 49(4), 835–850 (1989)
S. Vanbelle, A. Albert, Agreement between an isolated rater and a group of raters. Stat. Neerl. 63(1), 82–100 (2009)
J. Vegelius, S. Janson, Criteria for symmetric measures of association for nominal data. Qual. Quant. 16(4), 243–250 (1982)
J. Vegelius, On the utility of the E-correlation coefficient concept in psychological research. Educ. Psychol. Meas. 38(3), 605–611 (1978)
A. Von Eye, Alternatives to Cohen’s κ. Eur. Psychol. 11(1), 12–24 (2006)
H.F. Weisberg, Models of statistical relationship. Am. Polit. Sci. Rev. 68(4), 1638–1655 (1974)
J.L. Woodward, R. Franzen, A study on coding reliability. Public Opin. Q. 12(2), 253–257 (1948)
G.L. Yang, M.K. Chen, A note on weighted kappa. Soc. Econ. Plann. Sci. 12(5), 293–294 (1978)
R. Zwick, Another look at interrater agreement. Psychol. Bull. 103(3), 374–378 (1988)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Popping, R. (2019). Interrater Agreement. In: Introduction to Interrater Agreement for Nominal Data. Springer, Cham. https://doi.org/10.1007/978-3-030-11671-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-11671-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11670-5
Online ISBN: 978-3-030-11671-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)