Generalized Agreement Statistics over Fixed Group of Experts

  • Mohak Shah
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6913)


Generalizations of chance corrected statistics to measure inter-expert agreement on class label assignments to the data instances have traditionally relied on the marginalization argument over a variable group of experts. Further, this argument has also resulted in agreement measures to evaluate the class predictions by an isolated classifier against the (multiple) labels assigned by the group of experts. We show that these measures are not necessarily suitable for application in the more typical fixed experts’ group scenario. We also propose novel, more meaningful, less variable generalizations for quantifying both the inter-expert agreement over the fixed group and assessing a classifier’s output against it in a multi-expert multi-class scenario by taking into account expert-specific biases and correlations.


Agreement statistics classifier evaluation multiple experts 


  1. Asuncion, A., Newman, D.J.: UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA (2007), Google Scholar
  2. Berry, K.J., Mielke Jr, P.W.: A generalization of cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurements 48, 921–933 (1988)CrossRefGoogle Scholar
  3. Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurements 20, 37–46 (1960)CrossRefGoogle Scholar
  4. Eckstein, M.P., Wickens, T.D., Aharonov, G., Ruan, G., Morioka, C.A., Whiting, J.S.: Quantifying the limitation of the use of consensus expert commitees in roc studies. In: Proceedings SPIE: Medical Imaging 1998, vol. 3340, pp. 128–134 (1998)Google Scholar
  5. Efron, B., Tibshirani, R.J.: An introduction to the bootstrap. Chapman and Hall, New York (1993)CrossRefzbMATHGoogle Scholar
  6. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)CrossRefGoogle Scholar
  7. Hubert, L.: Kappa revisited. Psychological Bulletin 84(2), 289–297 (1977)CrossRefGoogle Scholar
  8. Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York (2011)CrossRefzbMATHGoogle Scholar
  9. Kraemer, H.C.: Ramifications of a population model for κ as a coefficient of reliability. Psychometrika 44, 461–472 (1979)CrossRefzbMATHGoogle Scholar
  10. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004) ISBN : 0471210781CrossRefzbMATHGoogle Scholar
  11. Light, R.L.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin 76, 365–377 (1971)CrossRefGoogle Scholar
  12. Miller, D.P., O’Shaughnessy, K.F., Wood, S.A., Castellino, R.A.: Gold standard and expert panels: a pulmonary nodule case study with challenges and solutions. In: Proceedings SPIE: Medical Imaging 2004: Image Perception, Observer Performance and Technology Assessment, vol. 5372, pp. 173–184 (2004)Google Scholar
  13. Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (2001)Google Scholar
  14. Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. Journal of Machine Learning Research 11, 1297–1322 (2010)MathSciNetGoogle Scholar
  15. Salerno, S.M., Alguire, P.C., Waxman, S.W.: Competency in interpretation of 12-lead electrocardiograms: a summary and appraisal of published evidence. Annals of Internal Medicine 138, 751–760 (2003)CrossRefGoogle Scholar
  16. Schouten, H.J.A.: Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica 36, 45–61 (1982)CrossRefzbMATHGoogle Scholar
  17. Scott, W.A.: Reliability of content analysis: The case of nominal scale coding. Public Opinion Q 19, 321–325 (1955)CrossRefGoogle Scholar
  18. Smith, R., Copas, A.J., Prince, M., George, B., Walker, A.S., Sadiq, S.T.: Poor sensitivity and consistency of microscopy in the diagnosis of low grade non-gonococcal urethrisis. Sexually Transmitted Infections 79, 487–490 (2003)CrossRefGoogle Scholar
  19. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: EMNLP 2008: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263. ACL (2008)Google Scholar
  20. Soeken, K.L., Prescott, P.A.: Issues in the use of kappa to estimate reliability. Medical Care 24, 733–741 (1986)CrossRefGoogle Scholar
  21. Vanbelle, S., Albert, A.: Agreement between an isolated rater and a group of raters. Statistica Neerlandica 63(1), 82–100 (2009)CrossRefMathSciNetGoogle Scholar
  22. Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)CrossRefGoogle Scholar
  23. Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in Neural Information Processing Systems 22, 2035–2043 (2009)Google Scholar
  24. Williams, G.W.: Comparing the joint agreement of several raters with another rater. Biometrics 32, 619–627 (1976)CrossRefzbMATHGoogle Scholar
  25. Witten, I.H., Frank, E.: Weka 3: Data Mining Software in Java (2005),
  26. Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo, G., Bogoni, L., Moy, L., Dy, J.: Modeling annotator expertise: Learning when everybody knows a bit of something. In: Proc. International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR, vol. 9, pp. 932–939 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mohak Shah
    • 1
  1. 1.Accenture Technology LabsChicagoUSA

Personalised recommendations