Abstract
Generalizations of chance corrected statistics to measure inter-expert agreement on class label assignments to the data instances have traditionally relied on the marginalization argument over a variable group of experts. Further, this argument has also resulted in agreement measures to evaluate the class predictions by an isolated classifier against the (multiple) labels assigned by the group of experts. We show that these measures are not necessarily suitable for application in the more typical fixed experts’ group scenario. We also propose novel, more meaningful, less variable generalizations for quantifying both the inter-expert agreement over the fixed group and assessing a classifier’s output against it in a multi-expert multi-class scenario by taking into account expert-specific biases and correlations.
Chapter PDF
References
Asuncion, A., Newman, D.J.: UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Berry, K.J., Mielke Jr, P.W.: A generalization of cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurements 48, 921–933 (1988)
Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurements 20, 37–46 (1960)
Eckstein, M.P., Wickens, T.D., Aharonov, G., Ruan, G., Morioka, C.A., Whiting, J.S.: Quantifying the limitation of the use of consensus expert commitees in roc studies. In: Proceedings SPIE: Medical Imaging 1998, vol. 3340, pp. 128–134 (1998)
Efron, B., Tibshirani, R.J.: An introduction to the bootstrap. Chapman and Hall, New York (1993)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)
Hubert, L.: Kappa revisited. Psychological Bulletin 84(2), 289–297 (1977)
Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York (2011)
Kraemer, H.C.: Ramifications of a population model for κ as a coefficient of reliability. Psychometrika 44, 461–472 (1979)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004) ISBN : 0471210781
Light, R.L.: Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin 76, 365–377 (1971)
Miller, D.P., O’Shaughnessy, K.F., Wood, S.A., Castellino, R.A.: Gold standard and expert panels: a pulmonary nodule case study with challenges and solutions. In: Proceedings SPIE: Medical Imaging 2004: Image Perception, Observer Performance and Technology Assessment, vol. 5372, pp. 173–184 (2004)
Rao, C.R.: Linear Statistical Inference and its Applications, 2nd edn. Wiley, New York (2001)
Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. Journal of Machine Learning Research 11, 1297–1322 (2010)
Salerno, S.M., Alguire, P.C., Waxman, S.W.: Competency in interpretation of 12-lead electrocardiograms: a summary and appraisal of published evidence. Annals of Internal Medicine 138, 751–760 (2003)
Schouten, H.J.A.: Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica 36, 45–61 (1982)
Scott, W.A.: Reliability of content analysis: The case of nominal scale coding. Public Opinion Q 19, 321–325 (1955)
Smith, R., Copas, A.J., Prince, M., George, B., Walker, A.S., Sadiq, S.T.: Poor sensitivity and consistency of microscopy in the diagnosis of low grade non-gonococcal urethrisis. Sexually Transmitted Infections 79, 487–490 (2003)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: EMNLP 2008: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 254–263. ACL (2008)
Soeken, K.L., Prescott, P.A.: Issues in the use of kappa to estimate reliability. Medical Care 24, 733–741 (1986)
Vanbelle, S., Albert, A.: Agreement between an isolated rater and a group of raters. Statistica Neerlandica 63(1), 82–100 (2009)
Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., Movellan, J.: Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in Neural Information Processing Systems 22, 2035–2043 (2009)
Williams, G.W.: Comparing the joint agreement of several raters with another rater. Biometrics 32, 619–627 (1976)
Witten, I.H., Frank, E.: Weka 3: Data Mining Software in Java (2005), http://www.cs.waikato.ac.nz/ml/weka/
Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo, G., Bogoni, L., Moy, L., Dy, J.: Modeling annotator expertise: Learning when everybody knows a bit of something. In: Proc. International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR, vol. 9, pp. 932–939 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shah, M. (2011). Generalized Agreement Statistics over Fixed Group of Experts. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6913. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-23808-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23807-9
Online ISBN: 978-3-642-23808-6
eBook Packages: Computer ScienceComputer Science (R0)