Measures of Agreement: Reliability, Classification Accuracy, and Classification Consistency

  • Sandip SinharayEmail author
  • Matthew S. Johnson
Part of the Methodology of Educational Measurement and Assessment book series (MEMA)


Gierl, Cui, and Zhou (J Educ Meas 46:293–313, 2009), Cui, Gierl, and Chang (J Educ Meas 49:19–38, 2012), Templin and Bradshaw (J Classif 30:251–275, 2013), Wang, Song, Chen, Meng, and Ding (J Educ Meas 52:457–476, 2015), Johnson and Sinharay (J Educ Meas, 55: 635–664, 2018), and Johnson and Sinharay (J Educ Behav Stat, in press) suggested reliability-like measures for the estimates obtained from a diagnostic classification model. These measures mostly express the agreement between the estimated skill and the true skill, or between estimated skills from parallel assessments. This paper provides a review of these measures and demonstrates some of them for a real data example.




  1. American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  2. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  3. Cui, Y., Gierl, M., & Chang, H.-H. (2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment. Journal of Educational Measurement, 49, 19–38.CrossRefGoogle Scholar
  4. de la Torre, J., & Lee, Y.-S. (2013). Evaluating the Wald test for item-level comparison of saturated and reduced models in cognitive diagnosis. Journal of Educational Measurement, 50, 355–373.CrossRefGoogle Scholar
  5. DiBello, L. V., Stout, W. F., & Roussos, L. A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–390). Iowa City, IA: Lawrence Erlbaum.Google Scholar
  6. Gierl, M. J., Cui, Y., & Zhou, J. (2009). Reliability and attribute-based scoring in cognitive diagnostic assessment. Journal of Educational Measurement, 46, 293–313.CrossRefGoogle Scholar
  7. Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 732–764.Google Scholar
  8. Haberman, S. J. (2005). When can subscores have value? (ETS Research report No. RR-05–08). Princeton, NJ: ETS.Google Scholar
  9. Haladyna, S. J., & Kramer, G. A. (2004). The validity of subscores for a credentialing test. Evaluation and the Health Professions, 24, 349–368.CrossRefGoogle Scholar
  10. Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345–359.CrossRefGoogle Scholar
  11. Harris, D. J., & Hanson, B. A. (1991). Methods of examining the usefulness of subscores. Paper Presented at the Annual meeting of the National Council of Measurement in Education, Chicago, IL.Google Scholar
  12. Henson, R., Templin, J., & Willse, J. (2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210.CrossRefGoogle Scholar
  13. Johnson, M. S., & Sinharay, S. (in press). The reliability of the posterior probability of skill attainment in diagnostic classification models. Journal of Educational and Behavioral Statistics.Google Scholar
  14. Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55, 635–664.CrossRefGoogle Scholar
  15. Lee, W.-C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412–432s.Google Scholar
  16. Lee, Y.-S., Park, Y. S., & Taylan, D. (2011). A cognitive diagnostic modeling of attribute mastery in Massachusetts, Minnesota, and the U.S. national sample using the TIMSS 2007. International Journal of Testing, 11, 144–177.CrossRefGoogle Scholar
  17. Leighton, J., Gierl, M., & Hunka, S. M. (2004). The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237.CrossRefGoogle Scholar
  18. Linfoot, E. (1957). An informational measure of correlation. Information and Control, 1, 85–89.CrossRefGoogle Scholar
  19. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.Google Scholar
  20. Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64, 187–212.CrossRefGoogle Scholar
  21. Mislevy, R. J., Almond, R. G., Steinberg, L. S., & Yan, D. (1999). Bayes nets in educational assessment: Where do the numbers come from? In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden (pp. 437–446).Google Scholar
  22. Oliveri, M. E., & von Davier, M. (2011). Investigation of model fit and score scale comparability in international assessments. Psychological Test and Assessment Modeling, 53, 315–333.Google Scholar
  23. Park, J. Y., Johnson, M. S., & Lee, Y.-S. (2015). Posterior predictive model checks for cognitive diagnostic models. International Journal of Quantitative Research in Education, 2(3/4), 244.CrossRefGoogle Scholar
  24. Robitzsch, A., Kiefer, T., George, A. C., & Uenlue, A. (2014). CDM: Cognitive diagnosis modeling [Software-Handbuch]. (R package version 4.1).
  25. Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford.Google Scholar
  26. Sinharay, S., & Haberman, S. J. (2009). How much can we reliably know about what examinees know? Measurement: Interdisciplinary Research and Perspectives, 6, 46–49.Google Scholar
  27. Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30, 251–275.CrossRefGoogle Scholar
  28. Templin, J., & Hoffman, L. (2013). Obtaining diagnostic classification model estimates using mplus. Educational Measurement: Issues and Practice, 32 (2), 37–50.CrossRefGoogle Scholar
  29. von Davier, M. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307.CrossRefGoogle Scholar
  30. von Davier, M. (2014). The log-linear cognitive diagnostic model (LCDM) as a special case of the general diagnostic model (GDM) (ETS research report No. RR-14–40). Princeton, NJ: ETS.Google Scholar
  31. von Davier, M., & Haberman, S. J. (2014). Hierarchical diagnostic classification models morphing into unidimensional ‘diagnostic’ classification models—a commentary. Psychometrika, 79, 340–346.CrossRefGoogle Scholar
  32. Wang, W., Song, L., Chen, P., Meng, Y., & Ding, S. (2015). Attribute-level and pattern-level classification consistency and accuracy indices for diagnostic assessment. Journal of Educational Measurement, 52, 457–476.CrossRefGoogle Scholar
  33. Xu, X., & von Davier, M. (2006). Cognitive diagnosis for NAEP proficiency data (ETS research report No. RR-06–08). Princeton, NJ: ETS.Google Scholar
  34. Youden, W. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35.CrossRefGoogle Scholar
  35. Yule, G. (1912). On the methods of measuring the association between two attributes. Journal of the Royal Statisical Society, 75, 579–652.CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Educational Testing ServicePrincetonUSA

Personalised recommendations