Skip to main content

Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy

  • Conference paper
  • First Online:
Book cover New Developments in Quantitative Psychology

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 66))

Abstract

The percentage of examinees who are classified consistently and accurately into the proficiency levels is an important measurement property of the tests that are used to classify the candidates. Given the suspected discrepancies between the classical test theory (CTT)- and item response theory (IRT)-based single-administration decision consistency and accuracy (DC/DA) estimates, these two approaches were evaluated for accuracy and robustness in various simulated conditions by varying the test length, ability distribution, and the degree of local item dependence (LID). The CTT-based Livingston–Lewis method was found underestimating the DC indices across all conditions and more sensitive to the short tests and skewed ability distributions. The IRT-based Lee method had small biases in most conditions except a high degree of LID. The violation of LID had a much greater negative effect on the DA estimate than on the DC estimate with both methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author.

    Google Scholar 

  • Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397–472). Reading, MA: Addison-Wesley.

    Google Scholar 

  • Bourque, M. L., Goodman, D., Hambleton, R. K., & Han, N. (2004). Reliability estimates for the ABTE tests in elementary education, professional teaching knowledge, secondary mathematics and English/language arts (Final report). Leesburg, VA: Mid-Atlantic Psychometric Services.

    Google Scholar 

  • Brennan, R. L. (2004). BB-CLASS: A computer program that uses the beta-binomial model for classification consistency and accuracy (Version 1.0, CASMA Research Report No. 9). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.

    Article  MATH  Google Scholar 

  • Deng, N. (2011). Evaluating IRT- and CTT-based methods of estimating classification consistency and accuracy indices from single administrations (Unpublished doctoral dissertation). Amherst, MA: University of Massachusetts.

    Google Scholar 

  • Hambleton, R. K., & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159–170.

    Article  Google Scholar 

  • Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345–359.

    Article  Google Scholar 

  • Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253–264.

    Article  Google Scholar 

  • Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational Statistics, 15, 353–368.

    Google Scholar 

  • Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17.

    Article  Google Scholar 

  • Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33, 374–390.

    Article  MathSciNet  Google Scholar 

  • Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412–432.

    Article  MathSciNet  Google Scholar 

  • Lee, W., & Kolen, M. J. (2008). IRT-CLASS: A computer program for item response theory classification consistency and accuracy (Version 2.0). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma

  • Li, S. (2006). Evaluating the consistency and accuracy of proficiency classifications using item response theory (Unpublished dissertation). Amherst, MA: University of Massachusetts.

    Google Scholar 

  • Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–197.

    Article  Google Scholar 

  • Muraki, E., & Bock, R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [Computer program]. Chicago, IL: Scientific Software International, Inc.

    Google Scholar 

  • Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(14). Available online: http://pareonline.net/getvn.asp?v=7&n=14

  • Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online: http://pareonline.net/getvn.asp?v=10&n=13

  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika (Monograph Supplement, 17).

    Google Scholar 

  • Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13, 265–276.

    Article  Google Scholar 

  • Swaminathan, H., Hambleton, R. K., & Algina, J. (1974). Reliability of criterion referenced tests: A decision-theoretic formulation. Journal of Educational Measurement, 11, 263–267.

    Article  Google Scholar 

  • Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Amsterdam: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Wan, L., Brennan, R. L., & Lee, W. (2007). Estimating classification consistency for complex assessments (CASMA Research Report No. 22). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma

  • Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.

    Article  Google Scholar 

Download references

Acknowledgment

The authors are grateful for the valuable comments from the editor Daniel Bolt, which strengthened the study considerably.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nina Deng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this paper

Cite this paper

Deng, N., Hambleton, R.K. (2013). Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy. In: Millsap, R.E., van der Ark, L.A., Bolt, D.M., Woods, C.M. (eds) New Developments in Quantitative Psychology. Springer Proceedings in Mathematics & Statistics, vol 66. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9348-8_15

Download citation

Publish with us

Policies and ethics