Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy

Deng, Nina; Hambleton, Ronald K.

doi:10.1007/978-1-4614-9348-8_15

Nina Deng⁵ &
Ronald K. Hambleton⁶

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 66))

1758 Accesses
2 Citations

Abstract

The percentage of examinees who are classified consistently and accurately into the proficiency levels is an important measurement property of the tests that are used to classify the candidates. Given the suspected discrepancies between the classical test theory (CTT)- and item response theory (IRT)-based single-administration decision consistency and accuracy (DC/DA) estimates, these two approaches were evaluated for accuracy and robustness in various simulated conditions by varying the test length, ability distribution, and the degree of local item dependence (LID). The CTT-based Livingston–Lewis method was found underestimating the DC indices across all conditions and more sensitive to the short tests and skewed ability distributions. The IRT-based Lee method had small biases in most conditions except a high degree of LID. The violation of LID had a much greater negative effect on the DA estimate than on the DC estimate with both methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author.
Google Scholar
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397–472). Reading, MA: Addison-Wesley.
Google Scholar
Bourque, M. L., Goodman, D., Hambleton, R. K., & Han, N. (2004). Reliability estimates for the ABTE tests in elementary education, professional teaching knowledge, secondary mathematics and English/language arts (Final report). Leesburg, VA: Mid-Atlantic Psychometric Services.
Google Scholar
Brennan, R. L. (2004). BB-CLASS: A computer program that uses the beta-binomial model for classification consistency and accuracy (Version 1.0, CASMA Research Report No. 9). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Article Google Scholar
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334.
Article MATH Google Scholar
Deng, N. (2011). Evaluating IRT- and CTT-based methods of estimating classification consistency and accuracy indices from single administrations (Unpublished doctoral dissertation). Amherst, MA: University of Massachusetts.
Google Scholar
Hambleton, R. K., & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159–170.
Article Google Scholar
Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345–359.
Article Google Scholar
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253–264.
Article Google Scholar
Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational Statistics, 15, 353–368.
Google Scholar
Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1–17.
Article Google Scholar
Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33, 374–390.
Article MathSciNet Google Scholar
Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412–432.
Article MathSciNet Google Scholar
Lee, W., & Kolen, M. J. (2008). IRT-CLASS: A computer program for item response theory classification consistency and accuracy (Version 2.0). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
Li, S. (2006). Evaluating the consistency and accuracy of proficiency classifications using item response theory (Unpublished dissertation). Amherst, MA: University of Massachusetts.
Google Scholar
Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179–197.
Article Google Scholar
Muraki, E., & Bock, R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [Computer program]. Chicago, IL: Scientific Software International, Inc.
Google Scholar
Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(14). Available online: http://pareonline.net/getvn.asp?v=7&n=14
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online: http://pareonline.net/getvn.asp?v=10&n=13
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika (Monograph Supplement, 17).
Google Scholar
Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13, 265–276.
Article Google Scholar
Swaminathan, H., Hambleton, R. K., & Algina, J. (1974). Reliability of criterion referenced tests: A decision-theoretic formulation. Journal of Educational Measurement, 11, 263–267.
Article Google Scholar
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Amsterdam: Kluwer Academic Publishers.
Chapter Google Scholar
Wan, L., Brennan, R. L., & Lee, W. (2007). Estimating classification consistency for complex assessments (CASMA Research Report No. 22). Iowa City, IA: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Available at http://www.education.uiowa.edu/casma
Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.
Article Google Scholar

Download references

Acknowledgment

The authors are grateful for the valuable comments from the editor Daniel Bolt, which strengthened the study considerably.

Author information

Authors and Affiliations

Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, 01655, USA
Nina Deng
Center for Educational Assessment, University of Massachusetts Amherst, Amherst, MA, 01003, USA
Ronald K. Hambleton

Authors

Nina Deng
View author publications
You can also search for this author in PubMed Google Scholar
Ronald K. Hambleton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nina Deng .

Editor information

Editors and Affiliations

Department of Psychology, Arizona State University, Tempe, AZ, USA
Roger E. Millsap
Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands
L. Andries van der Ark
Department of Educational Psychology, University of Wisconsin, Madison, WI, USA
Daniel M. Bolt
Department of Psychology, University of Kansas, Lawrence, KS, USA
Carol M. Woods

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, N., Hambleton, R.K. (2013). Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy. In: Millsap, R.E., van der Ark, L.A., Bolt, D.M., Woods, C.M. (eds) New Developments in Quantitative Psychology. Springer Proceedings in Mathematics & Statistics, vol 66. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-9348-8_15

Download citation

DOI: https://doi.org/10.1007/978-1-4614-9348-8_15
Published: 13 January 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-9347-1
Online ISBN: 978-1-4614-9348-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics