Skip to main content

Analyses of Testlet Data

  • Conference paper
  • First Online:

Abstract

Testlets, which are defined as a set of items linked by a common stimulus, are commonly used in educational and psychological tests. Such a linkage may make items within a testlet locally dependent. There are three major approaches to testlet-based items. First, one can fit standard item response theory (IRT) models and ignore the possible local dependence. Second, one can transform items in a testlet into a super (polytomous) item and then fit polytomous IRT models to the transformed data. Third, one can fit testlet response models that were developed to account for the local dependence. This study compared the performance of these three approaches in recovering person measures and test reliability through simulations. It was found that the polytomous-item approach performed highly satisfactorily when data were generated from testlet response models or when data had chain effects between adjacent items. In contrast, fitting standard item response models tended to result in overestimation of test reliability when data were generated from testlet response models, and underestimation or overestimation of test reliability when the data had chain effects. Likewise, fitting testlet response models might result in underestimation or overestimation of test reliability when the data have chain effects. Thus, if person measures as well as their measurement precision (test reliability) is the major concern, the polytomous-item approach is recommended.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Andrich, D. (1978). A rating scale formulation for ordered response categories. Psychometrika, 43, 561–573.

    Article  Google Scholar 

  • Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.

    Google Scholar 

  • Baghaei, P. (2008). Local dependency and Rasch measures. Rasch Measurement Transactions, 21(3), 1105–1106.

    Google Scholar 

  • Bradlow, E., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.

    Article  Google Scholar 

  • English Language Institute, University of Michigan. (2006). Examination for the certificate of proficiency in English 2004–05 annual report. Ann Arbor, MI: English Language Institute, University of Michigan.

    Google Scholar 

  • Hoskens, M., & De Boeck, P. (1997). A parametric model for local dependence among test items. Psychological Methods, 2, 261–277.

    Article  Google Scholar 

  • Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16, 207–222.

    Article  Google Scholar 

  • Klein-Braley, C. (1997). C-Tests in the context of reduced redundancy testing: An appraisal. Language Testing, 14, 47–84.

    Article  Google Scholar 

  • Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30, 3–21.

    Article  Google Scholar 

  • Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

    Article  Google Scholar 

  • Min, S., & He, L. (2014). Applying unidimensional and multidimensional item response theory models in testlet-based reading assessment. Language Testing, 31, 453–477.

    Article  Google Scholar 

  • Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

    Article  Google Scholar 

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

    Google Scholar 

  • Samejima, F. (1969). Estimation of a latent ability using a response pattern of graded scores. Psychometrika Monographs, 17, 1–100.

    Article  Google Scholar 

  • Schroeders, U., Robitzsch, A., & Schipolowski, S. (2014). A comparison of different psychometric approaches to modeling testlet structures: An example with C-tests. Journal of Educational Measurement, 51, 400–418.

    Article  Google Scholar 

  • Sireci, S. G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237–247.

    Article  Google Scholar 

  • Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2007). WinBUGS version 1.4.3: Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.

    Google Scholar 

  • Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: a use of multiple-categorical-response models. Journal of Educational Measurement, 26, 247–260.

    Article  Google Scholar 

  • Tuerlinckx, F., & De Boeck, P. (1999). Distinguishing constant and dimension-dependent interaction: A simulation study. Applied Psychological Measurement, 23, 299–307.

    Article  Google Scholar 

  • Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185–201.

    Article  Google Scholar 

  • Wainer, H., & Lewis, C. (1990). Toward a psychometrics for testlets. Journal of Educational Measurement, 27, 1–14.

    Article  Google Scholar 

  • Wainer, H., Bradlow, E., & Du, Z. (2000). Testlet response theory: An analog for the 3PL model useful in testlet-based adaptive testing. In W. J. van der Linden & G. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245–269). Dordrecht: Springer Netherlands.

    Google Scholar 

  • Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press.

    Book  Google Scholar 

  • Wang, W.-C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126–149.

    Google Scholar 

  • Wilson, M., & Adams, R. (1995). Rasch models for item bundles. Psychometrika, 60, 181–198.

    Article  Google Scholar 

  • Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.

    Article  Google Scholar 

  • Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27, 119–140.

    Article  Google Scholar 

  • Zhang, O. (2010). Polytomous IRT or Testlet Model: An evaluation of scoring models in small testlet size situations (Unpublished master’s Thesis). Gainesville, FL: University of Florida.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-Chung Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this paper

Cite this paper

Wang, WC., Jin, KY. (2016). Analyses of Testlet Data. In: Zhang, Q. (eds) Pacific Rim Objective Measurement Symposium (PROMS) 2015 Conference Proceedings. Springer, Singapore. https://doi.org/10.1007/978-981-10-1687-5_13

Download citation

Publish with us

Policies and ethics