Methods from Item Response Theory: Going Beyond Traditional Validity and Reliability in Standardizing Assessments

Froelich, Amy G.

doi:10.1007/978-1-4020-8427-0_14

Amy G. Froelich⁴

1439 Accesses

In determining the effectiveness of educational interventions, the Gold Standard requires the use of tests and assessments of proven validity. Messick (1989) defined validity as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (p. 13). Education researchers wishing to evaluate the effectiveness of educational interventions and programs under the Gold Standard must either develop and validate their own tests and assessments or use ones developed and validated by others. As a result of the No Child Left Behind federal legislative mandate for Grades K-12 in the United States (NCLB, 2002), educational research through intervention programs that improve student learning in mathematics, reading, and science education in Grades K-12 have one natural test of interest: the standardized examination used in the state for determining student proficiency status and school and district proficiency rates. Local school personnel and state education professionals are particularly interested in research showing improvements in student performance on these high-stakes tests. Other standardized assessments that can be used to show the effectiveness of an educational program or intervention are the National Assessment of Educational Progress (NAEP, US National Center for Education Statistics, n.d.), the ACT®, (ACT, n.d.), and the SAT« (College Board, n.d.).

However, the use of state NCLB tests and these other assessments is precluded in many situations. For example, the educational program or intervention may be targeted at a subject area not covered by these assessments, such as history or study in a foreign language. Even if the subject area is in mathematics, reading, or science, the goal of the intervention may not align with the underlying curriculum and goals of the NCLB tests in the subject area. For example, programs focusing on the development of problem-solving skills in mathematics may have different goals than the curriculum tested on the NAEP or the NCLB state assessment. These assessments would not be good measures of the effectiveness of this type of intervention program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ackerman, T. (1996). Graphical representation of multidimensional item response theory analyses. Applied Psychological Measurement, 20(4), 311–329.
Article Google Scholar
ACT. (n.d.).Homepage. Retrieved July 11, 2008, from http://www.act.org/
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord&M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 395–479).Reading, MA: Addison-Wesley.
Google Scholar
Bock, R. D.,&Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:Application of an EM algorithm. Psychometrika, 46(4), 443–459.
Article Google Scholar
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.
Article Google Scholar
College Board. (n.d.). About the SAT. Retrieved May 15, 2008, from http://www.collegeboard.com/student/testing/sat/about.html
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3),297–334.
Article Google Scholar
Froelich, A. G. (2008). A new bias correction method for the DIMTEST procedure. Unpublished manuscript, Iowa State University at Ames.
Google Scholar
Froelich, A. G.,&Habing, B. (2008). Conditional covariance-based subtest selection for DIMTEST.Applied Psychological Measurement, 32(2), 138–155.
Article Google Scholar
Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th edn., pp.65–110). Westport, CT: American Council on Education&Praeger.
Google Scholar
Humphreys, L. C. (1985). General intelligence: An integration of factor, test, and simplex theory.In B. J. Wolman (Ed.), Handbook of intelligence: Theories, measurements, and applications (pp. 201–224). New York: John Wiley&Sons.
Google Scholar
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th edn., pp.17–64). Westport, CT: American Council on Education&Praeger.
Google Scholar
Kim, H. R. (1994). New techniques for the dimensionality assessment of standardized test data.Unpublished doctoral dissertation, University of Illinois, Urbana-Champaign.
Google Scholar
Linden, W. J., van der,&Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer
Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,NJ: Lawrence Erlbaum.
Google Scholar
Lord, F. M. (1984). Standard errors of measurement at different ability levels. Journal of Educational Measurement, 21(3), 239–243.
Article Google Scholar
Lord, F. M.,&Novick, M. R. (1968). Statistical theories of mental test scores Reading, MA:Addison-Wesley.
Google Scholar
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd edn., pp. 13–103).New York: American Council on Education.
Google Scholar
Mislevy, R. J.,&Bock, R. D. (1984). Item operating characteristics of the Armed Services Aptitude Battery, Form 8A (Technical Report N00014-83-C-0283). Washington, DC: Office of Naval Research.
Google Scholar
Mokken, R. J. (1971). A theory and procedure of scale analysis with applications in political research. The Hague, The Netherlands: Mouton.
Google Scholar
Molenaar, I. W.,&Sijtsma, K. (2000). User's manual MSP5 for Windows. Groningen, The Netherlands: iecProGAMMA.
Google Scholar
Nandakumar, R.,&Stout, W. F. (1993). Refinements of Stout's procedure for assessing latent trai unidimensionality. Journal of Educational Statistics, 18(1), 41–68.
Article Google Scholar
No Child Left Behind Act of 2001. Pub. L. No. 107–110, 115 Stat. 1425. (2002).
Google Scholar
Reckase, M. D. (1997). A linear logistic model for dichotomous item response data. In W. J. van der Linden&R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271–286). New York: Springer.
Google Scholar
Roussos, L. A., Stout, W. F.,&Marden, J. I. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement, 35(1), 1–30.
Article Google Scholar
Sijtsma, K.,&Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Thousand Oaks, CA: Sage.
Google Scholar
Stout, W. F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589–617.
Article Google Scholar
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidi-mensionality assessment and ability estimation. Psychometrika, 55(2), 293–325.
Article Google Scholar
Stout, W. F., Froelich, A. G.,&Gao, F. (2001). Using resampling to produce an improved DIMTEST procedure. In A. Boomsma, M. A. J. van Dujin,&T. A. B. Snijders (Eds.), Essays on item response theory (pp. 357–375). Dordrecht, The Netherlands: Springer.
Google Scholar
Stout, W. F., Habing, B., Douglas, J., Hae Rim, K., Roussos, L.,&Zhang, J. (1996). Conditional covariance-based nonparametric multidimensionality assessment. Applied Psychological Measurement, 20(4), 331–354.
Article Google Scholar
Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.
Google Scholar
United States National Center for Education Statistics. (n.d.). NAEP: The nation's report card. Retrieved July 11, 2008, from http://nces.ed.gov/nationsreportcard/
Google Scholar
Zhang, J.,&Stout, W. F. (1999a). Conditional covariance structure of generalized compensatory multidimensional items. Psychometrika, 64(2), 129–152.
Article Google Scholar
Zhang, J.,&Stout, W. F. (1999b). The theoretical detect index of dimensionality and its application to approximate simple structure. Psychometrika, 64(2), 213–249.
Article Google Scholar
Zimowski, M., Muraki, E., Mislevy, R. J.,&Bock, R. D. (2007). BILOG-MG3 [computer software]. Mooresville, IN: Scientific Software International. Available from http://www.ssicen-tral.com/irt/index.html

Download references

Author information

Authors and Affiliations

Department of Statistics, Iowa State University, Ames, IA, USA
Amy G. Froelich

Authors

Amy G. Froelich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amy G. Froelich .

Editor information

Editors and Affiliations

Iowa State University, USA
Mack C. Shelley II
University of Victoria, Canada
Larry D. Yore
University of Iowa, USA
Brian Hand

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Froelich, A.G. (2009). Methods from Item Response Theory: Going Beyond Traditional Validity and Reliability in Standardizing Assessments. In: Shelley, M.C., Yore, L.D., Hand, B. (eds) Quality Research in Literacy and Science Education. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-8427-0_14

Download citation

DOI: https://doi.org/10.1007/978-1-4020-8427-0_14
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-8426-3
Online ISBN: 978-1-4020-8427-0
eBook Packages: Humanities, Social Sciences and LawEducation (R0)

Publish with us

Policies and ethics