Advertisement

Scoring and Scaling Educational Tests

  • Michael J. Kolen
  • Ye Tong
  • Robert L. Brennan
Chapter
Part of the Statistics for Social and Behavioral Sciences book series (SSBS)

Abstract

The numbers that are associated with examinee performance on educational or psychological tests are defined through the process of scaling. This process produces a score scale, and the scores that are reported to examinees are referred to as scale scores. Kolen (2006) referred to the term primary score scale, which is the focus of this chapter, as the scale that is used to underlie psychometric properties for tests.

A key component in the process of developing a score scale is the raw score for an examinee on a test, which is a function of the item scores for that examinee. Raw scores can be as simple as a sum of the item scores or be so complicated that they depend on the entire pattern of item responses.

Raw scores are transformed to scale scores to facilitate the meaning of scores for test users. For example, raw scores might be transformed to scale scores so that they have predefined distributional properties for a particular group of examinees, referred to as a norm group. Normative information might be incorporated by constructing scale scores to be approximately normally distributed with a mean of 50 and a standard deviation of 10 for a national population of examinees. In addition, procedures can be used for incorporating content and score precision information into score scales.

Keywords

Scale Score Item Response Theory Item Score Item Type Item Response Theory Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. ACT. (2001). EXPLORE technical manual. Iowa City, IA: Author.Google Scholar
  2. Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.Google Scholar
  3. Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: ETS. (Reprinted from Educational measurement, 2nd ed., pp. 508–600, by R. L. Thorndike, Ed., 1971, Washington, DC: American Council on Education)Google Scholar
  4. Ban, J.-C., & Lee, W.-C. (2007). Defining a score scale in relation to measurement error for mixed format tests (CASMA Research Report Number 24). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.Google Scholar
  5. Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34(3), 197–211.CrossRefGoogle Scholar
  6. Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport, CT: American Council on Education and Praeger.Google Scholar
  7. Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22(1), 15–25.Google Scholar
  8. Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695-763). Washington, DC: American Council on Education.Google Scholar
  9. Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and square root. Annals of Mathematical Statistics, 21(4), 607–611.CrossRefGoogle Scholar
  10. Hambleton, R. K., & Pitoniak, M. J. (2006) Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education and Praeger.Google Scholar
  11. Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187–220). Westport, CT: American Council on Education and Praeger.Google Scholar
  12. Iowa Tests of Educational Development. (1958). Manual for the school administrator (Rev. ed.). Iowa City: State University of Iowa.Google Scholar
  13. Kolen, M. J. (1988). Defining score scales in relation to measurement error. Journal of Educational Measurement, 25(2), 97–110.CrossRefGoogle Scholar
  14. Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 155–186). Westport, CT: American Council on Education and Praeger.Google Scholar
  15. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York, NY: Springer-Verlag.CrossRefGoogle Scholar
  16. Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29(4), 285–307.CrossRefGoogle Scholar
  17. Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33(2), 129–140.CrossRefGoogle Scholar
  18. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
  19. Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461.CrossRefGoogle Scholar
  20. McCall, W. A. (1939). Measurement: A revision of how to measure in education. New York, NY: Macmillan.Google Scholar
  21. Muraki, E. (1993) Information functions of the generalized partial credit model. Applied Psychological Measurement, 14(4), 351–363.CrossRefGoogle Scholar
  22. Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York, NY: Macmillan.Google Scholar
  23. Pommerich, M., Nicewander, W. A., & Hanson, B. A. (1999). Estimating average domain scores. Journal of Educational Measurement, 36(3), 199–216.CrossRefGoogle Scholar
  24. Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40(2), 163–184.CrossRefGoogle Scholar
  25. Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—Scale scores for patterns of summed scores. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 253–292). Mahwah, NJ: Erlbaum.Google Scholar
  26. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 1).Google Scholar
  27. Thissen, D., Nelson, L., & Swygert, K. A. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—Approximation methods for scale scores. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 293–341). Mahwah, NJ: Erlbaum.Google Scholar
  28. Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.Google Scholar
  29. Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39–49.CrossRefGoogle Scholar
  30. Thissen, D., & Wainer, H. (Eds.). (2001). Test scoring. Mahwah, NJ: Erlbaum.Google Scholar
  31. Tong, Y., & Kolen, M. J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418–432.CrossRefGoogle Scholar
  32. Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227–253.CrossRefGoogle Scholar
  33. van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York, NY: Springer-Verlag.Google Scholar
  34. Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6(2), 103–118.CrossRefGoogle Scholar
  35. Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37(2), 141–162.CrossRefGoogle Scholar
  36. Yen, W., & Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111–153). Westport, CT: American Council on Education and Praeger.Google Scholar
  37. Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20(2), 15–25.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Michael J. Kolen
    • 1
  • Ye Tong
    • 2
  • Robert L. Brennan
    • 3
  1. 1.University of IowaIowa CityUSA
  2. 2.Iowa CityUSA
  3. 3.University of IowaIowa CityUSA

Personalised recommendations