Score Scales

Kolen, Michael J.; Brennan, Robert L.

doi:10.1007/978-1-4939-0317-7_9

Michael J. Kolen⁵ &
Robert L. Brennan⁶

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

4428 Accesses

Abstract

In this chapter, is devoted to score scales for tests. We discuss different scaling perspectives. We describe linear and nonlinear transformations that are used to construct score scales, and we consider procedures for enhancing the meaning of scale scores that include incorporating normative, content, and score precision information. We discuss procedures for maintaining score scales and scales for batteries and composites. We conclude with a section on vertical scaling that includes consideration of scaling designs and psychometric methods and a review of research on vertical scaling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

ACT. (2007). The ACT technical manual. Iowa City, IA: Author.
Google Scholar
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Google Scholar
Association, American Educational Research, Association, American Psychological, & Council, National. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement. in Education.
Google Scholar
Andrews, K. M. (1995). The effects of scaling design and scaling method on the primary score scale associated with a multi-level achievement test. Unpublished Ph. D. Dissertation, The University of Iowa.
Google Scholar
Angoff, W. H. (1962). Scales with nonmeaningful origins and units of measurement. Educational & Psychological Measurement, 22, 27–34.
Google Scholar
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorridike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education.
Google Scholar
Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy, 4, 351–383.
Google Scholar
Ban, J., & Lee, W. (2007). Defining a score scale in relation to measurement error for mixed format tests (CASMA Research Report Number 24). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.
Google Scholar
Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204.
Google Scholar
Becker, D. F., & Forsyth, R. A. (1992). An empirical investigation of Thurstone and IRT methods of scaling achievement tests. Journal of Educational Measurement, 29, 341–354.
Google Scholar
Betebenner, D. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51.
Google Scholar
Blanton, H., & Jaccard, J. (2006a). Arbitrary metrics in psychology. American Psychologist, 61, 27.
Google Scholar
Blanton, H., & Jaccard, J. (2006b). Arbitrary metrics redux. American Psychologist, 61, 62.
Google Scholar
Bock, R. D. (1983). The mental growth curve reexamined. In D. J. Weiss (Ed.), New horizons in testing (pp. 205–209). New York: Academic Press.
Google Scholar
Bock, R. D., Mislevy, R., & Woodson, C. (1982). The next stage in educational assessment. Educational Researcher, 11(3), 4–11, 16.
Google Scholar
Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34, 197–211.
Google Scholar
Bourque, M. L. (1996). Mathematics assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Google Scholar
Bourque, M. L. (1996). NAEP Science assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Google Scholar
Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13, 1–18.
Google Scholar
Brennan, R. L. (Ed.). (1989). Methodology used in scaling the ACT Assessment and P-ACT+. Iowa City, IA: American College Testing.
Google Scholar
Brennan, R. L. (2001). Generalizability theory. New York: Springer.
MATH Google Scholar
Brennan, R. L. (2011). Utility indexes for decisions about subscores (CASMA Research Report Number 33). Iowa City: University of Iowa.
Google Scholar
Brennan, R. L., & Lee, W. (1999). Conditional scale-score standard errors of measurement under binomial and compound binomial assumptions. Educational and Psychological Measurement, 59(1), 5–24.
Google Scholar
Briggs, D. C., & Weeks, J. P. (2009a). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14.
Google Scholar
Briggs, D. C., & Weeks, J. P. (2009b). The sensitivity of value-added modeling to the creation of a vertical score scale. Education Finance and Policy, 4, 384–414.
Google Scholar
Brookhart, S. M. (2009). Editorial. Educational Measurement: Issues and Practice. 28(4), 1–2.
Google Scholar
Burket, G. R. (1984). Response to Hoover. Educational Measurement: Issues and Practice, 3(4), 15–16.
Google Scholar
Camilli, G. (1988). Scale shrinkage and the estimation of latent distribution parameters. Journal of Educational Statistics, 13, 227–241.
Google Scholar
Camilli, G. (1999). Measurement error, multidimensionality, and scale shrinkage: A reply to Yen and Burket. Journal of Educational Measurement, 36, 73–78.
Google Scholar
Camilli, G., Yamamoto, K., & Wang, M. (1993). Scale shrinkage in vertical equating. Applied Psychological Measurement, 17, 379–388.
Google Scholar
Carlson, J. E. (2011). Statistical models for vertical linking. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 59–70). New York: Springer.
Google Scholar
Chang, S. W. (2006). Methods in scaling the basic competence test. Educational and Psychological Measurement, 66, 907–929.
MathSciNet Google Scholar
Cizek, G. J. (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum.
Google Scholar
Cizek, G. J. (2005). Adapting testing technology to serve accountability aims: The case of vertically moderated standard setting. Applied Measurement in Education, 18, 1–9.
Google Scholar
Clemans, W. V. (1993). Item response theory, vertical scaling, and something’s awry in the state of test mark. Educational Assessment, 1, 329–347.
Google Scholar
Clemans, W. V. (1996). Reply to Yen, Burket, and Fitzpatrick. Educational Assessment, 3, 192–206.
Google Scholar
Cook, L. L. (1994). Recentering the SAT score scale: An overview and some policy considerations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans.
Google Scholar
Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall.
MATH Google Scholar
Council of Chief State School Officers (CCSSO) & National Governors Association (NGA), (2010). Common core state standards initiative. Iowa City: Author.
Google Scholar
Custer, M., Omar, M. H., & Pomplun, M. (2006). Vertical scaling with the Rasch model utilizing default and tight convergence settings with WINSTEPS and BILOG-MG. Applied Measurement in Education, 19, 133–149.
Google Scholar
de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30, 295–311.
Google Scholar
de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296–316.
Google Scholar
Donlon, T. (Ed.). (1984). The College Board technical handbook for the Scholastic Aptitude Test and Achievement Tests. New York: College Entrance Examination Board.
Google Scholar
Donoghue, J. R. (1996, April). Issues in item mapping: The maximum category information criterion and item mapping procedures for a composite scale. Paper presented at the Annual Meeting of the American Educational Research Association, New York.
Google Scholar
Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39, 59–84.
Google Scholar
Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport, CT: American Council on Education and Praeger.
Google Scholar
Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 15–25.
Google Scholar
Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31, 241–259.
Google Scholar
Embretson, S. E. (2006). The continued search for nonarbitrary metrics in psychology. American Psychologist, 61, 50–55.
Google Scholar
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
Google Scholar
Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35, 137–154.
Google Scholar
Feldt, L. S. (1997). Can validity rise when reliability declines? Applied Measurement in Education, 10, 377–387.
Google Scholar
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.
Google Scholar
Feldt, L. S., & Qualls, A. L. (1998). Approximating scale score standard error of measurement from the raw score standard error. Applied Measurement in Education, 11, 159–177.
Google Scholar
Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695–763). Washington, DC: American Council on Education.
Google Scholar
Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations? Educational Measurement: Issues and Practice, 10(3), 3–9, 16.
Google Scholar
Forsyth, R., Saisangjan, U., & Gilmer, J. (1981). Some empirical results related to the robustness of the Rasch model. Applied Psychological Measurement, 5, 175–186.
Google Scholar
Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and square root. Annals of Mathematical Statistics, 21, 607–611.
MATH MathSciNet Google Scholar
Gardner, E. F. (1962). Normative standard scores. Educational and Psychological Measurement, 22, 7–14.
Google Scholar
Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
Google Scholar
Guskey, T. R. (1981). Comparison of a Rasch model scale and the grade-equivalent scale for vertical equating of test scores. Applied Psychological Measurement, 5, 187–201.
Google Scholar
Gustafsson, J.-E. (1979). The Rasch model in vertical equating of tests: A critique of Slinde and Linn. Journal of Educational Measurement, 16, 153–158.
Google Scholar
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9, 139–150.
MathSciNet Google Scholar
Haberman, S. J. (2008a). Subscores and validity. (Research Report 08–64). Princeton, NJ: Educational Testing Service.
Google Scholar
Haberman, S. J. (2008b). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229.
Google Scholar
Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227.
MATH MathSciNet Google Scholar
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95.
MathSciNet Google Scholar
Haertel, E. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education and Praeger.
Google Scholar
Hambleton, R. K., Brennan, R. L., Brown, W., Dodd, B., Forsyth, R. A., Mehrens, W. A., et al. (2000). A response to “Setting reasonable and useful performance standards” in the National Academy of Sciences’ Grading the Nation’s Report Card. Educational Measurement: Issues and Practice, 19(2), 5–13.
Google Scholar
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education and Praeger.
Google Scholar
Hanson, B.A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. http://www.b-a-h.com/software/irt/icl/index.html
Harris, D. J. (1991). A comparison of Angoff’s Design I and Design II for vertical equating using traditional and IRT methodology. Journal of Educational Measurement, 28, 221–235.
Google Scholar
Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York: Springer.
Google Scholar
Harris, D. J., & Hoover, H. D. (1987). An application of the three-parameter IRT model to vertical equating. Applied Psychological Measurement, 11, 151–159.
Google Scholar
Hendrickson, A. B., Cao, Y., Chae, S. E., & Li, D. (2006, April). Effect of base year on IRT vertical scaling from the common-item design. Paper presented at the annual meeting of the National Council for Measurement in Education, San Francisco, CA.
Google Scholar
Hendrickson, A. B., Kolen, M. J., & Tong, Y. (2004, April). Comparison of IRT vertical scaling from scaling-test and common item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, San Diego, CA.
Google Scholar
Hendrickson, A. B., Wei, H., & Kolen, M. J. (2005, April). Dichotomous and polytomous scoring for IRT vertical scaling from scaling-test and common-item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, Montreal, Canada.
Google Scholar
Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34, 201–228.
Google Scholar
Ho, A. D., Lewis, D. M., & MacGregor Farris, J. L. (2009). The dependence of growth-model results on proficiency cut scores. Educational Measurement: Issues and Practice, 28(4), 15–26.
Google Scholar
Holland, P. W. (2002). Two measures of change in the gaps between CDFs of test-score distributions. Journal of Educational & Behavioral Statistics, 27, 3–18.
Google Scholar
Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch model. Journal of Educational Measurement, 19, 139–147.
Google Scholar
Hoover, H. D. (1984a). The most appropriate scores for measuring educational development in the elementary schools: GE’s. Educational Measurement: Issues & Practice, 3(4), 8–14.
Google Scholar
Hoover, H. D. (1984b). Rejoinder to Burket. Educational Measurement: Issues and Practice, 3(4), 16–18.
Google Scholar
Hoover, H. D. (1988). Growth expectations for low-achieving students: A reply to Yen. Educational Measurement: Issues and Practice, 7(4), 21–23.
Google Scholar
Hoover, H. D., Dunbar, S. D., & Frisbie, D. A. (2003). The Iowa tests. Guide to development and research. Itasca, IL: Riverside Publishing.
Google Scholar
Hoskens, M., Lewis, D. M., & Patz, R. J. (2003, April). Maintaining vertical scales using a common item design. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Google Scholar
Humphry, S. M. (2011). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9, 1–24.
Google Scholar
Huynh, H. (1998). On score locations of binary and partial credit items and their applications to item mapping and criterion-referenced interpretation. Journal of Educational and Behavioral Statistics, 23, 35–56.
Google Scholar
Huynh, H. (2006). A clarification on the response probability criterion RP67 for standard settings based on bookmark and item mapping. Educational Measurement: Issues and Practice, 25(2), 19–20.
Google Scholar
Iowa Tests of Educational Development. (1958). Manual for school administrators. 1958 revision. Iowa City, IA: University of Iowa.
Google Scholar
Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 21, 187–206.
Google Scholar
Jarjoura, D. (1985). Tolerance intervals for true scores. Journal of Educational Statistics, 10, 1–17.
Google Scholar
Jodoin, M. G., Keller, L. A., & Swaminathan, H. (2003). A comparison of linear, fixed common item, and concurrent parameter estimation procedures in capturing academic growth. The Journal of Experimental Education, 71, 229–250.
Google Scholar
Kahraman, N., & Thompson, T. (2011). Relating unidimensional IRT parameters to a multidimensional response space: A review of two alternative projection IRT models for scoring subscales. Journal of Educational Measurement, 48, 146–164.
Google Scholar
Kane, M. T. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461.
Google Scholar
Kane, M. (2008). The benefits and limitations of formality. Measurement: Interdisciplinary Research & Perspective, 6, 101–108.
Google Scholar
Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17, 221–240.
Google Scholar
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.
Google Scholar
Kolen, M. J. (1988). Defining score scales in relation to measurement error. Journal of Educational Measurement, 25, 97–110.
Google Scholar
Kolen, M. J. (2001). Linking assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20(1), 5–19.
Google Scholar
Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 155–186). Westport, CT: American Council on Education and Praeger.
Google Scholar
Kolen, M. J. (2011). Issues associated with vertical scales for PARCC assessments. Retrieved from Partnership for Assessment of Readiness for College and Careers (PARCC). http://www.parcconline.org/technical-advisory-committee
Kolen, M. J., & Hanson, B. A. (1989). Scaling the ACT Assessment. In R. L. Brennan (Ed.), Methodology used in scaling the ACT Assessment and P-ACT+ (pp. 35–55). Iowa City IA: ACT Inc.
Google Scholar
Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307.
Google Scholar
Kolen, M. J., & Lee, W. (2011). Psychometric properties of raw and scale scores on mixed-format tests. Educational Measurement: Issues and Practice, 30(2), 15–24.
Google Scholar
Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29(3), 8–14.
Google Scholar
Kolen, M. J., Tong, Y., & Brennan, R. L. (2011). Scoring and scaling educational tests. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 43–58). New York: Springer.
Google Scholar
Kolen, M. J., Wang, T., & Lee, W. (2012). Conditional standard errors of measurement for composite scores using IRT. International Journal of Testing, 12, 1–20.
Google Scholar
Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129–140.
Google Scholar
Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education and Praeger.
Google Scholar
Lee, W. (2007). Multinomial and compound multinomial error models for tests with complex item scoring. Applied Psychological Measurement, 31, 255–274.
MathSciNet Google Scholar
Lee, W., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 1–20.
Google Scholar
Lee, W., Brennan, R. L., & Kolen, M. J. (2006). Interval estimation for true raw and scale scores under the binomial error model. Journal of Educational and Behavioral Statistics, 31, 261–281.
Google Scholar
Lei, P., & Zhao, Y. (2012). Effects of vertical scaling methods on linear growth estimation. Applied Psychological Measurement, 36, 21–39.
Google Scholar
Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36, 3–20.
Google Scholar
Lindquist, E. F. (1953). Selecting appropriate score scales for tests. Proceedings of the 1952 Invitational Conference on Testing Problems (pp. 34–40). Princeton, NJ: Educational Testing Service.
Google Scholar
Lissitz, R. W., & Huynh, H. (2003). Vertical equating for state assessments: Issues and solutions in determination of adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10), 1–8.
Google Scholar
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Google Scholar
Lohman, D. F., & Hagen, E. P. (2002). Cognitive abilities test. Form 6. Research handbook. Itasca, IL: Riverside Publishing.
Google Scholar
Lord, F. M. (1965). A strong true score theory with applications. Psychometrika, 30, 239–270.
Google Scholar
Lord, F. M. (1969). Estimating true-score distributions in psychological testing. (An empirical Bayes estimation problem.), Psychometrika, 34, 259–299.
Google Scholar
Lord, F. M. (1975). Automated hypothesis tests and standard errors for nonstandard problems. The American Statistician, 29, 56–59.
MATH Google Scholar
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Google Scholar
Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157–162.
Google Scholar
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.
Google Scholar
Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment Research & Evaluation, 14(4), 3–12.
Google Scholar
Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31, 35–62.
Google Scholar
McCaffrey, D. F., Koretz, D., Lockwood, J. R., & Hamilton, L. S. (2004). Evaluating value-added models for teacher accountability. Santa Monica, CA: Rand.
Google Scholar
McCall, W. A. (1939). Measurement. New York, NY: Macmillan.
Google Scholar
Michell, J. (2008). Is psychometrics pathological science? Measurement: Interdisciplinary Research & Perspective, 6, 7–24.
Google Scholar
Mislevy, R. J. (1987). Recent developments in item response theory with implications for teacher certification. In E. Z. Rothkopf (Ed.), Review of research in education (Vol. 14, pp. 239–275). Washington, DC: American Educational Research Association.
Google Scholar
Mittman, A. (1958). An empirical study of methods of scaling achievement tests at the elementary grade level. Unpublished Doctoral Dissertation, The University of Iowa, Iowa City.
Google Scholar
Moses, T., & Golub-Smith, M. (2011). A scaling method that produces scale score distributions with specific skewness and kurtosis (Research Memorandum 11–04). Princeton, NJ: Educational Testing Service.
Google Scholar
Nitko, A. J. (1984). Defining “criterion-referenced test”. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 9–28). Baltimore, MD: Johns Hopkins.
Google Scholar
Omar, M. H. (1996). An investigation into the reasons item response theory scales show smaller variability for higher achieving groups (Iowa Testing Programs Occasional Papers Number 39). Iowa City, IA: University of Iowa.
Google Scholar
Omar, M. H. (1997, March). An investigation into the reasons why IRT theta scale shrinks for higher achieving groups. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Google Scholar
Omar, M. H. (1998, April). Item parameter invariance assumption and its implications on vertical scaling of multilevel achievement test data. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Google Scholar
O’Sullivan, C. Y., Reese, C. M., & Mazzeo, J. (1997). NAEP 1996 science report card for the Nation and the States. Washington, DC: National Center for Education Statistics.
Google Scholar
Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.
Google Scholar
Patz, R. J. (2007). Vertical scaling in standards-based educational assessment and accountability systems. Washington, DC: Technical Issues in Large Scale Assessment (TILSA) State Collaborative on Assessment and Student Standards (SCASS) of the Council of Chief State School Officers (CCSSO).
Google Scholar
Patz, R. J., & Yao, L. (2007a). Vertical scaling: Statistical models for measuring growth and achievement. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics: Psychometrics (Vol. 26, pp. 955–975). Amsterdam: Elsevier.
Google Scholar
Patz, R. J., & Yao, L. (2007b). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 252–272). New York: Springer.
Google Scholar
Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press.
Google Scholar
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York: Macmillan.
Google Scholar
Phillips, S. E. (1983). Comparison of equipercentile and item response theory equating when the scaling test method is applied to a multilevel achievement battery. Applied Psychological Measurement, 7, 267–281.
Google Scholar
Phillips, S. E. (1986). The effects of the deletion of misfitting persons on vertical equating via the Rasch model. Journal of Educational Measurement, 23, 107–118.
Google Scholar
Phillips, S. E., & Clarizio, H. F. (1988a). Conflicting growth expectations cannot both be real: A rejoinder to Yen. Educational Measurement: Issues and Practice, 7(4), 18–19.
Google Scholar
Phillips, S. E., & Clarizio, H. F. (1988b). Limitations of standard scores in individual achievement testing. Educational Measurement: Issues and Practice, 7(1), 8–15.
Google Scholar
Pommerich, M. (2006). Validation of group domain score estimates using a test of domain. Journal of Educational Measurement, 43, 97–111.
Google Scholar
Pommerich, M., Nicewander, W. A., & Hanson, B. A. (1999). Estimating average domain scores. Journal of Educational Measurement, 36, 199–216.
Google Scholar
Pomplun, M., Omar, M. H., & Custer, M. (2004). A comparison of WINSTEPS and BILOG-MG for vertical scaling with the Rasch model. Educational and Psychological Measurement, 64, 600–616.
MathSciNet Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.
Google Scholar
Puhan, G., & Liang, L. (2011). Equating subscores under the nonequivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30(1), 23–35.
Google Scholar
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2008). Comparison of subscores based on classical test theory methods (Research Report 08–54). Princeton, NJ: Educational Testing Service.
Google Scholar
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285.
Google Scholar
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
Google Scholar
Raudenbush, S. W. (2004). What are value-added models estimating and what does it imply for statistical practice. Journal of Educational and Behavioral Statistics, 29, 121–129.
Google Scholar
Reckase, M. D. (1998). Converting boundaries between National Assessment Governing Board performance categories to points on the National Assessment of Educational Progress score scale: The 1996 science NAEP process. Applied Measurement in Education, 11, 9–21.
Google Scholar
Reckase, M. D. (2000). The evolution of the NAEP achievement levels setting process: A summary of the research and development efforts conducted by ACT. Iowa City, IA: ACT Inc.
Google Scholar
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Google Scholar
Reckase, M. D., & Martineau, J. (2004). The vertical scaling of science achievement tests. Paper commissioned by the Committee on Test Design for K-12 Science Achievement, Center for Education, National Research Council, National Academy of Sciences.
Google Scholar
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and con-structed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.
Google Scholar
Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items scale scores for patterns of summed scores. In D. Thissen & H. Wainer (Eds.), Test scoring. Mahwah, NJ: Erlbaum.
Google Scholar
Rudner, L. M. (2001). Informed test component weighting. Educational Measurement: Issues and Practice, 20(1), 16–19.
Google Scholar
Schulz, E. M., & Nicewander, W. A. (1997). Grade equivalent and IRT representations of growth. Journal of Educational Measurement, 34, 315–331.
Google Scholar
Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation & Policy Analysis, 16, 41–49.
Google Scholar
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.
Google Scholar
Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122–145.
Google Scholar
Sinharay, S., Haberman, S. J., & Lee, Y. (2011). When does scale anchoring work? A case study. Journal of Educational Measurement, 48, 61–80.
Google Scholar
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28.
Google Scholar
Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789–797.
Google Scholar
Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic scores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45, 553–573.
Google Scholar
Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40.
Google Scholar
Skaggs, G., & Lissitz, R. W. (1986a). An exploration of the robustness of four test equating models. Applied Psychological Measurement, 10, 303–317.
Google Scholar
Skaggs, G., & Lissitz, R. W. (1986b). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56, 495–529.
Google Scholar
Skaggs, G., & Lissitz, R. W. (1988). Effect of examinee ability on test equating invariance. Applied Psychological Measurement, 12, 69–82.
Google Scholar
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70, 357–375.
Google Scholar
Slinde, J. A., & Linn, R. L. (1977). Vertically equated tests: Fact or phantom? Journal of Educational Measurement, 14, 23–32.
Google Scholar
Slinde, J. A., & Linn, R. L. (1978). An exploration of the adequacy of the Rasch model for the problem of vertical equating. Journal of Educational Measurement, 15, 23–35.
Google Scholar
Slinde, J. A., & Linn, R. L. (1979a). A note on vertical equating via the Rasch model for groups of quite different ability and tests of quite different difficulty. Journal of Educational Measurement, 16, 159–165.
Google Scholar
Slinde, J. A., & Linn, R. L. (1979b). The Rasch model, objective measurement, equating, and robustness. Applied Psychological Measurement, 3, 437–452.
Google Scholar
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
MATH Google Scholar
Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1–49). New York, NY: Wiley.
Google Scholar
Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 63–86.
Google Scholar
Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. I, pp. 1–76). New York, NY: Wiley.
Google Scholar
Sykes, R. C., & Hou, L. (2003). Weighting constructed-response items in IRT-based exams. Applied Measurement in Education, 16, 257–275.
Google Scholar
Sykes, R. C., & Yen, W. M. (2000). The scaling of mixed-item-format tests with the one-parameter and two-parameter partial credit models. Journal of Educational Measurement, 37, 221–244.
Google Scholar
Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance. Applied Measurement in Education, 17, 89–112.
Google Scholar
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.
Google Scholar
Thissen, D., Wainer, H., & Wang, X.-B. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123.
Google Scholar
Thomasson, G. L., Bloxom, B., & Wise, L. (1994). Initial operational test and evaluation of forms 20, 21, and 22 of the Armed Services Vocational Aptitude Battery (ASVAB) (DMDC Technical Report 94–001). Monterey, CA: Defense Manpower Data Center.
Google Scholar
Thurstone, L. L. (1925). A method of scaling psychological and educational tests. The Journal of Educational Psychology, 16, 433–451.
Google Scholar
Thurstone, L. L. (1927). The unit of measurement in educational scales. Journal of Educational Psychology, 18, 505–524.
Google Scholar
Thurstone, L. L. (1928). The absolute zero in intelligence measurement. Psychological Review, 35, 175–197.
Google Scholar
Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs.
Google Scholar
Thurstone, L. L., & Ackerman, L. (1929). The mental growth curve for the Binet tests. Journal of Educational Psychology, 20, 569–583.
Google Scholar
Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20, 227–253.
Google Scholar
Tong, Y., & Kolen, M. J. (2008, March). Maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.
Google Scholar
Tong, Y., & Kolen, M. J. (2009, April). A further look into the maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Google Scholar
Tong, Y., & Kolen, M. J. (2010). Scaling: An ITEMS module. Educational Measurement: Issues and Practice, 29(4), 39–48.
Google Scholar
Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29–44). Hillsdale, NJ: Erlbaum.
Google Scholar
Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29, 1–3.
Google Scholar
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
Google Scholar
Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 23–72). Mahwah, NJ: Erlbaum.
Google Scholar
Wang, M. W., & Stanley, J. C. (1970). Differential weighting: A review of methods and empirical studies. Review of Educational Research, 4, 663–704.
Google Scholar
Wang, S., & Jiao, H. (2009). Construct equivalence across grades in a vertical scale for a K-12 large-scale reading assessment. Educational and Psychological Measurement, 69, 760–777.
MathSciNet Google Scholar
Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.
Google Scholar
Wilcox, R. R. (1981). A review of the beta-binomial model and its extensions. Journal of Educational Statistics, 6, 3–32.
Google Scholar
Wilks, S. S. (1938). Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika, 3, 23–40.
MATH Google Scholar
Williams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93–107.
Google Scholar
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14, 97–116.
Google Scholar
Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105.
MathSciNet Google Scholar
Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory. Psychometrika, 50, 399–410.
Google Scholar
Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23, 299–325.
Google Scholar
Yen, W. M. (1988). Normative growth expectations must be realistic: A response to Phillips and Clarizio. Educational Measurement: Issues and Practice, 7(4), 16–17.
Google Scholar
Yen, W. (2007). Vertical scaling and no child left behind. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 273–283). New York: Springer.
Google Scholar
Yen, W. M., & Burket, G. R. (1997). Comparison of item response theory and Thurstone methods of vertical scaling. Journal of Educational Measurement, 34, 293–313.
Google Scholar
Yen, W. M., Burket, G. R., & Fitzpatrick, A. R. (1996). Response to Clemans. Educational Assessment, 3, 181–190.
Google Scholar
Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Mahwah, NJ: Erlbaum.
Google Scholar
Zwick, R. (1992). Statistical and psychometric issues in the measurement of educational achievement trends: Examples from the National Assessment of Educational Progress. Journal of Educational Statistics, 17, 205–218.
Google Scholar
Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20, 15–25.
Google Scholar

Download references

Author information

Authors and Affiliations

Iowa Testing Programs, University of Iowa, Iowa City, IA, USA
Michael J. Kolen
CASMA, University of Iowa, Iowa City, IA, USA
Robert L. Brennan

Authors

Michael J. Kolen
View author publications
You can also search for this author in PubMed Google Scholar
Robert L. Brennan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael J. Kolen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kolen, M.J., Brennan, R.L. (2014). Score Scales. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_9

Download citation

DOI: https://doi.org/10.1007/978-1-4939-0317-7_9
Published: 14 January 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-0316-0
Online ISBN: 978-1-4939-0317-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics