Abstract
In this chapter, is devoted to score scales for tests. We discuss different scaling perspectives. We describe linear and nonlinear transformations that are used to construct score scales, and we consider procedures for enhancing the meaning of scale scores that include incorporating normative, content, and score precision information. We discuss procedures for maintaining score scales and scales for batteries and composites. We conclude with a section on vertical scaling that includes consideration of scaling designs and psychometric methods and a review of research on vertical scaling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ACT. (2007). The ACT technical manual. Iowa City, IA: Author.
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Association, American Educational Research, Association, American Psychological, & Council, National. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement. in Education.
Andrews, K. M. (1995). The effects of scaling design and scaling method on the primary score scale associated with a multi-level achievement test. Unpublished Ph. D. Dissertation, The University of Iowa.
Angoff, W. H. (1962). Scales with nonmeaningful origins and units of measurement. Educational & Psychological Measurement, 22, 27–34.
Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorridike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education.
Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy, 4, 351–383.
Ban, J., & Lee, W. (2007). Defining a score scale in relation to measurement error for mixed format tests (CASMA Research Report Number 24). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.
Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204.
Becker, D. F., & Forsyth, R. A. (1992). An empirical investigation of Thurstone and IRT methods of scaling achievement tests. Journal of Educational Measurement, 29, 341–354.
Betebenner, D. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51.
Blanton, H., & Jaccard, J. (2006a). Arbitrary metrics in psychology. American Psychologist, 61, 27.
Blanton, H., & Jaccard, J. (2006b). Arbitrary metrics redux. American Psychologist, 61, 62.
Bock, R. D. (1983). The mental growth curve reexamined. In D. J. Weiss (Ed.), New horizons in testing (pp. 205–209). New York: Academic Press.
Bock, R. D., Mislevy, R., & Woodson, C. (1982). The next stage in educational assessment. Educational Researcher, 11(3), 4–11, 16.
Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34, 197–211.
Bourque, M. L. (1996). Mathematics assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Bourque, M. L. (1996). NAEP Science assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.
Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13, 1–18.
Brennan, R. L. (Ed.). (1989). Methodology used in scaling the ACT Assessment and P-ACT+. Iowa City, IA: American College Testing.
Brennan, R. L. (2001). Generalizability theory. New York: Springer.
Brennan, R. L. (2011). Utility indexes for decisions about subscores (CASMA Research Report Number 33). Iowa City: University of Iowa.
Brennan, R. L., & Lee, W. (1999). Conditional scale-score standard errors of measurement under binomial and compound binomial assumptions. Educational and Psychological Measurement, 59(1), 5–24.
Briggs, D. C., & Weeks, J. P. (2009a). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14.
Briggs, D. C., & Weeks, J. P. (2009b). The sensitivity of value-added modeling to the creation of a vertical score scale. Education Finance and Policy, 4, 384–414.
Brookhart, S. M. (2009). Editorial. Educational Measurement: Issues and Practice. 28(4), 1–2.
Burket, G. R. (1984). Response to Hoover. Educational Measurement: Issues and Practice, 3(4), 15–16.
Camilli, G. (1988). Scale shrinkage and the estimation of latent distribution parameters. Journal of Educational Statistics, 13, 227–241.
Camilli, G. (1999). Measurement error, multidimensionality, and scale shrinkage: A reply to Yen and Burket. Journal of Educational Measurement, 36, 73–78.
Camilli, G., Yamamoto, K., & Wang, M. (1993). Scale shrinkage in vertical equating. Applied Psychological Measurement, 17, 379–388.
Carlson, J. E. (2011). Statistical models for vertical linking. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 59–70). New York: Springer.
Chang, S. W. (2006). Methods in scaling the basic competence test. Educational and Psychological Measurement, 66, 907–929.
Cizek, G. J. (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum.
Cizek, G. J. (2005). Adapting testing technology to serve accountability aims: The case of vertically moderated standard setting. Applied Measurement in Education, 18, 1–9.
Clemans, W. V. (1993). Item response theory, vertical scaling, and something’s awry in the state of test mark. Educational Assessment, 1, 329–347.
Clemans, W. V. (1996). Reply to Yen, Burket, and Fitzpatrick. Educational Assessment, 3, 192–206.
Cook, L. L. (1994). Recentering the SAT score scale: An overview and some policy considerations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans.
Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall.
Council of Chief State School Officers (CCSSO) & National Governors Association (NGA), (2010). Common core state standards initiative. Iowa City: Author.
Custer, M., Omar, M. H., & Pomplun, M. (2006). Vertical scaling with the Rasch model utilizing default and tight convergence settings with WINSTEPS and BILOG-MG. Applied Measurement in Education, 19, 133–149.
de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30, 295–311.
de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296–316.
Donlon, T. (Ed.). (1984). The College Board technical handbook for the Scholastic Aptitude Test and Achievement Tests. New York: College Entrance Examination Board.
Donoghue, J. R. (1996, April). Issues in item mapping: The maximum category information criterion and item mapping procedures for a composite scale. Paper presented at the Annual Meeting of the American Educational Research Association, New York.
Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39, 59–84.
Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport, CT: American Council on Education and Praeger.
Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 15–25.
Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31, 241–259.
Embretson, S. E. (2006). The continued search for nonarbitrary metrics in psychology. American Psychologist, 61, 50–55.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35, 137–154.
Feldt, L. S. (1997). Can validity rise when reliability declines? Applied Measurement in Education, 10, 377–387.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.
Feldt, L. S., & Qualls, A. L. (1998). Approximating scale score standard error of measurement from the raw score standard error. Applied Measurement in Education, 11, 159–177.
Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695–763). Washington, DC: American Council on Education.
Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations? Educational Measurement: Issues and Practice, 10(3), 3–9, 16.
Forsyth, R., Saisangjan, U., & Gilmer, J. (1981). Some empirical results related to the robustness of the Rasch model. Applied Psychological Measurement, 5, 175–186.
Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and square root. Annals of Mathematical Statistics, 21, 607–611.
Gardner, E. F. (1962). Normative standard scores. Educational and Psychological Measurement, 22, 7–14.
Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
Guskey, T. R. (1981). Comparison of a Rasch model scale and the grade-equivalent scale for vertical equating of test scores. Applied Psychological Measurement, 5, 187–201.
Gustafsson, J.-E. (1979). The Rasch model in vertical equating of tests: A critique of Slinde and Linn. Journal of Educational Measurement, 16, 153–158.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9, 139–150.
Haberman, S. J. (2008a). Subscores and validity. (Research Report 08–64). Princeton, NJ: Educational Testing Service.
Haberman, S. J. (2008b). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229.
Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227.
Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95.
Haertel, E. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education and Praeger.
Hambleton, R. K., Brennan, R. L., Brown, W., Dodd, B., Forsyth, R. A., Mehrens, W. A., et al. (2000). A response to “Setting reasonable and useful performance standards” in the National Academy of Sciences’ Grading the Nation’s Report Card. Educational Measurement: Issues and Practice, 19(2), 5–13.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education and Praeger.
Hanson, B.A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. http://www.b-a-h.com/software/irt/icl/index.html
Harris, D. J. (1991). A comparison of Angoff’s Design I and Design II for vertical equating using traditional and IRT methodology. Journal of Educational Measurement, 28, 221–235.
Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York: Springer.
Harris, D. J., & Hoover, H. D. (1987). An application of the three-parameter IRT model to vertical equating. Applied Psychological Measurement, 11, 151–159.
Hendrickson, A. B., Cao, Y., Chae, S. E., & Li, D. (2006, April). Effect of base year on IRT vertical scaling from the common-item design. Paper presented at the annual meeting of the National Council for Measurement in Education, San Francisco, CA.
Hendrickson, A. B., Kolen, M. J., & Tong, Y. (2004, April). Comparison of IRT vertical scaling from scaling-test and common item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, San Diego, CA.
Hendrickson, A. B., Wei, H., & Kolen, M. J. (2005, April). Dichotomous and polytomous scoring for IRT vertical scaling from scaling-test and common-item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, Montreal, Canada.
Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34, 201–228.
Ho, A. D., Lewis, D. M., & MacGregor Farris, J. L. (2009). The dependence of growth-model results on proficiency cut scores. Educational Measurement: Issues and Practice, 28(4), 15–26.
Holland, P. W. (2002). Two measures of change in the gaps between CDFs of test-score distributions. Journal of Educational & Behavioral Statistics, 27, 3–18.
Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch model. Journal of Educational Measurement, 19, 139–147.
Hoover, H. D. (1984a). The most appropriate scores for measuring educational development in the elementary schools: GE’s. Educational Measurement: Issues & Practice, 3(4), 8–14.
Hoover, H. D. (1984b). Rejoinder to Burket. Educational Measurement: Issues and Practice, 3(4), 16–18.
Hoover, H. D. (1988). Growth expectations for low-achieving students: A reply to Yen. Educational Measurement: Issues and Practice, 7(4), 21–23.
Hoover, H. D., Dunbar, S. D., & Frisbie, D. A. (2003). The Iowa tests. Guide to development and research. Itasca, IL: Riverside Publishing.
Hoskens, M., Lewis, D. M., & Patz, R. J. (2003, April). Maintaining vertical scales using a common item design. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Humphry, S. M. (2011). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9, 1–24.
Huynh, H. (1998). On score locations of binary and partial credit items and their applications to item mapping and criterion-referenced interpretation. Journal of Educational and Behavioral Statistics, 23, 35–56.
Huynh, H. (2006). A clarification on the response probability criterion RP67 for standard settings based on bookmark and item mapping. Educational Measurement: Issues and Practice, 25(2), 19–20.
Iowa Tests of Educational Development. (1958). Manual for school administrators. 1958 revision. Iowa City, IA: University of Iowa.
Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 21, 187–206.
Jarjoura, D. (1985). Tolerance intervals for true scores. Journal of Educational Statistics, 10, 1–17.
Jodoin, M. G., Keller, L. A., & Swaminathan, H. (2003). A comparison of linear, fixed common item, and concurrent parameter estimation procedures in capturing academic growth. The Journal of Experimental Education, 71, 229–250.
Kahraman, N., & Thompson, T. (2011). Relating unidimensional IRT parameters to a multidimensional response space: A review of two alternative projection IRT models for scoring subscales. Journal of Educational Measurement, 48, 146–164.
Kane, M. T. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461.
Kane, M. (2008). The benefits and limitations of formality. Measurement: Interdisciplinary Research & Perspective, 6, 101–108.
Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17, 221–240.
Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.
Kolen, M. J. (1988). Defining score scales in relation to measurement error. Journal of Educational Measurement, 25, 97–110.
Kolen, M. J. (2001). Linking assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20(1), 5–19.
Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 155–186). Westport, CT: American Council on Education and Praeger.
Kolen, M. J. (2011). Issues associated with vertical scales for PARCC assessments. Retrieved from Partnership for Assessment of Readiness for College and Careers (PARCC). http://www.parcconline.org/technical-advisory-committee
Kolen, M. J., & Hanson, B. A. (1989). Scaling the ACT Assessment. In R. L. Brennan (Ed.), Methodology used in scaling the ACT Assessment and P-ACT+ (pp. 35–55). Iowa City IA: ACT Inc.
Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307.
Kolen, M. J., & Lee, W. (2011). Psychometric properties of raw and scale scores on mixed-format tests. Educational Measurement: Issues and Practice, 30(2), 15–24.
Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29(3), 8–14.
Kolen, M. J., Tong, Y., & Brennan, R. L. (2011). Scoring and scaling educational tests. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 43–58). New York: Springer.
Kolen, M. J., Wang, T., & Lee, W. (2012). Conditional standard errors of measurement for composite scores using IRT. International Journal of Testing, 12, 1–20.
Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129–140.
Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education and Praeger.
Lee, W. (2007). Multinomial and compound multinomial error models for tests with complex item scoring. Applied Psychological Measurement, 31, 255–274.
Lee, W., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 1–20.
Lee, W., Brennan, R. L., & Kolen, M. J. (2006). Interval estimation for true raw and scale scores under the binomial error model. Journal of Educational and Behavioral Statistics, 31, 261–281.
Lei, P., & Zhao, Y. (2012). Effects of vertical scaling methods on linear growth estimation. Applied Psychological Measurement, 36, 21–39.
Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36, 3–20.
Lindquist, E. F. (1953). Selecting appropriate score scales for tests. Proceedings of the 1952 Invitational Conference on Testing Problems (pp. 34–40). Princeton, NJ: Educational Testing Service.
Lissitz, R. W., & Huynh, H. (2003). Vertical equating for state assessments: Issues and solutions in determination of adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10), 1–8.
Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Lohman, D. F., & Hagen, E. P. (2002). Cognitive abilities test. Form 6. Research handbook. Itasca, IL: Riverside Publishing.
Lord, F. M. (1965). A strong true score theory with applications. Psychometrika, 30, 239–270.
Lord, F. M. (1969). Estimating true-score distributions in psychological testing. (An empirical Bayes estimation problem.), Psychometrika, 34, 259–299.
Lord, F. M. (1975). Automated hypothesis tests and standard errors for nonstandard problems. The American Statistician, 29, 56–59.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157–162.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.
Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment Research & Evaluation, 14(4), 3–12.
Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31, 35–62.
McCaffrey, D. F., Koretz, D., Lockwood, J. R., & Hamilton, L. S. (2004). Evaluating value-added models for teacher accountability. Santa Monica, CA: Rand.
McCall, W. A. (1939). Measurement. New York, NY: Macmillan.
Michell, J. (2008). Is psychometrics pathological science? Measurement: Interdisciplinary Research & Perspective, 6, 7–24.
Mislevy, R. J. (1987). Recent developments in item response theory with implications for teacher certification. In E. Z. Rothkopf (Ed.), Review of research in education (Vol. 14, pp. 239–275). Washington, DC: American Educational Research Association.
Mittman, A. (1958). An empirical study of methods of scaling achievement tests at the elementary grade level. Unpublished Doctoral Dissertation, The University of Iowa, Iowa City.
Moses, T., & Golub-Smith, M. (2011). A scaling method that produces scale score distributions with specific skewness and kurtosis (Research Memorandum 11–04). Princeton, NJ: Educational Testing Service.
Nitko, A. J. (1984). Defining “criterion-referenced test”. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 9–28). Baltimore, MD: Johns Hopkins.
Omar, M. H. (1996). An investigation into the reasons item response theory scales show smaller variability for higher achieving groups (Iowa Testing Programs Occasional Papers Number 39). Iowa City, IA: University of Iowa.
Omar, M. H. (1997, March). An investigation into the reasons why IRT theta scale shrinks for higher achieving groups. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Omar, M. H. (1998, April). Item parameter invariance assumption and its implications on vertical scaling of multilevel achievement test data. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
O’Sullivan, C. Y., Reese, C. M., & Mazzeo, J. (1997). NAEP 1996 science report card for the Nation and the States. Washington, DC: National Center for Education Statistics.
Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.
Patz, R. J. (2007). Vertical scaling in standards-based educational assessment and accountability systems. Washington, DC: Technical Issues in Large Scale Assessment (TILSA) State Collaborative on Assessment and Student Standards (SCASS) of the Council of Chief State School Officers (CCSSO).
Patz, R. J., & Yao, L. (2007a). Vertical scaling: Statistical models for measuring growth and achievement. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics: Psychometrics (Vol. 26, pp. 955–975). Amsterdam: Elsevier.
Patz, R. J., & Yao, L. (2007b). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 252–272). New York: Springer.
Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press.
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York: Macmillan.
Phillips, S. E. (1983). Comparison of equipercentile and item response theory equating when the scaling test method is applied to a multilevel achievement battery. Applied Psychological Measurement, 7, 267–281.
Phillips, S. E. (1986). The effects of the deletion of misfitting persons on vertical equating via the Rasch model. Journal of Educational Measurement, 23, 107–118.
Phillips, S. E., & Clarizio, H. F. (1988a). Conflicting growth expectations cannot both be real: A rejoinder to Yen. Educational Measurement: Issues and Practice, 7(4), 18–19.
Phillips, S. E., & Clarizio, H. F. (1988b). Limitations of standard scores in individual achievement testing. Educational Measurement: Issues and Practice, 7(1), 8–15.
Pommerich, M. (2006). Validation of group domain score estimates using a test of domain. Journal of Educational Measurement, 43, 97–111.
Pommerich, M., Nicewander, W. A., & Hanson, B. A. (1999). Estimating average domain scores. Journal of Educational Measurement, 36, 199–216.
Pomplun, M., Omar, M. H., & Custer, M. (2004). A comparison of WINSTEPS and BILOG-MG for vertical scaling with the Rasch model. Educational and Psychological Measurement, 64, 600–616.
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.
Puhan, G., & Liang, L. (2011). Equating subscores under the nonequivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30(1), 23–35.
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2008). Comparison of subscores based on classical test theory methods (Research Report 08–54). Princeton, NJ: Educational Testing Service.
Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
Raudenbush, S. W. (2004). What are value-added models estimating and what does it imply for statistical practice. Journal of Educational and Behavioral Statistics, 29, 121–129.
Reckase, M. D. (1998). Converting boundaries between National Assessment Governing Board performance categories to points on the National Assessment of Educational Progress score scale: The 1996 science NAEP process. Applied Measurement in Education, 11, 9–21.
Reckase, M. D. (2000). The evolution of the NAEP achievement levels setting process: A summary of the research and development efforts conducted by ACT. Iowa City, IA: ACT Inc.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Reckase, M. D., & Martineau, J. (2004). The vertical scaling of science achievement tests. Paper commissioned by the Committee on Test Design for K-12 Science Achievement, Center for Education, National Research Council, National Academy of Sciences.
Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and con-structed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.
Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items scale scores for patterns of summed scores. In D. Thissen & H. Wainer (Eds.), Test scoring. Mahwah, NJ: Erlbaum.
Rudner, L. M. (2001). Informed test component weighting. Educational Measurement: Issues and Practice, 20(1), 16–19.
Schulz, E. M., & Nicewander, W. A. (1997). Grade equivalent and IRT representations of growth. Journal of Educational Measurement, 34, 315–331.
Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation & Policy Analysis, 16, 41–49.
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.
Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122–145.
Sinharay, S., Haberman, S. J., & Lee, Y. (2011). When does scale anchoring work? A case study. Journal of Educational Measurement, 48, 61–80.
Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28.
Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789–797.
Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic scores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45, 553–573.
Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40.
Skaggs, G., & Lissitz, R. W. (1986a). An exploration of the robustness of four test equating models. Applied Psychological Measurement, 10, 303–317.
Skaggs, G., & Lissitz, R. W. (1986b). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56, 495–529.
Skaggs, G., & Lissitz, R. W. (1988). Effect of examinee ability on test equating invariance. Applied Psychological Measurement, 12, 69–82.
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70, 357–375.
Slinde, J. A., & Linn, R. L. (1977). Vertically equated tests: Fact or phantom? Journal of Educational Measurement, 14, 23–32.
Slinde, J. A., & Linn, R. L. (1978). An exploration of the adequacy of the Rasch model for the problem of vertical equating. Journal of Educational Measurement, 15, 23–35.
Slinde, J. A., & Linn, R. L. (1979a). A note on vertical equating via the Rasch model for groups of quite different ability and tests of quite different difficulty. Journal of Educational Measurement, 16, 159–165.
Slinde, J. A., & Linn, R. L. (1979b). The Rasch model, objective measurement, equating, and robustness. Applied Psychological Measurement, 3, 437–452.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1–49). New York, NY: Wiley.
Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 63–86.
Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. I, pp. 1–76). New York, NY: Wiley.
Sykes, R. C., & Hou, L. (2003). Weighting constructed-response items in IRT-based exams. Applied Measurement in Education, 16, 257–275.
Sykes, R. C., & Yen, W. M. (2000). The scaling of mixed-item-format tests with the one-parameter and two-parameter partial credit models. Journal of Educational Measurement, 37, 221–244.
Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance. Applied Measurement in Education, 17, 89–112.
Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.
Thissen, D., Wainer, H., & Wang, X.-B. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123.
Thomasson, G. L., Bloxom, B., & Wise, L. (1994). Initial operational test and evaluation of forms 20, 21, and 22 of the Armed Services Vocational Aptitude Battery (ASVAB) (DMDC Technical Report 94–001). Monterey, CA: Defense Manpower Data Center.
Thurstone, L. L. (1925). A method of scaling psychological and educational tests. The Journal of Educational Psychology, 16, 433–451.
Thurstone, L. L. (1927). The unit of measurement in educational scales. Journal of Educational Psychology, 18, 505–524.
Thurstone, L. L. (1928). The absolute zero in intelligence measurement. Psychological Review, 35, 175–197.
Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs.
Thurstone, L. L., & Ackerman, L. (1929). The mental growth curve for the Binet tests. Journal of Educational Psychology, 20, 569–583.
Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20, 227–253.
Tong, Y., & Kolen, M. J. (2008, March). Maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.
Tong, Y., & Kolen, M. J. (2009, April). A further look into the maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Tong, Y., & Kolen, M. J. (2010). Scaling: An ITEMS module. Educational Measurement: Issues and Practice, 29(4), 39–48.
Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29–44). Hillsdale, NJ: Erlbaum.
Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29, 1–3.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 23–72). Mahwah, NJ: Erlbaum.
Wang, M. W., & Stanley, J. C. (1970). Differential weighting: A review of methods and empirical studies. Review of Educational Research, 4, 663–704.
Wang, S., & Jiao, H. (2009). Construct equivalence across grades in a vertical scale for a K-12 large-scale reading assessment. Educational and Psychological Measurement, 69, 760–777.
Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.
Wilcox, R. R. (1981). A review of the beta-binomial model and its extensions. Journal of Educational Statistics, 6, 3–32.
Wilks, S. S. (1938). Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika, 3, 23–40.
Williams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93–107.
Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14, 97–116.
Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105.
Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory. Psychometrika, 50, 399–410.
Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23, 299–325.
Yen, W. M. (1988). Normative growth expectations must be realistic: A response to Phillips and Clarizio. Educational Measurement: Issues and Practice, 7(4), 16–17.
Yen, W. (2007). Vertical scaling and no child left behind. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 273–283). New York: Springer.
Yen, W. M., & Burket, G. R. (1997). Comparison of item response theory and Thurstone methods of vertical scaling. Journal of Educational Measurement, 34, 293–313.
Yen, W. M., Burket, G. R., & Fitzpatrick, A. R. (1996). Response to Clemans. Educational Assessment, 3, 181–190.
Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Mahwah, NJ: Erlbaum.
Zwick, R. (1992). Statistical and psychometric issues in the measurement of educational achievement trends: Examples from the National Assessment of Educational Progress. Journal of Educational Statistics, 17, 205–218.
Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20, 15–25.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kolen, M.J., Brennan, R.L. (2014). Score Scales. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_9
Download citation
DOI: https://doi.org/10.1007/978-1-4939-0317-7_9
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-0316-0
Online ISBN: 978-1-4939-0317-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)