Skip to main content

Score Scales

  • Chapter
  • First Online:
Test Equating, Scaling, and Linking

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

  • 4428 Accesses

Abstract

In this chapter, is devoted to score scales for tests. We discuss different scaling perspectives. We describe linear and nonlinear transformations that are used to construct score scales, and we consider procedures for enhancing the meaning of scale scores that include incorporating normative, content, and score precision information. We discuss procedures for maintaining score scales and scales for batteries and composites. We conclude with a section on vertical scaling that includes consideration of scaling designs and psychometric methods and a review of research on vertical scaling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • ACT. (2007). The ACT technical manual. Iowa City, IA: Author.

    Google Scholar 

  • Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.

    Google Scholar 

  • Association, American Educational Research, Association, American Psychological, & Council, National. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement. in Education.

    Google Scholar 

  • Andrews, K. M. (1995). The effects of scaling design and scaling method on the primary score scale associated with a multi-level achievement test. Unpublished Ph. D. Dissertation, The University of Iowa.

    Google Scholar 

  • Angoff, W. H. (1962). Scales with nonmeaningful origins and units of measurement. Educational & Psychological Measurement, 22, 27–34.

    Google Scholar 

  • Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorridike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education.

    Google Scholar 

  • Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy, 4, 351–383.

    Google Scholar 

  • Ban, J., & Lee, W. (2007). Defining a score scale in relation to measurement error for mixed format tests (CASMA Research Report Number 24). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment.

    Google Scholar 

  • Beaton, A. E., & Allen, N. L. (1992). Interpreting scales through scale anchoring. Journal of Educational Statistics, 17, 191–204.

    Google Scholar 

  • Becker, D. F., & Forsyth, R. A. (1992). An empirical investigation of Thurstone and IRT methods of scaling achievement tests. Journal of Educational Measurement, 29, 341–354.

    Google Scholar 

  • Betebenner, D. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51.

    Google Scholar 

  • Blanton, H., & Jaccard, J. (2006a). Arbitrary metrics in psychology. American Psychologist, 61, 27.

    Google Scholar 

  • Blanton, H., & Jaccard, J. (2006b). Arbitrary metrics redux. American Psychologist, 61, 62.

    Google Scholar 

  • Bock, R. D. (1983). The mental growth curve reexamined. In D. J. Weiss (Ed.), New horizons in testing (pp. 205–209). New York: Academic Press.

    Google Scholar 

  • Bock, R. D., Mislevy, R., & Woodson, C. (1982). The next stage in educational assessment. Educational Researcher, 11(3), 4–11, 16.

    Google Scholar 

  • Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34, 197–211.

    Google Scholar 

  • Bourque, M. L. (1996). Mathematics assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.

    Google Scholar 

  • Bourque, M. L. (1996). NAEP Science assessment. In N. L. Allen, J. E. Carlson, & C. A. Zelenak (Eds.), The NAEP 1996 technical report. Washington, DC: National Center for Education Statistics.

    Google Scholar 

  • Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 13, 1–18.

    Google Scholar 

  • Brennan, R. L. (Ed.). (1989). Methodology used in scaling the ACT Assessment and P-ACT+. Iowa City, IA: American College Testing.

    Google Scholar 

  • Brennan, R. L. (2001). Generalizability theory. New York: Springer.

    MATH  Google Scholar 

  • Brennan, R. L. (2011). Utility indexes for decisions about subscores (CASMA Research Report Number 33). Iowa City: University of Iowa.

    Google Scholar 

  • Brennan, R. L., & Lee, W. (1999). Conditional scale-score standard errors of measurement under binomial and compound binomial assumptions. Educational and Psychological Measurement, 59(1), 5–24.

    Google Scholar 

  • Briggs, D. C., & Weeks, J. P. (2009a). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues and Practice, 28(4), 3–14.

    Google Scholar 

  • Briggs, D. C., & Weeks, J. P. (2009b). The sensitivity of value-added modeling to the creation of a vertical score scale. Education Finance and Policy, 4, 384–414.

    Google Scholar 

  • Brookhart, S. M. (2009). Editorial. Educational Measurement: Issues and Practice. 28(4), 1–2.

    Google Scholar 

  • Burket, G. R. (1984). Response to Hoover. Educational Measurement: Issues and Practice, 3(4), 15–16.

    Google Scholar 

  • Camilli, G. (1988). Scale shrinkage and the estimation of latent distribution parameters. Journal of Educational Statistics, 13, 227–241.

    Google Scholar 

  • Camilli, G. (1999). Measurement error, multidimensionality, and scale shrinkage: A reply to Yen and Burket. Journal of Educational Measurement, 36, 73–78.

    Google Scholar 

  • Camilli, G., Yamamoto, K., & Wang, M. (1993). Scale shrinkage in vertical equating. Applied Psychological Measurement, 17, 379–388.

    Google Scholar 

  • Carlson, J. E. (2011). Statistical models for vertical linking. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 59–70). New York: Springer.

    Google Scholar 

  • Chang, S. W. (2006). Methods in scaling the basic competence test. Educational and Psychological Measurement, 66, 907–929.

    MathSciNet  Google Scholar 

  • Cizek, G. J. (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Cizek, G. J. (2005). Adapting testing technology to serve accountability aims: The case of vertically moderated standard setting. Applied Measurement in Education, 18, 1–9.

    Google Scholar 

  • Clemans, W. V. (1993). Item response theory, vertical scaling, and something’s awry in the state of test mark. Educational Assessment, 1, 329–347.

    Google Scholar 

  • Clemans, W. V. (1996). Reply to Yen, Burket, and Fitzpatrick. Educational Assessment, 3, 192–206.

    Google Scholar 

  • Cook, L. L. (1994). Recentering the SAT score scale: An overview and some policy considerations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans.

    Google Scholar 

  • Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Council of Chief State School Officers (CCSSO) & National Governors Association (NGA), (2010). Common core state standards initiative. Iowa City: Author.

    Google Scholar 

  • Custer, M., Omar, M. H., & Pomplun, M. (2006). Vertical scaling with the Rasch model utilizing default and tight convergence settings with WINSTEPS and BILOG-MG. Applied Measurement in Education, 19, 133–149.

    Google Scholar 

  • de la Torre, J., & Patz, R. J. (2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30, 295–311.

    Google Scholar 

  • de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35, 296–316.

    Google Scholar 

  • Donlon, T. (Ed.). (1984). The College Board technical handbook for the Scholastic Aptitude Test and Achievement Tests. New York: College Entrance Examination Board.

    Google Scholar 

  • Donoghue, J. R. (1996, April). Issues in item mapping: The maximum category information criterion and item mapping procedures for a composite scale. Paper presented at the Annual Meeting of the American Educational Research Association, New York.

    Google Scholar 

  • Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39, 59–84.

    Google Scholar 

  • Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 471–515). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 15–25.

    Google Scholar 

  • Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow? Journal of Educational and Behavioral Statistics, 31, 241–259.

    Google Scholar 

  • Embretson, S. E. (2006). The continued search for nonarbitrary metrics in psychology. American Psychologist, 61, 50–55.

    Google Scholar 

  • Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Ercikan, K., Schwarz, R. D., Julian, M. W., Burket, G. R., Weber, M. M., & Link, V. (1998). Calibration and scoring of tests with multiple-choice and constructed-response item types. Journal of Educational Measurement, 35, 137–154.

    Google Scholar 

  • Feldt, L. S. (1997). Can validity rise when reliability declines? Applied Measurement in Education, 10, 377–387.

    Google Scholar 

  • Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.

    Google Scholar 

  • Feldt, L. S., & Qualls, A. L. (1998). Approximating scale score standard error of measurement from the raw score standard error. Applied Measurement in Education, 11, 159–177.

    Google Scholar 

  • Flanagan, J. C. (1951). Units, scores, and norms. In E. F. Lindquist (Ed.), Educational measurement (pp. 695–763). Washington, DC: American Council on Education.

    Google Scholar 

  • Forsyth, R. A. (1991). Do NAEP scales yield valid criterion-referenced interpretations? Educational Measurement: Issues and Practice, 10(3), 3–9, 16.

    Google Scholar 

  • Forsyth, R., Saisangjan, U., & Gilmer, J. (1981). Some empirical results related to the robustness of the Rasch model. Applied Psychological Measurement, 5, 175–186.

    Google Scholar 

  • Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and square root. Annals of Mathematical Statistics, 21, 607–611.

    MATH  MathSciNet  Google Scholar 

  • Gardner, E. F. (1962). Normative standard scores. Educational and Psychological Measurement, 22, 7–14.

    Google Scholar 

  • Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.

    Google Scholar 

  • Guskey, T. R. (1981). Comparison of a Rasch model scale and the grade-equivalent scale for vertical equating of test scores. Applied Psychological Measurement, 5, 187–201.

    Google Scholar 

  • Gustafsson, J.-E. (1979). The Rasch model in vertical equating of tests: A critique of Slinde and Linn. Journal of Educational Measurement, 16, 153–158.

    Google Scholar 

  • Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9, 139–150.

    MathSciNet  Google Scholar 

  • Haberman, S. J. (2008a). Subscores and validity. (Research Report 08–64). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Haberman, S. J. (2008b). When can subscores have value? Journal of Educational and Behavioral Statistics, 33, 204–229.

    Google Scholar 

  • Haberman, S. J., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 209–227.

    MATH  MathSciNet  Google Scholar 

  • Haberman, S. J., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62, 79–95.

    MathSciNet  Google Scholar 

  • Haertel, E. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Hambleton, R. K., Brennan, R. L., Brown, W., Dodd, B., Forsyth, R. A., Mehrens, W. A., et al. (2000). A response to “Setting reasonable and useful performance standards” in the National Academy of Sciences’ Grading the Nation’s Report Card. Educational Measurement: Issues and Practice, 19(2), 5–13.

    Google Scholar 

  • Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Hanson, B.A. (2002). IRT command language (Version 0.020301, March 1, 2002). Monterey, CA: Author. http://www.b-a-h.com/software/irt/icl/index.html

  • Harris, D. J. (1991). A comparison of Angoff’s Design I and Design II for vertical equating using traditional and IRT methodology. Journal of Educational Measurement, 28, 221–235.

    Google Scholar 

  • Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 233–251). New York: Springer.

    Google Scholar 

  • Harris, D. J., & Hoover, H. D. (1987). An application of the three-parameter IRT model to vertical equating. Applied Psychological Measurement, 11, 151–159.

    Google Scholar 

  • Hendrickson, A. B., Cao, Y., Chae, S. E., & Li, D. (2006, April). Effect of base year on IRT vertical scaling from the common-item design. Paper presented at the annual meeting of the National Council for Measurement in Education, San Francisco, CA.

    Google Scholar 

  • Hendrickson, A. B., Kolen, M. J., & Tong, Y. (2004, April). Comparison of IRT vertical scaling from scaling-test and common item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, San Diego, CA.

    Google Scholar 

  • Hendrickson, A. B., Wei, H., & Kolen, M. J. (2005, April). Dichotomous and polytomous scoring for IRT vertical scaling from scaling-test and common-item designs. Paper presented at the annual meeting of the National Council for Measurement in Education, Montreal, Canada.

    Google Scholar 

  • Ho, A. D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34, 201–228.

    Google Scholar 

  • Ho, A. D., Lewis, D. M., & MacGregor Farris, J. L. (2009). The dependence of growth-model results on proficiency cut scores. Educational Measurement: Issues and Practice, 28(4), 15–26.

    Google Scholar 

  • Holland, P. W. (2002). Two measures of change in the gaps between CDFs of test-score distributions. Journal of Educational & Behavioral Statistics, 27, 3–18.

    Google Scholar 

  • Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch model. Journal of Educational Measurement, 19, 139–147.

    Google Scholar 

  • Hoover, H. D. (1984a). The most appropriate scores for measuring educational development in the elementary schools: GE’s. Educational Measurement: Issues & Practice, 3(4), 8–14.

    Google Scholar 

  • Hoover, H. D. (1984b). Rejoinder to Burket. Educational Measurement: Issues and Practice, 3(4), 16–18.

    Google Scholar 

  • Hoover, H. D. (1988). Growth expectations for low-achieving students: A reply to Yen. Educational Measurement: Issues and Practice, 7(4), 21–23.

    Google Scholar 

  • Hoover, H. D., Dunbar, S. D., & Frisbie, D. A. (2003). The Iowa tests. Guide to development and research. Itasca, IL: Riverside Publishing.

    Google Scholar 

  • Hoskens, M., Lewis, D. M., & Patz, R. J. (2003, April). Maintaining vertical scales using a common item design. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

    Google Scholar 

  • Humphry, S. M. (2011). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research & Perspective, 9, 1–24.

    Google Scholar 

  • Huynh, H. (1998). On score locations of binary and partial credit items and their applications to item mapping and criterion-referenced interpretation. Journal of Educational and Behavioral Statistics, 23, 35–56.

    Google Scholar 

  • Huynh, H. (2006). A clarification on the response probability criterion RP67 for standard settings based on bookmark and item mapping. Educational Measurement: Issues and Practice, 25(2), 19–20.

    Google Scholar 

  • Iowa Tests of Educational Development. (1958). Manual for school administrators. 1958 revision. Iowa City, IA: University of Iowa.

    Google Scholar 

  • Ito, K., Sykes, R. C., & Yao, L. (2008). Concurrent and separate grade-groups linking procedures for vertical scaling. Applied Measurement in Education, 21, 187–206.

    Google Scholar 

  • Jarjoura, D. (1985). Tolerance intervals for true scores. Journal of Educational Statistics, 10, 1–17.

    Google Scholar 

  • Jodoin, M. G., Keller, L. A., & Swaminathan, H. (2003). A comparison of linear, fixed common item, and concurrent parameter estimation procedures in capturing academic growth. The Journal of Experimental Education, 71, 229–250.

    Google Scholar 

  • Kahraman, N., & Thompson, T. (2011). Relating unidimensional IRT parameters to a multidimensional response space: A review of two alternative projection IRT models for scoring subscales. Journal of Educational Measurement, 48, 146–164.

    Google Scholar 

  • Kane, M. T. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461.

    Google Scholar 

  • Kane, M. (2008). The benefits and limitations of formality. Measurement: Interdisciplinary Research & Perspective, 6, 101–108.

    Google Scholar 

  • Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17, 221–240.

    Google Scholar 

  • Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11.

    Google Scholar 

  • Kolen, M. J. (1988). Defining score scales in relation to measurement error. Journal of Educational Measurement, 25, 97–110.

    Google Scholar 

  • Kolen, M. J. (2001). Linking assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20(1), 5–19.

    Google Scholar 

  • Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 155–186). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Kolen, M. J. (2011). Issues associated with vertical scales for PARCC assessments. Retrieved from Partnership for Assessment of Readiness for College and Careers (PARCC). http://www.parcconline.org/technical-advisory-committee

  • Kolen, M. J., & Hanson, B. A. (1989). Scaling the ACT Assessment. In R. L. Brennan (Ed.), Methodology used in scaling the ACT Assessment and P-ACT+ (pp. 35–55). Iowa City IA: ACT Inc.

    Google Scholar 

  • Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307.

    Google Scholar 

  • Kolen, M. J., & Lee, W. (2011). Psychometric properties of raw and scale scores on mixed-format tests. Educational Measurement: Issues and Practice, 30(2), 15–24.

    Google Scholar 

  • Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29(3), 8–14.

    Google Scholar 

  • Kolen, M. J., Tong, Y., & Brennan, R. L. (2011). Scoring and scaling educational tests. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 43–58). New York: Springer.

    Google Scholar 

  • Kolen, M. J., Wang, T., & Lee, W. (2012). Conditional standard errors of measurement for composite scores using IRT. International Journal of Testing, 12, 1–20.

    Google Scholar 

  • Kolen, M. J., Zeng, L., & Hanson, B. A. (1996). Conditional standard errors of measurement for scale scores using IRT. Journal of Educational Measurement, 33, 129–140.

    Google Scholar 

  • Lane, S., & Stone, C. A. (2006). Performance assessment. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431). Westport, CT: American Council on Education and Praeger.

    Google Scholar 

  • Lee, W. (2007). Multinomial and compound multinomial error models for tests with complex item scoring. Applied Psychological Measurement, 31, 255–274.

    MathSciNet  Google Scholar 

  • Lee, W., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 1–20.

    Google Scholar 

  • Lee, W., Brennan, R. L., & Kolen, M. J. (2006). Interval estimation for true raw and scale scores under the binomial error model. Journal of Educational and Behavioral Statistics, 31, 261–281.

    Google Scholar 

  • Lei, P., & Zhao, Y. (2012). Effects of vertical scaling methods on linear growth estimation. Applied Psychological Measurement, 36, 21–39.

    Google Scholar 

  • Li, Y., & Lissitz, R. W. (2012). Exploring the full-information bifactor model in vertical scaling with construct shift. Applied Psychological Measurement, 36, 3–20.

    Google Scholar 

  • Lindquist, E. F. (1953). Selecting appropriate score scales for tests. Proceedings of the 1952 Invitational Conference on Testing Problems (pp. 34–40). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Lissitz, R. W., & Huynh, H. (2003). Vertical equating for state assessments: Issues and solutions in determination of adequate yearly progress and school accountability. Practical Assessment, Research & Evaluation, 8(10), 1–8.

    Google Scholar 

  • Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Lohman, D. F., & Hagen, E. P. (2002). Cognitive abilities test. Form 6. Research handbook. Itasca, IL: Riverside Publishing.

    Google Scholar 

  • Lord, F. M. (1965). A strong true score theory with applications. Psychometrika, 30, 239–270.

    Google Scholar 

  • Lord, F. M. (1969). Estimating true-score distributions in psychological testing. (An empirical Bayes estimation problem.), Psychometrika, 34, 259–299.

    Google Scholar 

  • Lord, F. M. (1975). Automated hypothesis tests and standard errors for nonstandard problems. The American Statistician, 29, 56–59.

    MATH  Google Scholar 

  • Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157–162.

    Google Scholar 

  • Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch Model. Journal of Educational Measurement, 17, 179–193.

    Google Scholar 

  • Lyren, P. (2009). Reporting subscores from college admission tests. Practical Assessment Research & Evaluation, 14(4), 3–12.

    Google Scholar 

  • Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31, 35–62.

    Google Scholar 

  • McCaffrey, D. F., Koretz, D., Lockwood, J. R., & Hamilton, L. S. (2004). Evaluating value-added models for teacher accountability. Santa Monica, CA: Rand.

    Google Scholar 

  • McCall, W. A. (1939). Measurement. New York, NY: Macmillan.

    Google Scholar 

  • Michell, J. (2008). Is psychometrics pathological science? Measurement: Interdisciplinary Research & Perspective, 6, 7–24.

    Google Scholar 

  • Mislevy, R. J. (1987). Recent developments in item response theory with implications for teacher certification. In E. Z. Rothkopf (Ed.), Review of research in education (Vol. 14, pp. 239–275). Washington, DC: American Educational Research Association.

    Google Scholar 

  • Mittman, A. (1958). An empirical study of methods of scaling achievement tests at the elementary grade level. Unpublished Doctoral Dissertation, The University of Iowa, Iowa City.

    Google Scholar 

  • Moses, T., & Golub-Smith, M. (2011). A scaling method that produces scale score distributions with specific skewness and kurtosis (Research Memorandum 11–04). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Nitko, A. J. (1984). Defining “criterion-referenced test”. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 9–28). Baltimore, MD: Johns Hopkins.

    Google Scholar 

  • Omar, M. H. (1996). An investigation into the reasons item response theory scales show smaller variability for higher achieving groups (Iowa Testing Programs Occasional Papers Number 39). Iowa City, IA: University of Iowa.

    Google Scholar 

  • Omar, M. H. (1997, March). An investigation into the reasons why IRT theta scale shrinks for higher achieving groups. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

    Google Scholar 

  • Omar, M. H. (1998, April). Item parameter invariance assumption and its implications on vertical scaling of multilevel achievement test data. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

    Google Scholar 

  • O’Sullivan, C. Y., Reese, C. M., & Mazzeo, J. (1997). NAEP 1996 science report card for the Nation and the States. Washington, DC: National Center for Education Statistics.

    Google Scholar 

  • Paek, I., & Young, M. J. (2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data. Applied Measurement in Education, 18, 199–215.

    Google Scholar 

  • Patz, R. J. (2007). Vertical scaling in standards-based educational assessment and accountability systems. Washington, DC: Technical Issues in Large Scale Assessment (TILSA) State Collaborative on Assessment and Student Standards (SCASS) of the Council of Chief State School Officers (CCSSO).

    Google Scholar 

  • Patz, R. J., & Yao, L. (2007a). Vertical scaling: Statistical models for measuring growth and achievement. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics: Psychometrics (Vol. 26, pp. 955–975). Amsterdam: Elsevier.

    Google Scholar 

  • Patz, R. J., & Yao, L. (2007b). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 252–272). New York: Springer.

    Google Scholar 

  • Pellegrino, J. W., Jones, L. R., & Mitchell, K. J. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Washington, DC: National Academy Press.

    Google Scholar 

  • Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York: Macmillan.

    Google Scholar 

  • Phillips, S. E. (1983). Comparison of equipercentile and item response theory equating when the scaling test method is applied to a multilevel achievement battery. Applied Psychological Measurement, 7, 267–281.

    Google Scholar 

  • Phillips, S. E. (1986). The effects of the deletion of misfitting persons on vertical equating via the Rasch model. Journal of Educational Measurement, 23, 107–118.

    Google Scholar 

  • Phillips, S. E., & Clarizio, H. F. (1988a). Conflicting growth expectations cannot both be real: A rejoinder to Yen. Educational Measurement: Issues and Practice, 7(4), 18–19.

    Google Scholar 

  • Phillips, S. E., & Clarizio, H. F. (1988b). Limitations of standard scores in individual achievement testing. Educational Measurement: Issues and Practice, 7(1), 8–15.

    Google Scholar 

  • Pommerich, M. (2006). Validation of group domain score estimates using a test of domain. Journal of Educational Measurement, 43, 97–111.

    Google Scholar 

  • Pommerich, M., Nicewander, W. A., & Hanson, B. A. (1999). Estimating average domain scores. Journal of Educational Measurement, 36, 199–216.

    Google Scholar 

  • Pomplun, M., Omar, M. H., & Custer, M. (2004). A comparison of WINSTEPS and BILOG-MG for vertical scaling with the Rasch model. Educational and Psychological Measurement, 64, 600–616.

    MathSciNet  Google Scholar 

  • Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1989). Numerical recipes. The art of scientific computing (Fortran version). Cambridge, UK: Cambridge University Press.

    Google Scholar 

  • Puhan, G., & Liang, L. (2011). Equating subscores under the nonequivalent anchor test (NEAT) design. Educational Measurement: Issues and Practice, 30(1), 23–35.

    Google Scholar 

  • Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2008). Comparison of subscores based on classical test theory methods (Research Report 08–54). Princeton, NJ: Educational Testing Service.

    Google Scholar 

  • Puhan, G., Sinharay, S., Haberman, S. J., & Larkin, K. (2010). The utility of augmented subscores in a licensure exam: An evaluation of methods using empirical data. Applied Measurement in Education, 23, 266–285.

    Google Scholar 

  • Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.

    Google Scholar 

  • Raudenbush, S. W. (2004). What are value-added models estimating and what does it imply for statistical practice. Journal of Educational and Behavioral Statistics, 29, 121–129.

    Google Scholar 

  • Reckase, M. D. (1998). Converting boundaries between National Assessment Governing Board performance categories to points on the National Assessment of Educational Progress score scale: The 1996 science NAEP process. Applied Measurement in Education, 11, 9–21.

    Google Scholar 

  • Reckase, M. D. (2000). The evolution of the NAEP achievement levels setting process: A summary of the research and development efforts conducted by ACT. Iowa City, IA: ACT Inc.

    Google Scholar 

  • Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.

    Google Scholar 

  • Reckase, M. D., & Martineau, J. (2004). The vertical scaling of science achievement tests. Paper commissioned by the Committee on Test Design for K-12 Science Achievement, Center for Education, National Research Council, National Academy of Sciences.

    Google Scholar 

  • Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and con-structed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.

    Google Scholar 

  • Rosa, K., Swygert, K. A., Nelson, L., & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed-response items scale scores for patterns of summed scores. In D. Thissen & H. Wainer (Eds.), Test scoring. Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Rudner, L. M. (2001). Informed test component weighting. Educational Measurement: Issues and Practice, 20(1), 16–19.

    Google Scholar 

  • Schulz, E. M., & Nicewander, W. A. (1997). Grade equivalent and IRT representations of growth. Journal of Educational Measurement, 34, 315–331.

    Google Scholar 

  • Seltzer, M. H., Frank, K. A., & Bryk, A. S. (1994). The metric matters: The sensitivity of conclusions about growth in student achievement to choice of metric. Educational Evaluation & Policy Analysis, 16, 41–49.

    Google Scholar 

  • Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47, 150–174.

    Google Scholar 

  • Sinharay, S., & Haberman, S. J. (2011). Equating of augmented subscores. Journal of Educational Measurement, 48, 122–145.

    Google Scholar 

  • Sinharay, S., Haberman, S. J., & Lee, Y. (2011). When does scale anchoring work? A case study. Journal of Educational Measurement, 48, 61–80.

    Google Scholar 

  • Sinharay, S., Haberman, S. J., & Puhan, G. (2007). Subscores based on classical test theory: To report or not to report. Educational Measurement: Issues and Practice, 26(4), 21–28.

    Google Scholar 

  • Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted subscores lack validity? Don’t blame the messenger. Educational and Psychological Measurement, 71, 789–797.

    Google Scholar 

  • Sinharay, S., Puhan, G., & Haberman, S. J. (2010). Reporting diagnostic scores in educational testing: Temptations, pitfalls, and some solutions. Multivariate Behavioral Research, 45, 553–573.

    Google Scholar 

  • Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40.

    Google Scholar 

  • Skaggs, G., & Lissitz, R. W. (1986a). An exploration of the robustness of four test equating models. Applied Psychological Measurement, 10, 303–317.

    Google Scholar 

  • Skaggs, G., & Lissitz, R. W. (1986b). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56, 495–529.

    Google Scholar 

  • Skaggs, G., & Lissitz, R. W. (1988). Effect of examinee ability on test equating invariance. Applied Psychological Measurement, 12, 69–82.

    Google Scholar 

  • Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70, 357–375.

    Google Scholar 

  • Slinde, J. A., & Linn, R. L. (1977). Vertically equated tests: Fact or phantom? Journal of Educational Measurement, 14, 23–32.

    Google Scholar 

  • Slinde, J. A., & Linn, R. L. (1978). An exploration of the adequacy of the Rasch model for the problem of vertical equating. Journal of Educational Measurement, 15, 23–35.

    Google Scholar 

  • Slinde, J. A., & Linn, R. L. (1979a). A note on vertical equating via the Rasch model for groups of quite different ability and tests of quite different difficulty. Journal of Educational Measurement, 16, 159–165.

    Google Scholar 

  • Slinde, J. A., & Linn, R. L. (1979b). The Rasch model, objective measurement, equating, and robustness. Applied Psychological Measurement, 3, 437–452.

    Google Scholar 

  • Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.

    MATH  Google Scholar 

  • Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens (Ed.), Handbook of experimental psychology (pp. 1–49). New York, NY: Wiley.

    Google Scholar 

  • Stone, C. A., Ye, F., Zhu, X., & Lane, S. (2010). Providing subscale scores for diagnostic information: A case study when the test is essentially unidimensional. Applied Measurement in Education, 23, 63–86.

    Google Scholar 

  • Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. I, pp. 1–76). New York, NY: Wiley.

    Google Scholar 

  • Sykes, R. C., & Hou, L. (2003). Weighting constructed-response items in IRT-based exams. Applied Measurement in Education, 16, 257–275.

    Google Scholar 

  • Sykes, R. C., & Yen, W. M. (2000). The scaling of mixed-item-format tests with the one-parameter and two-parameter partial credit models. Journal of Educational Measurement, 37, 221–244.

    Google Scholar 

  • Tate, R. L. (2004). Implications of multidimensionality for total score and subscore performance. Applied Measurement in Education, 17, 89–112.

    Google Scholar 

  • Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Thissen, D., Wainer, H., & Wang, X.-B. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123.

    Google Scholar 

  • Thomasson, G. L., Bloxom, B., & Wise, L. (1994). Initial operational test and evaluation of forms 20, 21, and 22 of the Armed Services Vocational Aptitude Battery (ASVAB) (DMDC Technical Report 94–001). Monterey, CA: Defense Manpower Data Center.

    Google Scholar 

  • Thurstone, L. L. (1925). A method of scaling psychological and educational tests. The Journal of Educational Psychology, 16, 433–451.

    Google Scholar 

  • Thurstone, L. L. (1927). The unit of measurement in educational scales. Journal of Educational Psychology, 18, 505–524.

    Google Scholar 

  • Thurstone, L. L. (1928). The absolute zero in intelligence measurement. Psychological Review, 35, 175–197.

    Google Scholar 

  • Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs.

    Google Scholar 

  • Thurstone, L. L., & Ackerman, L. (1929). The mental growth curve for the Binet tests. Journal of Educational Psychology, 20, 569–583.

    Google Scholar 

  • Tong, Y., & Kolen, M. J. (2007). Comparisons of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20, 227–253.

    Google Scholar 

  • Tong, Y., & Kolen, M. J. (2008, March). Maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, New York.

    Google Scholar 

  • Tong, Y., & Kolen, M. J. (2009, April). A further look into the maintenance of vertical scales. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

    Google Scholar 

  • Tong, Y., & Kolen, M. J. (2010). Scaling: An ITEMS module. Educational Measurement: Issues and Practice, 29(4), 39–48.

    Google Scholar 

  • Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29–44). Hillsdale, NJ: Erlbaum.

    Google Scholar 

  • Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29, 1–3.

    Google Scholar 

  • Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.

    Google Scholar 

  • Wainer, H., & Thissen, D. (2001). True score theory: The traditional method. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 23–72). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Wang, M. W., & Stanley, J. C. (1970). Differential weighting: A review of methods and empirical studies. Review of Educational Research, 4, 663–704.

    Google Scholar 

  • Wang, S., & Jiao, H. (2009). Construct equivalence across grades in a vertical scale for a K-12 large-scale reading assessment. Educational and Psychological Measurement, 69, 760–777.

    MathSciNet  Google Scholar 

  • Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141–162.

    Google Scholar 

  • Wilcox, R. R. (1981). A review of the beta-binomial model and its extensions. Journal of Educational Statistics, 6, 3–32.

    Google Scholar 

  • Wilks, S. S. (1938). Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika, 3, 23–40.

    MATH  Google Scholar 

  • Williams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93–107.

    Google Scholar 

  • Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14, 97–116.

    Google Scholar 

  • Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105.

    MathSciNet  Google Scholar 

  • Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory. Psychometrika, 50, 399–410.

    Google Scholar 

  • Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23, 299–325.

    Google Scholar 

  • Yen, W. M. (1988). Normative growth expectations must be realistic: A response to Phillips and Clarizio. Educational Measurement: Issues and Practice, 7(4), 16–17.

    Google Scholar 

  • Yen, W. (2007). Vertical scaling and no child left behind. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 273–283). New York: Springer.

    Google Scholar 

  • Yen, W. M., & Burket, G. R. (1997). Comparison of item response theory and Thurstone methods of vertical scaling. Journal of Educational Measurement, 34, 293–313.

    Google Scholar 

  • Yen, W. M., Burket, G. R., & Fitzpatrick, A. R. (1996). Response to Clemans. Educational Assessment, 3, 181–190.

    Google Scholar 

  • Young, M. J. (2006). Vertical scales. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 469–485). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Zwick, R. (1992). Statistical and psychometric issues in the measurement of educational achievement trends: Examples from the National Assessment of Educational Progress. Journal of Educational Statistics, 17, 205–218.

    Google Scholar 

  • Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the National Assessment of Educational Progress. Educational Measurement: Issues and Practice, 20, 15–25.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael J. Kolen .

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kolen, M.J., Brennan, R.L. (2014). Score Scales. In: Test Equating, Scaling, and Linking. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-0317-7_9

Download citation

Publish with us

Policies and ethics