Potential sources of invalidity when using teacher value-added and principal observational estimates: artificial inflation, deflation, and conflation

  • Audrey Amrein-BeardsleyEmail author
  • Tray J. Geiger


Contemporary teacher evaluation policies are built upon multiple-measure systems including, primarily, teacher-level value-added and observational estimates. However, researchers have not yet investigated how using these indicators to evaluate teachers might distort validity, especially when one indicator seemingly trumps, or is trusted over the other. Accordingly, in this conceptual piece, we introduce and begin to establish evidences of three conceptual terms related to the validity of the inferences derived via these two measures in the context of teacher evaluation: (1) artificial inflation, (2) artificial deflation, and (3) artificial conflation. We define these terms by illustrating how those with the power to evaluate teachers (e.g., principals) within such contemporary evaluation systems might (1) artificially inflate or (2) artificially deflate observational estimates when used alongside their value-added counterparts, or (3) artificially conflate both estimates to purposefully (albeit perhaps naïvely) exaggerate perceptions of validity.


Accountability Educational policy Educational reform Teacher evaluation Validity 



  1. Aaronson, D., Barrow, L., & Sanders, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25(1), 95–135. Scholar
  2. American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
  3. Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). Retrieved from
  4. Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System (EVAAS). Educational Researcher, 37(2), 65-75.
  5. Amrein-Beardsley, A., & Barnett, J. H. (2012). Working with error and uncertainty to increase measurement validity. Educational Assessment, Evaluation and Accountability, 24(4), 369–379.
  6. Amrein-Beardsley, A., & *Close, K. (2019b). Teacher-level value-added models (VAMs) on trial: Empirical and pragmatic issues of concern across five court cases. Educational Policy, 1-42. Retrieved from,
  7. Anderson, J. (2013). Curious grade for teachers: nearly all pass. The New York Times. Retrieved from teachers-nearly-all-pass.html.
  8. Araujo, M. C., Carneiro, P., Cruz-Aguayo, Y., & Schady, N. (2016). Teacher quality and learning outcomes in kindergarten. The Quarterly Journal of Economics, 131(3), 1415–1453. Scholar
  9. Bailey, J., Bocala, C., Shakman, K., & Zweig, J. (2016). Teacher demographics and evaluation: a descriptive study in a large urban district. Washington DC: U.S. Department of Education Retrieved from Scholar
  10. Ballou, D. (2005). Value-added assessment: lessons from Tennessee. In R. W. Lissitz (Ed.), Value-added models in education: theory and application (pp. 272–297). Maple Grove, MN: JAM Press.Google Scholar
  11. Barnett, J. H., Rinthapol, N., & Hudgens, T. (2014). TAP research summary: examining the evidence and impact of TAP. The System for Teacher and Student Advancement. Santa Monica, CA: National Institute for Excellence in Teaching. Retrieved from
  12. Betebenner, D. W. (2009). A primer on student growth percentiles. Dover, NH: National Center for the Improvement of Educational Assessment Retrieved from Scholar
  13. Bill & Melinda Gates Foundation. (2013, January 8). Ensuring fair and reliable measures of effective teaching: Culminating findings from the MET project’s three-year study. Seattle, WA. Retrieved from
  14. Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value-added models. Princeton, NJ: Educational Testing Service.Google Scholar
  15. Braun, H. (2015). The value in value-added depends on the ecology. Educational Researcher, 44(2), 127–131. Scholar
  16. Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education.Google Scholar
  17. Brennan, R. L. (2013). Commentary on “validating interpretations and uses of test scores.”. Journal of Educational Measurement, 50(1), 74–83. Scholar
  18. Brown, C. (2014, July 31). Stephen Colbert interview with Campbell Brown. The Colbert Report. New York, NY: Comedy Central. Retrieved from
  19. Burgess, K. (2016, September 16). Number of effective teachers keeps dropping. The Albuquerque Journal. Retrieved from
  20. Campbell, D. T. (1976). Assessing the impact of planned social change. Hanover, NH: The Public Affairs Center, Dartmouth College.Google Scholar
  21. Chester, M. D. (2003). Multiple measures and high-stakes decisions: a framework for combining measures. Educational Measurement: Issues and Practice, 22(2), 32–41. Scholar
  22. Chetty, R., Friedman, J., & Rockoff, J. (2014a). Measuring the impact of teachers I: teacher value-added and student outcomes in adulthood. American Economic Review, 104(9), 2593–2632. Scholar
  23. Chetty, R., Friedman, J., & Rockoff, J. (2014b). Measuring the impact of teachers II: evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. Scholar
  24. Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington, DC: U.S. Department of Education Retrieved from Scholar
  25. Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from
  26. Close, K., Amrein-Beardsley, A., & Collins, C. (2019). Mapping America’s teacher evaluation plans post ESSA. Phi Delta Kappan. Retrieved from
  27. Collins, C. (2014). Houston, we have a problem: teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98), 1–42.
  28. Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16(1). Retrieved from:
  29. Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Providence, RI: Annenberg Institute for School Reform.Google Scholar
  30. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, DC: American Council on Education.Google Scholar
  31. Daly, G., & Kim, L. (2010). A teacher evaluation system that works. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).Google Scholar
  32. Danielson, C. (2012). Observing classroom practice. Educational Leadership, 70(3), 32–37.Google Scholar
  33. Danielson, C. (2016). Charlotte Danielson on rethinking teacher evaluation. Education Week. Retrieved from
  34. Danielson, C., & McGreal, T. L. (2000). Teacher evaluation to enhance professional practice. Alexandria, VA: Association for Supervision & Curriculum Development.Google Scholar
  35. Darling-Hammond, L. (2013). Getting teacher evaluation right: what really matters for effectiveness and improvement. New York, NY: Teachers College Press.Google Scholar
  36. Doan, S., Schweig, J. D., & Mihaly, K. (2019). The consistency of composite ratings of teacher effectiveness: evidence from New Mexico. American Educational Research Journal.
  37. Doherty, K. M., & Jacobs, S. (2015). State of the states: Evaluating teaching, leading and learning. Washington, DC: National Council on Teacher Quality (NCTQ).Google Scholar
  38. Duncan, A. (2011). Winning the future with education: responsibility, reform and results. DC: Washington Retrieved from Scholar
  39. Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114–95, § 129 Stat. 1802. (2016).Google Scholar
  40. Furr, R. M., & Bacharach, V. R. (2013). Psychometrics: an introduction. Los Angeles, CA: SAGE Inc..Google Scholar
  41. Goldhaber, D., & Hansen, M. (2013). Is it just a bad class? Assessing the long-term stability of estimated teacher performance. Economica, 80(319), 589–612. Scholar
  42. Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., &.Schuermann, P. (2015). Make room value-added: principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96–104. doi:
  43. Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: the relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. Scholar
  44. Gurney, K. (2016). Teachers say it’s getting harder to get a good evaluation. Miami Herald: The school district disagrees Retrieved from Scholar
  45. Haladyna, T. M., Nolen, N. S., & Haas, S. B. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2–7. Scholar
  46. Haney, W. (2000). The myth of the Texas miracle in education. Education Analysis Policy Archives, 8(41).
  47. Hanushek, E. (2009). Teacher deselection. In D. Goldhaber & J. Hannaway (Eds.), Creating a new teaching profession (pp. 165–180). Washington, DC: Urban Institute Press.Google Scholar
  48. Harris, D. N. (2011). Value-added measures in education: what every educator needs to know. Cambridge, MA: Harvard Education Press.Google Scholar
  49. Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matter for accountability: a comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 51(1), 73–112. Scholar
  50. Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. Scholar
  51. Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. Scholar
  52. Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Seattle, WA: Bill & Melinda Gates Foundation.Google Scholar
  53. Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: incentive contracts, asset ownership, and job design. Journal of Law, Economics, & Organization, 7, 24–52. Scholar
  54. Honig, M. I., & Hatch, T. C. (2004). Crafting coherence: how schools strategically manage multiple, external demands. Educational Researcher, 33(4), 16–30. Scholar
  55. Houston Independent School District (HISD). (2012). HISD Core Initiative 1: an effective teacher in every classroom, teacher appraisal and development system – year one summary report. Houston, TX.Google Scholar
  56. Houston Independent School District (HISD). (2013). Progress conference briefing. Houston, TX.Google Scholar
  57. Jacob, B. A. (2005). Accountability, incentives and behavior: the impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89(5–6), 761–796. Scholar
  58. Jacob, B. A., & Lefgren, L. (2006). When principals rate teachers: the best-and the worst-stand out. Education Next, 2(6), 58–64.Google Scholar
  59. Jacoby, R., Glauberman, N., & Herrnstein, R. J. (1995). The bell curve debate: history, documents, opinions. New York, NY: Times Books.Google Scholar
  60. Jennings, J. L., & Pallas, A. M. (2016). How does value-added data affect teachers? Educational Leadership, 73(8).Google Scholar
  61. Jerald, C. D., & Van Hook, K. (2011). More than measurement: the TAP system’s lessons learned for designing better teacher evaluation systems. Santa Monica, CA: National Institute for Excellence in Teaching (NIET).Google Scholar
  62. Jiang, J. Y., Sporte, S. E., & Luppescu, S. (2015). Teacher perspectives on evaluation reform: Chicago’s REACH students. Educational Researcher, 44(2), 105–116. Scholar
  63. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Washington, DC: The American Council on Education.Google Scholar
  64. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. Scholar
  65. Kane, T. J. (2015). Teachers must look in the mirror. The New York Daily News. Retrieved from
  66. Kane, M., & Case, S. M. (2004). The reliability and validity of weighted composite scores. Applied Measurement in Education, 17(3), 221–240. Scholar
  67. Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation.Google Scholar
  68. Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment. Seattle, WA: Bill & Melinda Gates Foundation.Google Scholar
  69. Kiewiet de Jonge, C. P., & Nickerson, D. W. (2014). Artificial inflation or deflation? Assessing the item count technique in comparative surveys. Political Behavior, 36(3), 659–682. Scholar
  70. Koedel, C., & Betts, J. R. (2007). Re-examining the role of teacher quality in the educational production function. Nashville, TN: National Center on Performance Initiatives.Google Scholar
  71. Koedel, C., & Betts, J. R. (2009). Does student sorting invalidate value-added models of teacher effectiveness? An extended analysis of the Rothstein critique (Working paper 2009-01). San Diego, CA: National Bureau of Economic Research. Retrieved from
  72. Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: a review. Economics of Education Review, 47, 180–195.
  73. Koretz, D. (2017). The testing charade: pretending to make schools better. Chicago, IL: University of Chicago Press.CrossRefGoogle Scholar
  74. Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the Widget Effect: teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249.
  75. Martínez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. Scholar
  76. Marzano, R. J., & Toth, M. D. (2013). Teacher evaluation that makes a difference: a new model for teacher growth and student achievement. Alexandria, VA: Association for Supervision & Curriculum Development.Google Scholar
  77. McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4(4), 572–606. Scholar
  78. Mellon, E. (2010, January 14). HISD moves ahead on dismissal policy: In the past, teachers were rarely let go over poor performance, data show. The Houston Chronicle. Retrieved from
  79. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35(11), 1012–1027.
  80. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–85). New York, NY: American Council on Education.Google Scholar
  81. Messick, S. (1990). Validity of test interpretation and use. Princeton, NJ: Educational Testing Service.Google Scholar
  82. Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. Scholar
  83. Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite estimator of effective teaching. Seattle, WA: Bill & Melinda Gates Foundation.Google Scholar
  84. Nelson, F. H. (2011). A guide for developing growth models for teacher development and evaluation. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.Google Scholar
  85. Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: how high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.Google Scholar
  86. Organisation for Economic Co-operation and Development (OECD). (2008). Measuring improvements in learning outcomes: best practices to assess the value-added of schools. Paris, France: Author.Google Scholar
  87. Otterman, S. (2010, December). 26. The New York Times: Hurdles emerge in rising effort to rate teachers Retrieved from
  88. Papay, J. P. (2010). Different tests, different answers: the stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. Scholar
  89. Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 36(4), 399–416. Scholar
  90. Poon, A., & Schwartz, N. (2016). Investigating misalignment in teacher observation and value-added ratings. Paper presented at the annual meeting of the Association for Education Finance and Policy, Denver, CO.Google Scholar
  91. Porter, E. (2015, March 24). Grading teachers by the test. The New York Times. Retrieved from
  92. Praetorius, A. K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. Scholar
  93. Quality Basic Education Act. S.B. 364. (2016).Google Scholar
  94. Ramaswamy, S. V. (2014). Teacher evaluations: subjective data skew state results. The Journal News. Retrieved from
  95. Raudenbush, S. W., & Jean, M. (2012). How should educators interpret value-added scores? Stanford, CA: Carnegie Knowledge Network Retrieved from Scholar
  96. Reddy, L. A., Hua, A., Dudek, C. M., Kettler, R. J., Lekwa, A., Arnold-Berkovits, I., & Crouse, K. (2019). Use of observational measures to predict student achievement. Studies in Educational Evaluation, 62, 197–208. Scholar
  97. Rhee, M. (2011). The evidence is clear: test scores must accurately reflect students’ learning. The Huffington Post. Retrieved from
  98. Rockoff, J. E., Staiger, D. O., Kane, T. J., & Taylor, E. S. (2010). Information and employee evaluation: evidence from a randomized intervention in public schools (Working Paper No. 16240). Cambridge, MA: National Bureau of Economic Research.Google Scholar
  99. Rothstein, J., & Mathis, W. J. (2013). Review of two culminating reports from the MET Project. Boulder, CO: National Education Policy Center Retrieved from Scholar
  100. Rubin, D. B., Stuart, E. A., & Zanutto, E. L. (2004). A potential outcomes view of value-added assessment in education. Journal of Educational and Behavioral Statistics, 29(1), 103–116. Scholar
  101. Rutledge, S. A., Harris, D. N., & Ingle, W. K. (2010). How principals “bridge and buffer” the new demands of teacher quality and accountability: a mixed-methods analysis of teacher hiring. American Journal of Education, 116(2), 211–242. Scholar
  102. Sandilosa, L. E., Sims, W. A., Norwalk, K. E., & Reddy, L. A. (2019). Converging on quality: examining multiple measures of teaching effectiveness. Journal of School Psychology, 74, 10–28. Scholar
  103. Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38(2), 142–171. Scholar
  104. Shaw, L. H., & Bovaird, J. A. (2011). The impact of latent variable outcomes on value-added models of intervention efficacy. Paper presented at the Annual Conference of the American Educational Research Association (AERA), New Orleans, LA.Google Scholar
  105. Shepard, L. A. (1990). Inflated test score gains: is the problem old norms or teaching the test? Educational Measurement: Issues and Practice, 9(3), 15–22. Scholar
  106. Sidorkin, A. M. (2016). Campbell’s Law and the ethics of immensurability. Studies in Philosophy and Education, 35(4), 321–332. Scholar
  107. Sloat, E., Amrein-Beardsley, A., & Holloway, J. (2018). Different teacher-level effectiveness estimates, different results: inter-model concordance across six generalized value-added models (VAMs). Educational Assessment, Evaluation and Accountability, 30(4), 367–397. Scholar
  108. Solochek, J. S. (2019). Four teachers removed from struggling Hudson Elementary School over test results. Tampa Bay Times. Retrieved from
  109. Sørensen, T. B. (2016). Value-added measurement or modelling (VAM). Brussels, Belgium: Education International Retrieved from Scholar
  110. Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. Scholar
  111. Taylor, K. (2015, March). 22. The New York Times: Cuomo fights rating system in which few teachers are bad Retrieved from
  112. Tennessee Department of Education (TDE). (2016). Teacher and administrator evaluation in Tennessee: a report on year 4 implementation. Nashville, TN: Author Retrieved from Scholar
  113. U.S. Department of Education. (2009). Race to the top program executive summary. DC: Washington Retrieved from Scholar
  114. U.S. Department of Education. (2014). States granted waivers from No Child Left Behind allowed to reapply for renewal for 2014 and 2015 school years. Washington D.C. Retrieved from
  115. van der Lans, R. M. (2018). On the “association between two things”: the case of student surveys and classroom observations of teaching quality. Educational Assessment, Evaluation and Accountability, 30(4), 347–366. Scholar
  116. van der Lans, R. M., van de Grift, W. J., van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is not enough: establishing reliability criteria for feedback and evaluation decisions based on classroom observations. Studies in Educational Evaluation, 50, 88–95. Scholar
  117. Wainer, H. (2004). Introduction to a special issue of the Journal of Educational and Behavioral Statistics on value-added assessment. Journal of Educational and Behavioral Statistics, 29(1), 1–3. Scholar
  118. Wallace, T. L., Kelcey, B., & Ruzek, E. (2016). What can student perception surveys tell us about teaching? Empirically testing the underlying structure of the Tripod student perception survey. American Educational Research Journal, 53(6), 1834–1868. doiI:
  119. Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: how new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality Retrieved from Scholar
  120. Weiner, I. B., Graham, J. R., & Naglieri, J. A. (2013). Handbook of psychology: assessment psychology (10 thVol.). Hoboken, NJ: John Wiley & Sons, Inc..Google Scholar
  121. Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project Retrieved from Scholar
  122. Whitehurst, G. J., Chingos, M. M., & Lindquist, K. M. (2014). Evaluating teachers with classroom observations: lessons learned in four districts. Washington, DC: Brookings Institution. Retrieved from
  123. Winerip, M. (2011). Evaluating New York teachers, perhaps the numbers do lie. The New York Times. Retrieved from
  124. Winters, M. A., & Cowen, J. M. (2013). Who would stay, who would be dismissed? An empirical consideration of value-added teacher retention policies. Educational Researcher, 42(6), 330–337. Scholar
  125. Yeh, S. S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115(12) Retrieved from
  126. Zilberberga, A., Finneya, S. J., Marsha, K. R., & Andersona, R. D. (2014). The role of students’ attitudes and test-taking motivation on the validity of college institutional accountability tests: a path analytic model. International Journal of Testing, 14(4), 360–384. Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.Educational Policy and Evaluation, Mary Lou Fulton Teachers CollegeArizona State UniversityTempeUSA

Personalised recommendations