Classroom observation systems in context: A case for the validation of observation systems

Liu, Shuangshuang; Bell, Courtney A.; Jones, Nathan D.; McCaffrey, Daniel F.

doi:10.1007/s11092-018-09291-3

Classroom observation systems in context: A case for the validation of observation systems

Published: 26 January 2019

Volume 31, pages 61–95, (2019)
Cite this article

Educational Assessment, Evaluation and Accountability Aims and scope Submit manuscript

Shuangshuang Liu ORCID: orcid.org/0000-0002-1754-0631¹^na1,
Courtney A. Bell¹^na1,
Nathan D. Jones² &
…
Daniel F. McCaffrey¹

2346 Accesses
27 Citations
1 Altmetric
Explore all metrics

Abstract

Researchers and practitioners sometimes presume that using a previously “validated” instrument will produce “valid” scores; however, contemporary views of validity suggest that there are many reasons this assumption can be faulty. In order to demonstrate just some of the problems with this view, and to support comparisons of different observation protocols across contexts, we introduce and define the conceptual tool of an observation system. We then describe psychometric evidence of a popular teacher observation instrument, Charlotte Danielson’s Framework for Teaching, in three use contexts—a lower-stakes research context, a lower-stakes practice-based context, and a higher-stakes practice-based context. Despite sharing a common instrument, we find the three observation systems and their associated use contexts combine to produce different average teacher scores, variation in score distributions, and different levels of precision in scores. However, all three systems produce higher average scores in the classroom environment domain than the instructional domain and all three sets of scores support a one-factor model, whereas the Framework posits four factors. We discuss how the dependencies between aspects of observation systems and practical constraints leave researchers with significant validation challenges and opportunities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Possible biases in observation systems when applied across contexts: conceptualizing, operationalizing, and sequencing instructional quality

Article Open access 02 July 2022

On Classroom Observations

Article 05 September 2018

Classroom observation frameworks for studying instructional quality: looking back and looking forward

Article 26 May 2018

Notes

The different patterns of missingness in observation scores may be explained in several ways. First, there were records that originally were considered “missing data” because the records were incomplete. Specifically, the electronic rating system allowed for the planning and preparation domain to be rated before other domains, so administrators may have entered the ratings for the same lesson separately. To clean up multiple entries for the same observation, we combined scores from multiple records with the same teacher and rater ID that were entered within a week. These records also needed to be consistent with each other if there were overlapping ratings on certain components. Second, administrators may have conducted informal “walk-through” classroom visits in which they did not rate all of the components. This could lead to incomplete records in the system. To remove the records from informal walk-throughs, we dropped observations that only had scores from one domain.
We also ran t tests that account for the correlation due to multiple lessons per teacher and repeated ratings by each rater. We estimated the mean and standard error of the mean for each component and domain using a nested or crossed random effect model. Specifically, for UTQ, we used a crossed effect model with teacher, rater, and teacher by rater random effects. For LAUSD, we used a nested effect model with rater and teacher nested within rater. We found the results were similar to results from the simple t tests, except that all of the differences in mean scores became significant between the two practice-based contexts. Estimates from t tests that account for clustering are available upon request.
We also cannot calculate inter-rater reliabilities because teacher performances in the practice-based contexts were not double-scored.
Eigenvalues and scree plots for all contexts are available upon request.
FFT was developed at Educational Testing Service and was used (with some differences) as a part of the Praxis III assessment for beginning teachers in the U.S. states of Ohio and Arkansas.

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education [AERA/APA/NCME]. (2014). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.
Google Scholar
Archer, J., Cantrell, S., Holtzman, S. L., Joe, J. N., Tocci, C. M., & Wood, J. (2016). Better feedback for better teaching: a practical guide to improving classroom observations. New York: John Wiley & Sons.
Google Scholar
Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. https://doi.org/10.1080/10627197.2012.715014.
Article Google Scholar
Bell, C., Jones, N., Lewis, J., Qi, Y., Kirui, D., Stickler, L., & Liu, S. (2016). Understanding consequential assessment systems of teaching: Year 1 final report to Los Angeles Unified School District (Research Memorandum No. RM-16-12). Princeton, NJ: Educational Testing Service.
Google Scholar
Carey, M. D., Mannell, R. H., & Dunn, P. K. (2011). Does a rater’s familiarity with a candidate’s pronunciation affect the rating in oral proficiency interviews? Language Testing, 28(2), 201–219. https://doi.org/10.1177/0265532210393704.
Article Google Scholar
Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337. https://doi.org/10.1177/0013164414539163.
Article Google Scholar
Chaplin, D., Gill, B., Thompkins, A., & Miller, H. (2014). Professional practice, student surveys, and value-added: Multiple measures of teacher effectiveness in the Pittsburgh Public Schools. REL 2014-024. Regional Educational Laboratory Mid-Atlantic.
Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: setting the ground for understanding instructional quality more comprehensively. ZDM, 50(3), 355–366.
Cohen, J., & Grossman, P. (2016). Respecting complexity in measures of teaching: keeping students and schools in focus. Teaching and Teacher Education, 55, 308–317. https://doi.org/10.1016/j.tate.2016.01.017.
Cohen, J., Ruzek, E., & Sandilos, L. (2018). Does teaching quality cross subjects? Exploring consistency in elementary teacher practice across subjects. AERA Open, 4(3), 2332858418794492), 233285841879449.
Article Google Scholar
Dalland, C.P., Klette, K., & Svenkerud, S. (2018). Video studies and the challenge of selecting time scales. International Journal of Research & Method in Education. Manuscript submitted for publication.
Danielson, C. (1996). Enhancing professional development: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.
Danielson, C. (2007). Enhancing professional practice: a framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.
Danielson, C. (2011). Enhancing professional practice: a framework for teaching. Princeton, NJ: The Danielson Group.
Danielson, C. (2013). The Framework for Teaching evaluation instrument, 2013 Edition. Retrieved January 17, 2017 from https://www.danielsongroup.org/framework/.
Darling-Hammond, L., & Rothman, R. (2015). Teaching in the flat world: learning from high-performing systems. Teachers College Press.
Donaldson, M. L., & Woulfin, S. (2018). From tinkering to going “rogue”: how principals use agency when enacting new teacher evaluation systems. Educational Evaluation and Policy Analysis 0162373718784205.
Engelhard, G. (1996). Evaluating rater accuracy in the performance assessments. Journal of Educational Measurement, 33(1), 56–70.
Article Google Scholar
Floman, J. L., Hagelskamp, C., Brackett, M. A., & Rivers, S. E. (2017). Emotional bias in classroom observations: within-rater positive emotion predicts favorable assessments of classroom quality. Journal of Psychoeducational Assessment, 35(3), 291–301.
Article Google Scholar
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: a research synthesis. National Comprehensive Center for Teacher Quality. Retrieved on December 3, 2008 from: https://gtlcenter.org/sites/default/files/docs/EvaluatingTeachEffectiveness.pdf.
Hafen, C. A., Hamre, B. K., Allen, J. P., Bell, C. A., Gitomer, D. H., & Pianta, R. C. (2015). Teaching through interactions in secondary school classrooms revisiting the factor structure and practical application of the Classroom Assessment Scoring System–Secondary. The Journal of Early Adolescence, 35(5–6), 651–680.
Article Google Scholar
Harik, P., Clauser, B. E., Grabovsky, I., Nungester, R. J., Swanson, D., & Nandakumar, R. (2009). An examination of rater drift within a generalizability theory framework. Journal of Educational Measurement, 46(1), 43–58.
Article Google Scholar
Herlihy, C., Karger, E., Pollard, C., Hill, H. C., Kraft, M. A., Williams, M., & Howard, S. (2014). State and local efforts to investigate the validity and reliability of scores from teacher evaluation systems. Teachers College Record, 116(1), 1–28.
Google Scholar
Hess, F. M. (2015). Lofty promises but little change for America’s schools. Education Next, 15(4), 50–56.
Google Scholar
Hill, H. C., Charalambous, C. Y., Blazar, D., McGinn, D., Kraft, M. A., Beisiegel, M., et al. (2012a). Validating arguments for observational instruments: attending to multiple sources of variation. Educational Assessment, 17(2–3), 88–106.
Article Google Scholar
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012b). When rater reliability is not enough: teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203.
Article Google Scholar
Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school personnel. Research paper. MET Project. Bill & Melinda Gates Foundation.
Hoffman, J. V., Sailors, M., Duffy, G. R., & Beretvas, S. N. (2004). The effective elementary classroom literacy environment: examining the validity of the TEX-IN3 Observation System. Journal of Literacy Research, 36(3), 303–334.
Article Google Scholar
Joe, J. N., McClellan, C. A., & Holtzman, S. L. (2014). Scoring design decisions: reliability and the length and focus of classroom observations. In T. J. Kane, K. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems (pp. 415–443). New York: Jossey Bass.
Google Scholar
Joe, J. N., Tocci, C. M., Holtzman, S. L., & Williams, J. C. (2013). Foundations of observation: considerations for developing a classroom observation system that helps districts achieve consistent and accurate scores. MET Project, Policy and Practice Brief. Retrieved on January 21, 2019 from http://k12education.gatesfoundation.org/resource/foundations-of-observations-considerations-for-developing-a-classroom-observation-system-that-helps-districts-achieve-consistent-and-accurate-scores/.
Jølle, L. (2015). Rater strategies for reaching agreement on pupil text quality. Assessment in Education: Principles, Policy & Practice, 22(4), 458–474.
Article Google Scholar
Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64). New York: Praeger.
Kane, M. T. (2013a). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000.
Article Google Scholar
Kane, M. T. (2013b). Validation as a pragmatic, scientific activity. Journal of Educational Measurement, 50(1), 115–122. https://doi.org/10.1111/jedm.12007.
Article Google Scholar
Kane, T. J., & Staiger, D. O. (2012). Gathering Feedback for Teaching: Combining High-Quality Observations with Student Surveys and Achievement Gains. Retrieved on January 4, 2013 from http://metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf.
Kane, T. J., Taylor, E. S., Tyler, J. H., & Wooten, A. L. (2010). Identifying effective classroom practices using student achievement data, (September 2010), 51. https://doi.org/10.3386/w15803.
Kraft, M. A., & Gilmour, A. F. (2016). Can principals promote teacher development as evaluators? A case study of principals’ views and experiences. Educational Administration Quarterly, 52(5), 711–753.
Article Google Scholar
Lazarev, V., Newman, D., Sharp, A., & (ED), R. E. L. W. (2014). Properties of the multiple measures in Arizona’s teacher evaluation model. REL 2015-050. Regional Educational Laboratory West, (October). Retrieved on July 23, 2018 from https://files.eric.ed.gov/fulltext/ED548027.pdf.
Leckie, G., & Baird, J. A. (2011). Rater effects on essay scoring: a multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48(4), 399–418.
Article Google Scholar
Lockwood, J. R., Savitsky, T. D., & McCaffrey, D. F. (2015). Inferring constructs of effective teaching from classroom observations: an application of Bayesian exploratory factor analysis without restrictions. Ann. Appl. Stat., 9(3), 1484–1509.
Article Google Scholar
Martin-Raugh, M., Tannenbaum, R. J., Tocci, C. M., & Reese, C. (2016). Behaviorally anchored rating scales: An application for evaluating teaching practice. Teaching and Teacher Education, 59, 414–419. https://doi.org/10.1016/j.tate.2016.07.026
Martinez, F., Taut, S., & Schaaf, K. (2016). Classroom observation for evaluating and improving teaching: an international perspective. Studies in Educational Evaluation, 49, 15–29.
Article Google Scholar
McCaffrey, D. F., Yuan, K., Savitsky, T. D., Lockwood, J. R., & Edelen, M. O. (2015). Uncovering multivariate structure in classroom observations in the presence of rater errors. Educational Measurement: issues and Practice, 34(2), 34–46.
Article Google Scholar
McClellan, C. (2013). What it looks like: master coding videos for observer training and assessment. Seattle: Bill & Melinda Gates Foundation. Retrieved on January 14, 2014 from http://k12education.gatesfoundation.org/resource/what-it-looks-like-master-coding-videos-for-observer-training-and-assessment/.
McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluator training & certification: lessons learned from the Measures of Effective Teaching project (Practitioner Series for Teacher Evaluation). San Francisco: Teachscape. Retrieved Jan 3, 2019 from https://www.issuelab.org/resource/teacher-evaluator-training-certification-lessons-learned-from-themeasures-of-effective-teaching-project.html.
Muijs, D., Kyriakides, L., van der Werf, G., Creemers, B., Timperley, H., & Earl, L. (2014). State of the art–teacher effectiveness and professional learning. School Effectiveness and School Improvement, 25(2), 231–256.
Article Google Scholar
Myford, C. M., & Wolfe, E. W. (2009). Monitoring rater performance over time: a framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46(4), 371–389.
Article Google Scholar
Netolicky, D. M. (2016). Coaching for professional growth in one Australian school: “oil in water”. International Journal of Mentoring and Coaching in Education, 5(2), 66–86. https://doi.org/10.1108/IJMCE-09-2015-0025.
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom assessment scoring system (CLASS) manual, pre-K. Baltimore: Brookes.
Google Scholar
Pons, A. (2018). What does teaching look like? A new video study [Blog post]. Retrieved from http://oecdeducationtoday.blogspot.com/2018/01/what-does-teaching-look-like-new-video.html. Accessed 2 Dec 2018.
Praetorius, A.-K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002.
Praetorius, A. K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying instructional quality: looking back and looking forward. ZDM - Mathematics Education, 50(3), 535–553. https://doi.org/10.1007/s11858-018-0946-0.
Article Google Scholar
Roegman, R., Goodwin, A. L., Reed, R., & Scott-McLaughlin, R. M. (2016). Unpacking the data: an analysis of the use of Danielson’s (2007) Framework for Professional Practice in a teaching residency program. Educational Assessment, Evaluation and Accountability, 28(2), 111–137. https://doi.org/10.1007/s11092-015-9228-3.
Sahlberg, P. (2011). Finnish lessons. New York: Teachers College Press.
Schoenfeld, A. H., Floden, R., El Chidiac, F., Gillingham, D., Fink, H., Hu, S., et al. (2018). On classroom observations. Journal for STEM Education Research., 1, 34–59. https://doi.org/10.1007/s41979-018-0001-7.
Article Google Scholar
Seidel, T., Prenzel, M., & Kobarg, M. (2005). How to run a video study. Technical report of the IPN Video Study. Berlin: Waxmann
Shepard, L. A. (2016). Evaluating test validity: reprise and progress. Assessment in Education: Principles, Policy and Practice, 23(2), 268–280. https://doi.org/10.1080/0969594X.2016.1141168.
Article Google Scholar
State of New Jersey Administrative Code, 6A:10-7.1 (2016), Subchapter 7.
Steinberg, M. P., & Garrett, R. (2016). Classroom composition and measured teacher performance: what do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293–317. https://doi.org/10.3102/0162373715616249.
Article Google Scholar
Stigler, J. W., Gonzales, P., Kwanaka, T., Knoll, S., & Serrano, A. (1999). The TIMSS videotape classroom study: methods and findings from an exploratory research project on eighth-grade mathematics instruction in Germany, Japan, and the United States, Washington D. C. Retrieved Oct 12, 2014 from: http://nces.ed.gov/pubs99/1999074.pdf.
Taut, S., Santelices, M. V., & Stecher, B. (2012). Validation of a national teacher assessment and improvement system. Educational Assessment, 17(4), 163–199.
Taut, S., & Sun, Y. (2014). The development and implementation of a national, standards-based, multi-method teacher performance assessment system in Chile. Education Policy Analysis Archives, 22(71), 1–31. https://doi.org/10.14507/epaa.v22n71.2014.
van der Lans, R. M., van de Grift, W. J., & van Veen, K. (2017). Individual differences in teacher development: an exploration of the applicability of a stage model to assess individual teachers. Learning and Individual Differences, 58, 46–55.
Article Google Scholar
Van der Lans, R. M., van de Grift, W. J., van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is not enough: establishing reliability criteria for feedback and evaluation decisions based on classroom observations. Studies in Educational Evaluation, 50, 88–95.
Article Google Scholar
White, T. (2014a). Evaluating teachers more strategically: using performance results to streamline evaluation systems. Retrieved September 6, 2018 from: https://www.carnegiefoundation.org/wp-content/uploads/2014/12/BRIEF_evaluating_teachers_strategically_Jan2014.pdf.
White, T. (2014b). Adding eyes: the rise, rewards, and risks of multi-rater teacher observation systems. Retrieved September 6, 2018 from: https://www.carnegiefoundation.org/wp-content/uploads/2014/12/BRIEF_Multi-rater_evaluation_Dec2014.pdf.
White, M. C. (2018). Rater performance standards for classroom observation instruments. Educational Researcher, 47(8), 492–501. https://doi.org/10.3102/0013189X18785623.
Whitehurst, G., Chingos, M., & Lindquist, K. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Providence, RI: Brown Center on Education Policy at the Brookings Institution.

Download references

Funding

This study was supported by grants from W.T. Grant Foundation (Grant # 181068) and The Bill and Melinda Gates Foundation (Grant # OPP52048). For making the data available for this study, we thank administrators, teachers, and staff from Los Angeles Unified School District (LAUSD) and three large southern districts. The opinions expressed herein are those of the authors and not the funding agency or participants.

Author information

Shuangshuang Liu and Courtney A. Bell contributed equally to this work.

Authors and Affiliations

Educational Testing Service, 660 Rosedale Rd, Princeton, NJ, 08541, USA
Shuangshuang Liu, Courtney A. Bell & Daniel F. McCaffrey
Boston University, Boston, MA, 02215, USA
Nathan D. Jones

Authors

Shuangshuang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Courtney A. Bell
View author publications
You can also search for this author in PubMed Google Scholar
Nathan D. Jones
View author publications
You can also search for this author in PubMed Google Scholar
Daniel F. McCaffrey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Courtney A. Bell.

Additional information

Publisher’s note

Springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, S., Bell, C.A., Jones, N.D. et al. Classroom observation systems in context: A case for the validation of observation systems. Educ Asse Eval Acc 31, 61–95 (2019). https://doi.org/10.1007/s11092-018-09291-3

Download citation

Received: 16 February 2018
Accepted: 27 December 2018
Published: 26 January 2019
Issue Date: 15 February 2019
DOI: https://doi.org/10.1007/s11092-018-09291-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classroom observation systems in context: A case for the validation of observation systems

Abstract

Access this article

Similar content being viewed by others

Possible biases in observation systems when applied across contexts: conceptualizing, operationalizing, and sequencing instructional quality

On Classroom Observations

Classroom observation frameworks for studying instructional quality: looking back and looking forward

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classroom observation systems in context: A case for the validation of observation systems

Abstract

Access this article

Similar content being viewed by others

Possible biases in observation systems when applied across contexts: conceptualizing, operationalizing, and sequencing instructional quality

On Classroom Observations

Classroom observation frameworks for studying instructional quality: looking back and looking forward

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation