How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability

Vanacore, Amalia; Pellegrino, Maria Sole

doi:10.1007/s11205-018-02055-y

How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability

Original Research
Published: 08 January 2019

Volume 146, pages 77–89, (2019)
Cite this article

Social Indicators Research Aims and scope Submit manuscript

Amalia Vanacore¹ &
Maria Sole Pellegrino¹

847 Accesses
14 Citations
Explore all metrics

Abstract

Students’ Evaluations of Teaching (SETs) are widely used as measures of teaching quality in Higher Education. A review of specialized literature evidences that researchers widely discuss whether SETs can be considered reliable measures of teaching quality evaluation. Though the controversy mainly refers to the role of students as assessors of teaching quality, most of research studies on SETs focus on the design and validation of the evaluation procedure and even when the need of measuring SETs reliability is recognized, it is generally indirectly assessed for the whole group of students by measuring inter-student agreement. In this paper the focus is on the direct assessment of the reliability of each student as a measurement instrument of teaching quality. An agreement-based approach is here adopted in order to assess student’s ability to provide consistent and stable evaluations; the sampling uncertainty is accounted for by building non-parametric bootstrap confidence intervals for the adopted agreement coefficients.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Article Open access 25 March 2022

Ethical Considerations of Conducting Systematic Reviews in Educational Research

An automated essay scoring systems: a systematic literature review

Article 23 September 2021

References

Abrami, P. C. (2001). Improving judgments about teaching effectiveness using teacher rating forms. New Directions for Institutional Research, 2001(109), 59–87.
Google Scholar
Abrami, P. C., d’Apollonia, S., & Cohen, P. A. (1990). Validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology, 82(2), 219–231.
Google Scholar
Ackerman, D., Gross, B. L., & Vigneron, F. (2009). Peer observation reports and student evaluations of teaching: Who are the experts? Alberta Journal of Educational Research, 55(1), 18–39.
Google Scholar
Adams, M. J., & Umbach, P. D. (2012). Nonresponse and online student evaluations of teaching: Understanding the influence of salience, fatigue, and academic environments. Research in Higher Education, 53(5), 576–591.
Google Scholar
Aitken, R. (1969). Measurement of feelings using visual analogue scales. Proceedings of the Royal Society of Medicine, 62(10), 989–993.
Google Scholar
Aleamoni, L. M. (1999). Student rating myths versus research facts from 1924 to 1998. Journal of Personnel Evaluation in Education, 13(2), 153–166.
Google Scholar
Altaye, M., Donner, A., & Eliasziw, M. (2001). A general goodness-of-fit approach for inference procedures concerning the kappa statistic. Statistics in Medicine, 20(16), 2479–2488.
Google Scholar
Alwin, D. F. (1989). Problems in the estimation and interpretation of the reliability of survey data. Quality and Quantity, 23(3–4), 277–331.
Google Scholar
Bassi, F., Clerci, R., & Aquario, D. (2017). Students evaluation of teaching at a large italian university: Measurement scale validation. Electronic Journal of Applied Statistical Analysis, 10(1), 93–117.
Google Scholar
Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness. International Journal of Teaching and Learning in Higher Education, 17(1), 48–62.
Google Scholar
Bi, J., & Kuesten, C. (2012). Intraclass correlation coefficient (ICC): A framework for monitoring and assessing performance of trained sensory panels and panelists. Journal of Sensory Studies, 27(5), 352–364.
Google Scholar
Blackman, N. J. M., & Koval, J. J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement. Statistics in Medicine, 19(5), 723–741.
Google Scholar
Bland, J. (2008). Measurement in health and disease. Cohens kappa. New York: University of York, Department of Health Sciences.
Google Scholar
Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen Research, 10, 1–11.
Google Scholar
Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699.
Google Scholar
Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement. Organizational Research Methods, 2(1), 49–68.
Google Scholar
Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians. Statistics in Medicine, 19(9), 1141–1164.
Google Scholar
Centra, J. A. (1979). Determining faculty effectiveness. Assessing teaching, research, and service for personnel decisions and improvement. Hamilton: ERIC.
Google Scholar
Chmura Kraemer, H., Periyakoil, V. S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21(14), 2109–2129.
Google Scholar
Coffey, M., & Gibbs, G. (2001). The evaluation of the student evaluation of educational quality questionnaire (SEEQ) in UK higher education. Assessment & Evaluation in Higher Education, 26(1), 89–93.
Google Scholar
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Google Scholar
Davies, M., Hirschberg, J., Lye, J., Johnston, C., & McDonald, I. (2007). Systematic influences on teaching evaluations: The case for caution. Australian Economic Papers, 46(1), 18–38.
Google Scholar
De Mast, J., & Van Wieringen, W. N. (2007). Measurement system analysis for categorical measurements: Agreement and kappa-type indices. Journal of Quality Technology, 39(3), 191–202.
Google Scholar
Dey, E. L. (1997). Working with low survey response rates: The efficacy of weighting adjustments. Research in Higher Education, 38(2), 215–227.
Google Scholar
Emery, C. R., Kramer, T. R., & Tian, R. G. (2003). Return to academic standards: A critique of student evaluations of teaching effectiveness. Quality Assurance in Education, 11(1), 37–46.
Google Scholar
Falotico, R., & Quatto, P. (2015). Fleiss kappa statistic without paradoxes. Quality & Quantity, 49(2), 463–470.
Google Scholar
Feistauer, D., & Richter, T. (2017). How reliable are students evaluations of teaching quality? A variance components approach. Assessment & Evaluation in Higher Education, 42(8), 1263–1279.
Google Scholar
Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses: A review and analysis. Research in Higher Education, 6(3), 223–274.
Google Scholar
Feldman, K. A. (1984). Class size and college students’ evaluations of teachers and courses: A closer look. Research in Higher Education, 21(1), 45–116.
Google Scholar
Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II. Evidence from students’ evaluations of their classroom teachers. Research in Higher Education, 34(2), 151–211.
Google Scholar
Fidelman, C. G. (2007). Course evaluation surveys: In-class paper surveys versus voluntary online surveys. Palamedu: Boston College.
Google Scholar
Fleiss, J. L., Levin, B., & Paik, M. C. (2013). Statistical methods for rates and proportions. New York: Wiley.
Google Scholar
Goos, M., & Salomons, A. (2017). Measuring teaching quality in higher education: Assessing selection bias in course evaluations. Research in Higher Education, 58(4), 341–364.
Google Scholar
Grammatikopoulos, V., Linardakis, M., Gregoriadis, A., & Oikonomidis, V. (2015). Assessing the students evaluations of educational quality (SEEQ) questionnaire in greek higher education. Higher Education, 70(3), 395–408.
Google Scholar
Gravestock, P., & Gregor-Greenleaf, E. (2008). Student course evaluations: Research, models and trends. Princeton: Citeseer.
Google Scholar
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Wright City: Advanced Analytics, LLC.
Google Scholar
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.
Google Scholar
Hornstein, H. A. (2017). Student evaluations of teaching are an inadequate assessment tool for evaluating faculty performance. Cogent Education, 4(1), 1–8.
Google Scholar
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69(1), 85–98.
Google Scholar
Kherfi, S. (2011). Whose opinion is it anyway? Determinants of participation in student evaluation of teaching. Journal of Economic Education, 42(1), 19–30.
Google Scholar
Klar, N., Lipsitz, S. R., Parzen, M., & Leong, T. (2002). An exact bootstrap confidence interval for \(\kappa\) in small samples. Journal of the Royal Statistical Society: Series D (The Statistician), 51(4), 467–478.
Google Scholar
Kuo, W. (2007). How reliable is teaching evaluation? The relationship of class size to teaching evaluation scores. IEEE Transactions on Reliability, 56(2), 178–181.
Google Scholar
Lalla, M., Facchinetti, G., & Mastroleo, G. (2005). Ordinal scales and fuzzy set systems to measure agreement: An application to the evaluation of teaching activity. Quality and Quantity, 38(5), 577–601.
Google Scholar
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Google Scholar
Lüdtke, O., Trautwein, U., Kunter, M., & Baumert, J. (2006). Reliability and agreement of student ratings of the classroom environment: A reanalysis of timss data. Learning Environments Research, 9(3), 215–230.
Google Scholar
Marasini, D., Quatto, P., & Ripamonti, E. (2014). A measure of ordinal concordance for the evaluation of university courses. Procedia Economics and Finance, 17, 39–46.
Google Scholar
Marsh, H. W. (1982). SEEQ: A reliable, valid, and useful instrument for collecting students’evaluations of university teaching. British Journal of Educational Psychology, 52(1), 77–95.
Google Scholar
Marsh, H. W. (1983). Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics. Journal of Educational Psychology, 75(1), 150–166.
Google Scholar
Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential baises, and utility. Journal of Educational Psychology, 76(5), 707–754.
Google Scholar
Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of Educational Research, 11(3), 253–388.
Google Scholar
Marsh, H. W., & Dunkin, M. (1992). Students’ evaluations of university teaching: A multidimensional perspective. In J. C. Smart (Ed.), Higher education: Handbook of theory and research (Vol. 8, pp. 143–223). New York: Agathon Press.
Google Scholar
Marsh, H. W., & Overall, J. (1981). The relative influence of course level, course type, and instructor on students’ evaluations of college teaching. American Educational Research Journal, 18(1), 103–112.
Google Scholar
Marsh, H. W., & Roche, L. (1993). The use of students evaluations and an individually structured intervention to enhance university teaching effectiveness. American Educational Research Journal, 30(1), 217–251.
Google Scholar
Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective: The critical issues of validity, bias, and utility. American Psychologist, 52(11), 1187–1197.
Google Scholar
Martínez-Gómez, M., Sierra, J. M. C., Jabaloyes, J., & Zarzo, M. (2011). A multivariate method for analyzing and improving the use of student evaluation of teaching questionnaires: A case study. Quality & Quantity, 45(6), 1415–1427.
Google Scholar
McKeachie, W. J. (1997). Student ratings: The validity of use. Washington: American Psychological Association.
Google Scholar
Morley, D. D. (2012). Claims about the reliability of student evaluations of instruction: The ecological fallacy rides again. Studies in Educational Evaluation, 38(1), 15–20.
Google Scholar
Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2), 197–209.
Google Scholar
Pinto, F. S. T., Fogliatto, F. S., & Qannari, E. M. (2014). A method for panelists consistency assessment in sensory evaluations based on the cronbachs alpha coefficient. Food Quality and Preference, 32, 41–47.
Google Scholar
Porter, S. R., & Umbach, P. D. (2006). Student survey response rates across institutions: Why do they vary? Research in Higher Education, 47(2), 229–247.
Google Scholar
Porter, S. R., & Whitcomb, M. E. (2005). Non-response in student surveys: The role of demographics, engagement and personality. Research in Higher Education, 46(2), 127–152.
Google Scholar
Pounder, J. S. (2008). Transformational classroom leadership: A novel approach to evaluating classroom performance. Assessment & Evaluation in Higher Education, 33(3), 233–243.
Google Scholar
Rindermann, H., & Schofield, N. (2001). Generalizability of multidimensional student ratings of university instruction across courses and teachers. Research in Higher Education, 42(4), 377–399.
Google Scholar
Rossi, F. (2001). Assessing sensory panelist performance using repeatability and reproducibility measures. Food Quality and Preference, 12(5), 467–479.
Google Scholar
Sarnacchiaro, P., & D’Ambra, L. (2012). Students’ evaluations of university teaching: A structural equation modeling analysis. Electronic Journal of Applied Statistical Analysis, 5(3), 406–412.
Google Scholar
Seldin, P. (1999). Changing practices in evaluating teaching: As practical guide to improved faculty performance and promotion/tenure decisions (Vol. 10). San Francisco: Jossey-Bass.
Google Scholar
Shapiro, E. G. (1990). Effect of instructor and class characteristics on students’ class evaluations. Research in Higher Education, 31(2), 135–148.
Google Scholar
Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257–268.
Google Scholar
Sliusarenko, T. (2013). Quantitative assessment of course evaluations. PhD Thesis (PhD-2013-318), Technical University of Denmark (DTU).
Stack, S. (2003). Research productivity and student evaluation of teaching in social science classes: A research note. Research in Higher Education, 44(5), 539–556.
Google Scholar
Stonebraker, R. J., & Stone, G. S. (2015). Too old to teach? The effect of age on college and university professors. Research in Higher Education, 56(8), 793–812.
Google Scholar
Thorpe, S. W. (2002). Online student evaluation of instruction: An investigation of non-response bias. AIR 2002 forum paper.
Ting, K. F. (1999). Measuring teaching quality in Hong Kong’s higher education: Reliability and validity of student ratings. In J. James (Ed.), Quality in teaching and learning in higher education (pp. 46–54). Hong Kong: Hong Kong Polytechnic University.
Google Scholar
Ukoumunne, O. C., Davison, A. C., Gulliford, M. C., & Chinn, S. (2003). Non-parametric bootstrap confidence intervals for the intraclass correlation coefficient. Statistics in Medicine, 22(24), 3805–3821.
Google Scholar
Vanacore, A., & Pellegrino, M. S. (2017). An agreement-based approach for reliability assessment of students evaluations of teaching. In Proceedings of the 3rd international conference on higher education advances (pp. 1286–1293). Editorial Universitat Politècnica de València
Watson, P., & Petrie, A. (2010). Method agreement analysis: A review of correct methodology. Theriogenology, 73(9), 1167–1179.
Google Scholar
Wolbring, T., & Treischl, E. (2016). Selection bias in students evaluation of teaching. Research in Higher Education, 57(1), 51–71.
Google Scholar
Wright, R. E. (2006). Student evaluations of faculty: Concerns raised in the literature, and possible solutions. College Student Journal, 40(2), 417.
Google Scholar
Zhao, J., & Gallant, D. J. (2012). Student evaluation of instruction in higher education: Exploring issues of validity and reliability. Assessment & Evaluation in Higher Education, 37(2), 227–235.
Google Scholar

Download references

Acknowledgements

The authors express their gratitude to the anonymous reviewers for their positive comments and helpful suggestions which contributed significantly to the improvement of this article.

Author information

Authors and Affiliations

Department of Industrial Engineering, University of Naples “Federico II”, p.le Tecchio 80, 80125, Naples, Italy
Amalia Vanacore & Maria Sole Pellegrino

Authors

Amalia Vanacore
View author publications
You can also search for this author in PubMed Google Scholar
Maria Sole Pellegrino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amalia Vanacore.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vanacore, A., Pellegrino, M.S. How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability. Soc Indic Res 146, 77–89 (2019). https://doi.org/10.1007/s11205-018-02055-y

Download citation

Accepted: 20 December 2018
Published: 08 January 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11205-018-02055-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability

Abstract

Access this article

Similar content being viewed by others

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Ethical Considerations of Conducting Systematic Reviews in Educational Research

An automated essay scoring systems: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability

Abstract

Access this article

Similar content being viewed by others

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Ethical Considerations of Conducting Systematic Reviews in Educational Research

An automated essay scoring systems: a systematic literature review

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation