## Abstract

Students’ Evaluations of Teaching (SETs) are widely used as measures of teaching quality in Higher Education. A review of specialized literature evidences that researchers widely discuss whether SETs can be considered reliable measures of teaching quality evaluation. Though the controversy mainly refers to the role of students as assessors of teaching quality, most of research studies on SETs focus on the design and validation of the evaluation procedure and even when the need of measuring SETs reliability is recognized, it is generally indirectly assessed for the whole group of students by measuring inter-student agreement. In this paper the focus is on the direct assessment of the reliability of each student as a measurement instrument of teaching quality. An agreement-based approach is here adopted in order to assess student’s ability to provide consistent and stable evaluations; the sampling uncertainty is accounted for by building non-parametric bootstrap confidence intervals for the adopted agreement coefficients.

## Introduction

Students’ Evaluations of Teaching (SETs) have dominated worldwide for the past 40 years as the most common measure of teaching effectiveness and course quality in Higher Education (hereafter, HE) institutions (e.g. Centra 1979; Emery et al. 2003; Onwuegbuzie et al. 2009; Pounder 2008; Sarnacchiaro and D’Ambra 2012; Seldin 1999; Sliusarenko 2013; Wright 2006). SETs commonly form the basis for building teaching quality indicators and taking major formative and summative academic decisions (Berk 2005; Gravestock and Gregor-Greenleaf 2008; Onwuegbuzie et al. 2009; Sliusarenko 2013).

Because of their popularity, SETs drew a great deal of attention in the specialized literature. Many research studies have been conducted over the years about SETs focusing on several issues among which the design of the questionnaire and its validity (Bassi et al. 2017; Lalla et al. 2005; Martínez-Gómez et al. 2011; Onwuegbuzie et al. 2009), the relationship between indicators of quality research productivity and SETs (Stack 2003), the significance of some factors as predictors of participation in online evaluations (Adams and Umbach 2012), the influence of selection bias in course evaluation (Goos and Salomons 2017; Wolbring and Treischl 2016), just to cite some of the most relevant ones.

Researchers have long discussed the reliability of SETs as tool for teaching quality evaluation but the debate is still open. Some researchers assuage the concern arguing that there is no better option that provides the same sort of quantifiable and reliable measures of teaching effectiveness (Abrami 2001; Aleamoni 1999; Feldman 1977; Marsh 1984, 1987; McKeachie 1997) whereas others validate the debate by pointing out significant impacting factors that affect SETs reliability (Hornstein 2017). In the following some critical points are briefly discussed.

Firstly, there is evidence that SETs are biased by students’ satisfaction with their own performance (Porter and Umbach 2006; Porter and Whitcomb 2005). Indeed, each student provides evaluations on the basis of her/his previous classroom experience, which can vary substantially depending on the individual progress toward a degree as well as on the attended college or university (Ackerman et al. 2009). The second criticism is that SETs may be biased upward/downward depending on the presence of students (e.g. commuting students or students with different cultural background) who are satisfied/dissatisfied because of general reasons not specifically related to the teaching quality (Davies et al. 2007; Sliusarenko 2013). In addition, there may be some personal characteristics (e.g. gender, age and nationality; Boring et al. 2016; Dey 1997; Fidelman 2007; Kherfi 2011; Porter and Umbach 2006; Porter and Whitcomb 2005; Thorpe 2002) or logistic factors (e.g. class size and level of course; Feldman 1984; Kuo 2007; Shapiro 1990) or even ascribed professor characteristics (e.g. gender and race; Boring et al. 2016; Feldman 1993; Stonebraker and Stone 2015) that could influence SETs.

The presence of these impacting factors makes questionable the adoption of SETs as suitable tool to evaluate teaching quality and thus stimulates the assessment of SETs reliability.

Specialized literature commonly relates SETs reliability to internal consistency and/or inter-student reliability. Specifically, internal consistency focuses on the questionnaire adopted to collect SETs and evaluates the degree to which differently wording questions on the same teaching quality factor produce the same results (e.g. Rindermann and Schofield 2001; Zhao and Gallant 2012). Inter-student reliability, instead, focuses on the homogeneity of the whole class of students in order to investigate if all students interpret the questions the same way and it is assessed as the degree to which the students attending the same teaching course in the same classroom agree in the evaluations simultaneously provided (e.g. Burke et al. 1999; Feistauer and Richter 2017; James et al. 1984; Lüdtke et al. 2006; Marasini et al. 2014; Morley 2012).

While the characteristics of the measurement procedure (i.e. format for data collection and administering procedure) are taken into account when assessing SETs reliability, those pertaining the measurement instrument (i.e. students) are rarely accounted for, although students may be a main source of error measurement (Alwin 1989). As a matter of fact, the stability of the student’s evaluations over different occasions (i.e. student’s repeatability) has been investigated only by few researchers (e.g. Marsh and Overall 1981; Ting 1999; Vanacore and Pellegrino 2017), whereas the consistency of the student’s evaluations over different rating scales (i.e. student’s reproducibility) has not yet been investigated. Moreover, to the best of our knowledge, no research study faced the problem of a fair characterization of the extent of student agreement that takes into account also sampling uncertainty.

Our study aims at investigating the peculiar abilities of the students as assessors of teaching quality by measuring student reliability in terms of both her/his repeatability and reproducibility.

Both abilities can be properly measured by agreement (e.g. Bi and Kuesten 2012; De Mast and Van Wieringen 2007; Falotico and Quatto 2015; Pinto et al. 2014; Rossi 2001; Watson and Petrie 2010). Operationally, we propose to assess student repeatability (or intra-student agreement over time) and reproducibility (or intra-student agreement over scales) via agreement coefficients formulated for treating nominal and ordinal data and to interpret the coefficient values via a benchmarking procedure based on bootstrap confidence interval in order to account for the sampling uncertainty.

The remainder of the paper is organized as follows: the adopted intra-student agreement coefficient and the non-parametric benchmarking procedure to characterize the extent of student repeatability and reproducibility are illustrated in Sect. 2 and 3, respectively; in Sect. 4 a real case study is fully described; conclusions are summarized in Sect. 5.

## Measuring Student Reproducibility and Student Repeatability

Student reproducibility and student repeatability will be evaluated using the linear weighted version of the uniform \(\kappa\) coefficient (Gwet 2014), a normalized difference between the observed proportion of agreement and that expected under the assumption of uniform chance measurements (i.e. maximally uninformative measurements, conceived of as equally and identically distributed).The uniform \(\kappa\) is also known in the literature—and hereafter referred to—as Brennan–Prediger coefficient since Brennan and Prediger (1981) have proposed it as an alternative to Cohen’s Kappa coefficient (Cohen 1960).

Let \(n_{S}\) be the number of items evaluated in the same occasion over two different rating scales with \(k \ge 3\) classification categories and \(n_{T}\) be the number of items evaluated over two different replications (i.e. at different times) using the same ordinal scale with *k* classification categories; \(n_{S_{ij}}\) the number of items classified into *i*th category over the first adopted rating scale and into *j*th category over the second rating scale; \(n_{T_{ij}}\) the number of items classified into *i*th category during the first replication and into *j*th category during the second replication. The observed overall proportion of agreement over scales, \({\hat{p}}_{S}\) and the observed overall proportion of agreement over time, \({\hat{p}}_{T}\) are given by:

where \(w_{ij}\) is the linear weight for agreement used in order to take into account that on ordinal scale some disagreements are more serious than others.

The expected proportion of agreement under the assumption of uniform chance measurements is given by:

Thus, the corresponding weighted Brennan–Prediger coefficients to estimate intra-student agreement over scales and over time are respectively formulated as:

The weighted Brennan–Prediger coefficient ranges from − 1 to + 1: when the observed agreement equals chance agreement, \(BP_{w}=0\); when the observed agreement is greater than chance agreement \(BP_{w}\) returns positive values whereas when the observed agreement is less than chance agreement the coefficient takes negative values and it can be interpreted as disagreement. Since \(BP_{w}\) is mainly used as a measure of agreement, the condition of disagreement is only of academic interest.

The approach currently adopted to characterize the extent of agreement is based upon a straight comparison between the estimated coefficient and an adopted benchmark scale. According to Landis and Koch (1977) scale (Table 1) — by far the most widely used guideline to interpret agreement coefficient (Altaye et al. 2001; Blackman and Koval 2000; Bland 2008; Chmura Kraemer et al. 2002; Fleiss et al. 2013; Hallgren 2012; Klar et al. 2002; Watson and Petrie 2010) — only an index value over 0.8 represents Almost perfect/perfect agreement; an index value less than 0.2 represents Slight agreement; all the other intermediate values represent Fair (\(0.2 < BP_{w} \le 0.4\)), Moderate (\(0.4 < BP_{w} \le 0.6\)) and Substantial (\(0.6 < BP_{w} \le 0.8\)) agreement, respectively.

## Non-parametric Bootstrap Confidence Intervals for Weighted Brennan–Prediger coefficient

Though commonly adopted by practitioners, benchmark scales are widely criticized in literature and some researchers give advice that their uncritical application may lead to practically questionable decisions (Sim and Wright 2005). Actually, such a deterministic approach to benchmarking does not account for the influence of experimental conditions on the estimated coefficient and, thus, it does not allow for a statistical characterization of the extent of agreement. This criticism may be overcome by benchmarking the lower bound of the Confidence Interval (CI) of the agreement coefficient rather than its point estimate.

Since the statistical distribution of \(BP_{w}\) for small samples is unknown, a non-parametric approach based on bootstrap resampling is recommended (Ukoumunne et al. 2003; Klar et al. 2002).

In spite of the higher computational complexity, the Bias-Corrected and Accelerated bootstrap (BCa) confidence interval adjusts for any bias in the bootstrap distribution, taking into account not only the lack of symmetry but also the fact that skewness might change as the coefficient value varies; moreover it has generally a smaller coverage error than the percentile bootstrap and Bias-Corrected bootstrap intervals, decreasing for \(\alpha <0.025\) as \(\alpha\) tends to 0 (Carpenter and Bithell 2000).

The lower and upper bound of the two-sided \((1-2\alpha )\%\) BCa confidence interval for \(BP_{w}\) are given by:

and

where \(\varPhi\) is the standard normal CDF, \(z_{\alpha }\) is the \(\alpha\) percentile of the normal distribution, *a* the acceleration parameter and *b* the bias correction parameter.

## Case Study

The case study consists of three supervised experiments (hereafter E.I, E.II and E.III) carried out on the same teaching course over three successive academic years (from 2013 to 2016).

Nothing changed in the course organization during these 3 years: teacher, teaching materials and topics were the same. The evaluated teaching course is a choice course scheduled at the last year of the students’ career path, therefore students chose to attend it because of the interest in the topic.

### Overview of the Experiments

Each experiment was articulated into three evaluation sessions and carried out in a class with more than 20 students. Particularly, all participants obtained the first level degree in Management Engineering from the same University and so, being homogeneous in curriculum and instruction, they can be reasonably assumed interchangeable.

The number of students participating in the classroom in each session of each experiment is reported in Table 2.

Only the students who participated in all sessions and rated all quality items were considered in the case study thus, the rater candidates for the first, second and third experiment were 17, 18 and 17, respectively.

Students’ ratings about the perceived course quality were collected via three evaluation sheets: a Numeric Rating Scale, a Verbal Rating Scale and a Visual Analogic Scale (hereafter NRS, VRS, VAS, respectively); each evaluation sheet included twenty statements (reported in Table 3) concerning Learning/Value, Organization/Clarity, Individual Rapport, Workload/ Difficulty, Breadth of Coverage, Enthusiasm, Interaction, Examinations and Assignments/Readings, which are the nine factors of the SEEQ questionnaire, one of the most widely used and universally accepted instruments to collect SETs (Coffey and Gibbs 2001; Grammatikopoulos et al. 2015; Marsh 1982, 1983, 1984, 1987; Marsh and Dunkin 1992; Marsh and Roche 1993, 1997).

The evaluation sheets differed each other in the adopted rating scale: NRS used a numeric scale with 11 categories (whose grades range from 0 to 10); VRS used a verbal 4-point scale (with agreement grades: “strongly disagreeing with the statement”, “disagreeing with the statement”, “agreeing with the statement” and “strongly agreeing with the statement”); whereas VAS used the Visual Analogue Scale, a bipolar (continuous) scale whose anchor points were “NO” and “YES” (Aitken 1969). For comparability purposes, students’ evaluations on the VAS were rescaled in 11 equal length segments, using 10 cutoff points.

The first evaluation session of each experiment (S.1) took place at mid-term course and the second evaluation session (S.2) took place one lesson after. In both sessions, the students rated the course quality filling two evaluation sheets, the NRS and the VRS. Between S.1 and S.2 there was no new lesson and no interaction with the teacher, therefore no change in evaluation was expected.

During the third evaluation session of each experiment (S.3), which took place four lessons after, the students rated the course filling the NRS, VRS and VAS evaluation sheets, administrated separately.

In order to guarantee evaluation traceability while preserving anonymity, each student signed her/his evaluation sheets with a nickname, which enabled to match student’s ratings in order to compute intra-student agreement.

### Results of the Experiments

According to the purpose of the case study, the evaluations provided during S.1 and S.2 on NRS were used to estimate \(BP_{w_{T}}\), whereas the ratings collected during S.3 on NRS and VAS were used to estimate \(BP_{w_{S}}\).

For each student participating in E.I, E.II and E.III, the 95% two-sided BCa confidence intervals for \(BP_{w_{S}}\) and \(BP_{w_{T}}\), obtained with B=1500 bootstrap replications, were built. The results, obtained by implementing the inferential procedure in Mathematica (Version 11.0, Wolfram Research, Inc., Champaign, IL, USA), are reported in Table 4 for \(BP_{w_{S}}\) and for \(BP_{w_{T}}\).

In Fig. 1, as an example, the BCa CIs are graphically reported against the Landis and Koch’s benchmark ranges (Table 1) together with \(BP_{w_{S}}\) and \(BP_{w_{T}}\) point estimates.

A coefficient value greater than 0.6 is generally considered an acceptable level of reliability. Thus, coherently with the specialized literature, the minimum acceptable reproducibility and repeatability levels for assuming the student a reliable assessor of teaching quality is here established as Substantial.

Assuming a confidence level of 95%, the null hypothesis that the reproducibility or repeatability level is at least Substantial can be accepted only for those students with the lower bound of the CI lying in the region over 0.6 (e.g. reproducibility of student 1 of E.I). Therefore, the null hypothesis that the reproducibility level is at least Substantial can be accepted for the 75% of students participating in the case study (i.e. 39 out of 52 involved students); whereas the null hypothesis that the repeatability level is at least Substantial can be accepted only for 61% of involved students (i.e. 32 out of 52).

However, focusing on each experiment, it is highlighted that the reproducibility and repeatability levels are quite similar but not stable over the years: they are intrinsic abilities of each rater and vary across students. Indeed, the percentages of at least substantially reproducible students are 82% for E.I (i.e. 14 students), 56% for E.II (i.e. 10 students) and 88% for E.III (i.e. 15 students); whereas the percentages of at least substantially repeatable students are 65% for E.I and E.III (i.e. 11 students) and 56% for E.II (i.e. 10 students).

Therefore, according to study results it is reasonable to claim that the students perform better on reproducibility rather than on repeatability: the students are able to provide more reliable evaluations over different rating scales than over time.

In Fig. 2 the lower bounds of the BCa confidence interval for \(BP_{w_{S}}\) and \(BP_{w_{T}}\) are plotted against the 5 regions of agreement according to Landis and Koch benchmark scale; the students selected as reliable are the ones contained in the black squares in the top right corner, coherently with the criterion that only those students who are able to provide both substantially reproducible and repeatable evaluations can be assumed substantially reliable assessors of teaching quality.

Specifically, the reliable students are only 26 (50% of the ones involved in the case study), including 9 students participating in E.I (i.e. 53% of the students involved in E.I), 7 students participating in E.II (i.e. 39% of E.II students) and 10 students participating in E.III (i.e. 59% of E.III students).

## Conclusions

As far back as 1970s many research studies supported the validity of SETs as a measure of teaching quality/effectiveness, meantime many others doubted their validity. As Abrami et al. (1990) point out, “Student ratings are seldom criticized as measures of student satisfaction with instruction” but they “are often criticized as measures of instructional effectiveness” (p. 219).

When assessing SETs reliability, the majority of studies focus on inter-student reliability, whereas scarce attention is devoted to the assessment of intra-student reliability which is generally measured only in terms of agreement between replicated evaluations (test-retest comparison). Here, the SETs reliability depends on student’s ability to provide both stable (repeatability) and consistent (reproducibility) evaluations of teaching quality.

The results highlight that students perform differently on reproducibility and repeatability, being their evaluations generally more consistent over different rating scales than stable over time with numeric rating scale. Only the 50% of the involved students perform satisfactory on both reproducibility and repeatability to be assumed substantially reliable assessors of teaching quality. These results cannot of course be generalized since, although the experiments were repeated over three academic years, they were limited to only one teaching course.

However, it is worthwhile to pinpoint that all the three experiments were carried in the best case of classes with little more than 20 students at the last year of their career path, all trained in teaching quality evaluation and homogeneous in curriculum and instruction. It is reasonably to suppose that in the worst case of large classes of students attending mandatory first-year courses, never involved in teaching quality evaluation and no homogeneous in curriculum and instruction, student reliability would decrease because they have low knowledge about the object of the evaluation (i.e. university teaching quality) they are asked for and, in addition, they have low or no expertise about the evaluation procedure.

These findings agree with previous studies claiming that SETs generally do not accurately measure teaching quality. Even though the assessment of teaching quality via SETs is an easy and low-cost measurement procedure that looks “objective” and thus reasonably preferred by administration, the obtained results question the validity of SETs implying that institutions should be cautious when using SETs as the basis for taking formative and/or summative academic decisions. Low student’s reliability severely impacts on SETs validity and thus on formative decisions taken on their basis with negative consequences for the quality of future courses. Unreliable SETs negatively impact also on summative decisions, since professor promotion and tenure would unfairly rely on students with poor evaluative abilities.

The extension of this study by replicating the experiments over successive academic years and involving different classes, teaching courses and universities, could provide useful guidelines for an aware adoption of SETs in the light of different student evaluative performance.

## References

Abrami, P. C. (2001). Improving judgments about teaching effectiveness using teacher rating forms.

*New Directions for Institutional Research*,*2001*(109), 59–87.Abrami, P. C., d’Apollonia, S., & Cohen, P. A. (1990). Validity of student ratings of instruction: What we know and what we do not.

*Journal of Educational Psychology*,*82*(2), 219–231.Ackerman, D., Gross, B. L., & Vigneron, F. (2009). Peer observation reports and student evaluations of teaching: Who are the experts?

*Alberta Journal of Educational Research*,*55*(1), 18–39.Adams, M. J., & Umbach, P. D. (2012). Nonresponse and online student evaluations of teaching: Understanding the influence of salience, fatigue, and academic environments.

*Research in Higher Education*,*53*(5), 576–591.Aitken, R. (1969). Measurement of feelings using visual analogue scales.

*Proceedings of the Royal Society of Medicine*,*62*(10), 989–993.Aleamoni, L. M. (1999). Student rating myths versus research facts from 1924 to 1998.

*Journal of Personnel Evaluation in Education*,*13*(2), 153–166.Altaye, M., Donner, A., & Eliasziw, M. (2001). A general goodness-of-fit approach for inference procedures concerning the kappa statistic.

*Statistics in Medicine*,*20*(16), 2479–2488.Alwin, D. F. (1989). Problems in the estimation and interpretation of the reliability of survey data.

*Quality and Quantity*,*23*(3–4), 277–331.Bassi, F., Clerci, R., & Aquario, D. (2017). Students evaluation of teaching at a large italian university: Measurement scale validation.

*Electronic Journal of Applied Statistical Analysis*,*10*(1), 93–117.Berk, R. A. (2005). Survey of 12 strategies to measure teaching effectiveness.

*International Journal of Teaching and Learning in Higher Education*,*17*(1), 48–62.Bi, J., & Kuesten, C. (2012). Intraclass correlation coefficient (ICC): A framework for monitoring and assessing performance of trained sensory panels and panelists.

*Journal of Sensory Studies*,*27*(5), 352–364.Blackman, N. J. M., & Koval, J. J. (2000). Interval estimation for Cohen’s kappa as a measure of agreement.

*Statistics in Medicine*,*19*(5), 723–741.Bland, J. (2008).

*Measurement in health and disease. Cohens kappa*. New York: University of York, Department of Health Sciences.Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness.

*ScienceOpen Research*,*10*, 1–11.Brennan, R. L., & Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives.

*Educational and Psychological Measurement*,*41*(3), 687–699.Burke, M. J., Finkelstein, L. M., & Dusig, M. S. (1999). On average deviation indices for estimating interrater agreement.

*Organizational Research Methods*,*2*(1), 49–68.Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: When, which, what? A practical guide for medical statisticians.

*Statistics in Medicine*,*19*(9), 1141–1164.Centra, J. A. (1979).

*Determining faculty effectiveness. Assessing teaching, research, and service for personnel decisions and improvement*. Hamilton: ERIC.Chmura Kraemer, H., Periyakoil, V. S., & Noda, A. (2002). Kappa coefficients in medical research.

*Statistics in Medicine*,*21*(14), 2109–2129.Coffey, M., & Gibbs, G. (2001). The evaluation of the student evaluation of educational quality questionnaire (SEEQ) in UK higher education.

*Assessment & Evaluation in Higher Education*,*26*(1), 89–93.Cohen, J. (1960). A coefficient of agreement for nominal scales.

*Educational and Psychological Measurement*,*20*(1), 37–46.Davies, M., Hirschberg, J., Lye, J., Johnston, C., & McDonald, I. (2007). Systematic influences on teaching evaluations: The case for caution.

*Australian Economic Papers*,*46*(1), 18–38.De Mast, J., & Van Wieringen, W. N. (2007). Measurement system analysis for categorical measurements: Agreement and kappa-type indices.

*Journal of Quality Technology*,*39*(3), 191–202.Dey, E. L. (1997). Working with low survey response rates: The efficacy of weighting adjustments.

*Research in Higher Education*,*38*(2), 215–227.Emery, C. R., Kramer, T. R., & Tian, R. G. (2003). Return to academic standards: A critique of student evaluations of teaching effectiveness.

*Quality Assurance in Education*,*11*(1), 37–46.Falotico, R., & Quatto, P. (2015). Fleiss kappa statistic without paradoxes.

*Quality & Quantity*,*49*(2), 463–470.Feistauer, D., & Richter, T. (2017). How reliable are students evaluations of teaching quality? A variance components approach.

*Assessment & Evaluation in Higher Education*,*42*(8), 1263–1279.Feldman, K. A. (1977). Consistency and variability among college students in rating their teachers and courses: A review and analysis.

*Research in Higher Education*,*6*(3), 223–274.Feldman, K. A. (1984). Class size and college students’ evaluations of teachers and courses: A closer look.

*Research in Higher Education*,*21*(1), 45–116.Feldman, K. A. (1993). College students’ views of male and female college teachers: Part II. Evidence from students’ evaluations of their classroom teachers.

*Research in Higher Education*,*34*(2), 151–211.Fidelman, C. G. (2007).

*Course evaluation surveys: In-class paper surveys versus voluntary online surveys*. Palamedu: Boston College.Fleiss, J. L., Levin, B., & Paik, M. C. (2013).

*Statistical methods for rates and proportions*. New York: Wiley.Goos, M., & Salomons, A. (2017). Measuring teaching quality in higher education: Assessing selection bias in course evaluations.

*Research in Higher Education*,*58*(4), 341–364.Grammatikopoulos, V., Linardakis, M., Gregoriadis, A., & Oikonomidis, V. (2015). Assessing the students evaluations of educational quality (SEEQ) questionnaire in greek higher education.

*Higher Education*,*70*(3), 395–408.Gravestock, P., & Gregor-Greenleaf, E. (2008).

*Student course evaluations: Research, models and trends*. Princeton: Citeseer.Gwet, K. L. (2014).

*Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters*. Wright City: Advanced Analytics, LLC.Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial.

*Tutorials in Quantitative Methods for Psychology*,*8*(1), 23–34.Hornstein, H. A. (2017). Student evaluations of teaching are an inadequate assessment tool for evaluating faculty performance.

*Cogent Education*,*4*(1), 1–8.James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias.

*Journal of Applied Psychology*,*69*(1), 85–98.Kherfi, S. (2011). Whose opinion is it anyway? Determinants of participation in student evaluation of teaching.

*Journal of Economic Education*,*42*(1), 19–30.Klar, N., Lipsitz, S. R., Parzen, M., & Leong, T. (2002). An exact bootstrap confidence interval for \(\kappa\) in small samples.

*Journal of the Royal Statistical Society: Series D (The Statistician)*,*51*(4), 467–478.Kuo, W. (2007). How reliable is teaching evaluation? The relationship of class size to teaching evaluation scores.

*IEEE Transactions on Reliability*,*56*(2), 178–181.Lalla, M., Facchinetti, G., & Mastroleo, G. (2005). Ordinal scales and fuzzy set systems to measure agreement: An application to the evaluation of teaching activity.

*Quality and Quantity*,*38*(5), 577–601.Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data.

*Biometrics*,*33*(1), 159–174.Lüdtke, O., Trautwein, U., Kunter, M., & Baumert, J. (2006). Reliability and agreement of student ratings of the classroom environment: A reanalysis of timss data.

*Learning Environments Research*,*9*(3), 215–230.Marasini, D., Quatto, P., & Ripamonti, E. (2014). A measure of ordinal concordance for the evaluation of university courses.

*Procedia Economics and Finance*,*17*, 39–46.Marsh, H. W. (1982). SEEQ: A reliable, valid, and useful instrument for collecting students’evaluations of university teaching.

*British Journal of Educational Psychology*,*52*(1), 77–95.Marsh, H. W. (1983). Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics.

*Journal of Educational Psychology*,*75*(1), 150–166.Marsh, H. W. (1984). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential baises, and utility.

*Journal of Educational Psychology*,*76*(5), 707–754.Marsh, H. W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research.

*International Journal of Educational Research*,*11*(3), 253–388.Marsh, H. W., & Dunkin, M. (1992). Students’ evaluations of university teaching: A multidimensional perspective. In J. C. Smart (Ed.),

*Higher education: Handbook of theory and research*(Vol. 8, pp. 143–223). New York: Agathon Press.Marsh, H. W., & Overall, J. (1981). The relative influence of course level, course type, and instructor on students’ evaluations of college teaching.

*American Educational Research Journal*,*18*(1), 103–112.Marsh, H. W., & Roche, L. (1993). The use of students evaluations and an individually structured intervention to enhance university teaching effectiveness.

*American Educational Research Journal*,*30*(1), 217–251.Marsh, H. W., & Roche, L. A. (1997). Making students’ evaluations of teaching effectiveness effective: The critical issues of validity, bias, and utility.

*American Psychologist*,*52*(11), 1187–1197.Martínez-Gómez, M., Sierra, J. M. C., Jabaloyes, J., & Zarzo, M. (2011). A multivariate method for analyzing and improving the use of student evaluation of teaching questionnaires: A case study.

*Quality & Quantity*,*45*(6), 1415–1427.McKeachie, W. J. (1997).

*Student ratings: The validity of use*. Washington: American Psychological Association.Morley, D. D. (2012). Claims about the reliability of student evaluations of instruction: The ecological fallacy rides again.

*Studies in Educational Evaluation*,*38*(1), 15–20.Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations.

*Quality & Quantity*,*43*(2), 197–209.Pinto, F. S. T., Fogliatto, F. S., & Qannari, E. M. (2014). A method for panelists consistency assessment in sensory evaluations based on the cronbachs alpha coefficient.

*Food Quality and Preference*,*32*, 41–47.Porter, S. R., & Umbach, P. D. (2006). Student survey response rates across institutions: Why do they vary?

*Research in Higher Education*,*47*(2), 229–247.Porter, S. R., & Whitcomb, M. E. (2005). Non-response in student surveys: The role of demographics, engagement and personality.

*Research in Higher Education*,*46*(2), 127–152.Pounder, J. S. (2008). Transformational classroom leadership: A novel approach to evaluating classroom performance.

*Assessment & Evaluation in Higher Education*,*33*(3), 233–243.Rindermann, H., & Schofield, N. (2001). Generalizability of multidimensional student ratings of university instruction across courses and teachers.

*Research in Higher Education*,*42*(4), 377–399.Rossi, F. (2001). Assessing sensory panelist performance using repeatability and reproducibility measures.

*Food Quality and Preference*,*12*(5), 467–479.Sarnacchiaro, P., & D’Ambra, L. (2012). Students’ evaluations of university teaching: A structural equation modeling analysis.

*Electronic Journal of Applied Statistical Analysis*,*5*(3), 406–412.Seldin, P. (1999).

*Changing practices in evaluating teaching: As practical guide to improved faculty performance and promotion/tenure decisions*(Vol. 10). San Francisco: Jossey-Bass.Shapiro, E. G. (1990). Effect of instructor and class characteristics on students’ class evaluations.

*Research in Higher Education*,*31*(2), 135–148.Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: Use, interpretation, and sample size requirements.

*Physical Therapy*,*85*(3), 257–268.Sliusarenko, T. (2013).

*Quantitative assessment of course evaluations*. PhD Thesis (PhD-2013-318), Technical University of Denmark (DTU).Stack, S. (2003). Research productivity and student evaluation of teaching in social science classes: A research note.

*Research in Higher Education*,*44*(5), 539–556.Stonebraker, R. J., & Stone, G. S. (2015). Too old to teach? The effect of age on college and university professors.

*Research in Higher Education*,*56*(8), 793–812.Thorpe, S. W. (2002).

*Online student evaluation of instruction: An investigation of non-response bias*. AIR 2002 forum paper.Ting, K. F. (1999). Measuring teaching quality in Hong Kong’s higher education: Reliability and validity of student ratings. In J. James (Ed.), Quality in teaching and learning in higher education (pp. 46–54). Hong Kong: Hong Kong Polytechnic University.

Ukoumunne, O. C., Davison, A. C., Gulliford, M. C., & Chinn, S. (2003). Non-parametric bootstrap confidence intervals for the intraclass correlation coefficient.

*Statistics in Medicine*,*22*(24), 3805–3821.Vanacore, A., & Pellegrino, M. S. (2017). An agreement-based approach for reliability assessment of students evaluations of teaching. In

*Proceedings of the 3rd international conference on higher education advances*(pp. 1286–1293). Editorial Universitat Politècnica de ValènciaWatson, P., & Petrie, A. (2010). Method agreement analysis: A review of correct methodology.

*Theriogenology*,*73*(9), 1167–1179.Wolbring, T., & Treischl, E. (2016). Selection bias in students evaluation of teaching.

*Research in Higher Education*,*57*(1), 51–71.Wright, R. E. (2006). Student evaluations of faculty: Concerns raised in the literature, and possible solutions.

*College Student Journal*,*40*(2), 417.Zhao, J., & Gallant, D. J. (2012). Student evaluation of instruction in higher education: Exploring issues of validity and reliability.

*Assessment & Evaluation in Higher Education*,*37*(2), 227–235.

## Acknowledgements

The authors express their gratitude to the anonymous reviewers for their positive comments and helpful suggestions which contributed significantly to the improvement of this article.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

## About this article

### Cite this article

Vanacore, A., Pellegrino, M.S. How Reliable are Students’ Evaluations of Teaching (SETs)? A Study to Test Student’s Reproducibility and Repeatability.
*Soc Indic Res* **146, **77–89 (2019). https://doi.org/10.1007/s11205-018-02055-y

Accepted:

Published:

Issue Date:

### Keywords

- Students’ Evaluations of Teaching
- Intra-rater agreement
- Student reproducibility
- Student repeatability