Abstract
It is crucial that scholars, but also policymakers and practitioners, have a critical understanding of the conceptual and technical limitations of impact evaluations, as well as the ways that they are necessarily affected by organizational and political dynamics. In order to achieve a critical understanding of impact evaluations, this chapter directs its attention toward both their most common form—regression analysis—as well as the form that is seen to be more robust, that is, randomized control trials (RCTs). The methodological assumptions of each of these are discussed, first, in conceptual terms before moving on to a review of some more technical issues. The final section of the chapter turns to a consideration of how the production of impact evaluations is affected by organizational and political dynamics. In all, this chapter advocates a critical understanding of impact evaluations in five senses: conceptually, technically, contextually, organizationally, and politically.
I am grateful to Steve Klees, Adrián Zancajo, and Charles Blake for their helpful feedback on an earlier version of this chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It should also be acknowledged that cost and the difficulty of executing such studies are deterrents as well.
- 2.
This problem is known as endogeneity.
- 3.
Researchers do not always assume linear relationships among the variables, as when they include squared terms or use logarithms. However, as discussed below, even when nonlinear relationships are modeled, there are still inconsistencies across studies, and no case can we be sure that all of the interrelationships have been modeled correctly to reflect their relationships in the real world.
- 4.
To be sure, accounting for all possible causes of social phenomena is a challenge faced by all methodologies. The point here is that a pretense of regression analysis is that all variables are included if we are to get accurate estimates of regression coefficients.
- 5.
Although we do not deal here with variations of regression analysis, the critiques outlined here apply to these variations and the methods that are built upon them. These variations and other methods include regression discontinuity, differences in differences, path analysis, hierarchical linear modeling, and structural equation modeling, among others (Klees, 2016; Schlotter, Schwerdt, & Woessmann, 2011)
- 6.
Interestingly, Rodrik (2008) provides a personal anecdote about how standards have changed: “When I was a young assistant professor [in the 1980s], one could still publish econometric results in top journals with nary a word on the endogeneity of regressors. If one went so far as to instrument for patently endogenous variables, it was often enough to state that you were doing [instrumental variables], with the list of instruments tacked into a footnote at the bottom of a table. No more. A large chunk of the typical empirical—but non-experimental—paper today is devoted to discussing issues having to do with endogeneity, omitted variables, and measurement error. The identification strategy is made explicit, and is often at the core of the paper. Robustness issues take a whole separate section. Possible objections are anticipated, and counter-arguments are advanced. In other words, considerable effort is devoted to convincing the reader of the internal validity of the study” (p. 20).
- 7.
This quote also mentions sample size, an issue that will be addressed later in this chapter, in the section on more technical issues.
- 8.
Selection bias can also occur when participants are allowed to self-select into a treatment, as discussed further in the section on “randomizing an unrepresentative sample.”
- 9.
As evidence on this point, a meta-evaluation of 213 studies on school-based interventions (47% of which employed randomized designs) found that less than 1% of treatment and control group were identical at the outset (Durlak, Weissberg, Dymnicki, Taylor, & Schellinger, 2011).
- 10.
Interestingly, Cartwright (2007, p. 20) suggests that expert judgment is more useful for addressing threats to internal validity than is pursuit of statistical controls.
- 11.
The four threats addressed here are all threats to internal validity. As noted in this section, when these threats—and therefore internal validity—are an issue, one cannot be sure that the effects observed are due to the intervention under study. While there are many more potential threats to internal validity that affect both regression analysis and RCTs (see, e.g., Mertens, 2005, pp. 121–124), these four have been addressed here because they can affect RCTs in spite of randomization.
- 12.
Although it doesn’t solve the problem of outliers, researchers often look at subpopulations within the sample (e.g., according to gender, ethnicity, class, etc.) to see how outcomes vary across them.
- 13.
For additional discussion of threats to internal and external validity see, for example, Mertens (2005). The goal here is not to speak to all possible threats to internal and external validity but rather to address some of the problematic assumptions and practices that are associated with RCTs in practice.
- 14.
In a point that connects with the next subsection, Deaton and Cartwright (2016) underscore that “without a clear idea of how to characterize the population of individuals in the trial, whether we are looking for an [average treatment effect] or to identify causality, and for which groups enrolled in the trial the results are supposed to hold, we have no basis for thinking about how to use the trial results in other contexts” (p. 27).
- 15.
See conclusion chapter for discussion of alternative methods, including realist evaluation, which focuses on understanding the mechanisms and contexts in which interventions are likely to work, as opposed to focusing only on the outcomes of an intervention, as is the case with impact evaluations.
- 16.
To be clear, Klees (2016) argues that quantitative data are still important but that we can only rely on simple correlations and cross tabulations.
- 17.
The language of rejecting the null hypothesis might sound odd, but as Steidl, Hayes, and Schauber (1997) remind us: “In the framework of the hypothetico-deductive method, research hypotheses can never be proven; rather, they can only be disproved (rejected) with the tools of statistical inference” (p. 271).
- 18.
Another way of stating this definition would be mean difference/standard error of mean difference.
- 19.
Carver (1978) make a further and particularly interesting point on researcher discretion when it comes to sample size: “Since the researcher often has complete control over the number of subjects sampled, one of the most important variables affecting the results of the research, the subjective judgment of the experimenter in choosing sample size, is usually not controlled. Controlling experimenter bias is a much discussed problem, but not enough is said about the experimenter’s ability to increase the odds of getting statistically significant results simply by increasing the number of subjects in an experiment” (p. 385).
- 20.
For additional discussion of the shortcomings of hypothesis testing, see Levine et al. (2008).
- 21.
See Steidl et al. (1997) for a good discussion of power analysis.
- 22.
O’Boyle and Aguinis (2012) suggest instead the use of the Paretian (power law) distribution.
- 23.
On test results and normality, Goertzel (n.d.) explains: “If a large number of people fill out a typical multiple choice test such as the Scholastic Aptitude Test (or a typical sociological questionnaire with precoded responses such as ‘strongly agree, agree’) at random using a perfect die, the scores are very likely to be normally distributed. This is true because many more combinations of response give a sum that is close to the theoretical mean than give a score that is close to either extreme” (p. 6).
- 24.
This is done using the analysis of covariance statistical procedure. Sometimes additional transformations of the data are made by converting outcomes to normalized Z scores in order to compare outcomes from different tests. See Pogrow (2017) for more.
- 25.
See Pogrow (2017) for a discussion of examples of where small effect sizes have been highlighted by researchers and the media and in ways that contradict how effect sizes should be interpreted.
- 26.
It is because journals are more willing to publish evaluations with positive or negative results rather than null results that one must also have caution when considering the findings of meta-analysis or literature reviews, a point that has been made by Glewwe (2014).
References
Anderson, D., Burnham, K., Gould, W., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311–316.
Baker, J. (2000). Evaluating the impact of development projects on poverty: A handbook for practitioners. Washington, DC: World Bank.
Banerjee, A., Banerji, R., Berry, J., Duflo, E., Kannan, H., Mukerji, S., Shotland, M., & Walton, M. (2017). From proof concept to scalable policies: Challenges and solutions, with an application. NBER Working Paper No. 22931. Retrieved from https://economics.mit.edu/files/12359
Banerjee, A., & He, R. (2008). Making aid work. In W. Easterly (Ed.), Reinventing foreign aid (pp. 47–92). Cambridge, MA: MIT.
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association, 33(203), 526–536.
Biglan, A., Ary, D., & Wagenaar, A. (2000). The value of interrupted time-series experiments for community intervention research. Prevention Science, 1(1), 31–49.
Boruch, R., Rindskopf, D., Anderson, P., Amidjaya, I., & Jansson, D. (1979). Randomized experiments for evaluating and planning local programs: A summary on appropriateness and feasibility. Public Administration Review, 39(1), 36–40.
Braun, A., Ball, S., Maguire, M., & Hoskins, K. (2011). Taking context seriously: Towards explaining policy enactments in the secondary school. Discourse: Studies in the Cultural Politics of Education, 32(4), 585–596.
Burde, D. (2012). Assessing impact and bridging methodological divides: Randomized trials in countries affected by conflict. Comparative Education Review, 56(3), 448–473.
Cartwright, N. (2007). Are RCTs the gold standard? Centre for Philosophy of Natural and Social Science. Technical Report 01/07. London: London School of Economics. Retrieved from http://www.lse.ac.uk/CPNSS/research/concludedResearchProjects/ContingencyDissentInScience/DP/Cartwright.pdf
Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.
Carver, R. (1993). The case against statistical significance testing, revisited. The Journal of Experimental Education, 61(4), 287–292.
Castillo, N., & Wagner, D. (2013). Gold standard? The use of randomized controlled trials for international educational policy. Comparative Education Review, 58(1), 166–173.
Clay, R. (2010). More than one way to measure: Randomized clinical trials have their place, but critics argue that researchers would get better results if they also embraced other methodologies. Monitor on Psychology, 41 (8), 52. Retrieved from http://www.apa.org/monitor/2010/09/trials.aspx
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Concato, J., Shah, N., & Horwitz, R. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. The New England Journal of Medicine, 342(25), 1887–1892.
Cook, T. (2001). Science phobia: Why education researchers reject randomized experiments. Education Next, (Fall), 63–68. Retrieved from http://www.indiana.edu/~educy520/readings/cook01_ed_research.pdf
Cook, T. (2004). Why have educational evaluators chosen not to do randomized experiments? Annals of the American Academy of Political and Social Science, 589(Sep.), 114–149.
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424–455.
Deaton, A., & Cartwright, N. (2016a). The limitations of ramdomised controlled trials. Vox. Retrieved from http://voxeu.org/article/limitations-randomised-controlled-trials
Deaton, A., & Cartwright, N. (2016b). Understanding and misunderstanding randomized controlled trials. NBER Working Paper No. 22595. Retrieved from http://www.nber.org/papers/w22595
De Boer, M., Waterlander, W., Kujper, L., Steenhuis, I., & Twisk, J. (2015). Testing for baseline differences in randomized controlled trials: An unhealthy research behavior that is hard to eradicate. International Journal of Behavioral Nutrition and Physical Activity, 12(4). Retrieved from https://ijbnpa.biomedcentral.com/articles/10.1186/s12966-015-0162-z
Duflo, E., & Kremer, M. (2003, July 15–16). Use of randomization in the evaluation of development effectiveness. Paper prepared for the World Bank Operations Evaluation Department Conference on Evaluation and Development Effectiveness, Washington, DC. Retrieved from https://economics.mit.edu/files/765
Durlak, J. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928.
Durlak, J., Weissberg, R., Dymnicki, A., Taylor, R., & Schellinger, K. (2011). The impact of enhancing students’ social and emotional learning: A meta-analysis of school-based universal interventions. Child Development, 82(1), 405–432.
Everett, B., Rehkopf, D., & Rogers, R. (2013). The nonlinear relationship between education and mortality: An examination of cohort, race/ethnic, and gender differences. Population Research Policy Review, 32 (6). Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3839428/
Feldman, A. & Haskins, R. (2016). Low-cost randomized controlled trials. Evidence-Based Policymaking Collaborative. Retrieved from http://www.evidencecollaborative.org/toolkits/low-cost-randomized-controlled-trials
Fendler, L., & Muzzafar, I. (2008). The history of the bell curve: Sorting and the idea of normal. Educational Theory, 58(1), 63–82.
Flack, V., & Chang, P. (1987). Frequency of selecting noise variables in subset regression analysis: A simulation study. The American Statistician, 41(1), 84–86.
Freedman, D. (1983). A note on screening regression equations. The American Statistician, 37(2), 152–155.
Ganimian, A. (2017). Not drawn to scale? RCTs and education reform in developing countries. Research on improving systems of education. Retrieved from http://www.riseprogramme.org/content/not-drawn-scale-rcts-and-education-reform-developing-countries
Garbarino, S., & Holland, J. (2009). Quantitative and qualitative methods in impact evaluation and measuring results. Governance and Social Development Resource Centre. UK Department for International Development. Retrieved from http://www.gsdrc.org/docs/open/eirs4.pdf
Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeersch, C. (2016). Impact evaluation in practice (2nd ed.). Washginton, DC: World Bank.
Gigerenzer, G., Swijtink, Z., Porter, T., Datson, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.
Ginsburg, A., & Smith, M. (2016). Do randomized controlled trials meet the “gold standard”? A study of the usefulness of RCTs in the What Works Clearinghouse. American Enterprise Institute. Retrieved from https://www.carnegiefoundation.org/wp-content/uploads/2016/03/Do-randomized-controlled-trials-meet-the-gold-standard.pdf
Glewwe, P. (Ed.). (2014). Education policy in developing countries. Chicago: University of Chicago.
Goertzel, T. (n.d). The myth of the bell curve. Retrieved from http://crab.rutgers.edu/~goertzel/normalcurve.htm
Gorard, S., & Taylor, C. (2004). Combining methods in educational and social research. New York: Open University.
Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S., & Altman, D. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
Johnson, D. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63(3), 763–772.
Klees, S. (2016). Inferences from regression analysis: Are they valid? Real-World Economics Review, 74, 85–97. Retrieved from http://www.paecon.net/PAEReview/issue74/Klees74.pdf
Klees, S., & Edwards, D. B., Jr. (2014). Knowledge production and technologies of governance. In T. Fenwick, E. Mangez, & J. Ozga (Eds.), World yearbook of education 2014: Governing knowledge: Comparison, knowledge-based technologies and expertise in the regulation of education (pp. 31–43). New York: Routledge.
Komatsu, H., & Rappleye, J. (2017). A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth. Comparative Education, 53(2), 166–191.
Kremer, M. (2003). Randomized evaluations of educational programs in developing countries: Some lessons. The American Economic Review, 93(2), 102–106.
Lareau, A. (2009). Narrow questions, narrow answers: The limited value of randomized controlled trials for education research. In P. Walters, A. Lareau, & S. Ranis (Eds.), Education research on trial: Policy reform and the call for scientific rigor (pp. 145–162). New York: Routledge.
Leamer, E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1), 31–43.
Leamer, E. (2010). Tantalus on the road to asymptopia. The Journal of Economic Perspectives, 24(2), 31–46.
Levine, T., Weber, R., Hullett, C., Park, H., & Lindsey, L. (2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34(2), 171–187.
Levy, S. (2006). Progress against poverty: Sustaining Mexico’s Progresa-Oportunidades program. Washington, DC: Brookings Institution Press.
Luecke, D., & McGinn, N. (1975). Regression analyses and education production functions: Can they be trusted? Harvard Educational Review, 45(3), 325–350.
Lykken, D. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151–159.
McLaughlin, M. (1987). Learning from experience: Lessons from policy implementation. Educational Evaluation and Policy Analysis, 9(2), 171–178.
McLaughlin, M. (1990). The Rand change agent study revisited: Macro perspectives and micro realities. Educational Researcher, 19(9), 11–16.
Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834.
Meehl, P. (1986). What social scientists don’t understand. In D. Fiske & R. Shweder (Eds.), Metatheory in social science: Pluralisms and subjectivities (pp. 315–338). Chicago: University of Chicago.
Meehl, P. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 393–425). Mahwah, NJ: Erlbaum.
Mertens, D. (2005). Research and evaluation in education and psychology: Integrating diversity with quantitative, qualitative, and mixed methods (2nd ed.). London: Sage.
Miguel, E., & Kremer, M. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72(1), 159–217.
Novella, S. (2016). P value under fire. Science-Base Medicine. Retrieved from https://sciencebasedmedicine.org/p-value-under-fire/
Nuzzo, R. (2015). Scientists perturbed by loss of stat tools to sift research fudge from fact. Scientific American. Retrieved from https://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tools-to-sift-research-fudge-from-fact/
O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79–119.
Peck, J., & Theodore, N. (2015). Fast policy: Experimental statecraft at the thresholds of neoliberalism. Minneapolis: University of Minnesota.
Pogrow, S. (2017). The failure of the U.S. education research establishment to identify effective practices: Beware effective practices policies. Education Policy Analysis Archives, 25(5), 1–19. Retrieved from https://epaa.asu.edu/ojs/article/view/2517
Pritchett, L. (n.d.). “The evidence” about “what works” in education: Graphs to illustrate external validity and construct validity. Research on Improving Systems of Education. Retrieved from https://www.cgdev.org/publication/evidence-about-what-works-education-graphs-illustrate-external-validity
Pritchett, L. (2002). It pays to be ignorant: A simple political economy of rigorous program evaluation. The Journal of Policy Reform, 5(4), 251–269.
Rodrik, D. (2008). The development economics: We shall experiment, but how shall we learn? Faculty Research Working Papers Series. RWP08-055. John F. Kennedy School of Government. Harvard University. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1296115
Romero, M., Sandefur, J., & Sandholtz, W. 2017. Can outsourcing improve Liberia’s schools? Preliminary results from year one of a three-year randomized evaluation of partnership schools for Liberia. Washington, DC: Center for Global Development. https://www.cgdev.org/sites/default/files/partnership-schools-for-liberia.pdf
Rust, V., Soumare, A., Pescador, O., & Shibuya, M. (1999). Research Strategies in Comparative Education. Comparative Education Review, 43(1), 86–109.
Sanson-Fisher, R., Bonevski, B., Green, L., & D’Este, C. (2007). Limitations of the randomized controlled trial in evaluating population-based health interventions. American Journal of Preventative Medicine, 33(2), 155–161.
Schlotter, M., Schwerdt, G., & Woessmann, L. (2011). Econometric methods for causal evaluation of education policies and practices: A non-technical guide. Education Economics, 19(2), 109–137.
Schanzenbach, D. (2012). Limitations of experiments in education research. Education Finance and Policy, 7(2), 219–232.
Steidl, R., Hayes, J., & Schauber, E. (1997). Statistical power analysis in wildlife research. The Journal of Wildlife Management, 61(2), 270–279.
Uriel, E. (2013). Hypothesis testing in the multiple regression model. Retrieved from https://www.uv.es/uriel/4%20Hypothesis%20testing%20in%20the%20multiple%20regression%20model.pdf
Vivalt, E. (2015). How much can we generalize from impact evaluations? New York University. Retrieved from https://pdfs.semanticscholar.org/6545/a87feaec7d6d0ba462860b3d1bb721d9da39.pdf
Wang, L., & Guo, K. (2018). Shadow education of mathematics in China. In Y. Cao & F. Leung (Eds.), The 21st century mathematics education in China (pp. 93–106). Berlin: Springer.
Weiss, R., & Rein, M. (1970). The evaluation of broad-aim programs: Experimental design, its difficulties, and an alternative. Administrative Science Quarterly, 15(1), 97–109.
Williams, W., & Evans, J. (1969). The politics of evaluation: The case of Head Start. Annals of the American Academy of Political and Social Sciences, 385, 118–132.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 The Author(s)
About this chapter
Cite this chapter
Edwards, D.B. (2018). Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues. In: Global Education Policy, Impact Evaluations, and Alternatives. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-319-75142-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-75142-9_2
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-319-75141-2
Online ISBN: 978-3-319-75142-9
eBook Packages: Political Science and International StudiesPolitical Science and International Studies (R0)