Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues

Edwards, D. Brent

doi:10.1007/978-3-319-75142-9_2

D. Brent Edwards Jr.²

354 Accesses

Abstract

It is crucial that scholars, but also policymakers and practitioners, have a critical understanding of the conceptual and technical limitations of impact evaluations, as well as the ways that they are necessarily affected by organizational and political dynamics. In order to achieve a critical understanding of impact evaluations, this chapter directs its attention toward both their most common form—regression analysis—as well as the form that is seen to be more robust, that is, randomized control trials (RCTs). The methodological assumptions of each of these are discussed, first, in conceptual terms before moving on to a review of some more technical issues. The final section of the chapter turns to a consideration of how the production of impact evaluations is affected by organizational and political dynamics. In all, this chapter advocates a critical understanding of impact evaluations in five senses: conceptually, technically, contextually, organizationally, and politically.

I am grateful to Steve Klees, Adrián Zancajo, and Charles Blake for their helpful feedback on an earlier version of this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It should also be acknowledged that cost and the difficulty of executing such studies are deterrents as well.
2.
This problem is known as endogeneity.
3.
Researchers do not always assume linear relationships among the variables, as when they include squared terms or use logarithms. However, as discussed below, even when nonlinear relationships are modeled, there are still inconsistencies across studies, and no case can we be sure that all of the interrelationships have been modeled correctly to reflect their relationships in the real world.
4.
To be sure, accounting for all possible causes of social phenomena is a challenge faced by all methodologies. The point here is that a pretense of regression analysis is that all variables are included if we are to get accurate estimates of regression coefficients.
5.
Although we do not deal here with variations of regression analysis, the critiques outlined here apply to these variations and the methods that are built upon them. These variations and other methods include regression discontinuity, differences in differences, path analysis, hierarchical linear modeling, and structural equation modeling, among others (Klees, 2016; Schlotter, Schwerdt, & Woessmann, 2011)
6.
Interestingly, Rodrik (2008) provides a personal anecdote about how standards have changed: “When I was a young assistant professor [in the 1980s], one could still publish econometric results in top journals with nary a word on the endogeneity of regressors. If one went so far as to instrument for patently endogenous variables, it was often enough to state that you were doing [instrumental variables], with the list of instruments tacked into a footnote at the bottom of a table. No more. A large chunk of the typical empirical—but non-experimental—paper today is devoted to discussing issues having to do with endogeneity, omitted variables, and measurement error. The identification strategy is made explicit, and is often at the core of the paper. Robustness issues take a whole separate section. Possible objections are anticipated, and counter-arguments are advanced. In other words, considerable effort is devoted to convincing the reader of the internal validity of the study” (p. 20).
7.
This quote also mentions sample size, an issue that will be addressed later in this chapter, in the section on more technical issues.
8.
Selection bias can also occur when participants are allowed to self-select into a treatment, as discussed further in the section on “randomizing an unrepresentative sample.”
9.
As evidence on this point, a meta-evaluation of 213 studies on school-based interventions (47% of which employed randomized designs) found that less than 1% of treatment and control group were identical at the outset (Durlak, Weissberg, Dymnicki, Taylor, & Schellinger, 2011).
10.
Interestingly, Cartwright (2007, p. 20) suggests that expert judgment is more useful for addressing threats to internal validity than is pursuit of statistical controls.
11.
The four threats addressed here are all threats to internal validity. As noted in this section, when these threats—and therefore internal validity—are an issue, one cannot be sure that the effects observed are due to the intervention under study. While there are many more potential threats to internal validity that affect both regression analysis and RCTs (see, e.g., Mertens, 2005, pp. 121–124), these four have been addressed here because they can affect RCTs in spite of randomization.
12.
Although it doesn’t solve the problem of outliers, researchers often look at subpopulations within the sample (e.g., according to gender, ethnicity, class, etc.) to see how outcomes vary across them.
13.
For additional discussion of threats to internal and external validity see, for example, Mertens (2005). The goal here is not to speak to all possible threats to internal and external validity but rather to address some of the problematic assumptions and practices that are associated with RCTs in practice.
14.
In a point that connects with the next subsection, Deaton and Cartwright (2016) underscore that “without a clear idea of how to characterize the population of individuals in the trial, whether we are looking for an [average treatment effect] or to identify causality, and for which groups enrolled in the trial the results are supposed to hold, we have no basis for thinking about how to use the trial results in other contexts” (p. 27).
15.
See conclusion chapter for discussion of alternative methods, including realist evaluation, which focuses on understanding the mechanisms and contexts in which interventions are likely to work, as opposed to focusing only on the outcomes of an intervention, as is the case with impact evaluations.
16.
To be clear, Klees (2016) argues that quantitative data are still important but that we can only rely on simple correlations and cross tabulations.
17.
The language of rejecting the null hypothesis might sound odd, but as Steidl, Hayes, and Schauber (1997) remind us: “In the framework of the hypothetico-deductive method, research hypotheses can never be proven; rather, they can only be disproved (rejected) with the tools of statistical inference” (p. 271).
18.
Another way of stating this definition would be mean difference/standard error of mean difference.
19.
Carver (1978) make a further and particularly interesting point on researcher discretion when it comes to sample size: “Since the researcher often has complete control over the number of subjects sampled, one of the most important variables affecting the results of the research, the subjective judgment of the experimenter in choosing sample size, is usually not controlled. Controlling experimenter bias is a much discussed problem, but not enough is said about the experimenter’s ability to increase the odds of getting statistically significant results simply by increasing the number of subjects in an experiment” (p. 385).
20.
For additional discussion of the shortcomings of hypothesis testing, see Levine et al. (2008).
21.
See Steidl et al. (1997) for a good discussion of power analysis.
22.
O’Boyle and Aguinis (2012) suggest instead the use of the Paretian (power law) distribution.
23.
On test results and normality, Goertzel (n.d.) explains: “If a large number of people fill out a typical multiple choice test such as the Scholastic Aptitude Test (or a typical sociological questionnaire with precoded responses such as ‘strongly agree, agree’) at random using a perfect die, the scores are very likely to be normally distributed. This is true because many more combinations of response give a sum that is close to the theoretical mean than give a score that is close to either extreme” (p. 6).
24.
This is done using the analysis of covariance statistical procedure. Sometimes additional transformations of the data are made by converting outcomes to normalized Z scores in order to compare outcomes from different tests. See Pogrow (2017) for more.
25.
See Pogrow (2017) for a discussion of examples of where small effect sizes have been highlighted by researchers and the media and in ways that contradict how effect sizes should be interpreted.
26.
It is because journals are more willing to publish evaluations with positive or negative results rather than null results that one must also have caution when considering the findings of meta-analysis or literature reviews, a point that has been made by Glewwe (2014).

References

Anderson, D., Burnham, K., Gould, W., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311–316.
Google Scholar
Baker, J. (2000). Evaluating the impact of development projects on poverty: A handbook for practitioners. Washington, DC: World Bank.
Book Google Scholar
Banerjee, A., Banerji, R., Berry, J., Duflo, E., Kannan, H., Mukerji, S., Shotland, M., & Walton, M. (2017). From proof concept to scalable policies: Challenges and solutions, with an application. NBER Working Paper No. 22931. Retrieved from https://economics.mit.edu/files/12359
Banerjee, A., & He, R. (2008). Making aid work. In W. Easterly (Ed.), Reinventing foreign aid (pp. 47–92). Cambridge, MA: MIT.
Google Scholar
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association, 33(203), 526–536.
Article Google Scholar
Biglan, A., Ary, D., & Wagenaar, A. (2000). The value of interrupted time-series experiments for community intervention research. Prevention Science, 1(1), 31–49.
Article Google Scholar
Boruch, R., Rindskopf, D., Anderson, P., Amidjaya, I., & Jansson, D. (1979). Randomized experiments for evaluating and planning local programs: A summary on appropriateness and feasibility. Public Administration Review, 39(1), 36–40.
Article Google Scholar
Braun, A., Ball, S., Maguire, M., & Hoskins, K. (2011). Taking context seriously: Towards explaining policy enactments in the secondary school. Discourse: Studies in the Cultural Politics of Education, 32(4), 585–596.
Google Scholar
Burde, D. (2012). Assessing impact and bridging methodological divides: Randomized trials in countries affected by conflict. Comparative Education Review, 56(3), 448–473.
Article Google Scholar
Cartwright, N. (2007). Are RCTs the gold standard? Centre for Philosophy of Natural and Social Science. Technical Report 01/07. London: London School of Economics. Retrieved from http://www.lse.ac.uk/CPNSS/research/concludedResearchProjects/ContingencyDissentInScience/DP/Cartwright.pdf
Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.
Article Google Scholar
Carver, R. (1993). The case against statistical significance testing, revisited. The Journal of Experimental Education, 61(4), 287–292.
Article Google Scholar
Castillo, N., & Wagner, D. (2013). Gold standard? The use of randomized controlled trials for international educational policy. Comparative Education Review, 58(1), 166–173.
Article Google Scholar
Clay, R. (2010). More than one way to measure: Randomized clinical trials have their place, but critics argue that researchers would get better results if they also embraced other methodologies. Monitor on Psychology, 41 (8), 52. Retrieved from http://www.apa.org/monitor/2010/09/trials.aspx
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Google Scholar
Concato, J., Shah, N., & Horwitz, R. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. The New England Journal of Medicine, 342(25), 1887–1892.
Article Google Scholar
Cook, T. (2001). Science phobia: Why education researchers reject randomized experiments. Education Next, (Fall), 63–68. Retrieved from http://www.indiana.edu/~educy520/readings/cook01_ed_research.pdf
Cook, T. (2004). Why have educational evaluators chosen not to do randomized experiments? Annals of the American Academy of Political and Social Science, 589(Sep.), 114–149.
Google Scholar
Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424–455.
Article Google Scholar
Deaton, A., & Cartwright, N. (2016a). The limitations of ramdomised controlled trials. Vox. Retrieved from http://voxeu.org/article/limitations-randomised-controlled-trials
Deaton, A., & Cartwright, N. (2016b). Understanding and misunderstanding randomized controlled trials. NBER Working Paper No. 22595. Retrieved from http://www.nber.org/papers/w22595
De Boer, M., Waterlander, W., Kujper, L., Steenhuis, I., & Twisk, J. (2015). Testing for baseline differences in randomized controlled trials: An unhealthy research behavior that is hard to eradicate. International Journal of Behavioral Nutrition and Physical Activity, 12(4). Retrieved from https://ijbnpa.biomedcentral.com/articles/10.1186/s12966-015-0162-z
Duflo, E., & Kremer, M. (2003, July 15–16). Use of randomization in the evaluation of development effectiveness. Paper prepared for the World Bank Operations Evaluation Department Conference on Evaluation and Development Effectiveness, Washington, DC. Retrieved from https://economics.mit.edu/files/765
Durlak, J. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928.
Article Google Scholar
Durlak, J., Weissberg, R., Dymnicki, A., Taylor, R., & Schellinger, K. (2011). The impact of enhancing students’ social and emotional learning: A meta-analysis of school-based universal interventions. Child Development, 82(1), 405–432.
Article Google Scholar
Everett, B., Rehkopf, D., & Rogers, R. (2013). The nonlinear relationship between education and mortality: An examination of cohort, race/ethnic, and gender differences. Population Research Policy Review, 32 (6). Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3839428/
Feldman, A. & Haskins, R. (2016). Low-cost randomized controlled trials. Evidence-Based Policymaking Collaborative. Retrieved from http://www.evidencecollaborative.org/toolkits/low-cost-randomized-controlled-trials
Fendler, L., & Muzzafar, I. (2008). The history of the bell curve: Sorting and the idea of normal. Educational Theory, 58(1), 63–82.
Article Google Scholar
Flack, V., & Chang, P. (1987). Frequency of selecting noise variables in subset regression analysis: A simulation study. The American Statistician, 41(1), 84–86.
Google Scholar
Freedman, D. (1983). A note on screening regression equations. The American Statistician, 37(2), 152–155.
Google Scholar
Ganimian, A. (2017). Not drawn to scale? RCTs and education reform in developing countries. Research on improving systems of education. Retrieved from http://www.riseprogramme.org/content/not-drawn-scale-rcts-and-education-reform-developing-countries
Garbarino, S., & Holland, J. (2009). Quantitative and qualitative methods in impact evaluation and measuring results. Governance and Social Development Resource Centre. UK Department for International Development. Retrieved from http://www.gsdrc.org/docs/open/eirs4.pdf
Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeersch, C. (2016). Impact evaluation in practice (2nd ed.). Washginton, DC: World Bank.
Book Google Scholar
Gigerenzer, G., Swijtink, Z., Porter, T., Datson, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.
Book Google Scholar
Ginsburg, A., & Smith, M. (2016). Do randomized controlled trials meet the “gold standard”? A study of the usefulness of RCTs in the What Works Clearinghouse. American Enterprise Institute. Retrieved from https://www.carnegiefoundation.org/wp-content/uploads/2016/03/Do-randomized-controlled-trials-meet-the-gold-standard.pdf
Glewwe, P. (Ed.). (2014). Education policy in developing countries. Chicago: University of Chicago.
Google Scholar
Goertzel, T. (n.d). The myth of the bell curve. Retrieved from http://crab.rutgers.edu/~goertzel/normalcurve.htm
Gorard, S., & Taylor, C. (2004). Combining methods in educational and social research. New York: Open University.
Google Scholar
Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S., & Altman, D. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
Article Google Scholar
Johnson, D. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63(3), 763–772.
Article Google Scholar
Klees, S. (2016). Inferences from regression analysis: Are they valid? Real-World Economics Review, 74, 85–97. Retrieved from http://www.paecon.net/PAEReview/issue74/Klees74.pdf
Klees, S., & Edwards, D. B., Jr. (2014). Knowledge production and technologies of governance. In T. Fenwick, E. Mangez, & J. Ozga (Eds.), World yearbook of education 2014: Governing knowledge: Comparison, knowledge-based technologies and expertise in the regulation of education (pp. 31–43). New York: Routledge.
Google Scholar
Komatsu, H., & Rappleye, J. (2017). A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth. Comparative Education, 53(2), 166–191.
Article Google Scholar
Kremer, M. (2003). Randomized evaluations of educational programs in developing countries: Some lessons. The American Economic Review, 93(2), 102–106.
Article Google Scholar
Lareau, A. (2009). Narrow questions, narrow answers: The limited value of randomized controlled trials for education research. In P. Walters, A. Lareau, & S. Ranis (Eds.), Education research on trial: Policy reform and the call for scientific rigor (pp. 145–162). New York: Routledge.
Google Scholar
Leamer, E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1), 31–43.
Google Scholar
Leamer, E. (2010). Tantalus on the road to asymptopia. The Journal of Economic Perspectives, 24(2), 31–46.
Article Google Scholar
Levine, T., Weber, R., Hullett, C., Park, H., & Lindsey, L. (2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34(2), 171–187.
Article Google Scholar
Levy, S. (2006). Progress against poverty: Sustaining Mexico’s Progresa-Oportunidades program. Washington, DC: Brookings Institution Press.
Google Scholar
Luecke, D., & McGinn, N. (1975). Regression analyses and education production functions: Can they be trusted? Harvard Educational Review, 45(3), 325–350.
Article Google Scholar
Lykken, D. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151–159.
Article Google Scholar
McLaughlin, M. (1987). Learning from experience: Lessons from policy implementation. Educational Evaluation and Policy Analysis, 9(2), 171–178.
Article Google Scholar
McLaughlin, M. (1990). The Rand change agent study revisited: Macro perspectives and micro realities. Educational Researcher, 19(9), 11–16.
Article Google Scholar
Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834.
Article Google Scholar
Meehl, P. (1986). What social scientists don’t understand. In D. Fiske & R. Shweder (Eds.), Metatheory in social science: Pluralisms and subjectivities (pp. 315–338). Chicago: University of Chicago.
Google Scholar
Meehl, P. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 393–425). Mahwah, NJ: Erlbaum.
Google Scholar
Mertens, D. (2005). Research and evaluation in education and psychology: Integrating diversity with quantitative, qualitative, and mixed methods (2nd ed.). London: Sage.
Google Scholar
Miguel, E., & Kremer, M. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72(1), 159–217.
Article Google Scholar
Novella, S. (2016). P value under fire. Science-Base Medicine. Retrieved from https://sciencebasedmedicine.org/p-value-under-fire/
Nuzzo, R. (2015). Scientists perturbed by loss of stat tools to sift research fudge from fact. Scientific American. Retrieved from https://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tools-to-sift-research-fudge-from-fact/
O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79–119.
Article Google Scholar
Peck, J., & Theodore, N. (2015). Fast policy: Experimental statecraft at the thresholds of neoliberalism. Minneapolis: University of Minnesota.
Book Google Scholar
Pogrow, S. (2017). The failure of the U.S. education research establishment to identify effective practices: Beware effective practices policies. Education Policy Analysis Archives, 25(5), 1–19. Retrieved from https://epaa.asu.edu/ojs/article/view/2517
Article Google Scholar
Pritchett, L. (n.d.). “The evidence” about “what works” in education: Graphs to illustrate external validity and construct validity. Research on Improving Systems of Education. Retrieved from https://www.cgdev.org/publication/evidence-about-what-works-education-graphs-illustrate-external-validity
Pritchett, L. (2002). It pays to be ignorant: A simple political economy of rigorous program evaluation. The Journal of Policy Reform, 5(4), 251–269.
Article Google Scholar
Rodrik, D. (2008). The development economics: We shall experiment, but how shall we learn? Faculty Research Working Papers Series. RWP08-055. John F. Kennedy School of Government. Harvard University. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1296115
Romero, M., Sandefur, J., & Sandholtz, W. 2017. Can outsourcing improve Liberia’s schools? Preliminary results from year one of a three-year randomized evaluation of partnership schools for Liberia. Washington, DC: Center for Global Development. https://www.cgdev.org/sites/default/files/partnership-schools-for-liberia.pdf
Rust, V., Soumare, A., Pescador, O., & Shibuya, M. (1999). Research Strategies in Comparative Education. Comparative Education Review, 43(1), 86–109.
Article Google Scholar
Sanson-Fisher, R., Bonevski, B., Green, L., & D’Este, C. (2007). Limitations of the randomized controlled trial in evaluating population-based health interventions. American Journal of Preventative Medicine, 33(2), 155–161.
Article Google Scholar
Schlotter, M., Schwerdt, G., & Woessmann, L. (2011). Econometric methods for causal evaluation of education policies and practices: A non-technical guide. Education Economics, 19(2), 109–137.
Article Google Scholar
Schanzenbach, D. (2012). Limitations of experiments in education research. Education Finance and Policy, 7(2), 219–232.
Article Google Scholar
Steidl, R., Hayes, J., & Schauber, E. (1997). Statistical power analysis in wildlife research. The Journal of Wildlife Management, 61(2), 270–279.
Article Google Scholar
Uriel, E. (2013). Hypothesis testing in the multiple regression model. Retrieved from https://www.uv.es/uriel/4%20Hypothesis%20testing%20in%20the%20multiple%20regression%20model.pdf
Vivalt, E. (2015). How much can we generalize from impact evaluations? New York University. Retrieved from https://pdfs.semanticscholar.org/6545/a87feaec7d6d0ba462860b3d1bb721d9da39.pdf
Wang, L., & Guo, K. (2018). Shadow education of mathematics in China. In Y. Cao & F. Leung (Eds.), The 21st century mathematics education in China (pp. 93–106). Berlin: Springer.
Chapter Google Scholar
Weiss, R., & Rein, M. (1970). The evaluation of broad-aim programs: Experimental design, its difficulties, and an alternative. Administrative Science Quarterly, 15(1), 97–109.
Article Google Scholar
Williams, W., & Evans, J. (1969). The politics of evaluation: The case of Head Start. Annals of the American Academy of Political and Social Sciences, 385, 118–132.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Hawaii at Manoa, Honolulu, HI, USA
D. Brent Edwards Jr.

Authors

D. Brent Edwards Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Edwards, D.B. (2018). Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues. In: Global Education Policy, Impact Evaluations, and Alternatives. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-319-75142-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-75142-9_2
Published: 11 April 2018
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-319-75141-2
Online ISBN: 978-3-319-75142-9
eBook Packages: Political Science and International StudiesPolitical Science and International Studies (R0)

Publish with us

Policies and ethics