Skip to main content

Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues

  • Chapter
  • First Online:
Global Education Policy, Impact Evaluations, and Alternatives
  • 354 Accesses

Abstract

It is crucial that scholars, but also policymakers and practitioners, have a critical understanding of the conceptual and technical limitations of impact evaluations, as well as the ways that they are necessarily affected by organizational and political dynamics. In order to achieve a critical understanding of impact evaluations, this chapter directs its attention toward both their most common form—regression analysis—as well as the form that is seen to be more robust, that is, randomized control trials (RCTs). The methodological assumptions of each of these are discussed, first, in conceptual terms before moving on to a review of some more technical issues. The final section of the chapter turns to a consideration of how the production of impact evaluations is affected by organizational and political dynamics. In all, this chapter advocates a critical understanding of impact evaluations in five senses: conceptually, technically, contextually, organizationally, and politically.

I am grateful to Steve Klees, Adrián Zancajo, and Charles Blake for their helpful feedback on an earlier version of this chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It should also be acknowledged that cost and the difficulty of executing such studies are deterrents as well.

  2. 2.

    This problem is known as endogeneity.

  3. 3.

    Researchers do not always assume linear relationships among the variables, as when they include squared terms or use logarithms. However, as discussed below, even when nonlinear relationships are modeled, there are still inconsistencies across studies, and no case can we be sure that all of the interrelationships have been modeled correctly to reflect their relationships in the real world.

  4. 4.

    To be sure, accounting for all possible causes of social phenomena is a challenge faced by all methodologies. The point here is that a pretense of regression analysis is that all variables are included if we are to get accurate estimates of regression coefficients.

  5. 5.

    Although we do not deal here with variations of regression analysis, the critiques outlined here apply to these variations and the methods that are built upon them. These variations and other methods include regression discontinuity, differences in differences, path analysis, hierarchical linear modeling, and structural equation modeling, among others (Klees, 2016; Schlotter, Schwerdt, & Woessmann, 2011)

  6. 6.

    Interestingly, Rodrik (2008) provides a personal anecdote about how standards have changed: “When I was a young assistant professor [in the 1980s], one could still publish econometric results in top journals with nary a word on the endogeneity of regressors. If one went so far as to instrument for patently endogenous variables, it was often enough to state that you were doing [instrumental variables], with the list of instruments tacked into a footnote at the bottom of a table. No more. A large chunk of the typical empirical—but non-experimental—paper today is devoted to discussing issues having to do with endogeneity, omitted variables, and measurement error. The identification strategy is made explicit, and is often at the core of the paper. Robustness issues take a whole separate section. Possible objections are anticipated, and counter-arguments are advanced. In other words, considerable effort is devoted to convincing the reader of the internal validity of the study” (p. 20).

  7. 7.

    This quote also mentions sample size, an issue that will be addressed later in this chapter, in the section on more technical issues.

  8. 8.

    Selection bias can also occur when participants are allowed to self-select into a treatment, as discussed further in the section on “randomizing an unrepresentative sample.”

  9. 9.

    As evidence on this point, a meta-evaluation of 213 studies on school-based interventions (47% of which employed randomized designs) found that less than 1% of treatment and control group were identical at the outset (Durlak, Weissberg, Dymnicki, Taylor, & Schellinger, 2011).

  10. 10.

    Interestingly, Cartwright (2007, p. 20) suggests that expert judgment is more useful for addressing threats to internal validity than is pursuit of statistical controls.

  11. 11.

    The four threats addressed here are all threats to internal validity. As noted in this section, when these threats—and therefore internal validity—are an issue, one cannot be sure that the effects observed are due to the intervention under study. While there are many more potential threats to internal validity that affect both regression analysis and RCTs (see, e.g., Mertens, 2005, pp. 121–124), these four have been addressed here because they can affect RCTs in spite of randomization.

  12. 12.

    Although it doesn’t solve the problem of outliers, researchers often look at subpopulations within the sample (e.g., according to gender, ethnicity, class, etc.) to see how outcomes vary across them.

  13. 13.

    For additional discussion of threats to internal and external validity see, for example, Mertens (2005). The goal here is not to speak to all possible threats to internal and external validity but rather to address some of the problematic assumptions and practices that are associated with RCTs in practice.

  14. 14.

    In a point that connects with the next subsection, Deaton and Cartwright (2016) underscore that “without a clear idea of how to characterize the population of individuals in the trial, whether we are looking for an [average treatment effect] or to identify causality, and for which groups enrolled in the trial the results are supposed to hold, we have no basis for thinking about how to use the trial results in other contexts” (p. 27).

  15. 15.

    See conclusion chapter for discussion of alternative methods, including realist evaluation, which focuses on understanding the mechanisms and contexts in which interventions are likely to work, as opposed to focusing only on the outcomes of an intervention, as is the case with impact evaluations.

  16. 16.

    To be clear, Klees (2016) argues that quantitative data are still important but that we can only rely on simple correlations and cross tabulations.

  17. 17.

    The language of rejecting the null hypothesis might sound odd, but as Steidl, Hayes, and Schauber (1997) remind us: “In the framework of the hypothetico-deductive method, research hypotheses can never be proven; rather, they can only be disproved (rejected) with the tools of statistical inference” (p. 271).

  18. 18.

    Another way of stating this definition would be mean difference/standard error of mean difference.

  19. 19.

    Carver (1978) make a further and particularly interesting point on researcher discretion when it comes to sample size: “Since the researcher often has complete control over the number of subjects sampled, one of the most important variables affecting the results of the research, the subjective judgment of the experimenter in choosing sample size, is usually not controlled. Controlling experimenter bias is a much discussed problem, but not enough is said about the experimenter’s ability to increase the odds of getting statistically significant results simply by increasing the number of subjects in an experiment” (p. 385).

  20. 20.

    For additional discussion of the shortcomings of hypothesis testing, see Levine et al. (2008).

  21. 21.

    See Steidl et al. (1997) for a good discussion of power analysis.

  22. 22.

    O’Boyle and Aguinis (2012) suggest instead the use of the Paretian (power law) distribution.

  23. 23.

    On test results and normality, Goertzel (n.d.) explains: “If a large number of people fill out a typical multiple choice test such as the Scholastic Aptitude Test (or a typical sociological questionnaire with precoded responses such as ‘strongly agree, agree’) at random using a perfect die, the scores are very likely to be normally distributed. This is true because many more combinations of response give a sum that is close to the theoretical mean than give a score that is close to either extreme” (p. 6).

  24. 24.

    This is done using the analysis of covariance statistical procedure. Sometimes additional transformations of the data are made by converting outcomes to normalized Z scores in order to compare outcomes from different tests. See Pogrow (2017) for more.

  25. 25.

    See Pogrow (2017) for a discussion of examples of where small effect sizes have been highlighted by researchers and the media and in ways that contradict how effect sizes should be interpreted.

  26. 26.

    It is because journals are more willing to publish evaluations with positive or negative results rather than null results that one must also have caution when considering the findings of meta-analysis or literature reviews, a point that has been made by Glewwe (2014).

References

  • Anderson, D., Burnham, K., Gould, W., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311–316.

    Google Scholar 

  • Baker, J. (2000). Evaluating the impact of development projects on poverty: A handbook for practitioners. Washington, DC: World Bank.

    Book  Google Scholar 

  • Banerjee, A., Banerji, R., Berry, J., Duflo, E., Kannan, H., Mukerji, S., Shotland, M., & Walton, M. (2017). From proof concept to scalable policies: Challenges and solutions, with an application. NBER Working Paper No. 22931. Retrieved from https://economics.mit.edu/files/12359

  • Banerjee, A., & He, R. (2008). Making aid work. In W. Easterly (Ed.), Reinventing foreign aid (pp. 47–92). Cambridge, MA: MIT.

    Google Scholar 

  • Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association, 33(203), 526–536.

    Article  Google Scholar 

  • Biglan, A., Ary, D., & Wagenaar, A. (2000). The value of interrupted time-series experiments for community intervention research. Prevention Science, 1(1), 31–49.

    Article  Google Scholar 

  • Boruch, R., Rindskopf, D., Anderson, P., Amidjaya, I., & Jansson, D. (1979). Randomized experiments for evaluating and planning local programs: A summary on appropriateness and feasibility. Public Administration Review, 39(1), 36–40.

    Article  Google Scholar 

  • Braun, A., Ball, S., Maguire, M., & Hoskins, K. (2011). Taking context seriously: Towards explaining policy enactments in the secondary school. Discourse: Studies in the Cultural Politics of Education, 32(4), 585–596.

    Google Scholar 

  • Burde, D. (2012). Assessing impact and bridging methodological divides: Randomized trials in countries affected by conflict. Comparative Education Review, 56(3), 448–473.

    Article  Google Scholar 

  • Cartwright, N. (2007). Are RCTs the gold standard? Centre for Philosophy of Natural and Social Science. Technical Report 01/07. London: London School of Economics. Retrieved from http://www.lse.ac.uk/CPNSS/research/concludedResearchProjects/ContingencyDissentInScience/DP/Cartwright.pdf

  • Carver, R. (1978). The case against statistical significance testing. Harvard Educational Review, 48(3), 378–399.

    Article  Google Scholar 

  • Carver, R. (1993). The case against statistical significance testing, revisited. The Journal of Experimental Education, 61(4), 287–292.

    Article  Google Scholar 

  • Castillo, N., & Wagner, D. (2013). Gold standard? The use of randomized controlled trials for international educational policy. Comparative Education Review, 58(1), 166–173.

    Article  Google Scholar 

  • Clay, R. (2010). More than one way to measure: Randomized clinical trials have their place, but critics argue that researchers would get better results if they also embraced other methodologies. Monitor on Psychology, 41 (8), 52. Retrieved from http://www.apa.org/monitor/2010/09/trials.aspx

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  • Concato, J., Shah, N., & Horwitz, R. (2000). Randomized, controlled trials, observational studies, and the hierarchy of research designs. The New England Journal of Medicine, 342(25), 1887–1892.

    Article  Google Scholar 

  • Cook, T. (2001). Science phobia: Why education researchers reject randomized experiments. Education Next, (Fall), 63–68. Retrieved from http://www.indiana.edu/~educy520/readings/cook01_ed_research.pdf

  • Cook, T. (2004). Why have educational evaluators chosen not to do randomized experiments? Annals of the American Academy of Political and Social Science, 589(Sep.), 114–149.

    Google Scholar 

  • Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48(2), 424–455.

    Article  Google Scholar 

  • Deaton, A., & Cartwright, N. (2016a). The limitations of ramdomised controlled trials. Vox. Retrieved from http://voxeu.org/article/limitations-randomised-controlled-trials

  • Deaton, A., & Cartwright, N. (2016b). Understanding and misunderstanding randomized controlled trials. NBER Working Paper No. 22595. Retrieved from http://www.nber.org/papers/w22595

  • De Boer, M., Waterlander, W., Kujper, L., Steenhuis, I., & Twisk, J. (2015). Testing for baseline differences in randomized controlled trials: An unhealthy research behavior that is hard to eradicate. International Journal of Behavioral Nutrition and Physical Activity, 12(4). Retrieved from https://ijbnpa.biomedcentral.com/articles/10.1186/s12966-015-0162-z

  • Duflo, E., & Kremer, M. (2003, July 15–16). Use of randomization in the evaluation of development effectiveness. Paper prepared for the World Bank Operations Evaluation Department Conference on Evaluation and Development Effectiveness, Washington, DC. Retrieved from https://economics.mit.edu/files/765

  • Durlak, J. (2009). How to select, calculate, and interpret effect sizes. Journal of Pediatric Psychology, 34(9), 917–928.

    Article  Google Scholar 

  • Durlak, J., Weissberg, R., Dymnicki, A., Taylor, R., & Schellinger, K. (2011). The impact of enhancing students’ social and emotional learning: A meta-analysis of school-based universal interventions. Child Development, 82(1), 405–432.

    Article  Google Scholar 

  • Everett, B., Rehkopf, D., & Rogers, R. (2013). The nonlinear relationship between education and mortality: An examination of cohort, race/ethnic, and gender differences. Population Research Policy Review, 32 (6). Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3839428/

  • Feldman, A. & Haskins, R. (2016). Low-cost randomized controlled trials. Evidence-Based Policymaking Collaborative. Retrieved from http://www.evidencecollaborative.org/toolkits/low-cost-randomized-controlled-trials

  • Fendler, L., & Muzzafar, I. (2008). The history of the bell curve: Sorting and the idea of normal. Educational Theory, 58(1), 63–82.

    Article  Google Scholar 

  • Flack, V., & Chang, P. (1987). Frequency of selecting noise variables in subset regression analysis: A simulation study. The American Statistician, 41(1), 84–86.

    Google Scholar 

  • Freedman, D. (1983). A note on screening regression equations. The American Statistician, 37(2), 152–155.

    Google Scholar 

  • Ganimian, A. (2017). Not drawn to scale? RCTs and education reform in developing countries. Research on improving systems of education. Retrieved from http://www.riseprogramme.org/content/not-drawn-scale-rcts-and-education-reform-developing-countries

  • Garbarino, S., & Holland, J. (2009). Quantitative and qualitative methods in impact evaluation and measuring results. Governance and Social Development Resource Centre. UK Department for International Development. Retrieved from http://www.gsdrc.org/docs/open/eirs4.pdf

  • Gertler, P., Martinez, S., Premand, P., Rawlings, L., & Vermeersch, C. (2016). Impact evaluation in practice (2nd ed.). Washginton, DC: World Bank.

    Book  Google Scholar 

  • Gigerenzer, G., Swijtink, Z., Porter, T., Datson, L., Beatty, J., & Kruger, L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press.

    Book  Google Scholar 

  • Ginsburg, A., & Smith, M. (2016). Do randomized controlled trials meet the “gold standard”? A study of the usefulness of RCTs in the What Works Clearinghouse. American Enterprise Institute. Retrieved from https://www.carnegiefoundation.org/wp-content/uploads/2016/03/Do-randomized-controlled-trials-meet-the-gold-standard.pdf

  • Glewwe, P. (Ed.). (2014). Education policy in developing countries. Chicago: University of Chicago.

    Google Scholar 

  • Goertzel, T. (n.d). The myth of the bell curve. Retrieved from http://crab.rutgers.edu/~goertzel/normalcurve.htm

  • Gorard, S., & Taylor, C. (2004). Combining methods in educational and social research. New York: Open University.

    Google Scholar 

  • Greenland, S., Senn, S., Rothman, K., Carlin, J., Poole, C., Goodman, S., & Altman, D. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.

    Article  Google Scholar 

  • Johnson, D. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63(3), 763–772.

    Article  Google Scholar 

  • Klees, S. (2016). Inferences from regression analysis: Are they valid? Real-World Economics Review, 74, 85–97. Retrieved from http://www.paecon.net/PAEReview/issue74/Klees74.pdf

  • Klees, S., & Edwards, D. B., Jr. (2014). Knowledge production and technologies of governance. In T. Fenwick, E. Mangez, & J. Ozga (Eds.), World yearbook of education 2014: Governing knowledge: Comparison, knowledge-based technologies and expertise in the regulation of education (pp. 31–43). New York: Routledge.

    Google Scholar 

  • Komatsu, H., & Rappleye, J. (2017). A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth. Comparative Education, 53(2), 166–191.

    Article  Google Scholar 

  • Kremer, M. (2003). Randomized evaluations of educational programs in developing countries: Some lessons. The American Economic Review, 93(2), 102–106.

    Article  Google Scholar 

  • Lareau, A. (2009). Narrow questions, narrow answers: The limited value of randomized controlled trials for education research. In P. Walters, A. Lareau, & S. Ranis (Eds.), Education research on trial: Policy reform and the call for scientific rigor (pp. 145–162). New York: Routledge.

    Google Scholar 

  • Leamer, E. (1983). Let’s take the con out of econometrics. The American Economic Review, 73(1), 31–43.

    Google Scholar 

  • Leamer, E. (2010). Tantalus on the road to asymptopia. The Journal of Economic Perspectives, 24(2), 31–46.

    Article  Google Scholar 

  • Levine, T., Weber, R., Hullett, C., Park, H., & Lindsey, L. (2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34(2), 171–187.

    Article  Google Scholar 

  • Levy, S. (2006). Progress against poverty: Sustaining Mexico’s Progresa-Oportunidades program. Washington, DC: Brookings Institution Press.

    Google Scholar 

  • Luecke, D., & McGinn, N. (1975). Regression analyses and education production functions: Can they be trusted? Harvard Educational Review, 45(3), 325–350.

    Article  Google Scholar 

  • Lykken, D. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3), 151–159.

    Article  Google Scholar 

  • McLaughlin, M. (1987). Learning from experience: Lessons from policy implementation. Educational Evaluation and Policy Analysis, 9(2), 171–178.

    Article  Google Scholar 

  • McLaughlin, M. (1990). The Rand change agent study revisited: Macro perspectives and micro realities. Educational Researcher, 19(9), 11–16.

    Article  Google Scholar 

  • Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834.

    Article  Google Scholar 

  • Meehl, P. (1986). What social scientists don’t understand. In D. Fiske & R. Shweder (Eds.), Metatheory in social science: Pluralisms and subjectivities (pp. 315–338). Chicago: University of Chicago.

    Google Scholar 

  • Meehl, P. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. Harlow, S. Mulaik, & J. Steiger (Eds.), What if there were no significance tests? (pp. 393–425). Mahwah, NJ: Erlbaum.

    Google Scholar 

  • Mertens, D. (2005). Research and evaluation in education and psychology: Integrating diversity with quantitative, qualitative, and mixed methods (2nd ed.). London: Sage.

    Google Scholar 

  • Miguel, E., & Kremer, M. (2004). Worms: Identifying impacts on education and health in the presence of treatment externalities. Econometrica, 72(1), 159–217.

    Article  Google Scholar 

  • Novella, S. (2016). P value under fire. Science-Base Medicine. Retrieved from https://sciencebasedmedicine.org/p-value-under-fire/

  • Nuzzo, R. (2015). Scientists perturbed by loss of stat tools to sift research fudge from fact. Scientific American. Retrieved from https://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tools-to-sift-research-fudge-from-fact/

  • O’Boyle, E., & Aguinis, H. (2012). The best and the rest: Revisiting the norm of normality of individual performance. Personnel Psychology, 65(1), 79–119.

    Article  Google Scholar 

  • Peck, J., & Theodore, N. (2015). Fast policy: Experimental statecraft at the thresholds of neoliberalism. Minneapolis: University of Minnesota.

    Book  Google Scholar 

  • Pogrow, S. (2017). The failure of the U.S. education research establishment to identify effective practices: Beware effective practices policies. Education Policy Analysis Archives, 25(5), 1–19. Retrieved from https://epaa.asu.edu/ojs/article/view/2517

    Article  Google Scholar 

  • Pritchett, L. (n.d.). “The evidence” about “what works” in education: Graphs to illustrate external validity and construct validity. Research on Improving Systems of Education. Retrieved from https://www.cgdev.org/publication/evidence-about-what-works-education-graphs-illustrate-external-validity

  • Pritchett, L. (2002). It pays to be ignorant: A simple political economy of rigorous program evaluation. The Journal of Policy Reform, 5(4), 251–269.

    Article  Google Scholar 

  • Rodrik, D. (2008). The development economics: We shall experiment, but how shall we learn? Faculty Research Working Papers Series. RWP08-055. John F. Kennedy School of Government. Harvard University. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1296115

  • Romero, M., Sandefur, J., & Sandholtz, W. 2017. Can outsourcing improve Liberia’s schools? Preliminary results from year one of a three-year randomized evaluation of partnership schools for Liberia. Washington, DC: Center for Global Development. https://www.cgdev.org/sites/default/files/partnership-schools-for-liberia.pdf

  • Rust, V., Soumare, A., Pescador, O., & Shibuya, M. (1999). Research Strategies in Comparative Education. Comparative Education Review, 43(1), 86–109.

    Article  Google Scholar 

  • Sanson-Fisher, R., Bonevski, B., Green, L., & D’Este, C. (2007). Limitations of the randomized controlled trial in evaluating population-based health interventions. American Journal of Preventative Medicine, 33(2), 155–161.

    Article  Google Scholar 

  • Schlotter, M., Schwerdt, G., & Woessmann, L. (2011). Econometric methods for causal evaluation of education policies and practices: A non-technical guide. Education Economics, 19(2), 109–137.

    Article  Google Scholar 

  • Schanzenbach, D. (2012). Limitations of experiments in education research. Education Finance and Policy, 7(2), 219–232.

    Article  Google Scholar 

  • Steidl, R., Hayes, J., & Schauber, E. (1997). Statistical power analysis in wildlife research. The Journal of Wildlife Management, 61(2), 270–279.

    Article  Google Scholar 

  • Uriel, E. (2013). Hypothesis testing in the multiple regression model. Retrieved from https://www.uv.es/uriel/4%20Hypothesis%20testing%20in%20the%20multiple%20regression%20model.pdf

  • Vivalt, E. (2015). How much can we generalize from impact evaluations? New York University. Retrieved from https://pdfs.semanticscholar.org/6545/a87feaec7d6d0ba462860b3d1bb721d9da39.pdf

  • Wang, L., & Guo, K. (2018). Shadow education of mathematics in China. In Y. Cao & F. Leung (Eds.), The 21st century mathematics education in China (pp. 93–106). Berlin: Springer.

    Chapter  Google Scholar 

  • Weiss, R., & Rein, M. (1970). The evaluation of broad-aim programs: Experimental design, its difficulties, and an alternative. Administrative Science Quarterly, 15(1), 97–109.

    Article  Google Scholar 

  • Williams, W., & Evans, J. (1969). The politics of evaluation: The case of Head Start. Annals of the American Academy of Political and Social Sciences, 385, 118–132.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Edwards, D.B. (2018). Critically Understanding Impact Evaluations: Technical, Methodological, Organizational, and Political Issues. In: Global Education Policy, Impact Evaluations, and Alternatives. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-319-75142-9_2

Download citation

Publish with us

Policies and ethics