Multiple hypothesis testing in experimental economics

  • John A. List
  • Azeem M. Shaikh
  • Yang XuEmail author
Original Paper


The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.


Experiments Multiple hypothesis testing Multiple treatments Multiple outcomes Multiple subgroups Randomized controlled trial Bootstrap Balance 

JEL Classification

C12 C14 



We would like to thank Joseph P. Romano for helpful comments on this paper. We also thank Joseph Seidel for his excellent research assistance. The research of the second author was supported by National Science Foundation Grants DMS-1308260, SES-1227091, and SES-1530661.


  1. Anderson, M. (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects. Journal of the American Statistical Association, 103(484), 1481–1495.Google Scholar
  2. Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33(1), 108–113.Google Scholar
  3. Bhattacharya, J., Shaikh, A. M., & Vytlacil, E. (2012). Treatment effect bounds: An application to swan-ganz catheterization. Journal of Econometrics, 168(2), 223–243.Google Scholar
  4. Bonferroni, C. E. (1935). Il calcolo delle assicurazioni su gruppi di teste. Rome: Tipografia del Senato.Google Scholar
  5. Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.Google Scholar
  6. Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., et al. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.Google Scholar
  7. Fink, G., McConnell, M., & Vollmer, S. (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures. Journal of Development Effectiveness, 6(1), 44–57.Google Scholar
  8. Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.Google Scholar
  9. Flory, J. A., Leibbrandt, A., & List, J. A. (2015b). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions. The Review of Economic Studies, 82(1), 122–155.Google Scholar
  10. Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environments: Gender differences. The Quarterly Journal of Economics, 118(3), 1049–1074.Google Scholar
  11. Heckman, J., Moon, S. H., Pinto, R., Savelyev, P., & Yavitz, A. (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program. Quantitative Economics, 1(1), 1–46.Google Scholar
  12. Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.Google Scholar
  13. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.Google Scholar
  14. Hossain, T., & List, J. A. (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations. Management Science, 58(12), 2151–2167.Google Scholar
  15. Ioannidis, J. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.Google Scholar
  16. Jennions, M. D., & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method. Biological Reviews of the Cambridge Philosophical Society, 77(02), 211–222.Google Scholar
  17. Karlan, D., & List, J. A. (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment. The American Economic Review, 97(5), 1774–1793.Google Scholar
  18. Kling, J., Liebman, J., & Katz, L. (2007). Experimental analysis of neighborhood effects. Econometrica, 75(1), 83–119.Google Scholar
  19. Lee, S., & Shaikh, A. M. (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment. Journal of Applied Econometrics, 29(4), 612–626.Google Scholar
  20. Lehmann, E., & Romano, J. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 1138–1154.Google Scholar
  21. Lehmann, E. L., & Romano, J. P. (2006). Testing statistical hypotheses. Berlin: Springer.Google Scholar
  22. Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.Google Scholar
  23. List, J. A., & Samek, A. S. (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption. Journal of Health Economics, 39, 135–146.Google Scholar
  24. Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].Google Scholar
  25. Maniadis, Z., Tufano, F., & List, J. A. (2014). One swallow doesn’t make a summer: New evidence on anchoring effects. The American Economic Review, 104(1), 277–290.Google Scholar
  26. Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? Do men compete too much? The Quarterly Journal of Economics, 122(3), 1067–1101.Google Scholar
  27. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631.Google Scholar
  28. Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In Lecture Notes-Monograph Series (pp. 33–50).Google Scholar
  29. Romano, J. P., & Shaikh, A. M. (2006b). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics, 34, 1850–1873.Google Scholar
  30. Romano, J. P., & Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6), 2798–2822.Google Scholar
  31. Romano, J. P., Shaikh, A. M., & Wolf, M. (2008a). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3), 417–442.Google Scholar
  32. Romano, J. P., Shaikh, A. M., & Wolf, M. (2008b). Formalized data snooping based on generalized error rates. Econometric Theory, 24(02), 404–447.Google Scholar
  33. Romano, J. P., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237–1282.Google Scholar
  34. Romano, J. P., & Wolf, M. (2010). Balanced control of generalized error rates. The Annals of Statistics, 38, 598–633.Google Scholar
  35. Sutter, M., & Glätzle-Rützler, D. (2014). Gender differences in the willingness to compete emerge early in life and persist. Management Science, 61(10), 2339–23354.Google Scholar
  36. Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p value adjustment (Vol. 279). New York: Wiley.Google Scholar

Copyright information

© Economic Science Association 2019

Authors and Affiliations

  1. 1.Department of EconomicsUniversity of ChicagoChicagoUSA

Personalised recommendations