# Multiple hypothesis testing in experimental economics

## Abstract

The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

## Keywords

Experiments Multiple hypothesis testing Multiple treatments Multiple outcomes Multiple subgroups Randomized controlled trial Bootstrap Balance## JEL Classification

C12 C14## Notes

### Acknowledgements

We would like to thank Joseph P. Romano for helpful comments on this paper. We also thank Joseph Seidel for his excellent research assistance. The research of the second author was supported by National Science Foundation Grants DMS-1308260, SES-1227091, and SES-1530661.

## References

- Anderson, M. (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects.
*Journal of the American Statistical Association*,*103*(484), 1481–1495.Google Scholar - Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories.
*Strategic Management Journal*,*33*(1), 108–113.Google Scholar - Bhattacharya, J., Shaikh, A. M., & Vytlacil, E. (2012). Treatment effect bounds: An application to swan-ganz catheterization.
*Journal of Econometrics*,*168*(2), 223–243.Google Scholar - Bonferroni, C. E. (1935).
*Il calcolo delle assicurazioni su gruppi di teste*. Rome: Tipografia del Senato.Google Scholar - Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.Google Scholar
- Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., et al. (2016). Evaluating replicability of laboratory experiments in economics.
*Science*,*351*(6280), 1433–1436.Google Scholar - Fink, G., McConnell, M., & Vollmer, S. (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures.
*Journal of Development Effectiveness*,*6*(1), 44–57.Google Scholar - Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.Google Scholar
- Flory, J. A., Leibbrandt, A., & List, J. A. (2015b). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions.
*The Review of Economic Studies*,*82*(1), 122–155.Google Scholar - Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environments: Gender differences.
*The Quarterly Journal of Economics*,*118*(3), 1049–1074.Google Scholar - Heckman, J., Moon, S. H., Pinto, R., Savelyev, P., & Yavitz, A. (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program.
*Quantitative Economics*,*1*(1), 1–46.Google Scholar - Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.Google Scholar
- Holm, S. (1979). A simple sequentially rejective multiple test procedure.
*Scandinavian Journal of Statistics*,*6*(2), 65–70.Google Scholar - Hossain, T., & List, J. A. (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations.
*Management Science*,*58*(12), 2151–2167.Google Scholar - Ioannidis, J. (2005). Why most published research findings are false.
*PLoS Med*,*2*(8), e124.Google Scholar - Jennions, M. D., & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method.
*Biological Reviews of the Cambridge Philosophical Society*,*77*(02), 211–222.Google Scholar - Karlan, D., & List, J. A. (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment.
*The American Economic Review*,*97*(5), 1774–1793.Google Scholar - Kling, J., Liebman, J., & Katz, L. (2007). Experimental analysis of neighborhood effects.
*Econometrica*,*75*(1), 83–119.Google Scholar - Lee, S., & Shaikh, A. M. (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment.
*Journal of Applied Econometrics*,*29*(4), 612–626.Google Scholar - Lehmann, E., & Romano, J. (2005). Generalizations of the familywise error rate.
*The Annals of Statistics*,*33*(3), 1138–1154.Google Scholar - Lehmann, E. L., & Romano, J. P. (2006).
*Testing statistical hypotheses*. Berlin: Springer.Google Scholar - Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.Google Scholar
- List, J. A., & Samek, A. S. (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption.
*Journal of Health Economics*,*39*, 135–146.Google Scholar - Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].Google Scholar
- Maniadis, Z., Tufano, F., & List, J. A. (2014). One swallow doesn’t make a summer: New evidence on anchoring effects.
*The American Economic Review*,*104*(1), 277–290.Google Scholar - Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? Do men compete too much?
*The Quarterly Journal of Economics*,*122*(3), 1067–1101.Google Scholar - Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability.
*Perspectives on Psychological Science*,*7*(6), 615–631.Google Scholar - Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In
*Lecture Notes-Monograph Series*(pp. 33–50).Google Scholar - Romano, J. P., & Shaikh, A. M. (2006b). Stepup procedures for control of generalizations of the familywise error rate.
*The Annals of Statistics*,*34*, 1850–1873.Google Scholar - Romano, J. P., & Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling and the bootstrap.
*The Annals of Statistics*,*40*(6), 2798–2822.Google Scholar - Romano, J. P., Shaikh, A. M., & Wolf, M. (2008a). Control of the false discovery rate under dependence using the bootstrap and subsampling.
*Test*,*17*(3), 417–442.Google Scholar - Romano, J. P., Shaikh, A. M., & Wolf, M. (2008b). Formalized data snooping based on generalized error rates.
*Econometric Theory*,*24*(02), 404–447.Google Scholar - Romano, J. P., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping.
*Econometrica*,*73*(4), 1237–1282.Google Scholar - Romano, J. P., & Wolf, M. (2010). Balanced control of generalized error rates.
*The Annals of Statistics*,*38*, 598–633.Google Scholar - Sutter, M., & Glätzle-Rützler, D. (2014). Gender differences in the willingness to compete emerge early in life and persist.
*Management Science*,*61*(10), 2339–23354.Google Scholar - Westfall, P. H., & Young, S. S. (1993).
*Resampling-based multiple testing: Examples and methods for p value adjustment*(Vol. 279). New York: Wiley.Google Scholar