Multiple hypothesis testing in experimental economics


The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

This is a preview of subscription content, log in to check access.


  1. Anderson, M. (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects. Journal of the American Statistical Association, 103(484), 1481–1495.

    Article  Google Scholar 

  2. Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33(1), 108–113.

    Article  Google Scholar 

  3. Bhattacharya, J., Shaikh, A. M., & Vytlacil, E. (2012). Treatment effect bounds: An application to swan-ganz catheterization. Journal of Econometrics, 168(2), 223–243.

    Article  Google Scholar 

  4. Bonferroni, C. E. (1935). Il calcolo delle assicurazioni su gruppi di teste. Rome: Tipografia del Senato.

    Google Scholar 

  5. Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.

  6. Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., et al. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.

    Article  Google Scholar 

  7. Fink, G., McConnell, M., & Vollmer, S. (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures. Journal of Development Effectiveness, 6(1), 44–57.

    Article  Google Scholar 

  8. Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.

  9. Flory, J. A., Leibbrandt, A., & List, J. A. (2015b). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions. The Review of Economic Studies, 82(1), 122–155.

    Article  Google Scholar 

  10. Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environments: Gender differences. The Quarterly Journal of Economics, 118(3), 1049–1074.

    Article  Google Scholar 

  11. Heckman, J., Moon, S. H., Pinto, R., Savelyev, P., & Yavitz, A. (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program. Quantitative Economics, 1(1), 1–46.

    Article  Google Scholar 

  12. Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.

  13. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

    Google Scholar 

  14. Hossain, T., & List, J. A. (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations. Management Science, 58(12), 2151–2167.

    Article  Google Scholar 

  15. Ioannidis, J. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.

    Article  Google Scholar 

  16. Jennions, M. D., & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method. Biological Reviews of the Cambridge Philosophical Society, 77(02), 211–222.

    Article  Google Scholar 

  17. Karlan, D., & List, J. A. (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment. The American Economic Review, 97(5), 1774–1793.

    Article  Google Scholar 

  18. Kling, J., Liebman, J., & Katz, L. (2007). Experimental analysis of neighborhood effects. Econometrica, 75(1), 83–119.

    Article  Google Scholar 

  19. Lee, S., & Shaikh, A. M. (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment. Journal of Applied Econometrics, 29(4), 612–626.

    Article  Google Scholar 

  20. Lehmann, E., & Romano, J. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 1138–1154.

    Article  Google Scholar 

  21. Lehmann, E. L., & Romano, J. P. (2006). Testing statistical hypotheses. Berlin: Springer.

    Google Scholar 

  22. Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.

  23. List, J. A., & Samek, A. S. (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption. Journal of Health Economics, 39, 135–146.

    Article  Google Scholar 

  24. Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].

  25. Maniadis, Z., Tufano, F., & List, J. A. (2014). One swallow doesn’t make a summer: New evidence on anchoring effects. The American Economic Review, 104(1), 277–290.

    Article  Google Scholar 

  26. Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? Do men compete too much? The Quarterly Journal of Economics, 122(3), 1067–1101.

    Article  Google Scholar 

  27. Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631.

    Article  Google Scholar 

  28. Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In Lecture Notes-Monograph Series (pp. 33–50).

  29. Romano, J. P., & Shaikh, A. M. (2006b). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics, 34, 1850–1873.

    Article  Google Scholar 

  30. Romano, J. P., & Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6), 2798–2822.

    Article  Google Scholar 

  31. Romano, J. P., Shaikh, A. M., & Wolf, M. (2008a). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3), 417–442.

    Article  Google Scholar 

  32. Romano, J. P., Shaikh, A. M., & Wolf, M. (2008b). Formalized data snooping based on generalized error rates. Econometric Theory, 24(02), 404–447.

    Article  Google Scholar 

  33. Romano, J. P., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237–1282.

    Article  Google Scholar 

  34. Romano, J. P., & Wolf, M. (2010). Balanced control of generalized error rates. The Annals of Statistics, 38, 598–633.

    Article  Google Scholar 

  35. Sutter, M., & Glätzle-Rützler, D. (2014). Gender differences in the willingness to compete emerge early in life and persist. Management Science, 61(10), 2339–23354.

    Article  Google Scholar 

  36. Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p value adjustment (Vol. 279). New York: Wiley.

    Google Scholar 

Download references


We would like to thank Joseph P. Romano for helpful comments on this paper. We also thank Joseph Seidel for his excellent research assistance. The research of the second author was supported by National Science Foundation Grants DMS-1308260, SES-1227091, and SES-1530661.

Author information



Corresponding author

Correspondence to Yang Xu.

Additional information

Documentation of our procedures and our Stata and Matlab code can be found at



Proof of Theorem 3.1

First note that under Assumption 2.1, \(Q\in \omega _{s}\) if and only if \(P\in {\tilde{\omega }}_{s}\), where

$${\tilde{\omega }}_{s}=\{P(Q):Q\in \varOmega ,E_{P}[Y_{i,k}|D_{i}=d,Z_{i}=z]=E_{P}[Y_{i,k}|D_{i}=d',Z_{i}=z]\}.$$

The proof of this result now follows by verifying the conditions of Corollary 5.1 in Romano and Wolf (2010). In particular, we verify Assumptions B.1–B.4 in Romano and Wolf (2010).

In order to verify Assumption B.1 in Romano and Wolf (2010), let

$$T_{s,n}^{*}(P)=\sqrt{n}\left( \frac{1}{n_{d,z}}\sum _{1\le i\le n:D_{i}=d,Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d,z}(P))-\frac{1}{n_{d',z}}\sum _{1\le i\le n:D_{i}=d',Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d',z}(P))\right),$$

and note that

$$T_{n}^{*}(P)=(T_{s,n}^{*}(P):s\in {\mathcal {S}})=f(A_{n}(P),B_{n}),$$


$$A_{n}(P)=\frac{1}{\sqrt{n}}\sum _{1\le i\le n}A_{n,i}(P),$$

with \(A_{n,i}(P)\) equal to the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} (Y_{i,k}-{\tilde{\mu }}_{k|d,z}(P))I\{D_{i}=d,Z_{i}=z\}\\ (Y_{i,k}-{\tilde{\mu }}_{k|d',z}(P))I\{D_{i}=d',Z_{i}=z\} \end{array}\right),$$

and \(B_{n}\) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} \frac{1}{\frac{1}{n}\sum _{1\le i\le n}I\{D_{i}=d,Z_{i}=z\}}\\ -\frac{1}{\frac{1}{n}\sum _{1\le i\le n}I\{D_{i}=d',Z_{i}=z\}} \end{array}\right).$$

and \(f:{\mathbf {R}}^{2|{\mathcal {S}}|}\times {\mathbf {R}}^{2|{\mathcal {S}}|}\rightarrow {\mathbf {R}}^{2|{\mathcal {S}}|}\) is the function of \(A_{n}(P)\) and \(B_{n}\) whose sth argument for \(s\in {\mathcal {S}}\) is given by the inner product of the sth pair of terms in \(A_{n}(P)\) and the sth pair of terms in \(B_{n}\), i.e., the inner product of (10) and (11). The weak law of large numbers and central limit theorem imply that

$$B_{n}{\mathop {\rightarrow }\limits ^{P}}B(P),$$

where B(P) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} \frac{1}{P\{D_{i}=d,Z_{i}=z\}}\\ -\frac{1}{P\{D_{i}=d',Z_{i}=z\}} \end{array}\right).$$

Next, note that \(E_{P}[A_{n,i}(P)]=0\). Assumption 2.3 and the central limit theorem therefore imply that

$$A_{n}(P){\mathop {\rightarrow }\limits ^{d}}N(0,V_{A}(P))$$

for an appropriate choice of \(V_{A}(P)\). In particular, the diagonal elements of \(V_{A}(P)\) are of the form

$${\tilde{\sigma }}_{k|d,z}^{2}(P)P\{D_{i}=d,Z_{i}=z\}.$$

The continuous mapping theorem thus implies that

$$T_{n}^{*}(P){\mathop {\rightarrow }\limits ^{d}}N(0,V(P))$$

for an appropriate variance matrix V(P). In particular, the sth diagonal element of V(P) is given by

$$\frac{{\tilde{\sigma }}_{k|d,z}^{2}(P)}{P\{D_{i}=d,Z_{i}=z\}}+\frac{{\tilde{\sigma }}_{k|d',z}^{2}(P)}{P\{D_{i}=d',Z_{i}=z\}}.$$

In order to verify Assumptions B.2–B.3 in Romano and Wolf (2010), it suffices to note that (12) is strictly greater than zero under our assumptions. Note that it is not required that V(P) be non-singular for these assumptions to be satisfied.

In order to verify Assumption B.4 in Romano and Wolf (2010), we first argue that

$$T_{n}^{*}(P_{n}){\mathop {\rightarrow }\limits ^{d}}N(0,V(P))$$

under \(P_{n}\) for an appropriate sequence of distributions \(P_{n}\) for \((Y_{i},D_{i},Z_{i})\). To this end, assume that

  1. (a)

    \(P_{n}{\mathop {\rightarrow }\limits ^{d}}P\).

  2. (b)

    \({\tilde{\mu }}_{k|d,z}(P_{n})\rightarrow {\tilde{\mu }}_{k|d,z}(P)\).

  3. (c)

    \(B_{n}{\mathop {\rightarrow }\limits ^{P_{n}}}B(P)\).

  4. (d)

    \(\text {Var}_{P_{n}}[A_{n,i}(P_{n})]\rightarrow \text {Var}_{P}[A_{n,i}(P)]\).

Under (a) and (b), it follows that \(A_{n,i}(P_{n}){\mathop {\rightarrow }\limits ^{d}}A_{n,i}(P)\) under \(P_{n}\). By arguing as in Theorem 15.4.3 in Lehmann and Romano (2006) and using (d), it follows from the Lindeberg–Feller central limit theorem that

$$A_{n}(P_{n}){\mathop {\rightarrow }\limits ^{d}}N(0,V_{A}(P))$$

under \(P_{n}\). It thus follows from (c) and the continuous mapping theorem that (13) holds under \(P_{n}\). Assumption B.4 in Romano and Wolf (2010) now follows simply by nothing that the Glivenko-Cantelli theorem, strong law of large numbers and continuous mapping theorem ensure that \({\hat{P}}_{n}\) satisfies (a)–(d) with probability one under P.

Table 1 Multiple outcomes
Table 2 Multiple subgroups
Table 3 Multiple treatments (Comparing multiple treatments with a control)
Table 4 Multiple treatments (All pairwise comparisons across multiple treatments and a control)
Table 5 Multiple outcomes, subgroups, and treatments

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

List, J.A., Shaikh, A.M. & Xu, Y. Multiple hypothesis testing in experimental economics. Exp Econ 22, 773–793 (2019).

Download citation


  • Experiments
  • Multiple hypothesis testing
  • Multiple treatments
  • Multiple outcomes
  • Multiple subgroups
  • Randomized controlled trial
  • Bootstrap
  • Balance

JEL Classification

  • C12
  • C14