A simulation study investigating power estimates in phenome-wide association studies
Phenome-wide association studies (PheWAS) are a high-throughput approach to evaluate comprehensive associations between genetic variants and a wide range of phenotypic measures. PheWAS has varying sample sizes for quantitative traits, and variable numbers of cases and controls for binary traits across the many phenotypes of interest, which can affect the statistical power to detect associations. The motivation of this study is to investigate the various parameters which affect the estimation of statistical power in PheWAS, including sample size, case-control ratio, minor allele frequency, and disease penetrance.
We performed a PheWAS simulation study, where we investigated variations in statistical power based on different parameters, such as overall sample size, number of cases, case-control ratio, minor allele frequency, and disease penetrance. The simulation was performed on both binary and quantitative phenotypic measures. Our simulation on binary traits suggests that the number of cases has more impact on statistical power than the case to control ratio; also, we found that a sample size of 200 cases or more maintains the statistical power to identify associations for common variants. For quantitative traits, a sample size of 1000 or more individuals performed best in the power calculations. We focused on common genetic variants (MAF > 0.01) in this study; however, in future studies, we will be extending this effort to perform similar simulations on rare variants.
This study provides a series of PheWAS simulation analyses that can be used to estimate statistical power for some potential scenarios. These results can be used to provide guidelines for appropriate study design for future PheWAS analyses.
KeywordsPheWAS EHR ICD-9 codes Power analysis Simulation study
AIDS clinical trial group
Electronic health record
Geisinger health system
International classification of diseases, ninth revision
Minor allele frequency
Phenome-wide association studies
Phenome-wide association study (PheWAS) has been implemented in a variety of different studies, like within the eMERGE network [1, 2, 3, 4, 5], using electronic health record (EHR) information that includes international classification of disease version 9 (ICD-9) code based diagnoses, laboratory test measurements and demographic information [6, 7, 8, 9, 10, 11, 12]. Other PheWAS have used data from epidemiological studies [13, 14], as well as clinical trials [8, 9] such as the AIDS clinical trial group (ACTG), which consist of measurements for different clinical domains like pharmacology, metabolism, virology, and immunology [15, 16]. Cohorts like these with a large number of measurements for every individual have made PheWAS a practical approach when scanning over hundreds and thousands of phenotypes in a high-throughput way. PheWAS generates genetic association hypotheses for further study and provides insights through cross-phenotype associations.
Unlike genome-wide association studies (GWAS) where one phenotype is investigated in a study population, PheWAS uses a wide range of phenotypes collected for a variety of reasons for each dataset, often with minimal curation. Thus, in PheWAS, the data collected for different measurements can vary considerably in sample size, including the numbers of cases for diagnoses, depending on the rarity of the diagnosis. This makes the estimation of statistical power for PheWAS a challenge. For example, in electronic health record (EHR) data, one of the most commonly used data types to define case-control status is through ICD-9 codes; these codes provide information on disease diagnosis, procedures, and medications in the form of three- to five-digit codes. The longitudinal ICD-9 data collected over many years varies between patients due to multiple factors, such as differences in the frequency of patient visits, differences in length of records due to different start and end dates, and incomplete patient medical history. These factors generate sparseness and missing information in the data and, hence, variability in the number of cases, the case-control ratio, and the overall sample size in case-control study designs. These factors can then affect downstream association testing. Three issues exist for measures with low sample sizes: 1) low statistical power to identify or replicate genetic associations and, 2) potentially biased estimates in analyses with low sample size, and 3) an increase in multiple hypothesis testing burden through including low powered phenotypes that may not provide insights but increase the number of statistical tests.
Distribution of Samples in Published PheWAS. In this table, we highlight the range of cases and controls for PheWAS on binary phenotypes and range of sample sizes for the quantitative phenotypes used in few PheWAS analyses in literature
Min/Max Case Counts Range
Min/Max Control Count Range
Binary Phenotypes (EHR-based ICD-9 codes)
Denny et al. 
Hebbring et al. 
Namjou et al. 
Simonti et al. 
Shameer et al. 
Karnes et al. 
Sample Size Range
Karaca et al. 
Hall et al. 
Moore and Verma et al. 
Pendergrass et al. 
For binary phenotypes, we generated the simulated datasets by varying the following parameter settings: cases, case-control ratios, SNP minor allele frequencies, and disease penetrance. In this simulation study, we only investigated a study population with unbalanced and unmatched cases and controls. For example, we simulated a dataset with a random set of 30 individuals. The parameter settings for this simulated dataset were as follows: case-control ratio = 1:2, cases = 10, controls = 20, disease penetrance =0.15, and SNP MAF = 0.01. The simulated dataset was generated for four SNPs and 10 phenotypes, including one SNP-phenotype model with signal, and other models were simulated as noise. The noise was added to evaluate any systematic bias in the power estimates. The noise SNPs were generated by randomly assigning the genotypes in the study population but keeping the MAF the same as the signal SNP. We randomly assigned the cases for the noise phenotypes. Under each parameter setting, we generated 1000 datasets and then calculated associations using logistic regression. Please refer to Fig. 1 for all the different combinations of parameter values used for simulation.
For the continuous or quantitative trait simulations, we investigated the power estimates similarly by varying the sample size, minor allele frequencies, and disease penetrance. The simulated dataset was generated for four SNPs and one phenotype, with one signal SNP-phenotype model, and the rest was noise data. We generated 3 noise SNPs as in the binary phenotype simulations. Again, we generated 1000 datasets for each parameter setting and then used linear regression to calculate associations with the quantitative trait. Please refer to Fig. 1 for all the different combinations of parameter settings used for the quantitative trait simulations. All the association testing for binary and quantitative phenotypes was performed using PLATO .
We calculated the power estimates by counting the number of associations below an alpha value based on total number of tests within each set of 1000 simulated datasets for all parameter settings. For binary trait, we used α = 0.00025 (0.01/40) and for the quantitative trait, we used α = 0.004 (0.01/4).
Binary trait simulations
We designed a simulation approach with different combinations of genotype and phenotype parameters and then performed association testing so as to investigate the factors that could influence the power to detect the signal.
Quantitative trait simulations
Using the findings from these simulations, we addressed three issues related to low sample size and its impact on PheWAS approach. First, the impact of low sample size is evident in quantitative trait simulations, which suggests that the sample size of 1000 individuals for each phenotype is important to consider in the study design. However, in binary trait simulations, we observed that overall sample size does not affect the power, but instead the number of cases drives the power estimates. Secondly, low Type 1 error across all parameter settings (Figs. 3 and 5, Additional file 1: Figure S1) shows no systematic bias in the regression method. However, low sample size or low case numbers will not have enough statistical power to detect the associations. Lastly, we demonstrate that using the above-suggested thresholds of case numbers for binary traits and sample size for quantitative traits can help with the selection of phenotypes and reduce the number of tests and; hence, this can reduce the multiple hypothesis testing burden.
Using the simulation approach, we were able to identify the parameters impacting the power to determine genetic associations and we provided recommendations for PheWAS analysis design. However, there can be other factors that can influence the power of PheWAS analysis. We primarily ran all the simulations based on a regression model (linear or logistic regression), but there are now many other statistical methods for phenome-wide association analysis . Further extensions of these simulation studies to explore other statistical methods will be important. We limited our investigation in this study to an additive effect of genotypes. However, it will be important to investigate other genetic effects as well such as dominant, recessive, weighted, and interaction. There are also other factors that can influence the power estimates; these include environmental exposure, confounding covariates (age, sex, and ancestry. In the future, we will plan to include such factors in our simulation design.
PheWAS have become a common tool to explore the genotype-phenotype landscape of large biobanks linked to comprehensive phenotype/trait data collections as in EHRs, clinical trials, or epidemiological cohort studies. This high-throughput analysis approach has been met with much success in recent years [4, 6, 14]. However, the community has been lacking guidance for making study design decisions regarding sample size, case to control ratios, and minor allele frequency thresholds. At present, there is not a PheWAS Power Calculator available to researchers. Thus, we implemented a large-scale simulation study to provide some guidelines for understanding the statistical power of PheWAS analyses under different scenarios. We believe these simulation results provide the needed power estimates for future PheWAS analysis decisions.
We would like to thank Dr. Tooraj Mirshahi and Dr. Janet Robishaw for helpful discussions during this study.
This work was funded in part by the following: AI077505, GM111913, HG008679, and SAP 4100070267. This project is funded, in part, by a grant from the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions.
Availability of data and materials
The summary results to generate the Figures are provided in Additional file 2.
AV and MDR conceptualized and led the project. AV contributed to designing the analysis workflow and manuscript writing. YB and AML assisted with performing the simulation analysis. SD assisted with the computer programming requirements of the project. SSV and SAP assisted with analysis design and provided important feedback on the manuscript. All the authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA, et al. The electronic medical records and genomics (eMERGE) network: past, present, and future. Genet Med Off J Am Coll Med Genet. 2013;15:761–71.Google Scholar
- 11.Karnes JH, Bastarache L, Shaffer CM, Gaudieri S, Xu Y, Glazer AM, et al. Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants. Sci Transl Med. 2017;9(389).Google Scholar
- 12.Liao KP, Kurreeman F, Li G, Duclos G, Murphy S, Guzman R, et al. Associations of autoantibodies, autoimmune risk alleles, and clinical diagnoses from the electronic medical records in rheumatoid arthritis cases and non-rheumatoid arthritis controls. Arthritis Rheum. 2013;65:571–81.CrossRefPubMedPubMedCentralGoogle Scholar
- 13.Pendergrass SA, Brown-Gentry K, Dudek S, Frase A, Torstenson ES, Goodloe R, et al. Phenome-wide association study (PheWAS) for detection of pleiotropy within the population architecture using genomics and epidemiology (PAGE) network. PLoS Genet. 2013;9(1):e1003087.CrossRefPubMedPubMedCentralGoogle Scholar
- 14.Hall MA, Verma A, Brown-Gentry KD, Goodloe R, Boston J, Wilson S, et al. Detection of pleiotropy through a phenome-wide association study (PheWAS) of epidemiologic data as part of the environmental architecture for genes linked to environment (EAGLE) study. Gibson G, editor. PLoS Genet. 2014;10:e1004678.CrossRefPubMedPubMedCentralGoogle Scholar
- 18.R Core Team. R: a language and environment for statistical computing [internet]. Vienna, Austria. R Found Stat Comput. 2013; Available from: http://www.R-project.org.
- 21.Andersen JW. AIDS Clinical Trials Group (ACTG). Encycl stat Sci. Hoboken: Wiley; 2005. p. 1–11. Available from: http://doi.wiley.com/10.1002/0471667196.ess7279.
- 24.Namjou B, Marsolo K, Caroll RJ, Denny JC, Ritchie MD, Verma SS, et al. Phenome-wide association study (PheWAS) in EMR-linked pediatric cohorts, genetically links PLCL1 to speech language development and IL5–IL13 to eosinophilic esophagitis. Front Genet. 2014;5:401.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.