Testing gene-environment interactions in gene-based association studies
- 2k Downloads
Gene-based and single-nucleotide polymorphism (SNP) set association studies provide an important complement to SNP analysis. Kernel-based nonparametric regression has recently emerged as a powerful and flexible tool for this purpose. Our goal is to explore whether this approach can be extended to incorporate and test for interaction effects, especially for genes containing rare variant SNPs. Here, we construct nonparametric regression models that can be used to include a gene-environment interaction effect under the framework of the least-squares kernel machine and examine the performance of the proposed method on the Genetic Analysis Workshop 17 unrelated individuals data set. Two hundred simulated replicates were used to explore the power for detecting interaction. We demonstrate through a genome scan of the quantitative phenotype Q1 that the simulated gene-environment interaction effect in the data can be detected with reasonable power by using the least-squares kernel machine method.
KeywordsKernel Function Nonparametric Regression Genetic Analysis Workshop Kernel Machine Gene FLT1
where y i is the quantitative trait outcome of the ith individual, x ji are binary indicator variables of genotypes or exposures, β1 and β2 are regression coefficients of the main effects of genotypes or exposures, and β3 is an interaction effect term. In genetic association studies, we usually wish to achieve two purposes by incorporating such an interaction term: first, improving the power to detect a causal gene with interaction effects; and, second, detecting an interaction effect per se, which hopefully will allow us to elucidate biological interaction. Testing for the first purpose (i.e., testing for association with genotypes at a locus while allowing for an interaction effect, either with genotypes at another locus or with an exposure) corresponds to the test H0: β1 = β3 = 0 or H0: β2 = β3 = 0 (with two degrees of freedom), whereas testing for the second purpose corresponds to testing whether β3 = 0 (with one degree of freedom). It is our purpose here to investigate whether similar procedures can be applied in the setting of nonparametic regression. Given the complex nature of interaction effects, it may be necessary to consider a more flexible parameterization of statistical interaction (which nonparametric regression allows) than just the product of first-order terms.
Our analysis is also motivated by gene-based association studies. Like the Genetic Analysis Workshop 17 (GAW17) data, many current studies provide both single-nucleotide polymorphisms (SNPs) and their affiliated gene information. Gene-centric tests that consider association between a trait and all markers within a gene region have become an important complement to traditional single-locus tests. Chatterjee et al.  proposed a logistic regression model that includes all pairwise interactions between SNPs across two genes or between all SNPs in one gene and an environment exposure. The estimation and inference were made feasible by using Tukey’s parsimonious one-degree-of-freedom model of interaction. Two inherent limitations of using Tukey’s model are (1) that nonremovable interactions and interactions involving factors with small marginal effects are not detected and (2) that the method may be more suitable for a candidate gene study, given that the evaluation of the test statistic is computationally demanding because the standard score test is not applicable. To allow the investigation of more interaction models, we propose a different solution that is computationally attractive and based on a least-squares kernel machine (LSKM).
The kernel machine (such as the well-known support vector machine) originated from machine learning techniques and has attracted considerable interest in recent years. It is being increasingly applied to genetics. The key idea behind kernel machines is to implicitly transform the original input data to a higher-dimension nonlinear space that allows a more efficient exploration of data patterns for classification and model fitting. Nonparametric regression implemented by an LSKM has also been proposed as a promising tool in SNP-set gene- and pathway-based association studies [3, 4, 5]. An LSKM-based regression can test for the overall association of a gene to a disease by using genetic information from multiple SNPs simultaneously, thus providing a test statistic with an adaptively estimated number of degrees of freedom. By specifying a flexible kernel function, this method also allows for modeling interaction effects in many forms other than the product form. In this report we focus on the analysis of quantitative phenotype Q1 in the GAW17 data set with an LSKM-based method that shows the greatest promise.
where β is a q × 1 vector of covariate coefficients and h(·) is a nonparametric smoothing function that allows a flexible modeling of the influence of the genotype information g i on the trait value or disease risk (for which the outcome is replaced by logit[P(y i = 1)]). Our primary interest is to test whether the overall effect of a gene or SNP set is 0, that is, whether h(g i ) = 0.
where the function φ(g) projects the data (g i ,1,g i ,2) T to Open image in new window . Therefore kernel functions can implicitly map input data to a higher-dimension inner product space (kernel trick).
where X is the matrix of covariates, h is a vector of random effects resulting from all SNPs in the region, following a distribution with mean 0 and variance τ K, and e ~ N(0, σ2I). It has been shown that the best linear unbiased estimates of the fixed effects β and random effects h under restricted maximum likelihood (REML) share a common mathematical form with the LSKM estimates. It follows that the test of H0: h = 0 is equivalent to testing H0: τ = 0. A score statistic for this purpose is given by Open image in new window , which is distributed as a sum of weighted chi-square variables and can be approximated by a scaled chi-square distribution using Satterthwaite’s procedure [4, 6] through matching the first two moments. These steps share many features with variance component methods .
where γ is the vector or scalar regression coefficient measuring respectively the main effects of g i or c i , respectively, and t i is composed of the product term(s) between smoking status and genotypes g i or smoking status and genotype sum c i . The main effect of smoking is included in the fixed effect vector β in models (11) and (12).
In the initial stage of our analysis, we tested three kernel functions on a subset of genes (one gene at a time) and found that the quadratic and Gaussian kernels produced consistent results but that the quadratic kernel was computationally much faster. Therefore, using a quadratic kernel, we performed a genome-wide scan using each of the 200 simulated replicates. Note that there was no need to put the product terms into the nonparametric function in models (13) and (14) when the Gaussian kernel was used because the Gaussian kernel automatically allows searching through a more inclusive space. Through this analysis, we answer the two separate questions asked in the introduction: (1) What is the power of the LSKM-based method to detect a gene-environment interaction effect per se, based on models (11) and (12); and (2) does incorporating interaction terms into the LSKM improve the power of detecting a true gene with interaction effects, based on models (13) and (14)?
Similarly, we can explore the improvement in power of a joint test (models (13) and (14)) versus a main effect model (model (10)) by comparing the resulting two curves in Q-Q plots. We found that both curves for the KDR gene lay above the 95% confidence band and were visually separated. The same pattern as that found for the other genes was found without incorporating interaction effects, for example, FLT1. Therefore the deviation of these curves cannot be directly attributed to an increase in power.
The study of interaction in human genetic association studies faces many challenges that are well known in the field, such as issues of computational burden, model dependency, and multiple testing [8, 9, 10]. A few additional issues arise in the analysis of gene-environment interaction using the GAW17 simulated data. First, as a major theme of GAW17, a large proportion of rare variant SNPs are contained in the data. This considerably reduces the power of SNP-based association tests that test only main effects—not to mention the interaction, which suffers more from a sparsity issue. A simple but practical solution is to combine genotypes within a gene, as we demonstrated in our analysis. Other genotype collapsing or aggregating methods, such as adaptive and weighted-sum methods, may also be applied. The analysis of interaction has been largely restricted by the simulation scheme used in generating the GAW17 data: Only one gene is simulated with a gene-environment interaction. The GAW17 data thus do not enable a systematic comparison of different methods or models. The confounding factor of population structure (present though not planned) has further complicated the analysis and interpretation of our results. Depending on the interaction model, any hidden population structure may yield false-positive results in a joint analysis of main and interaction effects, as shown in our results.
Despite all these restrictions, through our analysis we have demonstrated the advantages of the LSKM-based method. First, the method provides a flexible modeling and testing framework for multilocus and gene-based association studies, which allows the analysis of both quantitative and binary traits and the easy incorporation of covariates; the method can automatically reduce the degrees of freedom of the test by properly accounting for the correlation structure among markers. Second, various interaction models and nonlinear effects can be implicitly defined by specifying different kernel functions. Third, the score-based statistic makes the method’s implementation computationally efficient and thus suitable for both candidate genes and a genome-wide scan. The procedure described in this paper can be readily applied to gene-gene interaction. More simulation scenarios will be required in a future study to explore the performance of different gene collapsing methods and kernels. For example, a weighted version of the IIS kernel can be considered to emphasize the similarity between rare-variant SNPs [4, 5]. One possible extension would be to include a polygenic control term in the model (similar to a variance component method) so that information from family and unrelated case-control data can be combined. It would also be of interest to test whether the LSKM-based interaction model can be adapted for use in other classes of genomic similarity methods [11, 12].
By incorporating interaction terms, explicitly or implicitly, and using LSKM-based regression methods, we were able to detect signals for the interaction effects simulated in forming the quantitative trait. We were able to gain some power by jointly testing the main effects and interactions, but the results were confounded by the population structure that exists in the GAW17 data.
The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. This work was supported in part by the following U.S. Public Health Service grants: Resource grant P41 RR03655 from the National Center for Research Resources; Cancer Center Support grant P30 CAD43703 from the National Cancer Institute; Research grants HL074166 and HL086718 from the National Heart, Lung and Blood Institute; and Research grant HG003054 from the National Human Genome Research Institute. In addition, a grant from the Merck Foundation supported XW.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.