A hidden twolocus disease association pattern in genomewide association studies
 3.9k Downloads
 2 Citations
Abstract
Background
Recent association analyses in genomewide association studies (GWAS) mainly focus on singlelocus association tests (marginal tests) and twolocus interaction detections. These analysis methods have provided strong evidence of associations between genetics variances and complex diseases. However, there exists a type of association pattern, which often occurs within local regions in the genome and is unlikely to be detected by either marginal tests or interaction tests. This association pattern involves a group of correlated singlenucleotide polymorphisms (SNPs). The correlation among SNPs can lead to weak marginal effects and the interaction does not play a role in this association pattern. This phenomenon is due to the existence of unfaithfulness: the marginal effects of correlated SNPs do not express their significant joint effects faithfully due to the correlation cancelation.
Results
In this paper, we develop a computational method to detect this association pattern masked by unfaithfulness. We have applied our method to analyze seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The analysis for each data set takes about one week to finish the examination of all pairs of SNPs. Based on the empirical result of these real data, we show that this type of association masked by unfaithfulness widely exists in GWAS.
Conclusions
These newly identified associations enrich the discoveries of GWAS, which may provide new insights both in the analysis of tagSNPs and in the experiment design of GWAS. Since these associations may be easily missed by existing analysis tools, we can only connect some of them to publicly available findings from other association studies. As independent data set is limited at this moment, we also have difficulties to replicate these findings. More biological implications need further investigation.
Availability
The software is freely available at http://bioinformatics.ust.hk/hidden_pattern_finder.zip.
Keywords
Bipolar Disorder Lasso Multifactor Dimensionality Reduction Association Pattern Major Histocompatibility Complex RegionBackground
The development of DNA microchip technology has allowed the analysis of single nucleotide polymorphism (SNPs) on a genomewide scale to identify genetic variants associated with diseases. Researchers have proposed many methods to investigate association patterns of complex diseases. Two recent reviews [1, 2] presented detailed analyses on many popular methods and tools, such as multifactor dimensionality reduction (MDR) [3], Random Jungle [4], Bayesian epistasis association mapping (BEAM) [5] and PLINK [6]. MDR is a popular nonparametric approach for detecting all possible kway combinations of SNPs that interact to influence disease traits. Random Jungle (i.e., Random Forest [7]), is to solve classification and regression problems. In random forest, decision trees are combined to produce accurate predication. Its ability to handle the high dimensional problems in GWAS has been shown in [8, 9]. BEAM designs a Bayesian marker partition model which classifies SNP markers into three types: SNPs unassociated with the disease, SNPs contributing to the disease susceptibility independently, and SNPs influencing the disease risk jointly with each other. In this model, a first order Markov chain is designed for the accommodation of correlation between adjacent SNPs. Markov Chain Monte Carlo (MCMC) sampling is used to optimize the posterior probability of the model. In addition, the "Bstatistic" designed in BEAM can be used in the frequentist hypothesistesting framework. PLINK provides a toolkit for flexible analyses, in which various statistical tests for singlelocus analysis, haplotype analysis and allelicbased interaction analysis are implemented. Recently, a new method named "BOOST" [10] allows examination of all pairwise interactions in genomewide casecontrol studies. As a result, many genetic susceptibility determinants have been mapped.
However, there is another type of association pattern, which often occurs within local regions in the genome and may not be detected by these methods. This association pattern involves multiple correlated SNPs with neither strong marginal effects nor strong interaction effects. But they can jointly display strong associations. Here we use some simple regression models to explain this association pattern. Suppose we have two dependent variables, X_{1} and X_{2}, and one independent variable Y. We can fit two regression models (or logistic regressions for casecontrol data), Open image in new window and Open image in new window , to test the association significance of these two variables. Here Open image in new window and Open image in new window are named as marginal coefficients. If these two marginal coefficients are very small, single variable analysis methods will consider them statistically insignificant and ignore them.
where Open image in new window is the expectation of the marginal coefficient Open image in new window , ρ(X_{1}, X_{2}) is the population correlation between X_{1} and X_{2}. The marginal coefficients depend on their bivariate regression coefficients as well as the variable correlation, as we illustrated in Figure 1. We will give more explanation on this relationship in the high dimensional setting in the discussion section.
From the statistical point of view, different correlation patterns could cause the marginal coefficients Open image in new window and Open image in new window significantly different from the bivariate regression coeffcients β_{1} and β_{2} (shown in Figure 1). In fact, the issue of unfaithfulness has been discussed in the causality literature [12]. In GWAS, the correlation among SNPs arises due to the linkage disequilibrium pattern of the genome. A natural question arises: Does the issue of unfaithfulness occur in GWAS?
To answer this question, a computational method for detecting associations masked by unfaithfulness is necessary. In this paper, we propose a simple method to detect such associations involving two SNPs. It can evaluate each SNP pair in genomewide casecontrol studies in a fast manner. We have applied our method to analyze seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). The experimental results show that these associations widely exist in GWAS. In this work, we only handle the unfaithfulness issue involving two SNPs while the unfaithfulness can exist among a large number of markers. The detection of associations involving three or more SNPs is too timeconsuming and beyond the scope of this work.
Results
Experiment on simulation study
The simulation study is designed to compare our proposed method with other three methods for detecting associations in the presence of unfaithfulness. These three methods include the marginal association test (singlelocus analysis), Lasso [11, 13] and BEAM [5]. The reasons that we choose these methods for comparison are as follows:

Marginal association test is used in almost every GWAS due to its simplicity and effectiveness.

Lasso is a shrinkage and selection method for (generalized) linear regression. It imposes a sparsity constraint (i.e., only a small fraction of variables are relevant) and uses L_{1} penalty to eliminate irrelevant variables. Fast algorithms are available for Lasso. Thus, it can simultaneously analyze a huge number of variables. It is very popular in genetics [11, 14, 15, 16].

BEAM has the capability of detecting both marginal associations and interactions in largescale data sets. It uses first order Markov chain to accommodate the correlation between adjacent SNPs.
The details about the parameter settings in simulation are provided in the method section. In our simulation study, we only handle the unfaithfulness involving two associated variables X_{1} and X_{2} by using β_{1} > 0, β_{2} < 0 and ρ(X_{1}, X_{2}) > 0 as illustrated in Figure 1(d). The marginal coefficients Open image in new window and Open image in new window will be small due to the cancelation given by Equation (1). When β_{1} > 0, β_{2} > 0 and ρ(X_{1}, X_{2}) < 0 the unfaithfulness also happens. This corresponds to a situation that the minor alleles of both X_{1} and X_{2} increase the diseases risk but X_{1} and X_{2} are negatively correlated, as illustrated in Figure 1(c).
Another interesting point is that the statistical power of existing methods decreases as the linkage disequilibrium (LD) r^{2} increases. Although our proposed method also degrades its performance when LD increases, it maintains a relatively high power for strong LD (r^{2} = 0.7).
Experiment on seven data sets from WTCCC
We have applied our method to analyze the data sets (14,000 cases in total and 3,000 shared controls) from the WTCCC [17]. WTCCC studies seven common human diseases, including bipolar disorder (BD), coronary artery disease (CAD), Crohn's disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D). These data sets are generated using the affymetrix 500 K chip. We first apply a similar quality control procedure as suggested in [17] to preprocess the data. The numbers of remaining SNPs for seven data sets are around 360,000. In current stage, BEAM cannot directly handle these data sets [2].
The number of twolocus unfaithfulness associations identified from seven diseases data sets under different constraints.
BD  CAD  CD  HT  RA  T1D  T2D  

T ^{1}  48  31  25  46  132  153  67 
T ^{2}  52  35  28  51  153  204  80 
T ^{3}  60  36  29  52  165  252  84 
T^{1} & Dist  0  0  0  1  1  1  0 
T^{2} & Dist  0  0  0  3  17  17  0 
T ^{3} & Dist  0  0  0  0  4  35  0 
Some associations masked by unfaithfulness from the WTCCC data set.
Disease  SNP X_{ p }  Singlelocus Pvalue  SNP X_{ q }  Singlelocus Pvalue  Chr  Gene  Unfaithfulness Pvalue 

BD  rs668860  0.053  rs10873672  0.245  1  MCOLN2  4.885 × 10^{15} 
rs668860  0.053  rs6691970  0.216  1  MCOLN2  6.217 × 10^{15}  
CAD  rs7162070  0.867  rs16969478  0.160  15  FSIP1  5.551 × 10^{15} 
rs1876853  0.903  rs16969478  0.160  15  FSIP1  2.310 × 10^{13}  
rs8029602  0.853  rs16969478  0.160  15  FSIP1  5.274 × 10^{14}  
rs16969475  0.823  rs16969478  0.160  15  FSIP1  1.259 × 10^{13}  
T1D  rs1058318  0.074  rs2252745  0.840  6  GNL1, PPP1R10  1.326 × 10^{12} 
HT  rs2300390  0.460  rs12482676  0.061  21  RCAN1  2.442 × 10^{15} 
Regression coefficients of those associations listed in Table 2.
Disease  SNP X_{ p }  Open image in new window (z Value)  SNP X_{ q }  Open image in new window (z Value)  r ^{2}  β_{1}(z Value)  β_{2}(z Value) 

BD  rs668860  0.0162 (0.392)  rs10873672  0.0679 (1.662)  0.961  1.379 (5.823)  1.402 (5.989) 
rs668860  0.0162 (0.392)  rs6691970  0.0706 (1.728)  0.958  1.397 (5.934)  1.421 (6.107)  
CAD  rs7162070  0.0301 (0.482)  rs16969478  0.108 (1.732)  0.913  2.621 (5.671)  2.637 (5.720) 
rs1876853  0.0208 (0.331)  rs16969478  0.108 (1.732)  0.913  2.354 (5.538)  2.372 (5.603)  
rs8029602  0.0175 (0.281)  rs16969478  0.108 (1.732)  0.914  2.214 (5.594)  2.238 (5.676)  
rs16969475  0.0184 (0.294)  rs16969478  0.108 (1.732)  0.913  2.110 (5.660)  2.132 (5.746)  
T1D  rs1058318  0.0992 (2.269)  rs2252745  0.0167 (0.375)  0.887  1.0292 (7.530)  1.005 (7.219) 
HT  rs2300390  0.0600 (1.182)  rs12482676  0.0531 (1.037)  0.903  1.215 (6.567)  1.224 (6.577) 
Bipolar disorder (BD)
Among associated SNP pairs identified from the BD data set, we find two suspicious SNP pairs, (rs668860, rs10873672) and (rs668860, rs6691970). The unadjusted Pvalues for these two SNP pairs are 4.885 × 10^{15} and 6.217 × 10^{15}, respectively. They are still significant after Bonferroni correction. However, none of these three SNPs (rs668860 is involved in both pairs) was reported in [17] because their marginal effects are too weak to be detected by the singlelocus association test. The unadjusted Pvalues for these three SNPs based on the singlelocus association test are 0.053, 0.245 and 0.216, respectively. All three SNPs reside in the intron of gene MCOLN2. The protein Mucolipin2, encoded by gene MCOLN2 and also known as TRPML2 (transient receptor potential cation channel, mucolipin subfamily, member 2), has been confirmed to have strong associations with bipolar disorder in a familybased association study [19]. To our knowledge, this is the first identified association between the MCOLN2 gene and the bipolar disorder risk in a populationbased association study.

Although this region is in strong LD (see Figure 4(b)), association masked by unfaithfulness does not happen across the entire area. This shows that this type of asssociation not only depends on the correlation structure but also depends on the effects of the SNPs, as we illustrated in Figure 1 (also see Equation 1).

From Figure 4(c), the marginal effects of the imputed SNPs are very weak. This indicate that this type of association is not caused by some ungenotyped causative SNPs. Instead, it is a genuine effect.
Coronary artery disease (CAD)
We identify four suspicious associations involving five SNPs. The unadjusted Pvalues for these four associations range from 2.310 × 10^{13} to 5.551 × 10^{15}. The unadjusted singlelocus Pvalues for five SNPs involved in these five associations indicate that they do not have noticeable marginal effects. All five SNPs reside in the intron of gene FSIP1 (fibrous sheath interacting protein 1). We have not found evidence to directly connect gene FSIP1 with the coronary artery disease. However, the LD analysis identifies a well studied gene THBS1 (thrombospondin1), which is centromeric to gene FSIP1 and has been confirmed to increase the risk of coronary artery disease in many studies [21, 22, 23]. It would be of great interest to investigate gene FSIP1 in determining genetic susceptibility to coronary artery disease.
Type 1 diabetes (T1D)
Most identified associations from the T1D data set are linked with the MHC region. The MHC region at chromosomal position 6p21 encodes many genes (such as HLADQB1 and HLADRB1) that have been associated with type 1 diabetes [17, 24] by using the singlelocus test. However, it is still unclear which and how many loci within the MHC region determine T1D susceptibility because of the functional complexity of this small human genome segment. The MHC region has been connected with more than 100 diseases, such as diabetes, rheumatoid arthritis, psoriasis, asthma and various autoimmune disorders. Our results provide additional information to locate diseaseassociated loci. Concretely, one suspicious association involves SNP rs1058318 and SNP rs2252745. The unadjusted Pvalue of this association is 1.326 × 10^{12}. The unadjusted singlelocus Pvalues of rs1058318 and rs2252745 are 0.074 and 0.840, respectively. SNP rs1058318 resides in the intron region of gene GNL1 and SNP rs2252745 resides in the intron region of gene PPP1R10. Both genes are located in the MHC region and adjacent to each other. Gene GNL1 belongs to the HLAE class. The locus in HLAE has been strongly associated with type 1 diabetes [25]. The detailed examination of the relationship between gene GNL1 and gene PPP1R10 may provide some new insights in studying the causes of type 1 diabetes.
Hypertension (HT)
Among associations identified from the HT data set, we find one suspicious pair involving SNP rs2300390 and SNP rs12482676. The unadjusted Pvalue is 2.442 × 10^{15}. The unadjusted singlelocus Pvalues for rs2300390 and rs12482676 are 0.460 and 0.061, respectively. Both SNPs reside in the intron of gene RCAN1. Gene RCAN1 mainly functions as a regulator of calcineurin. Calcineurin participates in many cellular and tissue functions. Its abnormal expression is associated with many diseases including hypertension [26].
Crohn's disease (CD), rheumatoid arthritis (RA) and type 2 diabetes (T2D)
Currently, we have difficulties to connect the identified associations of CD, RA and T2D to publicly available findings from other association studies. Their biological implications need to be further explored.
Experiment on the Illumina data sets from other independent studies
We further analyze the Crohn's Disease data set [27], in which 308,332 autosomal SNPs were assayed on the Illumina HumanHap300 chip. After a standard quality control (the proportion of miss values ≤ 10%, the minor allele frequency ≥ 5% and the Pvalue of HardyWeinberg equilibrium ≥ 0.0001), the number of remaining SNPs is 291,964.
We apply our method to this data set and do not find any significant associations masked by unfaithfulness. Our explanation is that Illumina chip uses the tagSNP design and the correlation among SNPs is less than that of Affymetrix 500 K chip used by WTCCC. This result indicates that it is unlikely to detect associations masked by unfaithfulness using the tagSNP design.
In order to check if imputation helps in identifying significant association masked by unfaithfulness, we focus on the SNP regions in which we have identified associations from the WTCCC CD data set (Additional file 1: Table S3), impute the corresponding SNP data from [27], and rerun our analysis. Unfortunately, we fail to replicate those findings in the WTCCC CD data set. We have examined the imputation result carefully. At those local regions we are interested in, few SNPs are directly genotyped. In the hapmap data, hundreds of SNPs locate in those areas. This implies hundreds of SNPs need to been imputed using the information coming from the reference panel. In fact, the frequencies of those imputed haplotypes are almost the same in cases and controls. This is probably the reason that we cannot replicate those findings. Hopefully, nextgeneration sequencing will provide high resolution SNP data to resolve this issue. Another important reason may be that these two CD data sets are from different populations (one comes from Europe, another comes from north America).
Similarly, we have analyzed another RA data set [28] from Genetic Analysis Workshop 16. This data set is acquired from North American population. The SNPs are genotyped by the Illumina chip. We also have difficulties to replicate the findings of the RA data set from WTCCC. We hope we can get access to more data sets to verify our results in the future.
Discussion
The unfaithfulness issue in the high dimensional feature space
where Open image in new window is the expectation of marginal coefficient, ρ(X_{ q } , X_{ p } ) is the population correlation between X_{ q }and X_{ p } . If Open image in new window , then Open image in new window can be close to zero no matter how large β_{ p } is. In addition, the number of involved variables could be very big. To exclude the effect of unfaithfulness in feature selection, Fan and Lv [29] had to make an assumption that there is a C > 0 such that Open image in new window for p = 1,..., s, and then proved that the truly associated variables will be among those having the highest marginal coefficients.
In our simulation study, we only handle the unfaithfulness involving two associated variables X_{1} and X_{2} by using β_{1} > 0, β_{2} < 0 and ρ(X_{1}, X_{2}) > 0 as illustrated in Figure 1(d). The marginal coefficients Open image in new window and Open image in new window will be small due to the cancelation given by Equation (2). When β_{1} > 0, β_{2} > 0 and ρ(X_{1}, X_{2}) < 0, the unfaithfulness also happens. This corresponds to a situation that the minor alleles of both X_{1} and X_{2} increase the diseases risk but X_{1} and X_{2} are negatively correlated, as illustrated in Figure 1(c). The simulation result shows that the marginal test and Lasso perform poorly. The better performance of BEAM should be attributed to its first order Markov chain designed for the accommodation of correlation. Although our exhaustive method performs reasonably well, the direct extension of our method to deal with three or more loci is computationally expensive. Therefore, solving the unfaithfulness issue is still challenging.
The Relationship between interaction models and unfaithfulness
In this work, we only deal with a twolocus association pattern involving the unfaithfulness. There are many works [3, 5, 6, 30] discussing twolocus associations. Most of them belong to the category of interaction analysis (see details in [1, 2]). The SNP interaction is also referred to as "epistasis". The most common statistical definition of interactions is the statistical deviation from the additive effects of two loci on the phenotype [2]. Using the same example we discussed in the introduction section, testing interactions between X_{1} and X_{2} is to first fit the regression model (or logistic regressions for casecontrol data) Y ~ β_{1}X_{1} + β_{2}X_{2} + β_{12}X_{1}X_{2} and then test the significance of β_{12}. There is no direct connection between β_{1} (or β_{2}) and β_{12} In the analysis of unfaithfulness, the relationship between marginal coefficients ( Open image in new window , Open image in new window ) and joint coefficients (β_{1}, β_{2}) is given in Equation (2). The interaction term plays no role here. Therefore, it is not appropriate to use interaction models to detect associations masked by unfaithfulness.
The Relationship between unfaithfulness and confounding
Suppose we are studying the relationship between two variables X and Y using model Open image in new window . Confounding arises when there is another observed variable Z which is independently associated with X and Y. Specifically, we have Open image in new window for model Open image in new window and Open image in new window fot model Open image in new window . When studying the relationship between X and Y, it is necessary to account for the confounding effect by using model Y ~ β_{ yz }Z + β_{ yx }X. In other words, confounding is more like the situation illustrated in Figure 1 (b). Readers are referred to [31] for the detailed explanation of confounding.
The unfaithfulness is different. For model Open image in new window and Open image in new window , both Open image in new window and Open image in new window are close to zero. For joint model Y ~ β_{1}X_{1} + β_{2}X_{2}, both β_{1} and β_{2} are not zero, as illustrated in Figure 1(c) and 1(d).
Biological interpretations
There are two possible biological interpretations. The first interpretation is illustrated in Figure 1(d). Consider two loci X_{ p } and X_{ q } which are positively correlated. When X_{ q } increases the disease risk (β_{ q } > 0) and X_{ p } acts as a protective locus (β_{ p } < 0), unfaithfulness happens. The identified associations and their coefficients listed in Table 3 indicate that these associations indeed exist.
The second interpretation is illustrated in Figure 1(c). Consider two loci X_{ p } and X_{ q } which are negatively correlated. When both X_{ p } and X_{ q } increase the disease risk (β_{ p } > 0 and β_{ q } > 0), unfaithfulness also happens. This case may be particularly interesting when analyzing SNPs with low allele frequencies [32]. Suppose the allele frequencies of both X_{ p } and X_{ q } are low and thus the mutations happening at these two loci are relatively recent. We can further assume the haplotype a  a does not exist (because the probability of both two mutations happen in a short period is very small). This implies these two loci are negatively correlated. Unfortunately, we do not identify this type of associations. Possible reasons include: (1) The current genotyping chip is designed based on the "common disease/common variant" model [33, 34], the low frequency SNPs are not directly assayed. (2) The statistical power of current testing strategy is relatively low to handle rare variants.
The unfaithfulness implications on tagSNPs
GWAS is considered as a powerful approach to detecting genetic susceptibility of common diseases. Such studies require the genotypes of a huge number of SNPs across the genome, each of which is tested for association with the phenotype of interest. This is generally referred to as the direct test of association, in which the functional mutation is presumed to be genotyped. An alternative approach is to take advantage of the correlation between SNPs. This approach genotypes a subset of SNPs, called tagSNPs, which are in high linkage disequilibrium with other SNPs [33]. The association tests of tagSNPs are used to indirectly infer the association of other correlated SNPs. This approach is widely used to save genotyping costs in GWAS. Many tagging methods [33, 35, 36] have been developed to reduce the number of markers to be genotyped. One key assumption in these methods is that the association analysis of a set of highly correlated SNPs is equivalent with the association analysis of tagSNPs of this set. However, the existence of unfaithfulness poses a challenge for these methods. The weak marginal association of a tagSNP does not imply the weak association of the corresponding genome region in which this tagSNP is located. The reason is that some nongenotyped SNPs correlating with the tagSNPs may jointly display strong associations in the presence of unfaithfulness.
In this work, we analyzed the WTCCC data generated by the Affymetrix 500 K chip and a Crohn's disease data set generated by the Illumina chip. The Affymetrix 500 K chip spaces SNPs along the genome and the Illumina chip uses the tagSNP design. LD becomes less apparent in the Illumina data set and we did not find any association masked by unfaithfulness. This result suggests that it is very difficult to detect these associations by using the tagSNP design. If more SNPs could be genotyped in the future GWAS, we would detect more unknown associations.
Conclusion
The phenomenon named "unfaithfulness" has been discussed as a mathematical concept in the literature. In this work, we answered the question whether there exist associations masked by unfaithfulness in genomewide association studies. We developed a simple and fast method to examine all SNP pairs and demonstrated that our method is applicable to analyze genomewide SNP data sets. We conducted experiments on both simulated data and seven real data sets from WTCCC and identify many associations masked by unfaithfulness. As expected, these identified associations only occur in local area. This implies that only the local search is needed to find such associations.
To date, we can only connect some identified associations to publicly available results from other association studies. As independent data set is limited as this moment, we have difficulties to replicate these findings. The biological interpretation of many findings remains unclear. It would be of great interest if their biological functions could be investigated. In addition, we only handle the twolocus associations in the presence of unfaithfulness. Detecting such associations for three or more loci is still an open problem.
Methods
Given a data set with ℒ SNPs and n samples, we use X_{ l } to denote the lth SNP, l = 1,···, ℒ and Y to denote the class label (0 for control and 1 for case). SNPs are biallelic genetic markers. Capital letters (e.g. A, B,...) and lowercase letters (e.g. a, b,...) are often used to denote major and minor alleles, respectively. For simplicity, we use {0, 1, 2} to represent the three genotypes {AA, Aa, aa}, respectively.
Definition of the association masked by unfaithfulness
Please note that the superscripts X_{ p } and X_{ q } in Equation (8), Equation (9) and Equation (10) are merely the labels and do not represent the exponents. The term Open image in new window represents the coefficient of X_{ p } at category i. Throughout this paper, we use ℳ to denote logistic regression models. We will use M to denote loglinear models. The loglikelihood function of a logistic model ℳ is denoted as L_{ℳ}, and its maximum likelihood estimation (MLE) is denoted as Open image in new window . The loglikelihood function of a loglinear model M is denoted as L_{ M } , and its maximum likelihood estimation (MLE) is denoted as Open image in new window . For example, ℳ_{1} is a logistic regression model whose loglikelihood function and MLE are denoted by L_{ℳ1} and Open image in new window .
 1.
Fit a logistic regression model defined in Equation (10) and obtain the loglikelihood Open image in new window .
 2.
Compute the loglikelihood Open image in new window of the null logistic regression model defined in Equation (7).
 3.
Calculate Pvalue using the χ ^{2} test on the value 2 ( Open image in new window ) with degree of freedom df = 2.
 1.
Fit a logistic regression model defined in Equation (8) (or Equation (9)) to measure the main effect of X_{ p } (or X_{ q } ) and obtain the loglikelihood Open image in new window (or Open image in new window ).
 2.
Compute the loglikelihood Open image in new window of the null logistic regression model defined in Equation (7).
 3.
Calculate Pvalue using the χ ^{2} test on the value 2 ( Open image in new window ) (or 2 ( Open image in new window )) with degree of freedom df = 2.
Directly using regression methods for testing all pairs of SNPs in GWAS would be very timeconsuming. Often the parallel computation was recommended [38]. Here, we propose to use loglinear models [37] instead of logistical regression models in GWAS. We show that this makes the likelihood ratio test computationally more efficient in genomewide SNP data analysis. In the following, we briefly summarize the key components. The details are explained in the supplementary document (Additional file 1).
Likelihood ratio tests using loglinear models
The genotype counts in cases (Y = 1) and controls (Y = 2).
Y= 1  Xq= 1  Xq= 2  Xq= 3  Y= 2  Xq= 1  Xq = 2  Xq = 3 

Xp = 1  n _{111}  n _{121}  n _{131}  Xp = 1  n _{112}  n _{122}  n _{132} 
Xp = 2  n _{211}  n _{221}  n _{231}  Xp = 2  n _{212}  n _{222}  n _{232} 
Xp = 3  n _{311}  n _{321}  n _{331}  Xp = 3  n _{312}  n _{322}  n _{332} 
We use the dot convention to indicate summation over a subscript, e.g., Open image in new window is the number of observations with X_{ p } = i. Similarly, we have Open image in new window and Open image in new window . We also have Open image in new window and Open image in new window . Throughout this paper, we use Open image in new window to denote the expectation of N_{ ijk } under loglinear model M, and use Open image in new window to denote the MLE of Open image in new window .
Equivalence between loglinear models and logistic models for a threeway table with binary response variable Y (M_{ B }: Block independence model.
Loglinear model  Logistic model 

ℳ_{0} : β_{0}  
Running time of the proposed method for data sets of different sizes.
Data size  Running time 

n = 5, 000, ℒ = 1, 000  3s 
n = 5, 000, ℒ = 5, 000  76s 
n = 5, 000, ℒ = 10, 000  303s 
An exhaustive approach to detecting the twolocus associations masked by unfaithfulness in GWAS
This approach involves the following steps:

Step 1. For all of ℒ SNP markers, we first filter out those SNPs with significant main effects using Equation (12) since we are only interested in those markers without significant main effects. The ℒ Pvalues can be adjusted by either the classic BenjaminiHochberg method for controlling false discovery rate (FDR) or the Bonferroni correction for controlling family wise error rate (FWER).

Step 2. For the remaining ℒ' SNPs without significant main effects, we check every pair using the Equation (11). Again, the Pvalues can be adjusted for controlling either FDR or FWER.
The Pvalue thresholds in both Step 1 and Step 2 are specified by users. The default setting of the threshold is 0.1. The multiple factor for Bonferroni correction is ℒ'(ℒ'  1)/2, where ℒ' is the number of SNPs after removing those SNPs with significant marginal effects. Since the number of SNPs with significant marginal effects only accounts for a small fraction of the entire SNP set, we have ℒ'(ℒ'  1)/2 ≈ ℒ(ℒ  1)/2.
Simulation design
The odds table of the simulation model.
Model  BB  Bb  bb  Model  BB  Bb  bb 

AA  α  αθ _{21}  αθ _{21} θ _{22}  AA  α  αθ _{ b }  
Aa  αθ _{11}  αθ _{11} θ _{21}  αθ _{11} θ _{21} θ _{22}  Aa  αθ _{ a }  αθ _{ a } θ _{ b }  
aa  αθ _{11} θ _{12}  αθ _{11} θ _{12} θ _{21}  αθ _{11} θ _{12} θ _{21} θ _{22}  aa 
For simplicity, we restrict θ_{11} = θ_{12} = θ_{ a } and θ_{21} = θ_{22} = θ_{ b } . The parameter θ_{ a } > 1 means that the minor allele "a" increases the disease risk. This corresponds to the bivariate regression coefficient β_{1} > 0. Similarly, θ_{ b } < 1 implies β_{2} < 0. In the presence of linkage disequilibrium (linkage disequilibrium measure Δ > 0), unfaithfulness arises. To simulate this situation, we further set θ_{ a } = θ and θ_{ b } = 1/θ. In the simulation, the prevalence p(D) and the heritability h^{2} are controlled by the parameters α and θ. We first specify the disease prevalence p(D) and genetic heritability h^{2}. Then we numerically solve the parameters (α and θ) based on the above equations. We set p(D) = 0.1 and h^{2} = 0.02. The details are given in the supplementary document (Additional file 1).
Notes
Acknowledgements
This work was partially supported with the Grant GRF621707 from the Hong Kong Research Grant Council, grant RPC06/07.EG09, RPC07/08.EG25, RPC10EG04 and a grant from Sir Michael and Lady Kadoorie Funded Research Into Cancer Genetics. We thank the two anonymous reviewers for their constructive comments, which greatly help us improve our manuscript.
Supplementary material
References
 1.Balding D: A tutorial on statistical methods for population association studies. Nature Reviews Genetics 2006, 7: 781–791. 10.1038/nrg1916CrossRefPubMedGoogle Scholar
 2.Cordell H: Detecting genegene interactions that underlie human diseases. Nature Reviews Genetics 2009, 10: 392–404.PubMedCentralCrossRefPubMedGoogle Scholar
 3.Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. The American Journal of Human Genetics 2001, 69: 138–147. 10.1086/321276CrossRefPubMedGoogle Scholar
 4.Schwarz D, Kónig I, Ziegler A: On Safari to Random Jungle: A fast implementation of Random Forests for high dimensional data. Bioinformatics 2010, in press.Google Scholar
 5.Zhang Y, Liu J: Bayesian inference of epistatic interactions in casecontrol studies. Nature Genetics 2007, 39: 1167–1173. 10.1038/ng2110CrossRefPubMedGoogle Scholar
 6.Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly M, Sham P: PLINK: a tool set for wholegenome association and populationbased linkage analyses. The American Journal of Human Genetics 2007, 81: 559–575. 10.1086/519795CrossRefPubMedGoogle Scholar
 7.Breiman L: Random Forests. Machine Learning 2001, 45: 5–32. 10.1023/A:1010933404324CrossRefGoogle Scholar
 8.Lunetta K, Hayward L, Eerdewegh PV: Screening largescale association study data: exploiting interactions using random forests. BMC Genetics 2004, 5: 32–44. 10.1186/14712156532PubMedCentralCrossRefPubMedGoogle Scholar
 9.Bureau A, Dupuis J, Falls K, Lunetta K, Hayward B, Keith T, Van Eerdewegh P: Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology 2005, 28(2):171–182. 10.1002/gepi.20041CrossRefPubMedGoogle Scholar
 10.Wan X, Yang C, Yang Q, Xue H, Fan X, Tang N, Yu W: BOOST: A fast approach to detecting genegene interactions in genomewide casecontrol studies. The American Journal of Human Genetics 2010, 87(3):325–340. 10.1016/j.ajhg.2010.07.021CrossRefPubMedGoogle Scholar
 11.Wasserman L, Roeder K: Highdimensional variable selection. The Annals of Statistics 2009, 37(5A):2178–2201. 10.1214/08AOS646CrossRefPubMedGoogle Scholar
 12.Spirtes P, Glymour C, Scheines R: Causation, Prediction, and Search. MIT Press; 2001.Google Scholar
 13.Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, series B 1996, 58: 267–288.Google Scholar
 14.Wu T, Chen Y, Hastie T, Sobel E, Lange K: Genomewide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics 2009, 25(6):714–721. 10.1093/bioinformatics/btp041PubMedCentralCrossRefPubMedGoogle Scholar
 15.Hoggart C, Whittatker J, Iorio M, Balding D: Simultaneous Analysis of All SNPs in Genomewide and ReSequencing Association Studies. PLoS Genetics 2008, 4(7):e1000130. 10.1371/journal.pgen.1000130PubMedCentralCrossRefPubMedGoogle Scholar
 16.Yang C, Wan X, Yang Q, Xue H, Yu W: Identifying main effects and epistatic interactions from largescale SNP data via adaptive group Lasso. BMC Bioinformatics 2010, 11(Suppl 1):S18. 10.1186/1471210511S1S18PubMedCentralCrossRefPubMedGoogle Scholar
 17.WTCCC: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447: 661–678. 10.1038/nature05911CrossRefGoogle Scholar
 18.Guo Z, Hood L, Malkki M, Petersdorf E: Longrange multilocus haplotype phasing of the MHC. PNAS 2006, 103(18):6964–6969. 10.1073/pnas.0602286103PubMedCentralCrossRefPubMedGoogle Scholar
 19.Xu C, Li P, Cooke R, Parikh S, Wang K, Kennedy J, Warsh J: TRPM2 variants and bipolar disorder risk: confirmation in a familybased association study. Bipolar Disorder 2009, 11: 1–10.CrossRefGoogle Scholar
 20.Browning B, Browning S: A unified approach to genotype imputation and haplotypephase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics 2009, 84(2):210–223. 10.1016/j.ajhg.2009.01.005CrossRefPubMedGoogle Scholar
 21.Topol E, McCarthy J, Gabriel S, Moliterno D, Rogers W, Newby L, Freedman M, Metivier J, Cannata R, O'Donnell C, KottkeMarchant K, Murugesan G, Plow E, Stenina O, Daley G: Single nucleotide polymorphisms in multiple novel thrombospondin genes may be associated with familial premature myocardial infarction. Circulation 2001, 104: 2641–2644. 10.1161/hc4701.100910CrossRefPubMedGoogle Scholar
 22.Zwicker J, Peyvandi F, Palla R, Lombardi R, Canciani M, Cairo A, Ardissino D, Bernardinelli L, Bauer K, Lawler J, Mannucci P: The thrombospondin1 N700S polymorphism is associated with early myocardial infarction without altering von Willebrand factor multimer size. Blood 2006, 118(4):1280–1283.CrossRefGoogle Scholar
 23.McCarthy J, Meyer J, Moliterno D, Newby L, Rogers W, Topol E: Evidence for substantial effect modification by gender in a largescale genetic association study of the metabolic syndrome among coronary heart disease patients. Human Genetics 2003, 114: 87–98. 10.1007/s0043900310261CrossRefPubMedGoogle Scholar
 24.Nejentsev S, Howson J, Walker N, Szeszko J, et al.: Localization of type 1 diabetes susceptibility to the MHC class I genes HLAB and HLAA. Nature 2007, 450(6):887–892.PubMedCentralCrossRefPubMedGoogle Scholar
 25.Hodgkinson A, Millward B, Demaine A: The HLAE locus is associated with age at onset and susceptibility to type 1 diabetes mellitus. Human Immunology 2000, 61(3):290–295. 10.1016/S01988859(99)001160CrossRefPubMedGoogle Scholar
 26.Riper D, Jayakumar L, Latchana N, Bhoiwala D, Mitchell A, Valenti J, Crawford D: Regulation of vascular function by RCAN1 (ADAPT78). Archives of Biochemistry and Biophysics 2008, 472: 43–50. 10.1016/j.abb.2008.01.029CrossRefPubMedGoogle Scholar
 27.Duerr R, Taylor K, Brant S, Rioux J, Silverberg M, Daly M, Steinhart A, Abraham C, Regueiro M, Griffths A, Dassopoulos T, Bitton A, Yang H, Targan S, Datta L, Kistner E, Schumm L, Lee A, Gregersen P, Barmada M, Rotter J, DL N, Cho J: A genomewide association study identifies IL23R as an inflammatory bowel disease gene. Science 2006, 314: 1461–1463. 10.1126/science.1135245PubMedCentralCrossRefPubMedGoogle Scholar
 28.Amos C, Chen W, Seldin M, Remmers E, Taylor K, Criswell L, Lee A, Plenge R, Kastner D, Gregersen P: Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data. In BMC proceedings. Volume 3. BioMed Central Ltd; 2009:S2.Google Scholar
 29.Fan J, Lv J: Sure independence screening for ultrahighdimensional feature space. Journal of the American Statistical Association: Series B 2008, 70: 849–911. 10.1111/j.14679868.2008.00674.xGoogle Scholar
 30.Moore J, White B: Tuning ReliefF for genomewide genetic analysis. Lecture Notes in Computer Science 2007, 4447: 166–175. 10.1007/9783540717836_16CrossRefGoogle Scholar
 31.Wasserman L: All of statistics: a concise course in statistical inference. Springer Verlag; 2004.CrossRefGoogle Scholar
 32.Wang K, Dickson S, Stolle C, Krantz I, DB G, H H: Interpretation of association signals and identification of causal variants from genomewide association studies. The American Journal of Human Genetics 2010.Google Scholar
 33.Hirschhorn J, Daly M: Genomewide association studies for common diseases and complex traits. Nature Reviews Genetics 2005, 6(2):95–108.CrossRefPubMedGoogle Scholar
 34.Schork N, Murray S, Frazer K, Topol E: Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development 2009, 19(3):212–219. 10.1016/j.gde.2009.04.010CrossRefGoogle Scholar
 35.Weale M, Depondt C, Macdonald S, Smith A, Lai P, Shorvon S, Wood N, Goldstein D: Selection and evaluation of tagging SNPs in the neuronalsodiumchannel gene SCN1A: implications for linkagedisequilibrium gene mapping. The American Journal of Human Genetics 2003, 73: 551–565. 10.1086/378098CrossRefPubMedGoogle Scholar
 36.Carlson C, Eberle M, Rieder M, Yi Q, Kruglyak L, Nickerson D: Selecting a maximally informative set of singlenucleotide polymorphisms for association analyses using linkage disequilibrium. The American Journal of Human Genetics 2004, 74: 106–120. 10.1086/381000CrossRefPubMedGoogle Scholar
 37.Agresti A: Categorical Data Analysis. second edition. Wiley Series in Probability and Statistics, Wiley and Sons INC; 2002.CrossRefGoogle Scholar
 38.Ma L, Runesha H, Dvorkin D, Garbe J, Da Y: Parallel and serial computing tools for testing singlelocus and epistatic SNP effects of quantitative traits in genomewide association studies. BMC Bioinformatics 2009, 9: 315.CrossRefGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.