Background

High-throughput genotyping advances have generated large amounts of genetic data. At the same time, numerous related phenotypes are often collected in genetic studies of complex traits. To understand how genetic variants influence multiple traits, it is necessary to consider correlation among variants and multiple traits jointly [1, 2]. However, most genetic studies have focused on single variant-single trait association. Analysis performed in this fashion faces at least 3 issues. First, it leads to concerns regarding false discoveries. Bonferroni andother types of correction may limit false discoveries, but implementing these methods on correlated traits limits the power to detect true associations. Second, same genetic variants may influence multiple related traits [3]. Thus, modeling traits separately may misrepresent the underlying biology. Lastly, failure to integrate correlation among variants because of linkage disequilibrium (LD) dilutes the association. Longitudinal studies provide an additional challenge because traits measured over time are likely to be correlated.

We propose a Bayesian joint modeling of longitudinally measured multiple traits in family genetic studies. Explicitly considering the correlation structure between multiple traits using multivariate random effects [4] and seemingly unrelated regression techniques [5], this model studies simultaneously multiple single-nucleotide variations (SNVs) and their associations with multiple traits measured longitudinally. Bayesian multiplicity [6] correction controls false positives. Our method uniquely demonstrates the ability of Bayesian methods to account for the complexity of multivariate, longitudinal data in the context of genetic family data while controlling for false positives. Simulated data from Genetic Analysis Workshop 18 (GAW18) illustrate the application of the proposed method. The analysis was performed with knowledge of the simulation model.

Methods

Bayesian bivariate model

We consider SNV association models with a longitudinal bivariate quantitative outcome in family studies. Let y i j t k be the k t h outcome of subject j of pedigree i at time t . Here k = 1 or 2 for bivariate outcomes; i = 1 , 2 , , I indexes I families in the study; j = 1 , 2 , , n i indexes n i subjects in pedigree i ; and t = 1 , 2 , m i j indexes time points where outcomes are measured for subject j of pedigree i . Here, the number of measurement m i j does not need to be equal for different subjects, but we assume the 2 outcomes are measured at the same set of time points for the same individual. We jointly model the outcomes as follows:

y i j t k = β 0 k + X i j T β x k + Z i j t T β z k + Σ g = 1 G γ g k β g k S N V i j g + p i k + s i j k + e i j t k

Here, X is a vector of subject-specific time-invariant covariates and Z is a vector of time-varying covariates. S N V i j g , g = 1 , 2 , G , is the g t h SNV of subject j of pedigree i , coded as 0, 1, or 2 indicating the number of minor alleles and G is the number of SNVs in the model. β s are regression coefficients (including an intercept) for the covariates or genetic variants. γ s are SNV-specific indicator variables, which take values 1 or 0, depending on whether the SNV has an effect on the given outcome or not. Let p i = { p i 1 , p i 2 } , a pedigree-specific random effect common to all the individuals in the same family and accounts for correlation within families. Let s i j = { s i j 1 , s i j 2 } , a subject-specific random effect accounting for the correlation between measures taken repeatedly on the same subject. Let e i j t = { e i j t 1 , e i j t 2 } , the residual error. Additionally, p i , s i j and e i j t are assumed to be independent, and

Random family : p i ~ N 2 0 , Σ p , Σ p = σ p 1 2 ρ p σ p 1 σ p 2 ρ p σ p 1 σ p 2 σ p 2 2 ;
(1)
Random subject : s i j ~ N 2 0 , Σ s , Σ s = σ s 1 2 ρ s σ s 1 σ s 2 ρ s σ s 1 σ s 2 σ s 2 2 ;
(2)
Residual error : e i j t ~ N 2 0 , Σ e , Σ e = σ e 1 2 ρ e σ e 1 σ e 2 ρ e σ e 1 σ e 2 σ e 2 2 .
(3)

This Bayesian bivariate model uses bivariate normal random family and subject effects, and a bivariate normal distribution for the residual errors, as in seemingly unrelated regressions, to jointly model the 2 traits and allow them to be correlated.

Bayesian multiplicity adjustment

Bayesian multiplicity adjustment is a Bayesian solution to multiplicity problems [6]. Through appropriate choices of priors on model probabilities, this Bayesian solution will not inflate type I error in the face of multiple comparisons. In our model, we use prior distributions on SNV inclusion to provide correction for multiplicity. For each SNV in the model, 2 parameters are present--the inclusion indicator and the regression coefficient. We assume that the inclusion indicators follow a Bernoulli distribution with an unknown inclusion probability q , while q follows a Beta distribution, that is, γ g k ~ B e r n o u l l i 1 , q and q ~ B e t a a , b . Here, we set a = 1 , b = G , which represents a sceptical prior such that the marginal prior odds of an association is 1 / G [7]. This Beta-Bernoulli type of prior provides an intrinsic multiplicity correction in the face of multiple comparisons. The significance of the Bayesian analysis is the posterior inclusion probability. Unlike traditional p values, a higher posterior inclusion probability indicates greater confidence of a true result. For each SNV, we estimate posterior inclusion probability as

p ^ γ g k = 1 | D = Σ M γ k S I γ g k = 1 p ^ M γ k | D , S

Here, S is the model space being visited by the Markov chain Monte Carlo (MCMC); p ^ M γ k | D , S is the estimated posterior probability of model M γ k , where γ k = { γ 1 k , γ 2 k , , γ G k } is a vector of 0's and 1's, indicating which SNVs are in the model or not. I γ g k = 1 = 1 if γ g k = 1 and 0 otherwise. True- and false-positive rates are based on the median probability model, which is the model that includes variants with a posterior inclusion probability larger or equal to 0.5. We use noninformative, independent,univariate normal priors for all regression coefficients β 0 k , β x k , β z k , and β g k , and noninformative inverse Wishart priors for variance-covariance matrices Σ p , Σ s , and Σ e .

Analysis of GAW18 data

Two outcomes, diastolic (DBP) and systolic blood pressure (SBP), were evaluated. Covariates included sex, age, and smoking status. To adjust for antihypertensive medications use, we calculated the mean difference of blood pressure (BP) between observations with medication use (med = 1) and those without medication (med = 0) among observations with hypertension (htn = 1);that is, ( B P ¯ | h t n = 1 , m e d = 1 ) - ( B P ¯ | h t n = 1 , m e d = 0 ) . Observed BP values were adjusted using mean differences for observations with htn = 1 and med = 1 to impute DBP and SBP measures without medication use. These adjusted outcomes were used in the analysis [8].

Included were 849 individuals from 20 pedigrees (average number of subjects per pedigree was 42; range: 21-75). Each individual has 3 observations. Our focus is on SNVs in MAP4 on chromosome 3. There are 894 SNVs in the gene; 14 influence both DBP and SBP, and 1 influences SBP. Among the 15 causal SNVs, three are common (minor allele frequency [MAF]>0.05); all others are rare. All analyses were based on replication set 1. To investigate effectiveness of Bayesian multiplicity adjustment, a random set of noise variants of size 1, 15, 30, 60, and 90 were added from MAP4. Selection of noise variants was truly random and did not account for LD with causal SNVs. After burn-in and thinning every 10th iteration, 50,000 samples were drawn for MCMC simulations.

We compared our method with two other methods. One method is a Bayesian univariate model where the off diagonal of the variance and covariance matrices in equations (1), (2), and (3) are zero, thus the 2 traits are modeled independently. The univariate model can handle longitudinal data and multi-variants, and uses Bayesian multiplicity techniques to adjust for multiple comparisons. The other is the family-based measured genotype approach (MGA), which is a standard approach to analyze family genetic studies and compares polygenic models with or without each SNV as a covariate [9]. MGA models a single outcome and single SNV, and cannot handle longitudinal data from multiple outcomes jointly. Thus, only the first pair of DBP and SBP measures, adjusted for medication use, of each individual was used and modeled separately. Bonferroni correction was used for multiple comparisons after accounting for 105 tests (90 noises and 15 causal).

Results

Using a threshold of 0.5 for the posterior inclusion probability, the Bayesian bivariate model detected 5 of the 15 SNVs between SBP and DBP with 1 noise variant (Table 1). As the number of noise variants increased, the number of true positives identified was reduced, such that by 90 noise variants, only two causal variants were identified. True positives had relatively low MAFs and large effect sizes (Table 1). False negatives either were rare, had small effect size, or both. Importantly, Bayesian multiplicity adjustment yielded no false positives. This suggests that a lower posterior probability threshold may be appropriate to yield a family-wise error rate of 0.05.

Table 1 Posterior inclusion probabilities of the causal SNVs

Using the same set of 90 noise variants, MGA detected five causal SNVs (4 for SBP and 1 additional for DBP). However, MGA identified 8 and 6 false positives for DBP and SBP, respectively. The Bayesian univariate model identified 2 causal SNVs and 1 false positive for both DBP and SBP.

As the 90 noise variants were randomly selected, examining the LD structure is also important. Among the true positives, two pairs (SNVs 5 and 8 and SNVs 7 and 10) have a high LD ( r 2 0 . 8 ) . SNVs 7 and 10 were both identified by the Bayesian bivariate model and MGA, but only SNV 10 was identified by the Bayesian univariate model. SNVs 5 and 8 were identified by MGA, but none was identified by either the Bayesian bivariate or univariate approach. For MGA, six of the false positives of DBP and four of the false positives of SBP had relatively high LD ( r 2 0 . 8 ) with identified true positives. After discounting false positives caused by indirect effects, MGA had 2 false positives for both DBP and SBP.

Overall the Bayesian bivariate model resulted in posterior estimates of beta values very similar to the reported effect size from Genetic Analysis Workshop (GAW) data generators (Figure 1). However, estimates of effect sizes were inflated when only 1 of the 2 variants in high LD and having the same effect direction was included in the model during MCMC simulation (e.g. SNVs 7 and 10). The estimated effect size for SNV 8 (true effect size ≈ 0) was also inflated because it is in high LD with SNV 5 (true effect size <0). In addition, for many causal SNVs there was a high posterior exclusion probability (the proportion of times the variant was not included in the model), suggesting that some beta estimates were based on a small number of MCMC runs.

Figure 1
figure 1

Posterior exclusion probability and posterior density of regression coefficients of the causal SNVs. For the Bayesian bivariate model with 90 noise variants, the plot shows, for the 15 causal SNVs, (a) estimated posterior exclusion probabilities (red for DBP and blue for SDP), and (b) posterior density of regression coefficients (red for DBP and blue for SBP) when the SNVs were included in the model. Dashed reference lines indicate simulated effect sizes.

Discussion

We developed a novel Bayesian model for analysis of multiple longitudinal traits and multi-variant models in family genetic studies. This is a significant advance because previous studies have looked at bivariate models, longitudinal data, or multi-SNV models separately. For the first time, we have developed an analytic approach that can incorporate all of these issues jointly. The model considers bivariate random family and random subject effects to account for correlation between outcomes, family members, and repeated measures of the same individual. The seemingly unrelated regression technique is used to correlate the residuals of the 2 outcomes measured at the same time for the same individual. Inherent in this method is Bayesian multiplicity for the control of false positives. Using the GAW18 simulated data set, we demonstrated the feasibility of the proposed model.

Compared to the Bayesian univariate model in which the 2 outcomes are modeled separately, our method had similar power, but fewer false positives. Compared to MGA, our Bayesian bivariate approach had fewer false positives regardless of LD between causal and noise SNVs. While the noise variants in LD with our causal variants did not result in increased false positives using the Bayesian approach, given the multi-variant nature of the analysis it is possible that the LD in these noise parameters may have reduced power to detect causal effects. Although different from our proposed method in many aspects, MGA represents a standard practice for analysis of family-based genetic studies. This comparison of different approaches is important because it demonstrates that given the low false positives, the Bayesian model is potentially better than current standard practice.

Several other studies have shown improved power of joint modeling of multiple traits over univariate analyses [1013]. However, we saw a modest reduction in false positives without much increase in power. This may be a result of the fact that although the correlation between SBP and DBP is modest ( r 2 = 0 . 55 ) , these 2 outcomes depend on almost exactly the same set of variants; efficiency gains by utilizing a multivariate approach in this case are negligible. In contrast, large efficiency gains are expected when the outcomes depend on different covariate sets [11]. Furthermore, our model does not explicitly model the genetic relationship between individuals; instead, we used a random family effect. Although we do not expect the inclusion of a kinship coefficient matrix to influence the results, future studies should examine this further.

It is important to note that the model inference is based on computationally expensive MCMC simulation. Although we didn't use any special algorithms to efficiently sample the model space and the number of iterations was relatively small given the extremely large model space, the posterior inclusion probabilities became stable very early in the MCMC chain. This suggests that posterior inclusion probabilities can provide valid model inference. Sensitivity analyses with more optimistic priors on inclusion probabilities gave more power while controlling false positives. However, prior knowledge or evidence must support the use of this type of prior.

Conclusions

In summary, we introduced a Bayesian model for analysis of bivariate quantitative traits measured longitudinally in family genetic studies. This method extends the previous joint modeling methods and permits simultaneous analysis of multiple traits with longitudinal data. Furthermore, this method incorporates multi-variant effects while effectively controlling the false-positive rate.