Analysis of baseline, average, and longitudinally measured blood pressure data using linear mixed models
- 902 Downloads
This article compares baseline, average, and longitudinal data analysis methods for identifying genetic variants in genome-wide association study using the Genetic Analysis Workshop 18 data. We apply methods that include (a) linear mixed models with baseline measures, (b) random intercept linear mixed models with mean measures outcome, and (c) random intercept linear mixed models with longitudinal measurements. In the linear mixed models, covariates are included as fixed effects, whereas relatedness among individuals is incorporated as the variance-covariance structure of the random effect for the individuals. The overall strategy of applying linear mixed models decorrelate the data is based on Aulchenko et al.'s GRAMMAR. By analyzing systolic and diastolic blood pressure, which are used separately as outcomes, we compare the 3 methods in identifying a known genetic variant that is associated with blood pressure from chromosome 3 and simulated phenotype data. We also analyze the real phenotype data to illustrate the methods. We conclude that the linear mixed model with longitudinal measurements of diastolic blood pressure is the most accurate at identifying the known single-nucleotide polymorphism among the methods, but linear mixed models with baseline measures perform best with systolic blood pressure as the outcome.
KeywordsMinor Allele Frequency Longitudinal Measurement GWAS Data Genetic Analysis Workshop Vary Coefficient Model
Hypertension is a major morbidity and mortality risk factor for stroke, myocardial infarction, heart failure, and end-stage renal disease . It is a multifactorial disorder resulting from inheritance of several susceptibility genes, as well as multiple environmental determinants, including weight control, dietary intake, physical activity, and alcohol consumption . To date, several variants have been identified by genome-wide association studies (GWAS) as being associated with blood pressure and hypertension [1, 3, 4]. Various statistical, data mining, and machine learning strategies have shown some promise for identifying genetic variants, but are not scalable to large-scale GWAS [5, 6]. Linear mixed models (LMMs) are widely used in controlling for phenotypes and relatedness within GWAS . In the application of LMMs for GWAS data the covariates are included as fixed effects, whereas kinship among individuals is incorporated as a variance-covariance structure of the random effect for the individuals. We followed Aulchenko et al's  residual approach, which is based on a 2-step strategy in the application of the LMM. The first step optimizes a reduced LMM with the genetic marker effect excluded. In the second step, the residual from the reduced LMM is fitted as the dependent variable to test each marker in a linear model.
We give an overview of 3 LMMs for the analysis of Genetic Analysis Workshop 18 (GAW18) data, paying attention to the power of selecting an associated single-nucleotide polymorphism (SNP) from chromosome 3 and simulated phenotype data. In particular, we apply 3 types of LMMs for statistical analysis of baseline measurements, mean measurements, and longitudinal data. We compare the LMMs through simulations and illustrate them using the real phenotype data.
Data and quality control
We use 3 models to analyze the GWAS data set from chromosome 3 of the GAW18: a Diabetes-GENES Project, which consists of whole genome sequence data in a pedigree-based sample, longitudinal phenotype data for hypertension and related traits, and selected covariates. In this GWAS data, 65,519 SNPs have been genotyped for chromosome 3. In the simulated phenotype data, 849 subjects were measured at 3 time points for age, medication use, smoking status, and blood pressure. As is standard practice, SNPs with minor allele frequency (MAF) <1% were excluded from data analysis. We also filtered out SNPs with low call rates (<90%) and deviation from Hardy-Weinberg equilibrium (p value ≤ 1e−6). The quality controls were implemented using the R package SNPassoc . In addition, we excluded 4 individuals because more than half of their SNP values were missing. After filtering, a total of 27,313 SNPs and 845 samples met our quality-control criteria and were used for analysis. The family relationships among these individuals were copied from the pedigree on the real data.
Statistical analysis to evaluate the effect of SNPs
We consider 3 LMMs to evaluate the effect of SNPs on systolic blood pressure (SBP) and diastolic blood pressure (DBP) separately.
Model 1: GRAMMAR approach for baseline measures analysis
Aulchenko et al.  proposed a genome-wide rapid association using mixed model and regression (GRAMMAR) to assess significance of the effect of a polymorphism. The method first obtains residuals adjusted for family effects and then analyzes the association between these residuals and genetic polymorphisms using least-squares methods. The model is expressed as follows:
Initial model. The initial model is , where y ij is the value of phenotype corresponding to the jth individual in the ith pedigree, x ijk is the value of the kth covariate or fixed effect, is an estimate of the kth fixed effect or covariate, and e ij is the vector of residual effects. G is the random polygenic effect that follows a multivariate normal distribution with mean 0 and variance where is the relationship matrix (kinship matrix) and the additive genetic variance as a result of polygenes. The vector of estimated residuals is given by .
SNP model. The residuals are used as the dependent trait in a simple linear regression for each SNP, , where is the coefficient of the lth SNP from the model 1 scenario. The method adjusts for familial relationship and is computationally fast, but the model only considers the time 1 information from the GAW18 data. The first stage model is implemented using the polygenic() function of the R package GenABEL, and the kinship matrix is estimated using the R package kinship2. Next, the lm() function is used for fitting the linear model with residuals obtained from the first-stage model.
Model 2: Two-stage LMM for mean measured outcome analysis
We considered the measurement of the mean across the 3 time points as the outcome and followed the 2-stage approach. The model formula for the first stage is , where denotes the mean phenotype across the time points for the jth individual in the ith pedigree. is the coefficient for unknown fixed effects representing nongenetic effects (mean age across time points, sex, smoking status at time 1, and medication use at time 1), and G is the random polygenic effect that follows a multivariate normal distribution with mean 0 and variance where K is the kinship matrix with elements calculated from pedigrees, and is an unknown genetic variance; e is a vector of random residual effects that are normally distributed with zero mean and variance-covariance where I is the identity matrix and is the unknown residual variance.
In the second stage we consider the residuals as the outcome and fit the linear model, where is the coefficient of the lth SNP from the model 2 scenario. We implemented the model using the R package kinship (R 2.10.1). The lmekin() function is used to obtain the residuals from the first-stage model.
Model 3: Two-stage LMM for longitudinal analysis
We also evaluate a 2-stage LMM that takes longitudinal measurements into account. We consider 2 models, without and with time effects, in the application of the longitudinal analysis. In the first stage, we fit a random intercept LMM as follows:
Next, we extend the first-stage model allowing time points:
where y ijt denotes the phenotype (SBP or DBP) for the jth individual in the ith pedigree at time t. x ijt is the k fixed effect time-dependent covariate, v is the slope coefficient for the time points z t ; t = 1, 2, 3, and G is the random polygenic effect as in model 2; e is a vector of random residual effects that are normally distributed with zero mean and covariance , and is the unknown residual variance. Then we consider the mean of the residuals across the time points as the outcome and fit the model where is the vector of mean residuals across the time points and is the coefficient of the lth SNP from the model 3 scenario. We applied the model using the R packages pedigreemm and kinship.
Simulated data analysis
We investigated the performance of the 3 LMMs for selecting a known associated SNP from simulation studies. The 3 models are employed after adjusting for covariates and pedigree information, and the p values for each SNP are used to rank the SNPs.
The simulated phenotype data in GAW18 has 10 known SNPs from chromosome 3 that are associated with blood pressure. Among these SNPs, 2 have MAF >0.05. These 2 variants are rs6442089 (gene symbol: MAP4, position: 47956424, and MAF: 0.367) and rs1131356 (gene symbol: FLNB, position: 58109162, and MAF: 0.488). We investigated the 3 LMMs in terms of selection performances of rs6442089. We selected rs6442089 because it is a well-known SNP from the gene MAP4 that affects blood pressure.
Application to real data
Top 5 SNPs by the 3 LMMs considering the outcome SBP
Rs10511379 (1.651 e-03)
Rs2366104 (1.810 e-03)
In this article we applied 3 LMMs to the study of GAW18 in family data and in settings of relevance to baseline measures, mean measures, and longitudinally measured data. The statistical analysis of GWAS for GAW18 data using LMMs with longitudinal DBP measurements is capable of revealing the dynamic pattern of genetic control over chromosome 3 but did not perform competitively with other models for longitudinal SBP measurements. Exploratory/graphical analysis for the trajectories of SBP and DBP measurements also supported the conclusion that DBP had more subject-specific variability in slopes than SBP. However, the GRAMMAR approach with single-measure SBP data at baseline can be used on the development of SNP selection.
A general consideration applicable to all the methods discussed here concerns the issue of whether the outcome is linear or nonlinear. An alternative approach could be to relax the conditions imposed on linear models and explore the hidden structure by using a varying coefficient model . Consequently, it will be interesting to apply another method assuming the effects of SNPs are smooth functions of time.
We showed that a linear mixed modeling framework was most accurate at identifying known single-nucleotide polymorphism compared to other competing methods we considered in this manuscript for the analysis of longitudinal measurements of diastolic blood pressure. In contrast, baseline measures performed best with systolic blood pressure highlighting that, depending on the trajectory profile of the quantitative trait of interest, either just baseline values or serially measured values can be useful in genetic association studies.
AH acknowledges post-doctoral fellowship funding from the Drug Safety and Effectiveness Cross-Disciplinary Training (DSECT) Program. JB would like to acknowledge Discovery Grant funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) (grant number 293295-2009) and Canadian Institutes of Health Research (CIHR) (grant number 84392). JB holds the John D. Cameron Endowed Chair in the Genetic Determinants of Chronic Diseases, Department of Clinical Epidemiology and Biostatistics, McMaster University. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575. We would like to thank two anonymous reviewers and the editor for insightful comments that improved the presentation and clarity of our manuscript.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
- 2.Carnethon MR, Evans NS, Church TS, Lewis CE, Schreiner PJ, Jacobs DR, Sternfeld B, Sidney S: Joint associations of physical activity and aerobic fitness on the development of incident hypertension: coronary artery risk development in young adults. Hypertension. 2010, 56: 49-55. 10.1161/HYPERTENSIONAHA.109.147603.PubMedCentralCrossRefPubMedGoogle Scholar
- 4.Tabara Y, Kohara K, Kita Y, Hirawa N, Katsuya T, Ohkubo T, Hiura Y, Tajima A, Morisaki T, Miyata T, et al: Common variants in the ATP2B1 gene are associated with susceptibility to hypertension: the Japanese Millennium Genome Project. Hypertension. 2010, 56: 973-980. 10.1161/HYPERTENSIONAHA.110.153429.CrossRefPubMedGoogle Scholar
- 8.Aulchenko YS, de Koning DJ, Haley C: Genome wide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007, 177: 577-585. 10.1534/genetics.107.075614.PubMedCentralCrossRefPubMedGoogle Scholar
- 10.Fan J, Zhang JT: Functional linear model for longitudinal data. J R Stat Soc Series B Stat Methodol. 2000, 55: 757-796.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.