Joint analysis of multiple phenotypes: summary of results and discussions from the Genetic Analysis Workshop 19
- 1.3k Downloads
For Genetic Analysis Workshop 19, 2 extensive data sets were provided, including whole genome and whole exome sequence data, gene expression data, and longitudinal blood pressure outcomes, together with nongenetic covariates. These data sets gave researchers the chance to investigate different aspects of more complex relationships within the data, and the contributions in our working group focused on statistical methods for the joint analysis of multiple phenotypes, which is part of the research field of data integration. The analysis of data from different sources poses challenges to researchers but provides the opportunity to model the real-life situation more realistically.
Our 4 contributions all used the provided real data to identify genetic predictors for blood pressure. In the contributions, novel multivariate rare variant tests, copula models, structural equation models and a sparse matrix representation variable selection approach were applied. Each of these statistical models can be used to investigate specific hypothesized relationships, which are described together with their biological assumptions.
The results showed that all methods are ready for application on a genome-wide scale and can be used or extended to include multiple omics data sets. The results provide potentially interesting genetic targets for future investigation and replication. Furthermore, all contributions demonstrated that the analysis of complex data sets could benefit from modeling correlated phenotypes jointly as well as by adding further bioinformatics information.
KeywordsJoint Model Gene Expression Measure Copula Function Copula Model Latent Growth Curve Model
For Genetic Analysis Workshop (GAW) 19, a large collection of different types of data were provided . Researchers were able to use both systolic (SBP) and diastolic blood pressure (DBP) phenotypes, measured at multiple time points, gene expression measures, and sequencing data, as well as single nucleotide polymorphisms (SNPs) from families and unrelated individuals. This enabled participating researchers to investigate a multitude of complex questions, which often involved combining data from various sources.
Blood pressure and gene expression as multiple phenotypes
Building on the above arguments, it is intuitive that a joint analysis of multiple BP and gene expression phenotypes together with the underlying genotypes may allow building a biologically more meaningful model. This can help to not only increase the face validity and confidence in findings from the analyses, but can also allow estimating the genetic effects more efficiently with higher power.
Investigated phenotypes, transformations, and adjustments
Overview of the analyzed sample and data in the contributions
GE and genetic data
Cao et al. 
n = 397 individuals in 46 families, from family data set
Real data: SBP at time point 3
GE and SNP data: k = 11,522 transcripts, l = 354,893 SNPs
Of top 1000 variables associated with BP, 575 are SNPs and 425 are GE, 302 have plausible relevance for BP, 173 are associated with body weight, and 84 associated with left ventricular contractility
Konigorski et al. 
n = 81 unrelated individuals, from family data set
Real data: SBP at time point 1
GE and WGS data on chromosome 19: k = 848 transcripts, l = 68,727 SNVs
R functions, available upon request
Higher power of bivariate copula models compared to univariate regression and univariate SKAT, SKAT-O
Identification of 5 SNVs in CEACAM5 gene relevant for SBP, and 1075 cis-eQTLs relevant for GE
Song et al. 
n = 1389 individuals from family data set
Real data: SBP and DBP at time points 1–3
SNP data: l = 460,359 SNPs
The 2 tested models (autoregressive and latent growth curve) show similar ranking of relevant SNPs
Identification of 10 SNPs related to both SBP and DBP, mostly on chromosome 1
Sun et al. 
n = 1851 unrelated individuals, from unrelated data set
real data: SBP and DBP
WES data: l = 152,337 SNVs
R functions, available upon request
Multivariate tests tend to give smaller p values than the univariate SKAT, and can improve power
Identification of 2 SNPs in CYP4A22 and near APOC4, which were previously reported to be associated with BP
Dealing with the high dimensionality of the data
For genome-wide association studies (GWAS) in general, a paramount question is how to approach the high dimensionality of SNVs, with solutions of either testing SNVs or sets of SNVs sequentially, or by using some dimension reduction approach or penalized model to analyze all the data simultaneously. Whether a statistical method is applicable to high dimensional data, is dependent on whether it is computationally fast enough to fit a large number of models, sequentially testing each SNV, or whether it is both computationally fast enough and able to fit 1 very high-dimensional model including all SNVs (in the genome or in a chromosome, for example). This challenge is amplified when multiple phenotypes are considered, which increases the dimensionality problem further. Three of the contributions – approached this issue by restricting the analysis to prespecified models (see Fig. 2) with assumed relationships between phenotypes and SNVs. These models were then fitted sequentially for different SNVs or sets of SNVs with copula models , structural equation models (SEM) , or multivariate linear mixed effect regression . As another novel approach to the high dimensionality, Cao et al.  used a sparse representation based variable selection (SRVS) to extract relevant genes based on signatures from the entire data, including all SNVs and all gene expressions. In this regard, all approaches could be described as different forms of meta-dimensional approaches, with either using a sparse statistical model and analyzing all the SNP and gene expression data simultaneously, or by posing hypothesis-based restrictions and testing triplets of SBP, DBP, and SNVs or of SBP, gene expression, and SNVs sequentially. In more detail, Cao et al.  analyzed 11,522 gene expression measures, 354,893 SNPs and 6 nongenetic variables simultaneously. Konigorski et al.  and Song et al. , each tested models incorporating multiple phenotypes (SBP and 1 gene expression measure , or 3 longitudinal SBP and 3 DBP measures ) together with 1 SNV/SNP, sequentially for 68,727 SNVs  and 460,359 SNPs . Finally, Sun et al.  tested SBP and DBP conditional on the nongenetic covariates for the association with 152,337 SNVs sequentially, and with multimarker tests allocating these SNVs into 10,886 genes or 13,094 windows and testing them sequentially.
As another view on the approaches, Konigorski et al. , Song et al. , and Sun et al.  applied new methodologies and implementations for models of multiple dependent outcome measures to integrate BP and gene expression as phenotypes in the analysis. On the other hand, Cao et al.  looked at gene expression at the same level as genotypes to derive joint signatures relevant for BP, and applied a new approach to extend the literature on how to integrate and mine multiple covariates for interesting structures. Hence, the approaches focus on slightly different aspects of transcriptional and translational processes.
Additional hypothesis-based restrictions were sometimes used to decrease the number of tested hypotheses and the number of estimated parameters; for example, the assumption that the effects of the same variant on different phenotypes have the same correlation  or the restriction to analyze cis-acting SNVs with respect to gene expression and disregard trans-acting SNVs .
Classical multiple phenotype analysis
A multivariate method in the classical sense was applied by Sun et al. . Motivated by the search for pleiotropic genetic variants, which might help to shed more light on the genetic architecture of complex traits, they used their recently developed multivariate rare-variant association test (MURAT) to test for associations between BP and genetic variants. The underlying model of MURAT relates multiple phenotypes to a group of genotypes and covariates with a multivariate linear mixed effect model, assuming that the phenotypes are randomly distributed yet correlated and that effect of the genetic variants is normally distributed. Restrictions to the correlation structure involve the assumption that effects of different variants are uncorrelated but that the effect of the same variant on different phenotypes may be correlated. In the analysis, Sun et al. used SBP and DBP as highly correlated phenotypes to be associated with the sequencing data of unrelated individuals. They applied MURAT to the sequence data of the 1943 unrelated individuals and log-transformed BP measurements, which showed a correlation of r = 0.542. In their analysis, the authors compared the results with the frequently used sequence kernel association test (SKAT) . Both tests were first used as single-variant tests, restricted to exomic variants on odd-numbered chromosomes with at least 4 carriers. For the application as region-based tests, they used 10,866 gene regions based on the hg19 annotations and also 13,094 nonoverlapping windows of 30 kb. Additionally, the authors looked at 15 candidate genes known to be associated with hypertension. A score type statistic with analytically computable p values allows the application in genome-wide analyses. Besides identifying potentially interesting candidate SNPs for hypertension, the authors found evidence for an increased power when applying the multivariate test, compared to the univariate SKAT method.
Song et al.  used a SEM method for pedigrees  allowing for measurement error of the observed BP values, to search for SNPs which affect the BP changes over time. In the measurement part of the model, SBP and DBP are assumed to be linearly related to the SNP and the latent variable, as illustrated in Fig. 2, assuming multivariate normality . In the structural part of the model, the relationship between the 3 latent variables was modeled in 2 ways, with a first-order autoregressive model as well as a latent growth curve model. More specifically, in the first-order autoregressive model, the genotype affects only the first latent variable and all following variables depend only on their respective predecessor, and in the latent growth curve model, the effect of the SNP on the latent variables is constant. With a score test developed from their original R package strum , the authors tested the effect of each SNP separately on the BP traits. Because of a vast amount of missing data, Song et al. excluded the fourth time point from their analysis. In their analysis, the authors found weak associations signals for 10 SNPs in both models. The SNPs have not been reported in previous GWAS and could be investigated further in future studies. These results indicate that with the computationally efficient score test, SEMs of complex relationships between multiple phenotypes can be investigated with strum on a genome-wide scale.
Data integration methods of multilevel biological data
Two of our groups incorporated data from different omics technologies in the form of genotypes and gene expression to make use of more information. Both used the provided family data set as it contained both genotype and gene expression data.
Konigorski et al. followed up on their work from GAW18  and extended their analysis to gene expression data  for a more biologically meaningful model, and to the analysis of rare variants. In their model, the dependence between the gene expression of a transcript with BP is used for building joint models of both phenotypes conditional on genetic variants. The relationship between gene expression and BP is assumed to be undirected, because it is not clear a priori whether for a particular transcript, the effect is from gene expression on BP or vice versa. For example, the mRNA expression of some genes can be expected to influence BP levels through complex translational processes, but also, BP could be thought of as an indicator of, for example, the stress level of an individual, which can be characterized by changes in the blood level of different hormones and proteins. These hormone levels could in turn interact with mRNA transcripts and modulate gene expression, resulting in an “effect” of BP on the gene expression levels. The authors used copula functions to model the joint distribution of SBP and gene expression conditional on SNVs, separately for all SNVs within the gene boundaries, to investigate genetic effects on both phenotypes while considering their dependence structure. In the marginal models, the SNV is linearly related to the phenotype. They restricted their analysis to chromosome 19, as it contains the transcript/gene with the highest association with SBP and hence a joint model could benefit the most compared to a univariate model. Genetic variants had to have at least 3 copies and must be located within 5 kb of a gene to be considered. Hence, for the effect on gene expression, only cis-acting effects were considered. They compared their results with results from single-marker univariate models of SBP and gene expression and also the gene-based SKAT  and SKAT-O (optimal sequence kernel association test) . While there was no indication for inflated type 1 errors under the proposed copula approach, the results indicate that joint models using copula functions can estimate the genetic effects of both common and rare variants more efficiently and with higher power compared to standard univariate regression models as well as the popular multimarker tests of SKAT and SKAT-O.
Cao et al. applied their previously developed SRVS  to identify signatures from SNPs and mRNA expression which are associated with BP . After regressing out the nongenetic covariates from the observed BP, the resulting residuals are related to all SNPs and transcripts in a linear regression model. This high-dimensional model is solved with their SRVS algorithm to select SNPs or transcripts. In the approach, combinations of SNPs and transcripts are repeatedly randomly selected and used for a least-squares optimization for minimal regression coefficients. Variables with nonzero regression coefficients are considered as selected. For the top 1000 selected variables, they performed a bioinformatics analysis to investigate the potential biological relevance of these variables. For each selected SNP or gene they looked for associations with BP (logarithm of odds [LOD] score >3) using the rat genome database tool  based on the Human Genome Assembly GRCh37. Approximately 56 % of the SNPs were identified as BP related and could be interesting targets for follow up in future studies. These results show the applicability of the SRVS algorithm to a large data set including multiple data sources with the potential to identify interesting biomarkers.
When summarizing insights from the 4 contributions, it was demonstrated that substantially different joint statistical models of multiple phenotypes are available for analysis, that their implementation is computationally efficient enough for an analysis on a genome-wide scale, and can be applied to multiple omics data sets without modification  or by posing further hypothesis-based constraints or through extensions of the approaches –. Complex molecular relationships of interest can be hypothesized from previous knowledge and investigated, or also, the entire data can be explored as a starting point to deriving meaningful signatures relevant for BP. All approaches analyzed the real data with the focus on applying new methodologies, showing their potential usefulness, and comparing the results to established approaches. In all comparisons to standard univariate approaches, promising results were described which suggest that using the information from multiple phenotypes can increase the power for identifying relevant genetic variants, even when considering a power decrease in joint models because of a smaller sample size with complete information on all phenotypes. For example, when using gene expression data from unrelated individuals of the San Antonio family study, the sample size available for analysis was n = 81 . Future work in this area could investigate adapted imputation models and extend the approaches to model the family structure, to increase the available sample size for analysis. Because the simulated phenotypes weren’t used in the analyses, future studies are also needed to evaluate the empirical type 1 error and power of the proposed test statistics.
As Cao et al.  demonstrated, subsequent bioinformatics analysis can help in the interpretation of the obtained results, and complement both model-based statistical approaches and unsupervised methods. Starting from identified associations of genetic variants, using causal models or developing appropriate goodness-of-fit tests for effects in different directions could each be very interesting roads for future extensions and future work for the fine-mapping of effects in the complex biological systems.
Finally, in addition to the methodological development of multivariate statistical approaches, an inherent premise in all contributions was that a careful selection, adjustment, and statistical modeling of each of the multiple phenotypes is essential for obtaining reliable results. With this, what can be more intuitive than to recall the classical genetic definition of a phenotype, and model the multiple observed traits of a phenotype jointly, which each contribute and explain parts of the entire organism.
We thank the Group 5 meeting participants Celia Greenwood, Gengxin Li, Angel Martinez-Perez, Yeunjoo Song, and Yin Yao for a lively and productive discussion.
We thank Celia Greenwood, Yeunjoo Song, Yin Yao, and Yildiz Yilmaz for valuable comments, which helped to improve this manuscript. The GAW19 was supported by NIH grant R01 GM031575.
This article has been published as part of BMC Genetics Volume 17 Supplement 2, 2016: Genetic Analysis Workshop 19: Sequence, Blood Pressure and Expression Data. Summary articles. The full contents of the supplement are available online at www.biomedcentral.com/bmcgenet/supplements/17/S2. Publication of the proceedings of Genetic Analysis Workshop 19 was supported by National Institutes of Health grant R01 GM031575.
- 1.Blangero J, Teslovich TM, Sim X, Almeida MA, Jun G, Dyer TD, et al: Omics-squared: human genomic, transcriptomic and phenotypic data for Genetic Analysis Workshop 19. BMC Proc. 2015, 9 (Suppl 8): S2-Google Scholar
- 8.Cao H, Guo W, Qin H, Xu M, Tao Y, Lehrman B, et al: Integrating multiple genomic data: sparse representation based biomarker selection for blood pressure. BMC Proc. 2015, 9 (Suppl 8): S39-Google Scholar
- 9.Konigorski S, Yilmaz YE, Pischon T: Genetic association analysis based on a joint model of gene expression and blood pressure. BMC Proc. 2015, 9 (Suppl 8): S39-Google Scholar
- 10.Song YE, Morris NJ, Stein CM: Structural equation modeling with latent variables for longitudinal blood pressure traits using general pedigrees. BMC Proc. 2015, 9 (Suppl 8): S41-Google Scholar
- 11.Sun J, Bhatnagar S, Oualkacha K, Ciampi A, Greenwood C: Joint analysis of multiple blood pressure phenotypes in GAW19 data by using a multivariate rare-variant association test. BMC Proc. 2015, 9 (Suppl 8): S42-Google Scholar
- 13.Huan T, Esko T, Peters MJ, Pilling LC, Schramm K, Schurmann C, et al: A meta-analysis of gene expression signatures of blood pressure and hypertension. PLoS Genet. 2015, 11 (3): 10.1371/journal.pgen.1005035. Article ID e1005035Google Scholar
This article is published under license to BioMed Central Ltd. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.