Identifying rare-variant associations in parent-child trios using a Gaussian support vector machine
- 708 Downloads
- 3 Citations
Abstract
As the availability of cost-effective high-throughput sequencing technology increases, genetic research is beginning to focus on identifying the contributions of rare variants (RVs) to complex traits. Using RVs to detect associated genes requires statistical approaches that mitigate the lack of power with the analysis of single RVs. Here we report the development and application of an approach that aggregates and evaluates the transmissions of RVs in parent-child trios. An initial score that incorporates the distortion in transmission of the observed RVs from the parents to their offspring is calculated for each variant. The scores are analyzed using a support vector machine that handles these data by mapping the transmission distortion of the multiple RVs into a one-dimensional score in a nonlinear fashion when parent-child trios with affected and nonaffected children are contrasted. We refer to this approach as Trio-SVM. A total of 275 trios were available in the Genetic Analysis Workshop 18 data for analysis. Because of their nonindependence and the extended linkage disequilibrium (LD) within pedigrees, Trio-SVM was vulnerable to type I errors in detecting association. Using the GAW18 data with simulated trait values, Trio-SVM has an appropriate type I error, but it lacks power with a sample of 267 trios. Larger samples of 500 to 1000 trios, derived from combining the simulated data, provided sufficient power. Two chromosome 3 candidate genes were tested in the real GAW18 data with Trio-SVM, and they showed marginal associations with hypertension.
Keywords
Support Vector Machine Genetic Analysis Workshop GAW18 Data Transmission Distortion Simulated TraitBackground
Genome-wide association studies (GWAS) of common variants have not explained the heritability estimates of common complex disorders [1]. In response, exome sequencing, which is designed to reveal rare variants (RVs) with a frequency less than a value in the range of 1% to 5%, is being applied to pursue additional risk genes. Interpretation of RVs is best done for Mendelian disorders within pedigrees to identify significant loci and avoid artifacts of the sequencing process. However, for complex disorders and quantitative traits, RVs that segregate only within a few pedigrees do not provide adequate statistical power to implicate a particular gene when they are analyzed alone. Approaches to solve this problem involve the aggregation of RVs within genes and regions. We developed an approach, called Trio-SVM, using the support vector machine (SVM) method that aggregates and tests the RVs of a gene for a dichotomized trait in parent-child trios [2]. Parent-child trios are used to test association through distortions in transmission from the parents to their children. An advantage of this approach is that (a) the transmission of RVs can be aggregated across genes and compared with their aggregation in controls by the SVM, and (b) population stratification is mitigated because only parents with the RV provide information in the analysis. That is, the differences in frequencies of RVs in different ethnic groups will have no effect on the test statistic because only opportunities for transmission in parents heterozygous for RVs contribute to the transmission distortion data used by the SVM.
Using Trio-SVM, all members of the trios are sequenced for RVs, and the observed RV transmissions are compared with what is expected, given the parental RV genotypes. Transmission distortions in a gene are combined using SVM. The area under receiver operating characteristic (area under the curve [AUC]) was generated by SVM when contrasting the transmission between affected and unaffected children is estimated for each gene under analysis and used as the test statistic. The strength of Trio-SVM is that it allows for each RV to either confer risk or to be protective and contribute to an overall score in which the direction of the effect for each RV is not a factor in the score.
One potential concern is the availability and choice of control groups for the SVM. First, the control sample should be beyond the age of risk for the disorder under analysis and have appropriate environmental exposures when those are known to be important. Second, to provide an opportunity for transmission of RVs from parents to their children, ethnic matching, although not necessary, may be helpful. An interesting choice might be the unaffected siblings from the trios in the study because they would have the same opportunities to inherit the RVs that are transmitted to their affected siblings.
Methods
An overview of support vector machine
The purpose of the SVM is to discriminate between two groups using a set of variables. It is particularly useful when the number of variables is greater than the number of individuals in the data set. SVM is based on a model with N ordered pairs $\left({y}_{i},{x}_{i}\right)$ where ${y}_{i}$ is a binary outcome with a vertex −1 assigned to one group and +1 to the other and, ${x}_{i}=\left({x}_{ij}\right),j=1,2,..,M$, is a vector with M predictors.
If "." denotes the dot product and "^" the parameter estimate, SVM constructs two hyperplanes in space, ${H}_{1}:{x}_{i}.w+b=-1$ and ${H}_{2}:{x}_{i}.w+b=+1$ in which the weights w and the offset b are estimated to maximize the separation of $\left(\frac{2}{||w||}\right)$between ${H}_{1}$ and ${H}_{2}$, with the constraint ${y}_{i}\left({x}_{i}.w+b\right)-1\ge 0;\forall i$ (i.e., all of the observations of two groups are separated by the two hyperplanes). The optimization is equivalent to minimizing ${L}_{p}=\frac{1}{2}{\u2225w\u2225}^{2}+\sum _{i}^{N}{\alpha}_{i}\left(1-{y}_{i}\left({x}_{i}.w+b\right)\right)$, with respect to w and b, where ${\alpha}_{i}\ge 0$ are Lagrange multipliers. Geometrically, $\hat{w}$ is a function of ${N}_{s}$ support vectors, with nonzero ${\alpha}_{i}$ that locate on the margins of ${H}_{1}$ and ${H}_{2},$and the solution of $w\phantom{\rule{0.3em}{0ex}}.{x}_{i}$ is given by $\sum _{s}^{{N}_{s}}{\alpha}_{s}{y}_{s}{x}_{s}.\phantom{\rule{0.3em}{0ex}}{x}_{i}$.
SVM provides the advantage of allowing $M\phantom{\rule{0.3em}{0ex}}\text{to}\phantom{\rule{0.3em}{0ex}}\text{be}>N$ because the solution that estimates w is based on the support vectors. An additional advantage is the relaxation of linear mapping by using a kernel function K that corresponds to a nonlinear function $\phi $ such that ${x}_{i}$ is replaced by $\phi \left({x}_{i}\right)$ and ${x}_{s}.{x}_{i}$ is replaced by $\left({x}_{s},{x}_{i}\right)=\phi \left({x}_{s}\right).\phi \left({x}_{i}\right)$. Using a Gaussian kernel with a scale, ${\sigma}_{g}^{2}$, the dot product $\hat{w}\phantom{\rule{0.3em}{0ex}}.\phi \left({x}_{i}\right)$ is expressed as $\sum _{s}^{{N}_{s}}{\alpha}_{s}{y}_{s}\text{exp}\left(-{\u2225{x}_{s}-{x}_{i}\u2225}^{2}/2{\sigma}_{g}^{2}\right)$. A penalty term (in general denoted by C) is added for a generalization of the optimal hyperplane when the data do not allow the two groups to be completely separated, which limits the Lagrange multipliers to range between 0 and C.
Adapting support vector machine for parent-child trios: Trio-SVM
Trio-SVM analyzes a set of N parent-child trios, in which each child is described by a coordinate $\left({y}_{i},{x}_{i}\right)$, for the M RVs observed for that child and his or her parents and ${y}_{i}$ is -1 when the child is affected and +1 otherwise. Here ${x}_{i}$ incorporates the conditional distribution of RVs on parental genotypes using the framework of the family-based association test (FBAT) [3]. At each RV site j, ${x}_{ij}$ is the difference between the observed and expected transmission of an RV to the child given the two parental genotypes for that RV. Using this, each child then gets a composite score for the test gene, ${y}_{i,score}$, which is modeled by $\hat{w}\phantom{\rule{0.3em}{0ex}}.\phi \left({x}_{i}\right)+\hat{b}$, which aggregates the RV. The AUC (denoted by θ) of ${y}_{score}$ for H_{0}: θ <= 0.5 vs H_{a}: θ > 0.5 is used to represent the composite scores for distorted transmission within a gene over the sample of trios comparing those who are affected with those who are not. Accepting that θ is greater than 0.5 indicates the combined RVs in the gene are transmitted with greater distortion from that which is expected in the cases when compared with the control participants. The test is one-sided because each group is assigned to a fixed vertex; the statistic $\hat{\theta}/SE\left(\hat{\theta}\right)$ is asymptotically Gaussian. For case-control analyses, one would let ${x}_{ij}$ count the number of RVs at each site.
Applying Trio-SVM to Genetic Analysis Workshop 18 pedigree data
For the GAW18 data, Trio-SVM was used to combine all observed RVs for a given gene, by selecting all affected and unaffected individuals having both parents in each pedigree and treating them as independent. A total of 275 such trios were derived from the 959 individuals in 20 GAW18 pedigrees ascertained for type 2 diabetes (T2D). The pedigree members were genotyped at 472,049 SNPs on GWAS platforms. Half of the sample (n = 464) was sequenced at 8,348,674 sites, and imputation of nonsequenced individuals was performed using the GWAS data, thus providing each individual with a constellation of RVs. Blood pressures were taken longitudinally at from 1 to 4 exams for 932 participants. Hypertension was assigned based on systolic blood pressure (SBP) greater than 140 mm Hg, diastolic blood pressure (DBP) greater than 90 mm Hg, or use of antihypertensive medications.
Trio-SVM accepted the input of the GAW18 pedigree data in linkage format, and the noninformative sites in which no RVs were observed were removed. All trios with two parents available were gleaned from the pedigrees and treated as if they were independent for these analyses. However, because they are not independent and LD reaches much greater distances in pedigrees than in independent samples, significant results with Trio-SVM may lead to false positives in such pedigrees. Specifically, for a disorder, if there is a causal common variant, its haplotype will segregate with the disorder throughout the pedigree. Any RVs that are on the haplotype in the pedigrees will be carried along with it, and genes that happen to have many RVs on that haplotype will be implicated. If the RVs are not in the causal gene, a type I error regarding association will occur.
Analyses were focused on chromosome 3, as suggested by the GAW18 organizers. Two T2D GWAS candidate genes, ADCY5 at (3q21.1) [4] and UBE2E2 (3p24.2) [5], on a different arm of chromosome 3 were tested using Trio-SVM. For comparison, SVM without transmission information was used to analyze 108 founders consisting of 67 cases and 41 control participants.
Trio-SVM type I and type II error rates using the simulated pedigree data
Two hundred replicates of the genotyped data in the GAW18 pedigrees with the trait simulated under specific genetic models were available for assessments of type I and II statistical errors. The genes on chromosome 3 that were predisposing and nonpredisposing in the simulated models were tested. RVs were included in the analysis when their frequencies were less than 0.01 and less than 0.03 in two separate assessments. These analyses were performed on the simulated trait, hypertension, defined in two ways: (a) adjusted by age, age × gender, gender, and use of antihypertensive medications and (b) not adjusted. Covariates were included using linear mixed models with a random effect to account for the intrapedigree correlation. The traits were adjusted to age 38, no medications, and male gender. Power analyses were based on evaluating the predisposing gene MAP4, and the type I errors were assessed for the nonpredisposing gene, ARL13B (93.8 Mb), located between RYBP (72.5Mb) and B4GALT4 (118.9Mb), where both influenced DBP or SBP. To evaluate the power in a larger sample size, 500 and 1000 trios were drawn from the 200 replicates using bootstrap sampling.
Trio-SVM was evaluated using a Gaussian kernel (${\sigma}_{G}^{2}$ fixed at 1) and 5-fold cross-validation for model selection across different C, from 1 to 10.
Results and discussion
Trio-SVM analysis of type 2 diabetes candidate genes
Trio-SVM analyses of candidate genes in GAW18 trios and founders
Trios (n= 275) 66 case trios 209 control trios | Founders (n= 108) 50 cases founders 58 control founders | ||||
---|---|---|---|---|---|
Gene (#Bp) | #RV sites | AUC (SE) | p-Value | AUC (SE) | p-Value |
ADCY5 (166,249) | 426 | 0.637 (0.040) | 3.2E-04 | 0.554 (0.056) | 0.17 |
UBE2E2 (387,512) | 917 | 0.575 (0.041) | 0.035 | 0.539 (0.057) | 0.25 |
Trio-SVM: power and type I error
Trio-SVM type I error and power in GAW18 simulated data (267 trios)
RV frequency | Trait adjusted | #RV sites | Power | #RV sites | Type I error (for p-values <0.05) |
---|---|---|---|---|---|
<0.03 | No | 405 | 0.15 | 115 | 0.040 |
Yes | 0.19 | 0.055 | |||
<0.01 | No | 314 | 0.11 | 91 | 0.065 |
Yes | 0.14 | 0.065 |
Trio-SVM power in multiple replicates
α^{1} | 500 trios | 1000 trios |
---|---|---|
p-Value <0.05 | 0.755 | 0.995 |
p-Value <0.01 | 0.610 | 0.980 |
p-Value <0.001 | 0.405 | 0.930 |
p-Value <0.0001 | 0.215 | 0.870 |
Conclusions
Applications of machine learning methods in genomic data are just beginning [7, 8, 9]. Using SVM, we developed a novel approach for analysis of RVs to handle high-dimensional genomic data, relax a linear relationship between $\left({y}_{i},{x}_{i}\right)$, and control population stratification. One disadvantage is that the magnitude of $\hat{w}$ cannot be explicitly expressed by using a nonlinear kernel. Importantly, we can detect the association between RVs and a test trait when applying Trio-SVM to a sample composed of nuclear families. Our future work is to increase the power by considering other newly defined kernel functions, such as wavelet transform, and make the extension a viable option in our code. The MATLAB code of Trio-SVM can be obtained from the authors.
Notes
Acknowledgements
This work is supported by the statistical core of National Institutes of Health (NIH) grant HL028481.
The GAW18 whole genome sequence data were provided by the T2D-GENES (Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples) Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The GAW is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.
References
- 1.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747-753. 10.1038/nature08494.PubMedCentralCrossRefPubMedGoogle Scholar
- 2.Vapnik V: The Nature of Statistical Learning Theory. 1995, New York, Springer-VerlagCrossRefGoogle Scholar
- 3.Horvath S, Xu X, Laird NM: The family based association test method: strategies for studying general genotype--phenotype associations. Eur J Hum Genet. 2001, 9: 301-306. 10.1038/sj.ejhg.5200625.CrossRefPubMedGoogle Scholar
- 4.Dupuis J, Langenberg C, Prokopenko I, Saxena R, Soranzo N, Jackson AU, Wheeler E, Glazer NL, Bouatia-Naji N, Gloyn AL, et al: New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat Genet. 2010, 42: 105-116. 10.1038/ng.520.PubMedCentralCrossRefPubMedGoogle Scholar
- 5.Yamauchi T, Hara K, Maeda S, Yasuda K, Takahashi A, Horikoshi M, Nakamura M, Fujita H, Grarup N, Cauchi S, et al: A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet. 2010, 42: 864-868. 10.1038/ng.660.CrossRefPubMedGoogle Scholar
- 6.Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, et al: Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008, 320: 539-543. 10.1126/science.1155174.CrossRefPubMedGoogle Scholar
- 7.Lu AT, Bakker S, Janson E, Cichon S, Cantor RM, Ophoff RA: Prediction of serotonin transporter promoter polymorphism genotypes from single nucleotide polymorphism arrays using machine learning methods. Psychiatr Genet. 2012, 22: 182-188. 10.1097/YPG.0b013e328353ae23.CrossRefPubMedGoogle Scholar
- 8.Dasgupta A, Sun YV, Konig IR, Bailey-Wilson JE, Malley JD: Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience. Genet Epidemiol. 2011, 35 (Suppl 1): S5-S11.PubMedCentralCrossRefPubMedGoogle Scholar
- 9.Guo Y, Hastie T, Tibshirani R: Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007, 8: 86-100. 10.1093/biostatistics/kxj035.CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.