An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations

Lee, Seunghak; Lozano, Aurélie; Kambadur, Prabhanjan; Xing, Eric P.

doi:10.1007/978-3-319-16706-0_17

Seunghak Lee⁵,
Aurélie Lozano⁶,
Prabhanjan Kambadur⁷ &
…
Eric P. Xing⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9029))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

2840 Accesses

Abstract

Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (\(\sim 10^{11}\) SNP pairs in the human genome), and relatively small sample sizes (typically \(< 10^{4}\)). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect non-linear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives? In this paper, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture non-linear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer’s disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm which can screen \(\sim \) 118 billion candidates of associations on a 60-node cluster in \(<{}5.5\) hours. SPHINX is available at http://www.cs.cmu.edu/\(\sim \)seunghak/SPHINX/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bach, F.R.: Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research 9, 1179–1225 (2008)
MATH MathSciNet Google Scholar
Becker, K.G., Barnes, K.C., Bright, T.J., Wang, S.A.: The genetic association database. Nature Genetics 36(5), 431–432 (2004)
Article Google Scholar
Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical interactions. The Annals of Statistics 41(3), 1111–1141 (2013)
Article MATH MathSciNet Google Scholar
Bodmer, W.F., Bodmer, J.G.: Evolution and function of the hla system. British Medical Bulletin 34(3), 309–316 (1978)
Google Scholar
Bretscher, O.: Linear algebra with applications. Prentice Hall Eaglewood Cliffs, NJ (1997)
MATH Google Scholar
Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.: Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference (2013)
Google Scholar
Cagniard, B., Balsam, P.D., Brunner, D., Zhuang, X.: Mice with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a food reward. Neuropsychopharmacology 31(7), 1362–1370 (2005)
Article Google Scholar
Evans, D.M., Marchini, J., Morris, A.P., Cardon, L.R.: Two-stage two-locus models in genome-wide association. PLoS Genetics 2(9), e157 (2006)
Article Google Scholar
Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557 (2011)
Article MATH MathSciNet Google Scholar
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911 (2008)
Article MathSciNet Google Scholar
Foradori, C.D., Goodman, R.L., Adams, V.L., Valent, M., Lehman, M.N.: Progesterone increases dynorphin a concentrations in cerebrospinal fluid and preprodynorphin messenger ribonucleic acid levels in a subset of dynorphin neurons in the sheep. Endocrinology 146(4), 1835–1842 (2005)
Article Google Scholar
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332 (2007)
Article MATH MathSciNet Google Scholar
Gerfen, C.R., Engber, T.M., Mahan, L.C., Susel, Z., Chase, T.N., Monsma, F.J., Sibley, D.R., Sibley, D.R.: D1 and d2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science 250(4986), 1429–1432 (1990)
Google Scholar
Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions. Numerische Mathematik 14(5), 403–420 (1970)
Article MATH MathSciNet Google Scholar
Guerini, F.R., Tinelli, C., Calabrese, E., Agliardi, C., Zanzottera, M., De Silvestri, A., Franceschi, M., Grimaldi, L.M., Nemni, R., Clerici, M.: HLA-A*01 is associated with late onset of Alzheimer’s disease in italian patients. International Journal of Immunopathology and Pharmacology 22, 991–999 (2009)
Google Scholar
Hoffman, G.E., Logsdon, B.A., Mezey, J.G.: PUMA: A unified framework for penalized multiple regression analysis of gwas data. PLoS Computational Biology 9(6), e1003101 (2013)
Article Google Scholar
Kambadur, P., Gupta, A., Ghoting, A., Avron, H., Lumsdaine, A.: PFunc: modern task parallelism for modern high performance computing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 43. ACM (2009)
Google Scholar
Kim, S., Xing, E.P.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5(8), e1000587 (2009)
Article Google Scholar
Lee, S., Xing, E.P.: Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics 28(12), i137–i146 (2012)
Article Google Scholar
Lehmann, D.J., Barnardo, M.C., Fuggle, S., Quiroga, I., Sutherland, A., Warden, D.R., Barnetson, L., Horton, R., Beck, S., Smith, A.D.: Replication of the association of HLA-B7 with Alzheimer’s disease: a role for homozygosity? Journal of Neuroinflammation 3(1), 33 (2006)
Article Google Scholar
Lehmann, D.J., et al.: HLA class I, II & III genes in confirmed late-onset Alzheimer’s disease. Neurobiology of Aging 22(1), 71–77 (2001)
Article Google Scholar
Li, C., Li, M.: GWAsimulator: a rapid whole-genome simulation program. Bioinformatics 24(1), 140–142 (2008)
Article Google Scholar
Li, J., Zhu, M., Manning-Bog, A.B., Di Monte, D.A., Fink, A.L.: Dopamine and l-dopa disaggregate amyloid fibrils: implications for parkinson’s and Alzheimer’s disease. The FASEB Journal 18(9), 962–964 (2004)
Google Scholar
Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)
Google Scholar
Liu, J., Ye, J.: Moreau-yosida regularization for grouped tree structure learning. Advances in Neural Information Processing Systems 187, 195–207 (2010)
Google Scholar
Maggioli, E., Boiocchi, C., Zorzetto, M., Sinforiani, E., Cereda, C., Ricevuti, G., Cuccia, M.: The human leukocyte antigen class III haplotype approach: new insight in Alzheimer’s disease inflammation hypothesis. Current Alzheimer Research 10(10), 1047–1056 (2013)
Article Google Scholar
Meinshausen, N., Bühlmann, P.: Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4), 417–473 (2010)
Article MathSciNet Google Scholar
Meinshausen, N., Meier, L., Bühlmann, P.: P-values for high-dimensional regression. Journal of the American Statistical Association 104(488), 1671–1681 (2009)
Article MATH MathSciNet Google Scholar
Message Passing Interface Forum. MPI (June 1995). http://www.mpi-forum.org/
Message Passing Interface Forum. MPI-2 (July 1997). http://www.mpi-forum.org/
Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445–455 (2010)
Article Google Scholar
Nyholt, D.R.: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. The American Journal of Human Genetics 74(4), 765–769 (2004)
Article Google Scholar
Park, M., Hastie, T.: Penalized logistic regression for detecting gene interactions. Biostatistics 9(1), 30–50 (2008)
Article MATH Google Scholar
Payami, H., et al.: Evidence for association of HLA-A2 allele with onset age of Alzheimer’s disease. Neurology 49(2), 512–518 (1997)
Article Google Scholar
Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (2007)
Article Google Scholar
Rakitsch, B., Lippert, C., Stegle, O., Borgwardt, K.: A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29(2), 206–214 (2013)
Article Google Scholar
Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N.L.S., Yu, W.: BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. American Journal of Human Genetics 87(3), 325 (2010)
Article Google Scholar
Wasserman, L., Roeder, K.: High dimensional variable selection. Annals of Statistics 37(5A), 2178 (2009)
Article MATH MathSciNet Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67 (2005)
Article MathSciNet Google Scholar
Zhang, B., et al.: Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer disease. Cell 153(3), 707–720 (2013)
Google Scholar
X. Zhang, F. Zou, and W. Wang. FastANOVA: an efficient algorithm for genome-wide association study. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 821–829. ACM (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Seunghak Lee & Eric P. Xing
IBM T. J. Watson Research Center, New York, NY, USA
Aurélie Lozano
Bloomberg L.P., New York, NY, USA
Prabhanjan Kambadur

Authors

Seunghak Lee
View author publications
You can also search for this author in PubMed Google Scholar
Aurélie Lozano
View author publications
You can also search for this author in PubMed Google Scholar
Prabhanjan Kambadur
View author publications
You can also search for this author in PubMed Google Scholar
Eric P. Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eric P. Xing .

Editor information

Editors and Affiliations

National Center of Biotechnology Information, Bethesda, Maryland, USA
Teresa M. Przytycka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, S., Lozano, A., Kambadur, P., Xing, E.P. (2015). An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations. In: Przytycka, T. (eds) Research in Computational Molecular Biology. RECOMB 2015. Lecture Notes in Computer Science(), vol 9029. Springer, Cham. https://doi.org/10.1007/978-3-319-16706-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-16706-0_17
Published: 26 March 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16705-3
Online ISBN: 978-3-319-16706-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics