An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations

  • Seunghak Lee
  • Aurélie Lozano
  • Prabhanjan Kambadur
  • Eric P. XingEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9029)


Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (\(\sim 10^{11}\) SNP pairs in the human genome), and relatively small sample sizes (typically \(< 10^{4}\)). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect non-linear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives? In this paper, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture non-linear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer’s disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm which can screen \(\sim \) 118 billion candidates of associations on a 60-node cluster in \(<{}5.5\) hours. SPHINX is available at\(\sim \)seunghak/SPHINX/.


True Positive Rate Association Strength Group Lasso Stability Selection Piecewise Linear Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bach, F.R.: Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research 9, 1179–1225 (2008)zbMATHMathSciNetGoogle Scholar
  2. 2.
    Becker, K.G., Barnes, K.C., Bright, T.J., Wang, S.A.: The genetic association database. Nature Genetics 36(5), 431–432 (2004)CrossRefGoogle Scholar
  3. 3.
    Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical interactions. The Annals of Statistics 41(3), 1111–1141 (2013)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Bodmer, W.F., Bodmer, J.G.: Evolution and function of the hla system. British Medical Bulletin 34(3), 309–316 (1978)Google Scholar
  5. 5.
    Bretscher, O.: Linear algebra with applications. Prentice Hall Eaglewood Cliffs, NJ (1997)zbMATHGoogle Scholar
  6. 6.
    Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.: Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference (2013)Google Scholar
  7. 7.
    Cagniard, B., Balsam, P.D., Brunner, D., Zhuang, X.: Mice with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a food reward. Neuropsychopharmacology 31(7), 1362–1370 (2005)CrossRefGoogle Scholar
  8. 8.
    Evans, D.M., Marchini, J., Morris, A.P., Cardon, L.R.: Two-stage two-locus models in genome-wide association. PLoS Genetics 2(9), e157 (2006)CrossRefGoogle Scholar
  9. 9.
    Fan, J., Feng, Y., Song, R.: Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association 106(494), 544–557 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  10. 10.
    Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70(5), 849–911 (2008)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Foradori, C.D., Goodman, R.L., Adams, V.L., Valent, M., Lehman, M.N.: Progesterone increases dynorphin a concentrations in cerebrospinal fluid and preprodynorphin messenger ribonucleic acid levels in a subset of dynorphin neurons in the sheep. Endocrinology 146(4), 1835–1842 (2005)CrossRefGoogle Scholar
  12. 12.
    Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332 (2007)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Gerfen, C.R., Engber, T.M., Mahan, L.C., Susel, Z., Chase, T.N., Monsma, F.J., Sibley, D.R., Sibley, D.R.: D1 and d2 dopamine receptor-regulated gene expression of striatonigral and striatopallidal neurons. Science 250(4986), 1429–1432 (1990)Google Scholar
  14. 14.
    Golub, G.H., Reinsch, C.: Singular value decomposition and least squares solutions. Numerische Mathematik 14(5), 403–420 (1970)CrossRefzbMATHMathSciNetGoogle Scholar
  15. 15.
    Guerini, F.R., Tinelli, C., Calabrese, E., Agliardi, C., Zanzottera, M., De Silvestri, A., Franceschi, M., Grimaldi, L.M., Nemni, R., Clerici, M.: HLA-A*01 is associated with late onset of Alzheimer’s disease in italian patients. International Journal of Immunopathology and Pharmacology 22, 991–999 (2009)Google Scholar
  16. 16.
    Hoffman, G.E., Logsdon, B.A., Mezey, J.G.: PUMA: A unified framework for penalized multiple regression analysis of gwas data. PLoS Computational Biology 9(6), e1003101 (2013)CrossRefGoogle Scholar
  17. 17.
    Kambadur, P., Gupta, A., Ghoting, A., Avron, H., Lumsdaine, A.: PFunc: modern task parallelism for modern high performance computing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 43. ACM (2009)Google Scholar
  18. 18.
    Kim, S., Xing, E.P.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genetics 5(8), e1000587 (2009)CrossRefGoogle Scholar
  19. 19.
    Lee, S., Xing, E.P.: Leveraging input and output structures for joint mapping of epistatic and marginal eqtls. Bioinformatics 28(12), i137–i146 (2012)CrossRefGoogle Scholar
  20. 20.
    Lehmann, D.J., Barnardo, M.C., Fuggle, S., Quiroga, I., Sutherland, A., Warden, D.R., Barnetson, L., Horton, R., Beck, S., Smith, A.D.: Replication of the association of HLA-B7 with Alzheimer’s disease: a role for homozygosity? Journal of Neuroinflammation 3(1), 33 (2006)CrossRefGoogle Scholar
  21. 21.
    Lehmann, D.J., et al.: HLA class I, II & III genes in confirmed late-onset Alzheimer’s disease. Neurobiology of Aging 22(1), 71–77 (2001)CrossRefGoogle Scholar
  22. 22.
    Li, C., Li, M.: GWAsimulator: a rapid whole-genome simulation program. Bioinformatics 24(1), 140–142 (2008)CrossRefGoogle Scholar
  23. 23.
    Li, J., Zhu, M., Manning-Bog, A.B., Di Monte, D.A., Fink, A.L.: Dopamine and l-dopa disaggregate amyloid fibrils: implications for parkinson’s and Alzheimer’s disease. The FASEB Journal 18(9), 962–964 (2004)Google Scholar
  24. 24.
    Liu, J., Ji, S., Ye, J.: SLEP: Sparse Learning with Efficient Projections. Arizona State University (2009)Google Scholar
  25. 25.
    Liu, J., Ye, J.: Moreau-yosida regularization for grouped tree structure learning. Advances in Neural Information Processing Systems 187, 195–207 (2010)Google Scholar
  26. 26.
    Maggioli, E., Boiocchi, C., Zorzetto, M., Sinforiani, E., Cereda, C., Ricevuti, G., Cuccia, M.: The human leukocyte antigen class III haplotype approach: new insight in Alzheimer’s disease inflammation hypothesis. Current Alzheimer Research 10(10), 1047–1056 (2013)CrossRefGoogle Scholar
  27. 27.
    Meinshausen, N., Bühlmann, P.: Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72(4), 417–473 (2010)CrossRefMathSciNetGoogle Scholar
  28. 28.
    Meinshausen, N., Meier, L., Bühlmann, P.: P-values for high-dimensional regression. Journal of the American Statistical Association 104(488), 1671–1681 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  29. 29.
    Message Passing Interface Forum. MPI (June 1995).
  30. 30.
    Message Passing Interface Forum. MPI-2 (July 1997).
  31. 31.
    Moore, J.H., Asselbergs, F.W., Williams, S.M.: Bioinformatics challenges for genome-wide association studies. Bioinformatics 26(4), 445–455 (2010)CrossRefGoogle Scholar
  32. 32.
    Nyholt, D.R.: A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. The American Journal of Human Genetics 74(4), 765–769 (2004)CrossRefGoogle Scholar
  33. 33.
    Park, M., Hastie, T.: Penalized logistic regression for detecting gene interactions. Biostatistics 9(1), 30–50 (2008)CrossRefzbMATHGoogle Scholar
  34. 34.
    Payami, H., et al.: Evidence for association of HLA-A2 allele with onset age of Alzheimer’s disease. Neurology 49(2), 512–518 (1997)CrossRefGoogle Scholar
  35. 35.
    Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3), 559–575 (2007)CrossRefGoogle Scholar
  36. 36.
    Rakitsch, B., Lippert, C., Stegle, O., Borgwardt, K.: A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics 29(2), 206–214 (2013)CrossRefGoogle Scholar
  37. 37.
    Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N.L.S., Yu, W.: BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. American Journal of Human Genetics 87(3), 325 (2010)CrossRefGoogle Scholar
  38. 38.
    Wasserman, L., Roeder, K.: High dimensional variable selection. Annals of Statistics 37(5A), 2178 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  39. 39.
    Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67 (2005)CrossRefMathSciNetGoogle Scholar
  40. 40.
    Zhang, B., et al.: Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer disease. Cell 153(3), 707–720 (2013)Google Scholar
  41. 41.
    X. Zhang, F. Zou, and W. Wang. FastANOVA: an efficient algorithm for genome-wide association study. In Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 821–829. ACM (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Seunghak Lee
    • 1
  • Aurélie Lozano
    • 2
  • Prabhanjan Kambadur
    • 3
  • Eric P. Xing
    • 1
    Email author
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  2. 2.IBM T. J. Watson Research CenterNew YorkUSA
  3. 3.Bloomberg L.P.New YorkUSA

Personalised recommendations