Abstract
Identifying measurable genetic indicators (or biomarkers) of a specific condition of a biological system is a key element of precision medicine. Indeed it allows to tailor diagnostic, prognostic and treatment choice to individual characteristics of a patient. In machine learning terms, biomarker discovery can be framed as a feature selection problem on whole-genome data sets. However, classical feature selection methods are usually underpowered to process these data sets, which contain orders of magnitude more features than samples. This can be addressed by making the assumption that genetic features that are linked on a biological network are more likely to work jointly towards explaining the phenotype of interest. We review here three families of methods for feature selection that integrate prior knowledge in the form of networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Spear, B.B., Heath-Chiozzi, M., Huff, J.: Clinical application of pharmacogenetics. Trends Mol. Med. 7(5), 201ā204 (2001)
Reuter, J., Spacek, D.V., Snyder, M.: High-throughput sequencing technologies. Molecular Cell 58(4), 586ā597 (2015)
Van Allen, E.M., Wagle, N., Levy, M.A.: Clinical analysis and interpretation of cancer genome data. J. Clin. Oncol. 31(15), 1825ā1833 (2013)
Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., et al.: Finding the missing heritability of complex diseases. Nature 461(7265), 747ā753 (2009)
Holzinger, A.: Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inf. 3(2), 119ā131 (2016)
Hund, M., Bƶhm, D., Sturm, W., Sedlmair, M., et al.: Visual analytics for concept exploration in subspaces of patient groups. Brain Inf. 3(4), 233ā247 (2016). doi:10.1007/s40708-016-0043-5
Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., et al.: STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43(Database issue), D447ā452 (2015)
Chatr-Aryamontri, A., Breitkreutz, B.J., Oughtred, R., Boucher, L., Heinicke, S., et al.: The BioGRID interaction database: 2015 update. Nucleic Acids Res. 43(Database issue), D470ā478 (2015)
Kuperstein, I., Bonnet, E., Nguyen, H.A., Cohen, D., et al.: Atlas of cancer signalling network: a systems biology resource for integrative analysis of cancer data with Google Maps. Oncogenesis 4(7), e160 (2015)
Azencott, C.A., Grimm, D., Sugiyama, M., Kawahara, Y., Borgwardt, K.M.: Efficient network-guided multi-locus association mapping with graph cuts. Bioinformatics 29(13), i171āi179 (2013)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn Res. 3, 1157ā1182 (2003)
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Bush, W.S., Moore, J.H.: Chapter 11: genome-wide association studies. PLoS Comput. Biol. 8(12), e1002822 (2012)
Merris, R.: Laplacian matrices of graphs: a survey. Linear Algebra Appl. 197, 143ā176 (1994)
Smola, A.J., Kondor, R.: Kernels and regularization on graphs. In: Schƶlkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS (LNAI), vol. 2777, pp. 144ā158. Springer, Heidelberg (2003). doi:10.1007/978-3-540-45167-9_12
Fujishige, S.: Submodular Functions and Optimization. Elsevier, Amsterdam (2005)
Bach, F.: Learning with submodular functions: a convex optimization perspective. Found. Trends Mach. Learn. 6(2ā3), 145ā373 (2013)
Thornton, T.: Statistical methods for genome-wide and sequencing association studies of complex traits in related samples. Curr. Protoc. Hum. Genet. 84, 1.28.1ā1.28.9 (2015)
Liu, J., Wang, K., Ma, S., Huang, J.: Accounting for linkage disequilibrium in genome-wide association studies: a penalized regression method. Statist. Interface 6(1), 99ā115 (2013)
Lee, S., Abecasis, G., Boehnke, M., Lin, X.: Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95(1), 5ā23 (2014)
Liu, J.Z., Mcrae, A.F., Nyholt, D.R., Medland, S.E., et al.: A versatile gene-based test for genome-wide association studies. Am. J. Hum. Genet. 87(1), 139ā145 (2010)
Jia, P., Wang, L., Fanous, A.H., Pato, C.N., Edwards, T.L., Zhao, Z.: The International Schizophrenia Consortium: network-assisted investigation of combined causal signals from Genome-Wide Association Studies in schizophrenia. PLoS Comput. Biol. 8(7), e1002587 (2012)
Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., Ideker, T.: Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140 (2007)
Baranzini, S.E., Galwey, N.W., Wang, J., Khankhanian, P., et al.: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum. Mol. Genet. 18(11), 2078ā2090 (2009)
Wang, L., Matsushita, T., Madireddy, L., Mousavi, P., Baranzini, S.E.: PINBPA: Cytoscape app for network analysis of GWAS data. Bioinformatics 31(2), 262ā264 (2015)
Ideker, T., Ozier, O., Schwikowski, B., Siegel, A.F.: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18(suppl 1), S233āS240 (2002)
TaÅan, M., Musso, G., Hao, T., Vidal, M., MacRae, C.A., Roth, F.P.: Selecting causal genes from genome-wide association studies via functionally coherent subnetworks. Nat. Methods 12(2), 154ā159 (2015)
Mitra, K., Carvunis, A.R., Ramesh, S.K., Ideker, T.: Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet. 14(10), 719ā732 (2013)
Akula, N., Baranova, A., Seto, D., Solka, J., et al.: A network-based approach to prioritize results from genome-wide association studies. PLoS ONE 6(9), e24220 (2011)
Marchini, J., Donnelly, P., Cardon, L.R.: Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37(4), 413ā417 (2005)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267ā288 (1994)
Wu, T.T., Chen, Y.F., Hastie, T., Sobel, E., Lange, K.: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25(6), 714ā721 (2009)
Zhou, H., Sehl, M.E., Sinsheimer, J.S., Lange, K.: Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26(19), 2375ā2382 (2010)
Chen, L.S., Hutter, C.M., Potter, J.D., Liu, Y., Prentice, R.L., Peters, U., Hsu, L.: Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. Am. J. Hum. Genet. 86(6), 860ā871 (2010)
Zhao, J., Gupta, S., Seielstad, M., Liu, J., Thalamuthu, A.: Pathway-based analysis using reduced gene subsets in genome-wide association studies. BMC Bioinf. 12, 17 (2011)
Silver, M., Montana, G.: Alzheimerās disease neuroimaging initiative: fast identification of biological pathways associated with a quantitative trait using group lasso with overlaps. Stat. Appl. Genet. Mol. Biol. 11(1), 7 (2012)
Huang, J., Zhang, T., Metaxas, D.: Learning with structured sparsity. J. Mach. Learn. Res. 12, 3371ā3412 (2011)
Micchelli, C.A., Morales, J.M., Pontil, M.: Regularizers for structured sparsity. Adv. Comput. Math. 38(3), 455ā489 (2013)
Jacob, L., Obozinski, G., Vert, J.P.: Group lasso with overlap and graph lasso. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 433ā440. ACM (2009)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. Roy. Stat. Soc. B 67(1), 91ā108 (2005)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183ā202 (2009)
Xin, B., Kawahara, Y., Wang, Y., Gao, W.: Efficient generalized fused lasso and its application to the diagnosis of Alzheimerās disease. In: Twenty-Eighth AAAI Conference on Artificial Intelligence (2014)
Li, C., Li, H.: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24(9), 1175ā1182 (2008)
Li, C., Li, H.: Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 4(3), 1498ā1516 (2010)
Sokolov, A., Carlin, D.E., Paull, E.O., Baertsch, R., Stuart, J.M.: Pathway-based genomics prediction using generalized elastic net. PLoS Comput. Biol. 12(3), e1004790 (2016)
Friedman, J., Hastie, T., Hƶfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302ā332 (2007)
Yang, S., Yuan, L., Lai, Y.C., Shen, X., et al.: Feature grouping and selection over an undirected graph. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 922ā930. ACM (2012)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17ā40 (1976)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1ā122 (2011)
Wang, Z., Montana, G.: The graph-guided group lasso for genome-wide association studies. In: Regularization, Optimization, Kernels, and Support Vector Machines, pp. 131ā157 (2014)
Dernoncourt, D., Hanczar, B., Zucker, J.D.: Analysis of feature selection stability on high dimension and small sample data. Comput. Stat. Data Anal. 71, 681ā693 (2014)
Haury, A.C., Gestraud, P., Vert, J.P.: The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS ONE 6(12), e28210 (2011)
Kuncheva, L., Smith, C., Syed, Y., Phillips, C., Lewis, K.: Evaluation of feature ranking ensembles for high-dimensional biomedical data: a case study. In: 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 49ā56 (2012)
Bach, F.: Structured sparsity-inducing norms through submodular functions. In: 24th Annual Conference on Neural Information Processing Systems 2010 (2010)
Orlin, J.B.: A faster strongly polynomial time algorithm for submodular function minimization. Math. Prog. 118(2), 237ā251 (2009)
Greig, D.M., Porteous, B.T., Seheult, A.H.: Exact maximum a posteriori estimation for binary images. J. Roy. Stat. Soc. B 51(2), 271ā279 (1989)
Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26(2), 147ā159 (2004)
Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M., Lin, X.: Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89(1), 82ā93 (2011)
Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th Conference on Proceedings of the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications, pp. 390ā395. ACTA Press (2007)
Park, S.H., Lee, J.Y., Kim, S.: A methodology for multivariate phenotype-based genome-wide association studies to mine pleiotropic genes. BMC Syst. Biol. 5(2), 1ā14 (2011)
OāReilly, P.F., Hoggart, C.J., Pomyen, Y., Calboli, F.C.F., Elliott, P., Jarvelin, M.R., Coin, L.J.M.: MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS ONE 7(5), e34861 (2012)
Eduati, F., Mangravite, L.M., Wang, T., Tang, H., et al.: Prediction of human population responses to toxic compounds by a collaborative competition. Nat. Biotechnol. 33(9), 933ā940 (2015)
Cheng, W., Zhang, X., Guo, Z., Shi, Y., Wang, W.: Graph-regularized dual lasso for robust eQTL mapping. Bioinformatics 30(12), i139āi148 (2014)
Obozinski, G., Taskar, B., Jordan, M.I.: Multi-task feature selection. Technical report, UC Berkeley (2006)
Sugiyama, M., Azencott, C., Grimm, D., Kawahara, Y., Borgwardt, K.: Multi-task feature selection on multiple networks via maximum flows. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 199ā207 (2014)
Kim, S., Xing, E.P.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5(8), e1000587 (2009)
Wang, Z., Curry, E., Montana, G.: Network-guided regression for detecting associations between DNA methylation and gene expression. Bioinformatics 30(19), 2693ā2701 (2014)
Fei, H., Huan, J.: Structured feature selection and task relationship inference for multi-task learning. Knowl. Inf. Syst. 35(2), 345ā364 (2013)
Swirszcz, G., Lozano, A.C.: Multi-level lasso for sparse multi-task regression. In: Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pp. 361ā368 (2012)
Bellon, V., Stoven, V., Azencott, C.A.: Multitask feature selection with task descriptors. In: Pacific Symposium on Biocomputing, vol. 21, pp. 261ā272 (2016)
Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., et al.: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69(1), 138ā147 (2001)
Larson, N.B., Jenkins, G.D., Larson, M.C., Sellers, T.A., Sellers, T.A., et al.: Kernel canonical correlation analysis for assessing genegene interactions and application to ovarian cancer. Eur. J. Hum. Genet. 22(1), 126ā131 (2014)
Williams, S.M., Ritchie, M.D., Phillips, J.A., Dawson, E., et al.: Multilocus analysis of hypertension: a hierarchical approach. Hum. Hered. 57(1), 28ā38 (2004)
Cho, Y.M., Ritchie, M.D., Moore, J.H., Park, J.Y., et al.: Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus. Diabetologia 47(3), 549ā554 (2004)
Niel, C., Sinoquet, C., Dina, C., Rocheleau, G.: A survey about methods dedicated to epistasis detection. J. Bioinf. Comput. Biol. 6, 285 (2015)
Yoshida, M., Koike, A.: SNPInterForest: a new method for detecting epistatic interactions. BMC Bioinf. 12(1), 469 (2011)
Stephan, J., Stegle, O., Beyer, A.: A random forest approach to capture genetic effects in the presence of population structure. Nat. Commun. 6, 7432 (2015)
Beam, A.L., Motsinger-Reif, A., Doyle, J.: Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinf. 15(1), 368 (2014)
Drouin, A., GiguĆØre, S., Sagatovich, V., DĆ©raspe, M., et al.: Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine (2014). arXiv:1412.1074 [cs, q-bio, stat]
Marchand, M., Shawe-Taylor, J.: The set covering machine. J. Mach. Learn. Res. 3, 723ā746 (2002)
He, Z., Yu, W.: Stable feature selection for biomarker discovery. Comput. Biol. Chem. 34(4), 215ā225 (2010)
Ma, S., Huang, J., Moran, M.S.: Identification of genes associated with multiple cancers via integrative analysis. BMC Genom. 10, 535 (2009)
Yu, L., Ding, C., Loscalzo, S.: Stable feature selection via dense feature groups. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 803ā811. ACM (2008)
Meinshausen, N., BĆ¼hlmann, P.: Stability selection. J. Roy. Stat. Soc. B 72(4), 417ā473 (2010)
Shah, R.D., Samworth, R.J.: Variable selection with error control: another look at stability selection. J. Roy. Stat. Soc. B 75(1), 55ā80 (2013)
Han, Y., Yu, L.: A variance reduction framework for stable feature selection. Stat. Anal. Data Min. 5(5), 428ā445 (2012)
Llinares-LĆ³pez, F., Grimm, D.G., Bodenham, D.A., Gieraths, U., et al.: Genome-wide detection of intervals of genetic heterogeneity associated with complex traits. Bioinformatics 31(12), i240āi249 (2015)
Belilovsky, E., Varoquaux, G., Blaschko, M.B.: Testing for differences in Gaussian graphical models: applications to brain connectivity. In: Lee, D.D., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29 (2016)
Tur, I., Roverato, A., Castelo, R.: Mapping eQTL networks with mixed graphical markov models. Genetics 198(4), 1377ā1393 (2014)
Sandhu, K., Li, G., Poh, H., Quek, Y., et al.: Large-scale functional organization of long-range chromatin interaction networks. Cell. Rep. 2(5), 1207ā1219 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2016 Springer International Publishing AG
About this chapter
Cite this chapter
Azencott, CA. (2016). Network-Guided Biomarker Discovery. In: Holzinger, A. (eds) Machine Learning for Health Informatics. Lecture Notes in Computer Science(), vol 9605. Springer, Cham. https://doi.org/10.1007/978-3-319-50478-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-50478-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50477-3
Online ISBN: 978-3-319-50478-0
eBook Packages: Computer ScienceComputer Science (R0)