Genetic variants and their interactions in disease risk prediction – machine learning and network perspectives
A central challenge in systems biology and medical genetics is to understand how interactions among genetic loci contribute to complex phenotypic traits and human diseases. While most studies have so far relied on statistical modeling and association testing procedures, machine learning and predictive modeling approaches are increasingly being applied to mining genotype-phenotype relationships, also among those associations that do not necessarily meet statistical significance at the level of individual variants, yet still contributing to the combined predictive power at the level of variant panels. Network-based analysis of genetic variants and their interaction partners is another emerging trend by which to explore how sub-network level features contribute to complex disease processes and related phenotypes. In this review, we describe the basic concepts and algorithms behind machine learning-based genetic feature selection approaches, their potential benefits and limitations in genome-wide setting, and how physical or genetic interaction networks could be used as a priori information for providing improved predictive power and mechanistic insights into the disease networks. These developments are geared toward explaining a part of the missing heritability, and when combined with individual genomic profiling, such systems medicine approaches may also provide a principled means for tailoring personalized treatment strategies in the future.
KeywordsFeature Selection Genetic Interaction Feature Subset Feature Selection Method Multifactor Dimensionality Reduction
Most disease phenotypes are genetically complex, with contributions from combinations of genetic variation in different loci. A major challenge of medical genetics is to determine a set of genetic markers, which when combined together with conventional risk factors could be used in predicting an individual's susceptibility to developing various complex disorders. The recent advances and wide availability of genetic technologies, such as those based on genome-wide association (GWA) and next-generation sequencing (NGS), have allowed for the in-depth analysis of the variation contained in the human genome. In particular, these technologies are enabling the investigation of the genetic architecture of complex diseases, with the aim of constructing more accurate disease risk prediction models that would eventually facilitate effective approaches to personalized prevention and treatment alternatives for many diseases [1, 2]. While GWA studies have successfully identified hundreds of genetic variants that are associated with complex human diseases and other traits [3, 4, 5, 6], most variants identified so far using mainly statistical association testing approaches only capture a small portion of the heritability and even an aggregate of these effects is often not predictive enough for clinical utility, leaving open the question of what may explain the remaining or ‘missing heritability’. Suggested explanations include, for instance, contributions from rare and structural variants, genotype–environment and gene-gene interactions and sample stratification, or simply that complex traits truly are affected by thousands of variants of small effect size [8, 9]. The relative contributions of these and other factors remain poorly understood, which is hindering the development of improved models for disease risk assessment.
Given the multi-factorial nature of complex diseases, many authors have reiterated the concept of interactions among genetic loci, so-called epistatic interactions, as one of the major factors contributing to the missing heritability [9, 10]. Epistatic genetic interactions between or within genes are thought to be profoundly important in the development of many complex diseases, but these interactions are often beyond the reach of the conventional single-variant association testing procedures [11, 12, 13, 14]. There exist also increasingly complex interactions between genetic variants and environmental factors that may contribute to the disease risk on an individualized basis. Consequently, it has been argued that we should move away from the traditional ’one variant at a time’ approach toward a more holistic, network-centric approaches, which take into account the complexity of the genotype-phenotype relationships characterized by multiple gene-gene and gene-environment interactions [15, 16]. Although the conventional statistical significance testing procedures have successfully identified several susceptibility loci, it has become clear that many of the true disease associations may be much lower down on the ranked list of hits, compared to the top hits with the most statistical support [4, 17, 18]. Ignoring the potential risk variants in this ‘gray zone’ of genetic information is likely to result in models that are missing an important proportion of the quantitative variation in heritability. Therefore, it may be that most of the heritability is hidden rather than missing, but has not previously been detected because the individual effects are too small to pass the stringent significance filters used in many studies, yet still having significant contribution to the predictive power at the level of variant or subject subsets, or when combined with non-genetic risk factors.
Here, we discuss how computational machine learning approaches can utilize hidden interactions among panels of the genetic and other risk factors, predictive of the individual disease risk by means of implementing genetic feature selection procedures and network-guided predictive models. In contrast to the conventional population-level association testing, which often detect only a few variants with statistical support beyond the genome-wide significance level (e.g. p < 10-8), machine learning algorithms place special emphasis on maximizing the predictive accuracy at the level of individual subjects. The goal of feature selection is to identify such a panel of genetic and other risk factors, which result in a model that optimally predicts the phenotypic response variables, either the class labels in case-control classification (e.g. disease vs. healthy), or quantitative phenotypes in regression problems (e.g. height prediction). While epistatic genetic interactions may easily end up being averaged out in statistical association models, machine learning-based predictive modeling can also take into account those individual effects that are dependent on interactions with other variants or environmental exposures, making these models convenient for developing predictive strategies for multi-factorial diseases. Indeed, it has been shown that single-locus p-value-based selection strategies for constructing prediction models may lead to sub-optimal prediction accuracies . In another example, hundreds of genetic markers, many of which did not originally meet the genome-wide level of statistical significance, were combined into a predictive model of type 1 diabetes risk . Even though diabetes is known to involve many biological pathways, the large number of variants required may partly be attributed also to the selection of variants based solely on their individual p-values, which does not take into account any gene-gene interactions.
While machine learning-based computational approaches may provide a convenient framework for making use of the whole spectrum of genetic information when predicting an individual’s risk of developing a disease, these developments are still in their very early stages. Implementation of highly scalable computational algorithms for genetic feature selection is a key for making these frameworks effective enough for mining data from current GWA studies, in which more than a million genetic variants are assayed in thousands of individuals, not to mention the emerging data from NGS studies, such as the 1000 Genomes project . Recent improvements in constructing accurate and scalable machine learning-based predictive models will be discussed in Section 2. Another pressing problem inherent in every machine learning application is the challenge of how to evaluate the predictive capability of the constructed models, in order to avoid stating over-optimistic prediction results . Model validation approaches are described in Section 3. One approach to reducing the massive search spaces and computational complexities is to use additional biological information in the model construction process. There are already several successful examples of how to make use of physical protein interaction networks when mining data from GWA studies in the search of, for instance, regulatory models , epistatic interactions , or disease genes . In Section 4, we take the next step of network level analysis of genetic variants and review recent data mining solutions capable of systematically utilizing functional information from the interaction networks as a priori information when building disease prediction models. Finally, in Section 5, we will list some current challenges and possibilities as future directions toward improved understanding of individual predisposition to genetically complex diseases such as cancers.
Selection of genetic risk factors for machine learning-based prediction models
Rather surprisingly, the use of machine learning method in the context of genome-wide data on genetic variants has yielded a relatively limited number of studies until the very recent years (for a systematic literature review, see ), compared to the large number of machine learning studies on other types of genomic datasets, especially genome-wide gene expression profiles. Further, the combination of predictive modeling and advanced feature selection algorithms have been implemented in an even more restricted set of studies, even though these have generally yielded quite positive results [15, 23]. Indeed, many studies have demonstrated that the use of feature selection approaches are capable of improving the prediction results beyond that when the same model is implemented on features selected solely through prior knowledge of the disease or on those genetic variants which reach genome-wide statistical significance [18, 23, 24, 25]. However, it is relatively challenging to extract the predictive signal from the high-dimensional datasets originating from GWA or NGS studies, due to a number of experimental and computational issues, many of which are different from those faced when using data from microarray gene expression profiling. Further, in order to construct accurate and reliable predictive models of complex phenotypes based on genome-wide profiles of genetic variants, it is essential to have an understanding of how to identify predictive features both individually and in groups of variant subsets, and how different feature selection approaches can deal with issues such as epistatic interactions and high-dimensional datasets . Feature selection methods in machine learning can broadly be divided into filter, wrapper and embedded methods. This categorization is not strict, and each of the approaches has its own advantages and disadvantages which are, in turn, very problem dependent. Next, we briefly describe each feature selection category and consider some representative examples of each.
Filter methods for genetic feature selection are the most common in GWA studies due to the simplicity of their implementation, low computational complexity, and the human interpretability of the results. In their simplest form, filter methods calculate a univariate test statistic separately for each genetic feature, and the features are then ranked based on the observed statistic values. The highest ranked features form the final set of selected features, on which a predictive model may be subsequently trained. The number of features to be selected is either decided in advance or determined by a pre-defined significance threshold for the test statistic. Several well-known statistical tests have been used in GWA studies, including the Fisher’s exact test and Armitage trend test [26, 27, 28], and an increasing number of statistical approaches are being developed for rare variants and the NGS data [29, 30, 31]. Since this feature selection approach requires only a single pass through the whole data, single-locus filters can be straightforwardly applied to even the largest genome sequencing datasets. Along with the multiple testing problem, the primary drawback of the single-locus filter methods is that they do not take account of the interactions between the features, which may lead to selection of both false positives, such as redundant loci, and false negatives due to epistasis interactions between or within loci [12, 13, 15]. More advanced filter methods can also select specific risk variant combinations that are associated with a disease risk. For instance, multifactor dimensionality reduction (MDR) is a non-parametric method that can detect statistically significant genetic interactions among two or more loci in the absence of any marginal effects, even in relatively small sample sizes . While proved to be useful for association testing, however, it has been argued that the statistics being used to identify variants or their combinations, typically p-values for disease risk association, are perhaps not the most appropriate means for evaluating the predictive or clinical value of the genetic profiles .
Wrapper methods consist of three components: a search algorithm for systematically traversing through the space of all possible feature subsets, a scoring function for evaluating the predictive accuracy of the feature subsets, and the learning algorithm around which the feature selection procedure is wrapped . Since the size of the power set of the features grows exponentially with the number of genetic variants screened (say n), testing all the feature subsets (2 n ) is computationally infeasible (n is on the order of a million in a typical GWA study and much larger in NGS studies). Therefore, one must resort in practice to local search methods that do not guarantee finding the optimal subset but, nevertheless, usually lead to good local optima. For example, the greedy forward selection adds one feature at a time to the set of selected features after checking which of the remaining features would improve the value of the scoring function the most. Thus, the whole data set is traversed through once for each selected feature. To avoid getting trapped in poor local optima during the search in the complex and high-dimensional genetic landscapes, modified local search strategies can be utilized, including the backtracking option or several variations of evolutionary algorithms. The most popular scoring functions used with wrapper methods are the prediction error on the training set, a separate validation set, or cross-validation error. The feature selection can be in principle wrapped around any learning method, but it is beneficial if the method can be efficiently trained or if the already learned model can be efficiently updated. Indeed, for some learning methods, such as regularized least-squares (RLS), the search process can be considerably accelerated with computational short-cuts for scoring function evaluation . These inbuilt short-cuts bring the methods closer to the next category of the selection methods, namely the embedded ones.
Embedded methods have the feature selection mechanism built into the training algorithm itself , that is, the predictive models they produce tend to depend only on a subset of the original features. Perhaps the most well-known embedded method is LASSO (least absolute shrinkage and selection operator), which is also recently being applied to a larger number of GWA studies [20, 25, 36, 37, 38]. While only a few machine learning approaches, in fact, allow for scaling-up to the genome-wide level, this has been made possible in LASSO by the recently developed model training algorithms, such as those based on the coordinate descent methods, which are computationally very efficient. The problem setup resembles the wrapper approach in the sense that there is an objective function for which one performs a stochastic search, such as cyclic or stochastic coordinate descent, in order to find a global optimum. Basically, the search algorithm goes through each feature at a time, and updates the corresponding coefficient in the linear model under construction. The objective function consists of a scoring metric such as the mean squared error (MSE) on the training data and a regularization term that favors sparse linear models, that is, it tends to push the search algorithm towards such models that have only a few nonzero parameters. Typically, coordinate descent passes through the whole data set only a couple of times before convergence, but the number of passes depends on the properties of the data, the desired sparsity level, and the other possible hyperparameters. Wrappers and embedded methods are known to have the ability to produce better results than filter methods in many applications [23, 25, 39], but if not implemented correctly, they can easily lead to the models failing to generalize beyond the training data, underscoring the importance of rigid evaluation of the prediction models.
The importance of evaluation of the predictive models for complex phenotypes
The evaluation of the predictive power is important also when considering predictive models constructed on the basis on statistical significant variants. For instance, there are numerous observations showing that the increases in the proportion of variance explained by significant variants does not go hand in hand with improved genetic prediction of disease risk. For instance, when using statistical modeling on the single training sample only, a panel of thousands of non-significant variants collectively could capture over one-third of the heritability for schizophrenia, but the same panel only explained a few percent of disease susceptibility in another replication cohort . Similarly, while the statistical explanation power of the genetic variation in human height could be substantially increased by considering increasing number of common variants in a single population sample , the proportion of variance accounted for in other independent samples was much smaller . These examples underscore the importance of rigid validation of the predictive accuracy of the models based on genetic profiles. While external cross-validation is a valid option, it is not free of any study-specific factors. For example, if there is a problem during the genotyping phase, it will appear also in any training and test data splits. These errors, stemming from problems during the experimental design and/or quality control have led for the need to re-evaluate the established methods and use caution when claiming replication . The recommended option for truly validating the generalizability of predictive risk models is to make use of a large enough set of independent samples in which there is no overlap between the examined cohorts . However, here one should consider whether the aim is to validate the predictive model itself (e.g. using external cross-validation or independent validation samples), or the predictive variants selected by the model (replication of the model construction or its application to separate cohorts) .
Through the development of better model validation techniques and unbiased examination of all feature subsets in genome-wide scale, we are likely to continuously improve the accuracy of the predictive models and increase their reproducibility on independent population samples. A challenge here is that differences in the population genetic structure, attributable to confounding factors such as the ethnicity or ancestry of the subjects, may result in highly heterogeneous datasets with a number of hidden subject sub-groups, which may associate with divergent disease phenotypes and therefore cause an increased false-positive rates . Related to this, while there are comparisons among various feature selection methods and predictive modeling frameworks on individual cohorts [23, 24, 27], there is not yet any definitive results whether one method will universally lead to optimal results in other subject cohorts or populations. Such confounding variability should also be taken into account in the model construction and evaluation, perhaps in some form of population stratified cross-validation. Failure to replicate a genetic association should not only be considered as a negative result, as it may also provide important clues about genetic architecture among study populations or genetic interactions among risk variants . When epistasis interactions are involved, then it is likely that simple methods, such as single-locus filters, will not alone be able to provide most optimal results, while in extremely large datasets, wrapper methods may pose computational limitations if combined with complex prediction models. Finally, even though the improvements obtained by the machine learning wrappers, compared to those from the traditional p-value based filters, may seem quite modest (e.g. Figure 1), it may turn out that even slight improvements in the predictive accuracy can result in significant clinical benefits. Moreover, it is argued that the modest predictive improvements may be further aggregated through pathway and network-level analyses of the selected variants.
Molecular networks as a prior information for constructing predictive models
Even in the absence of significant single-locus marginal effects, multiple genetic loci from a number of molecular pathways may act synergistically and lead to disease phenotype when combined. Therefore, it has become popular to map the genetic loci identified in GWA or NGS studies to established biological pathways in order to elucidate the potential cellular mechanisms behind the observed genetic and phenotypic variation. There exist a wide variety of tools and guidelines on how to implement such pathway analyses in the context of genetic association studies [52, 53, 54, 55, 56]. Building on approaches originally developed in the context of microarray gene expression experiments, the common theme in the pathway analysis approaches is that they examine whether a group of related loci in the same biological pathway are jointly associated with a trait of interest. In line with the observations in microarray gene expression studies, it has been shown that in those cases where there is only a modest overlap in the variant or gene-level findings between different studies, due to factors such as differences in the genetic structure, the pathway-level associations may be much more reproducible even between different study populations [57, 58, 59, 60]. These findings support the concept that individuals with the same disease phenotype may have marked inter-individual genetic heterogeneity in the sense that their disease predisposing variants may lie in distinct loci within the same or related pathways . Machine learning-based predictive models constructed upon gene expression profiling have already shown the benefits of using pathway activities as features in terms of improved classification accuracy, compared to those models that consider merely individual gene expression levels . It has also been demonstrated in the context of GWA datasets that pathway analysis can provide not only mechanistic insights but also improved discrimination power using tailored statistical data mining techniques, such as HyperLasso  or so-called pathways of distinction analysis (PoDA) .
Toward more systematic network-centric analysis of genetic variants on a genome-wide scale, molecular interaction networks can be used as a priori information in the predictive models, in the form e.g. filters or integrators, with the aim of either reducing the massive size of the search space in the variant selection process or boosting the signal-to-noise ratio through external knowledge incorporated in terms of physical or functional molecular networks [69, 70]. Network graphs provide a convenient mathematical framework for modeling, integrating and mining high-dimensional genomic datasets, in which to present the relationships among genetic loci, genes and diseases [64, 69, 70, 71, 72]. Successful examples of combining individual-level gene expression measurements with background networks of physical interactions between proteins and transcription factor targets have demonstrated that it is possible to identify and make use of disease-specific sub-networks, so-called modules, in order to reduce both the number of false positives and negatives, caused by factors such as technical variability and genetic heterogeneity, respectively, as well as to improve individual-level prediction of clinical outcomes, such as cancer metastasis or survival time [64, 73, 74, 75]. There are also studies in the context of GWA datasets, which motivate the use of network connectivity structures, such as sub-network modules or highly-connected network hubs [22, 64, 76, 77, 78], as aggregate features in the disease prediction models. However, what has been largely missing is a systematic approach that could combine network topology as a priori information when constructing predictive models. Recently, a particularly interesting approach was introduced as a principled method that uses genetic algorithms guided by the structure of a given gene interaction network to discover small groups of connected variants, which are jointly associated with a disease outcome on a genome-wide scale . Combined with more efficient, wrapper-type of search algorithms, such network-guides feature selection approaches could be scaled-up in the future to enable extracting also larger sub-networks with improved predictive capability.
Future directions: lessons from model organisms and individualized medicine
Given the rather modest progress made so far in pursuing the expensive and suboptimal route of current drug discovery, there has been much interest lately in moving towards personalized medicine strategies [80, 81]. Another major paradigm shift in disease treatment is moving away from the traditional 'one target, one drug' strategy towards the so-called network pharmacology, a novel paradigm which provides more global understanding of the mechanisms behind disease processed and drug action by considering drug targets in their context of biological networks and pathways . These emerging paradigms can offer holistic information on disease networks and drug responses, with the aim of identifying more effective drug targets and their combinations tailored for individualized treatment strategies. A prime challenge in developing such strategies is to understand how genes function as interaction networks to carry out and regulate cellular processes, and how perturbations in these cellular networks cause certain phenotypes, such as human diseases, in some individuals, but not in the others. There has been active research in model organisms addressing the question why disease causing mutations do not cause the disease in all individuals . Recent studies in yeast Saccharomyces cerevisiae, worm Caenorhabditis elegans, and fly Drosophila melanogaster have demonstrated the importance of incorporating functional genetic interaction partners of the mutated genes in the prediction of phenotypic variation and mutational outcomes at an individual level [83, 84, 85]. Pilot studies in human trials have also suggested that personal genomic approaches, such as those based on GWA or NGS studies, may indeed yield useful and clinically relevant information for individual patients [1, 2]. However, a number of experimental, modeling and computational challenges have to be solved before the promises of personalized medicine can be translated into routine clinical practice [5, 81, 86].
From the experimental point of view, the whole-genome sequencing efforts will enable us to delve deeper into the individual genomes by elucidating the role of low-frequency variants in the genetic architecture of complex diseases. The sequencing efforts, such as the 1000 Genomes project , are also being used to subsequently extend the coverage of the existing GWA datasets by means of imputation methods and population-specific reference haplotypes [87, 88]. However, while the emerging shift from population-level common variants toward individual-level rare or even personal variants holds great promise for medical research, it also represents with unique modeling challenges; in particular, the traditional statistical modeling frameworks that were developed under settings where the number of study samples greatly exceeds the number of study variables may not to be ideally suited for the personalized medicine settings, in which the individuals and disease subtypes are stratified into increasingly smaller subgroups . Although machine learning methods are better targeted at individual-level prediction making, the feature selection methods would also benefit from more stratified options, for instance, in terms of enabling phenotype-specific genetic features, rather than assuming that all subjects share the same panel of predictive genotypes. Also, since the binary disease outcomes, typically in the form of case or control dichotomy, may not provide the most reliable study phenotypes, the predictive modeling frameworks might become more successful for predicting quantitative phenotypic traits [90, 91, 92]. This also raises related modeling questions, such as how to encode imputed variants (e.g. expected or most likely genotype), how to treat missing data (exclude or impute), or how to model the variants and their interactions (multiplicative, additive, recessive or dominant models) [90, 91, 92, 93, 94]; these all may have an important effect on the prediction performance, especially in the presence of epistatic interactions at an individual level.
From the computational perspective, the ever increasing sizes of the raw NGS and imputed GWA datasets pose great challenges to the computational algorithms. For instance, while systematic genetic mappings in model organisms have revealed widespread genetic interactions within individual species [85, 95, 96, 97], epistasis interactions have remained extremely difficult to identify on a global scale in human populations. This can be attributed to the vast number of potential interaction partners, along with complex genotype-phenotype relationships and their individual-level differences. Improvements in computational performance have recently been obtained through effective usage of computer hardware, for instance, through graphics processing units, Cloud-based computing environments, or multithread parallelization, when exploring genetic variants or their interactions in GWA studies [98, 99, 100, 101]. Furthermore, since the memory consumption in the high-dimensional NGS applications can form even a tighter bottleneck than the running time, there is also a need to develop space-efficient implementations, which trade running time for decreased memory consumption . Lessons from model organisms, such as yeast, have also demonstrated that data integration between complementary screening approaches, either functional or physical assays, can reveal novel genetic interactions and their modular organization which have gone undetected by any of the individual approaches alone [95, 96, 102]. Also, integrating diverse phenotypic readouts facilitates genetic interaction screens , and Bayesian models have been shown especially useful for making use of multiple traits, gene-gene or gene-environment interactions in disease risk prediction . Finally, visualization algorithms that can capture the hierarchical modularity of the physical and functional interaction networks may help reveal interesting biological patterns and relationships within the data, such as pathway components and biological processes, which can be further investigated by follow-up computational and/or experimental analyses .
Better understanding of the general design principles underlying genetic interaction networks in model organisms can provide important insights into the relationships between genotype and phenotype, toward better understanding and treating also complex human diseases, such as cancers. Cancer phenotypes are known to arise and develop from various genetic alterations, and therefore the same therapy often results in different treatment responses. Moreover, the underlying genetic heterogeneity results in alterations within multiple molecular pathways, which lead to various cancer phenotypes and make most tumors resistant to single agents. Cancer sequencing efforts, such as The Cancer Genome Atlas (TCGA), are systematically characterizing the structural basis of cancer, by identifying the genomic mutations associated with each cancer type. These efforts have revealed tremendous inter-individual mutational and phenotypic heterogeneity, which renders it difficult to translate the genetic information into clinically actionable individualized treatment strategies [106, 107, 108]. Therefore, integrating the structural genomic information with systematic functional assessment of genes for their contribution to genetic dependencies and cancer vulnerabilities, such as oncogenic addictions or synthetic lethalities [109, 110], is likely needed for providing more comprehensive insight into the molecular mechanisms and pathways behind specific cancer types and for improving their prevention, diagnosis and treatment [106, 111]. Machine learning-based predictive modeling approaches are well-powered to make the most of the exciting functional and genetic screens toward revealing hidden genetic variants and their interactions behind cancer and other complex phenotypes. When combined with network analyses, these integrated systems medicine approaches may offer the possibility to identify key players and their relationships responsible for multi-factorial behavior in disease networks, with many diagnostic, prognostic and pharmaceutical applications.
The authors thank Dr. Antti Airola for his help with the RLS algorithm, the two expert reviewers, Prof. Greg Gibson and Prof. Jason Moore, for their constructive comments, and CSC, the Finnish IT center for science, for providing us with extensive computational resources. The work was supported by the Academy of Finland (grants 120 569, 133 227 and 140 880 to T.A. and 134020 to T.P.), the Turku Centre for Computer Science (TUCS), Turun Yliopistosäätiö and the Finnish Cultural Foundation. This study makes use of data generated by the Wellcome Trust Case-Control Consortium (WTCCC). A full list of the investigators who contributed to the generation of the WTCCC data is available from http://www.wtccc.org.uk. Funding for the WTCCC project was provided by the Wellcome Trust under award 076113.
- 3.Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.Google Scholar
- 19.1000 Genomes Project: A map of genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.Google Scholar
- 22.Barrenäs F, Chavali S, Alves AC, Coin L, Jarvelin MR, Jörnsten R, Langston MA, Ramasamy A, Rogers G, Wang H, Benson M: Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. Genome Biol. 2012, 13 (6): R46. 10.1186/gb-2012-13-6-r46.PubMedPubMedCentralGoogle Scholar
- 23.Pahikkala T, Okser S, Airola A, Salakoski T, Aittokallio T: Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithm Mol Biol. 2012, 7 (1): 11. 10.1186/1748-7188-7-11.Google Scholar
- 24.Okser S, Lehtimäki T, Elo LL, Mononen N, Peltonen N: Genetic Variants and Their Interactions in the Prediction of Increased Pre-Clinical Carotid Atherosclerosis: The Cardiovascular Risk in Young Finns Study. PLoS Genet. 2010, 6 (9): e1001146. 10.1371/journal.pgen.1001146.PubMedPubMedCentralGoogle Scholar
- 31.Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, Nickerson DA, Christiani DC, Wurfel MM, Lin X, NHLBI GO Exome Sequencing Project—ESP Lung Project Team: Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet. 2012, 91 (2): 224-237. 10.1016/j.ajhg.2012.06.007.PubMedPubMedCentralGoogle Scholar
- 35.Guyon I, Elisseeff A: An introduction to variable and feature selection. J Mach Learn Res. 2003, 3: 1157-1182.Google Scholar
- 39.Aha DW, Bankert RL: A comparative evaluation of sequential feature selection algorithms. Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics. Edited by: Fisher DH, Lenz HJ. 1996, New York: Springer-Verlag, 199-206.Google Scholar
- 57.Srinivasan BS, Doostzadeh J, Absalan F, Mohandessi S, Jalili R, Bigdeli S, Wang J, Mahadevan J, Lee CL, Davis RW, William Langston J, Ronaghi M: Whole genome survey of coding SNPs reveals a reproducible pathway determinant of Parkinson disease. Hum Mutat. 2009, 30 (2): 228-238. 10.1002/humu.20840.PubMedGoogle Scholar
- 66.Lavender NA, Rogers EN, Yeyeodu S, Rudd J, Hu T, Zhang J, Brock GN, Kimbro KS, Moore JH, Hein DW, Kidd LC: Interaction among apoptosis-associated sequence variants and joint effects on aggressive prostate cancer. BMC Med Genomics. 2012, 5: 11. 10.1186/1755-8794-5-11.PubMedPubMedCentralGoogle Scholar
- 74.Winter C, Kristiansen G, Kersting S, Roy J, Aust D, Knösel T, Rümmele P, Jahnke B, Hentrich V, Rückert F, Niedergethmann M, Weichert W, Bahra M, Schlitt HJ, Settmacher U, Friess H, Büchler M, Saeger HD, Schroeder M, Pilarsky C, Grützmann R: Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS Comput Biol. 2012, 8 (5): e1002511. 10.1371/journal.pcbi.1002511.PubMedPubMedCentralGoogle Scholar
- 77.Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BM, Kappos L, Polman CH, Matthews PM, Hauser SL, Gibson RA, Oksenberg JR, Barnes MR, GeneMSA Consortium: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18 (11): 2078-2090. 10.1093/hmg/ddp120.PubMedPubMedCentralGoogle Scholar
- 85.Huang W, Richards S, Carbone MA, Zhu D, Anholt RR, Ayroles JF, Duncan L, Jordan KW, Lawrence F, Magwire MM, Warner CB, Blankenburg K, Han Y, Javaid M, Jayaseelan J, Jhangiani SN, Muzny D, Ongeri F, Perales L, Wu YQ, Zhang Y, Zou X, Stone EA, Gibbs RA, Mackay TF: Epistasis dominates the genetic architecture of Drosophila quantitative traits. Proc Natl Acad Sci USA. 2012, 109 (39): 15553-15559. 10.1073/pnas.1213423109.PubMedPubMedCentralGoogle Scholar
- 86.Corander J, Aittokallio T, Ripatti S, Kaski S: The rocky road to personalized medicine: computational and statistical challenges. Personalized Med. 2012, 9 (2): 109-114. 10.2217/pme.12.1.Google Scholar
- 87.Surakka I, Kristiansson K, Anttila V, Inouye M, Barnes C, Moutsianas L, Salomaa V, Daly M, Palotie A, Peltonen L, Ripatti S: Founder population-specific HapMap panel increases power in GWA studies through improved imputation accuracy and CNV tagging. Genome Res. 2010, 20 (10): 1344-1351. 10.1101/gr.106534.110.PubMedPubMedCentralGoogle Scholar
- 88.Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, Zanon C, Magnusson OT, Helgason A, Saemundsdottir J, Gylfason A, Stefansdottir H, Gretarsdottir S, Matthiasson SE, Thorgeirsson GM, Jonasdottir A, Sigurdsson A, Stefansson H, Werge T, Rafnar T, Kiemeney LA, Parvez B, Muhammad R, Roden DM, Darbar D, Thorleifsson G, Walters GB, Kong A, Thorsteinsdottir U, Arnar DO, Stefansson K: A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nat Genet. 2011, 43 (4): 316-320. 10.1038/ng.781.PubMedPubMedCentralGoogle Scholar
- 91.Sebastiani P, Solovieff N, Dewan AT, Walsh KM, Puca A, Hartley SW, Melista E, Andersen S, Dworkis DA, Wilk JB, Myers RH, Steinberg MH, Montano M, Baldwin CT, Hoh J, Perls TT: Genetic signatures of exceptional longevity in humans. PLoS One. 2012, 7 (1): e29848. 10.1371/journal.pone.0029848.PubMedPubMedCentralGoogle Scholar
- 92.Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA, Stricker C, Gianola D, Schlather M, Mackay TF, Simianer H: Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet. 2012, 8 (5): e1002685. 10.1371/journal.pgen.1002685.PubMedPubMedCentralGoogle Scholar
- 105.Tuikkala J, Vähämaa H, Salmela P, Nevalainen OS, Aittokallio T: A multilevel layout algorithm for visualizing physical and genetic interaction networks, with emphasis on their modular organization. BioData Min. 2012, 26 (5): 2-Google Scholar
- 111.Heiskanen MA, Aittokallio T: Mining high-throughput screens for cancer drug targets—lessons from yeast chemical-genomic profiling and synthetic lethality. Wiley Interdisciplinary Rev: Data Min Knowl Discov. 2012, 2 (3): 263-272. 10.1002/widm.1055.Google Scholar
- 112.Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protocol. 2009, 4 (1): 44-57.Google Scholar
- 113.Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1): 1-13. 10.1093/nar/gkn923.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.