Multiscale modeling of the causal functional roles of nsSNPs in a genome-wide association study: application to hypoxia
It is a great challenge of modern biology to determine the functional roles of non-synonymous Single Nucleotide Polymorphisms (nsSNPs) on complex phenotypes. Statistical and machine learning techniques establish correlations between genotype and phenotype, but may fail to infer the biologically relevant mechanisms. The emerging paradigm of Network-based Association Studies aims to address this problem of statistical analysis. However, a mechanistic understanding of how individual molecular components work together in a system requires knowledge of molecular structures, and their interactions.
To address the challenge of understanding the genetic, molecular, and cellular basis of complex phenotypes, we have, for the first time, developed a structural systems biology approach for genome-wide multiscale modeling of nsSNPs - from the atomic details of molecular interactions to the emergent properties of biological networks. We apply our approach to determine the functional roles of nsSNPs associated with hypoxia tolerance in Drosophila melanogaster. The integrated view of the functional roles of nsSNP at both molecular and network levels allows us to identify driver mutations and their interactions (epistasis) in H, Rad51D, Ulp1, Wnt5, HDAC4, Sol, Dys, GalNAc-T2, and CG33714 genes, all of which are involved in the up-regulation of Notch and Gurken/EGFR signaling pathways. Moreover, we find that a large fraction of the driver mutations are neither located in conserved functional sites, nor responsible for structural stability, but rather regulate protein activity through allosteric transitions, protein-protein interactions, or protein-nucleic acid interactions. This finding should impact future Genome-Wide Association Studies.
Our studies demonstrate that the consolidation of statistical, structural, and network views of biomolecules and their interactions can provide new insight into the functional role of nsSNPs in Genome-Wide Association Studies, in a way that neither the knowledge of molecular structures nor biological networks alone could achieve. Thus, multiscale modeling of nsSNPs may prove to be a powerful tool for establishing the functional roles of sequence variants in a wide array of applications.
KeywordsNotch Signaling Driver Mutation Allosteric Regulation Hypoxia Tolerance Pfam Family
List of abbreviations
Epidermal Growth Factor Receptor
Receptor Tyrosine Kinase
Mutation Seeded Subnetwork
Recent advances in next generation sequencing have generated abundant genetic variants and "omics" data. Together, these extremely large, multidimensional datasets present an exciting opportunity to identify genes, and to predict pathways likely to be involved in diseases and traits. However, these complex data sources plus the broad spectrum of phenotypes, challenge the quest to uncover the genetic, molecular, and cellular mechanisms that underlie phenotypes [1, 2, 3]. A major challenge in deciphering the genetic basis of multigenic diseases or traits is to distinguish driver mutations that impact the survival or reproduction of a particular phenotype (e.g., cancer) from passengers that do not confer a selective advantage. Standard genome sequence analysis cannot detect all driver mutations due to difficulties in the estimation of the background mutation rate and underlying genetic heterogeneity of adaptive phenotypes [4, 5]. Statistical machine learning techniques (e.g., SNAP ) provide an alternate approach by learning from the annotated mutation data. However, the "black-box" nature of machine learning makes it difficult to interpret the novel functional roles of mutations. Parallel to the development of new genotyping and phenotyping techniques, a number of novel computational tools have been developed to integrate and analyze genetic and omics data with the aim of establishing statistical causal relationships between genetic markers, genome-wide molecular signatures, and organismal phenotypes [7, 8, 9, 10, 11, 12, 13]. For example, co-expression and Bayesian network models derived from DNA variances and genome-wide transcriptional profiles have been applied to identify causal disease genes , cancer drivers [10, 15], and master regulators of cancer [16, 17, 18]. Although great efforts have been made to address n<<p problem, where the number of observations n (e.g., gene expressions in different conditions) is much smaller than the number of variables or parameters p (e.g., all measured genes), the power of these statistics-based techniques is still limited if sample sizes are small. Moreover, the complex phenotype is often associated with interactions among multiple causal genes (epistasis), any of which alone is not sufficient to drive phenotypic change. It is challenging for statistical methods to identify epistasis given the large number of possible interactions. Fundamentally, the "causal" relationships inferred from these methods are mathematical correlations. They may not provide biological insight into the underlying molecular and cellular mechanisms that associate genotypes with phenotypes.
To demonstrate the feasibility of our approach, we apply multiscale modeling to reveal the genetic, molecular, and cellular basis of hypoxia, a physiological condition in which the cell is deprived of an adequate oxygen supply. The hypoxia-induced phenotype has been related to multiple pathological conditions including cancer . Cells, tissues, and organisms have developed different strategies to survive low oxygen levels; however, the underlying molecular mechanisms contributing to hypoxia tolerance remain unclear. To render mammalian cells and tissues resistant to a low O2 environment, Drosophila melanogaster (D. melanogaster) has been used as a model system to investigate the mechanisms underlying hypoxia tolerance. Through long-term laboratory selection, Zhou et al. have generated D. melanogaster populations that tolerate severe, normally lethal, levels of hypoxia . Microarray analysis identified several adaptive changes in the hypoxia-selected flies . Comparison between the genome sequences of hypoxia-selected flies and those of controls identified 107 amino acid mutations in 52 genes . These data provide us with an unparalleled opportunity to understand the genetic, molecular, and cellular basis of the hypoxia tolerance phenotype and to develop new computational tools to establish causal genotype-phenotype associations, which can be validated through controlled experiments. It is noted that the gene expression profiles are only measured for one condition in the hypoxia tolerance phenotype, hence conventional co-expression approaches are not applicable to this study. Although the hypotheses generated from this study have been experimentally validated by us and are consistent with experimental results from others, the sensitivity and specificity of the method has not been fully evaluated. In the future we will extensively test our method using large case-control datasets from public databases such as the NCBI database of genotypes and phenotypes (dbGap)  and the Welcome Trust Case Control Consortium (WTCCC) .
Knowledge-driven network inference of driver mutations responsible for hypoxia tolerance
Predicted driver mutations and core pathways for hypoxia tolerance in Drosophila melanogaster from multiple evidences.
Mutated Gene (Annotation Symbol)
FDR Corrected p-value for the overrepresentation of signaling pathways
Shortest-path Distance (z-score) up/down
Functional role of nsSNP inferred from structural modeling
Expected accuracy (%) of non-neutral mutation from SNAP
Human ortholog and hypoxia association
Possible DNA binding
histone deacetylase 4
AR of catalytic activity
calcium-dependent cysteine-type endopeptidase
AR of substrate binding
AR of substrate binding
Structural analysis of functional roles of nsSNPs
Structural modeling of nsSNPs
Structural roles of putative driver mutations
Machine learning based prediction of non-neutral nsSNPs
The functional importance of nsSNP is further supported by SNAP , software used to predict a given nsSNP as neutral or non-neutral with an expected accuracy. In a benchmark study, SNAP outperformed most similar methods . 23 out of the 107 nsSNPs, located on 18 genes, are predicted as non-neutral with an accuracy of higher than 58% (SNAP reliability index 0), (Additional File 1 Table S4). Five predicted non-neutral mutations are hypothesized as putative drivers. Two of them (H and CG33714) have an accuracy of over 80%. The remaining predictions have lower expected accuracies. This could imply that while the functional impact of each individual mutation is limited, collectively they may mediate the signaling pathway activity through epistasis.
Several mutations in CG31220 (Additional File 1 Table S4), a serine-type peptidase, are predicted as non-neutral by SNAP. These mutations are mapped to the substrate binding sites or other functional important regions in the structure (Additional File 1 Figure S1). However, enriched biological pathways associated with this gene were not detected. More studies are required to understand how these non-neutral mutations impact the biological network.
Experimental and literature supports
As discussed above, a complex phenotype rises from re-regulated biological pathways that themselves result from the collective effects of multiple genetic mutations (epistasis). Since the down- or up-regulation of core pathways directly impacts the organismal phenotype, the experimental validation of the core pathway would provide strong evidence to support the predicted driver mutations that are responsible for the re-regulation of the core pathway. Indeed, we have experimentally validated that Notch signaling is the core pathway of hypoxia tolerance in D. melanogaster. The reduced activation of Notch signaling by a specific γ-secretase inhibitor significantly reduces the survival and life-span of hypoxia tolerant D. melanogaster strains . The critical role of Notch signaling in hypoxia tolerance is further supported by UAS-Gal4 over-expression and RNAi knockdown of genes involved in Notch signaling . Other experimental evidence from the literatures, as detailed below, also support our predictions. The top ranked H gene (also called hairless) is a well-known regulator of Notch signaling in D. melanogaster . Dys encodes the protein dystrophin. Genetic interaction screens in D. melanogaster have shown that Dys is involved in interactions with components of the Notch signaling pathway . Furthermore, the mutation of the Dys homolog in the mouse model is related to the up-regulation of the Notch-beta pathway . For other genes, although little direct experimental evidence supports an association with hypoxia in D. melanogasta their functional roles in hypoxia has been demonstrated in cancer and other human diseases. HDAC4 regulates hypoxia-inducible factor 1 α (HIF1 α) and cancer cell response to hypoxia . GalNAc-T2 is an N-acetyl-galactoseaminyl transferase that catalyzes the synthesis of glycosphingolipid (GSL). A recent study has shown that GSL may directly regulate the activity of Notch signaling . Wnt5 is a ligand to a family of frizzled receptors, acting as a regulator of Wnt signaling. An increasing body of evidences suggests that Wnt and Notch signaling cooperatively determine the fate of cell development in humans [36, 37, 38, 39, 40, 41, 42]. The association between Rad51D and hypoxia has been demonstrated in cancer . Ulp1 is a SUMO-specific protease that is essential for the stabilization of HIF1α during hypoxia by removing SUMO and participates in the regulation of hypoxia-responsive genes .
The important functional role of allosteric regulation, protein-protein interactions, and protein-nucleic acid interactions in sequence variants
In this study, none of the driver mutations associated with hypoxia are conserved functional site residues, nor are they responsible for structural stability. The driver mutations are hypothesized to be involved in either protein-protein interactions (in the case of Rad51D), protein-nucleic acid interaction (e.g., in CG33714), or allosteric regulation (e.g., in HDAC4). A recent survey of the structural basis of in-frame mutations in protein-protein interactions has suggested that changes in specific interactions play a critical role in pathogenesis . From a network point of view, the modification of protein-protein interactions, rather than the proteins themselves, may have significant impact on network properties . Recent progress in the ENCODE and modENCODE projects highlights the critical functional roles of non-coding DNAs in the regulation of biological processes [46, 47]. As a large number of non-coding DNAs perform their functions through specific protein-nucleic acid interactions, the mutations that impact protein-nucleic acid binding could be directly associated with phenotype changes. The dysregulation of allosteric interactions is considered to be another major determinant of disease . During evolution, organisms need to survive and reproduce in a changed environment. As such, certain genes need to gain functions and activate critical pathways. Allosteric regulation is an efficient way for driver mutations to act since the change of activity is not constrained to a single molecule, but can be propagated to a whole network . New computational methods that are able to identify "hot spots" in protein-protein interactions, protein-nucleic acid recognition, and allosteric regulations, in which the mutation may cause the dysregulation of biological pathways, may have significant impact on the interpretation of Genome-Wide Association Studies.
The relevance of D. melanogasta driver mutations to human hypoxia adaption
Recently several studies in hypoxia adaptation in humans have been performed on Tibetans [49, 50], Andeans , and Ethiopians . However, all human studies to date have adopted limited, sampling-based approaches, such as genotyping or exome sequencing. The relatively sparse sampling of the genome makes it harder to identify large-scale shifts in the allele frequency spectrum associated with natural selection. Consequently, these studies restricted subsequent analysis to variants in candidate genes that are mainly involved in the canonical hypoxia response (HIF pathway) and related pathways. The identification of the functional roles of sequence variances in human orthologs of Drosophila genes may provide critical insight in the prioritization of candidate genes in human, which may fail using conventional statistical techniques. Indeed, the majority of driver mutations identified in this study are human orthologs and associated with the hypoxia cellular phenotype, as shown in Table 1.
Based upon multiscale modeling, we propose that the up-regulation of Notch and Gurken/EGFR and the down-regulation of Toll and Torso/RTK pathways are responsible for hypoxia tolerance. Using integrated structural and network analysis, we hypothesize that nsSNPs in H, Rad51D, Ulp1, Sol, Wnt5, CG33714, GalNAc-T2, Dys, and HDAC4, may all lead to the functional modification of these genes via allosteric regulation and protein-protein/DNA/RNA interactions and hence are driver mutations defining the hypoxia tolerance phenotype. Our predictions are supported by experimental evidence [23, 26]. Moreover, multiscale modeling may identify potential epistasis using a very small sample size. This reduces the burden imposed during statistical multiple testing of large epistasis models. It is anticipated that the further extension of this multiscale modeling approach to genome-wide protein-protein interactions, protein-nucleic acid interactions, and microRNA data will provide a powerful tool for uncovering the functional roles of both coding and non-coding sequence variations in GWAS; a role which neither the knowledge of molecular structures nor of biological networks alone can achieve. However, challenges remain in extending multiscale modeling approaches. New algorithms are required to predict emergent properties, at both molecular and network levels, as well as to seamlessly model information flow across scales.
Prediction of non-neutral mutations on nsSNPs from sequence
A sequence information based method, SNAP  is used to predict the non-neutral (functional effect) and neutral (no functional effect) nsSNPs.
Knowledge-driven network inference of core pathways and driver mutations
The network-based analysis of driver mutation is shown in Figure 2. The mutated genes and differentially regulated genes are mapped to a protein-protein interaction (PPI) network extracted from the STRING Database  for D. melanogaster. A subnetwork that connects a mutated gene and up-, and down-regulated genes is identified using a shortest path search of the PPI network. The genes identified in each subnetwork are subject to Gene Set Enrichment Analysis (GSEA). If the genes in the subnetwork are enriched by the essential biological processes/pathways, the mutated gene is a potential driver.
Analysis of differential expressed genes
A cDNA microarray analysis of 13,061 known or predicted genes from the D. melanogaster genome is performed using the R package . K-nearest neighbors  in the space of genes is used to impute missing expression values. The LOWESS normalization method  is used to normalize the raw density data. P-value and fold change are calculated using the two-sided, two-class t-test . A Bonferroni-Holm  false discovery rate (FDR) controlling procedure [58, 59] is used to adjust the P-values. The genes are considered to be differentially expressed between the two samples when the FDR is smaller than 0.05. If the fold change is larger than 1.5-fold for up-regulated genes and is smaller than 0.67-fold for down-regulated genes, these genes are considered significantly differentially expressed.
Subnetwork construction by shortest path search
Where s2 is the unbiased estimator of the variance of the sample and n is the number of participants.
Here the t-value is used to measure the difference between the identified subnetwork (x2) and a background random network (x1). Background random networks are built by randomly selecting one gene as a source node and a set of other genes as destination nodes. A positive t-value means a shorter than average path. The mutations on the genes with statistically significant high t-values are prioritized as driver mutations.
Gene set overrepresentation analysis to identify driver biological pathways and mutations
The Biological Networks Gene Ontology Tool (BiNGO)  is applied in Cytoscape's versatile visualization environment  to determine which biological processes and molecular functions are significantly overrepresented in the set of genes involved in each subnetwork. Gene ontology  terms are ranked according to the False Discovery Rate (FDR) corrected p-values for each subnetwork. The statistically significant enriched biological pathways (p-value < 0.05) are considered as potential core pathways that contribute to the survival or reproduction of a phenotype. This pathway is subject to further validations by experiments and literature searches. If a subnetwork contains the validated core pathway, the mutated gene in this subnetwork is hypothesized to be a causal gene. Correspondingly, the mutations on this gene are candidate driver mutations.
Structure-based analysis of driver mutations
Homology modeling and nsSNP mapping
Homology models of proteins are built using Modeller . Sequence alignments between these proteins and templates of known structures are obtained from a PSI-BLAST sequence search . The functional sites are predicted using SMAP [67, 68, 69]. Mutated residues are mapped onto the model structures and the functional roles of these residues are predicted according to their locations on the model structures.
Covariance analysis based on multiple sequence alignments of proteins in the same Pfam family  as the mutated protein can help identify remote relationships between mutated residues and other residues within the protein sequence. The Pfam family is identified by a whole sequence search. Redundancy of sequences in the Pfam family is removed using CD-hit  with a sequence identity threshold of 90% . Multiple sequence alignments among these sequences are built using the MUSCLE software  with default parameters. Covariance of mutations with other residues is calculated using five different methods: Statistical Coupling Analysis (SCA) ; Explicit Likelihood of Subset Co-variation (ELSC) ; Observed Minus Expected Squared covariance algorithm (OMES) ; Mutual Information Covariance Algorithm (MI) ; and Conservation Algorithm (ConservationSum) . The residues that are predicted to be coupling with mutations by at least two methods are considered as co-evolved residues with the mutated residues.
Availability of supporting data
The data sets supporting the results of this article are included within the article.
This work was supported by National Institutes of Health Grants GM63208, CUNY High Performance Computing Center, CUNY Research Foundation, and Hunter President Fund. We thank reviewers for their constructive comments.
The publication costs for this article were funded by CUNY Research foundation.
This article has been published as part of BMC Genomics Volume 14 Supplement 3, 2013: SNP-SIG 2012: Identification and annotation of SNPs in the context of structure, function, and disease. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S3
- 17.Sumazin P, Yang X, Chiu HS, Chung WJ, Iyer A, Llobet-Navas D, Rajbhandari P, Bansal M, Guarnieri P, Silva J, Califano A: An extensive microRNA-mediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma. Cell. 2011, 147: 370-381. 10.1016/j.cell.2011.09.041.PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Blois MS: Information and Medicine: The Nature of Medical Descriptions. 1984, University of California PressGoogle Scholar
- 25.Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678. 10.1038/nature05911.Google Scholar
- 28.Bottomley MJ, Lo Surdo P, Di Giovine P, Cirillo A, Scarpelli R, Ferrigno F, Jones P, Neddermann P, De Francesco R, Steinkuhler C et al: Structural and functional analysis of the human HDAC4 catalytic domain reveals a regulatory structural zinc-binding domain. J Biol Chem. 2008, 283: 26694-26704. 10.1074/jbc.M803514200.PubMedCentralCrossRefPubMedGoogle Scholar
- 32.Kucherenko MM, Pantoja M, Yatsenko AS, Shcherbata HR, Fischer KA, Maksymiv DV, Chernyk YI, Ruohola-Baker H: Genetic modifier screens reveal new components that interact with the Drosophila dystroglycan-dystrophin complex. PLoS One. 2008, 3: e2418-10.1371/journal.pone.0002418.PubMedCentralCrossRefPubMedGoogle Scholar
- 37.Fre S, Pallavi SK, Huyghe M, Lae M, Janssen KP, Robine S, Artavanis-Tsakonas S, Louvard D: Notch and Wnt signals cooperatively control cell proliferation and tumorigenesis in the intestine. Proc Natl Acad Sci USA. 2009, 106: 6309-6314. 10.1073/pnas.0900427106.PubMedCentralCrossRefPubMedGoogle Scholar
- 38.Boulter L, Govaere O, Bird TG, Radulescu S, Ramachandran P, Pellicoro A, Ridgway RA, Seo SS, Spee B, Van Rooijen N et al: Macrophage-derived Wnt opposes Notch signaling to specify hepatic progenitor cell fate in chronic liver disease. Nat Med. 2012, 18: 572-579. 10.1038/nm.2667.PubMedCentralCrossRefPubMedGoogle Scholar
- 48.Kowarsch A, Fuchs A, Frishman D, Pagel P: Correlated mutations: a hallmark of phenotypic amino acid substitutions. PLoS Comput Biol. 2010, 6:Google Scholar
- 50.Bigham A, Bauchet M, Pinto D, Mao X, Akey JM, Mei R, Scherer SW, Julian CG, Wilson MJ, Lopez Herraez D et al: Identifying signatures of natural selection in Tibetan and Andean populations using dense genome scan data. PLoS Genet. 2010, 6:Google Scholar
- 53.Team RDC: R: A language and enviroment for statistical computing. Book R: A language and enviroment for statistical computing. Edited by: ed.^eds. 2010, City: R Foundation for Statistical ComputingGoogle Scholar
- 56.Rice JA: Mathematical Statistics and Data Analysis. 2006, Belmont, CA: Duxbury PressGoogle Scholar
- 58.Lin WY, Lee WC: Improving power of genome-wide association studies with weighted false discovery rate control and prioritized subset analysis. PLoS One. 7: e33716-Google Scholar
- 59.Hu JX, Zhao H, Zhou HH: False Discovery Rate Control With Groups. J Am Stat Assoc. 105: 1215-1227.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.