Pathway networks generated from human disease phenome
Understanding the effect of human genetic variations on disease can provide insight into phenotype-genotype relationships, and has great potential for improving the effectiveness of personalized medicine. While some genetic markers linked to disease susceptibility have been identified, a large number are still unknown. In this paper, we propose a pathway-based approach to extend disease-variant associations and find new molecular connections between genetic mutations and diseases.
We used a compilation of over 80,000 human genetic variants with known disease associations from databases including the Online Mendelian Inheritance in Man (OMIM), Clinical Variance database (ClinVar), Universal Protein Resource (UniProt), and Human Gene Mutation Database (HGMD). Furthermore, we used the Unified Medical Language System (UMLS) to normalize variant phenotype terminologies, mapping 87% of unique genetic variants to phenotypic disorder concepts. Lastly, variants were grouped by UMLS Medical Subject Heading (MeSH) identifiers to determine pathway enrichment in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.
By linking KEGG pathways through underlying variant associations, we elucidated connections between the human genetic variant-based disease phenome and metabolic pathways, finding novel disease connections not otherwise detected through gene-level analysis. When looking at broader disease categories, our network analysis showed that large complex diseases, such as cancers, are highly linked by their common pathways. In addition, we found Cardiovascular Diseases and Skin and Connective Tissue Diseases to have the highest number of common pathways, among 35 significant main disease category (MeSH) pairings.
This study constitutes an important contribution to extending disease-variant connections and new molecular links between diseases. Novel disease connections were made by disease-pathway associations not otherwise detected through single-gene analysis. For instance, we found that mutations in different genes associated to Noonan Syndrome and Essential Hypertension share a common pathway. This analysis also provides the foundation to build novel disease-drug networks through their underlying common metabolic pathways, thus enabling new diagnostic and therapeutic interventions.
KeywordsNetworks Phenome Disease mutations
Application Program Interface
Clinical variance database
Concept unique identifier
Human gene mutation database
Human phenotype ontology
Kyoto encyclopedia of genes and genomes
Medical Subject Heading
Online mendelian inheritance in man
Unified medical language system
Universal protein resource
UMLS terminology server
Current repositories of disease-associated human genetic variants encompass over 4000 genes and 17,000 disease phenotypes, derived mostly from manual extraction of literature . Our team has compiled these variants from sources including the Online Mendelian Inheritance in Man (OMIM) , Clinical Variance database (ClinVar) , Universal Protein Resource (UniProt) , and Human Gene Mutation Database (HGMD) [1, 5]. Our compilation contains over 80,000 human genetic variants and is the largest collection known to date . Yet despite the large number of genetic variants known to influence disease, only a small percentage of an individual genome is expected to map to known variants. To identify novel disease-variant associations, the functional context of known variants must be expanded.
In prior studies, human disease networks (e.g., the human diseasome) have been generated to link genetic disorders with disease genes, supporting disease-specific functional modules . Genes rarely work alone, often participating in complex pathways and synergic interactions with other genes, proteins, and environmental factors that collectively influence the clinical manifestations of disease [7, 8, 9]. Analyzing known disease-associated variants through a network biology approach can provide insight into the relationship between diseases and underlying metabolic pathways, and can expand the functional context of our variants . This may result in novel disease associations not otherwise discovered through single-gene analysis. Existing research has shown that exploring genetic risk through a modular approach allows for more stable and robust results and improves the interpretation of molecular mechanisms underlying disease .
The first challenge in identifying disease-pathway associations from known variants is to correctly assign each variant phenotype to an ontology. A standardized terminology provides an accurate mapping of diseases to disease-specific categories and facilitates the identification of enriched pathway associations. Thus, the first step in this work was to map variant phenotypes to the Unified Medical Language System (UMLS)  to build a disease phenome. The UMLS  leverages the UMLS Metathesaurus to integrate a wide variety of phenotype synonyms from terminologies like OMIM  and the Human Phenotype Ontology (HPO) . Once variants were mapped, they were clustered into higher-level categories via Medical Subject Heading (MeSH) identifiers, providing a broader view of the variant relationships.
To identify disease-pathway associations, mapped variants were linked at both the disease and pathway level. Our work builds on the approach taken in Goh et al. , expanding gene-based connections to their corresponding networks of human variant-derived Kyoto Encyclopedia of Genes and Genomes (KEGG)  pathways. Network representations were used, illustrating the power of our approach by providing a more complete view of human disease-variant associations. Our networks produced over 1300 novel disease associations at the pathway level, not otherwise connected at the gene level. This provides potential new insight into the relationship between metabolic pathways and interacting drugs, and may lead to improved drug efficacy and new potential drug targets for repurposing.  These associations will constitute an important resource in the future development of computational tools to optimize patient diagnosis and disease treatment.
Mapping human variants to UMLS
We mapped the Peterson et al.  compilation of over 80,000 human disease variants, which unifies specific genetic data from databases including OMIM , ClinVar , UniProt , and HGMD , to disorder concepts in the UMLS Metathesaurus. This mapping was performed using exact and normalized string matching functions from the UMLS Terminology Server (UTS) Application Program Interface (API). Variants were categorized by two different classifications: primarily by Concept Unique Identifier (CUI) and secondarily by MeSH. Each specific CUI categorizes all disease phenotypes equivalent to the same concept, linking those that may use a different phenotype terminology but ultimately define the same entity . MeSH provides a hierarchical set of descriptors, allowing for a controlled vocabulary and a broader level of specificity when searching for disease terms . Both CUI and MeSH classifications will allow human variants to be clustered based on their disease descriptions, providing an important link between genotype and phenotype.
Human variants unable to be mapped automatically to UMLS  were manually curated. Curators used the UMLS Metathesaurus browser to search unique phenotypes and variations of their terminology after removing modifiers and other fragments deemed unessential to the concept. Phenotypes were then classified into three types, per the amount of manual manipulation necessary to find a matching CUI (Additional file 1: Figure S1). Variants categorized as Type I were matched with little to no modification; Type II categorization required a moderate amount of manipulation; and Type III represents variant phenotypes unable to be mapped. Once completed, this manual mapping will allow for Type I and Type II variants to be mapped to a higher-order MeSH classification, increasing the number of variants in our further analysis at the pathway level.
Functional enrichment of variants
Mapped human variants were clustered into higher-level disease categories based on MeSH terms. For each grouping, genes mutated in the variants mapped to that MeSH were statistically enriched in KEGG pathways (one-tailed Fisher exact test, p < = 0.0001) for groups associated with at least three genes. MeSH categories enriched in at least three pathways were selected for network construction.
Mapping of variants to a phenotype ontology
Pathway enrichment of variants and construction of networks by MeSH term
Simple parameter network analysis  of disease-level and pathway-level networks
Number of nodes
Number of edges
Characteristic path length
Avg. number of neighbors
Using UMLS string matching algorithms, we mapped the Peterson et al.  collection of human variants to the UMLS  phenotype ontology and built a collection of almost 70,000 (87%) unique human genetic variants to disorder concepts. UMLS  offers wide coverage of variants and higher-level disease categorization through MeSH. This allowed us to cluster phenotypes into broader categories and make inferences about each subgroup. Unique genetic variants associated with phenotypes otherwise unable to be mapped to UMLS  were run through a manual curation protocol, through which 56% of variants have currently been manually curated. Mapping to UMLS  allowed human variants to be grouped together based on same and/or similar phenotype, alleviating many of the difficulties faced due to the lack of standardization of vocabulary.
Disease-level and pathway-level bipartite networks were constructed using KEGG pathway enrichment, linking human genetic variants by common pathways or CUIs. In both networks, one central cluster of nodes was the most highly connected. The nodes in this cluster generally encompassed pathways involved in essential processes (i.e., reproduction, survival), many of which may be altered in and/or directly linked to cancer.  Through network analyzation  of the pathway-level network, KEGG Pathways in Cancer was found to have a high connectivity in the central cluster, with the highest node degree of 52 (Table 1).
The disease-level network contained 223 CUIs connected by 2548 unique disease associations through gene- or pathway-level analysis. Of these, 1338 (53%) connections are only observed through disease-pathway associations and not otherwise connected at the gene-level. Additionally, 461 (18%) connections are observed through disease-gene associations only, and 741 (29%) connections are observed through both gene- and pathway-level associations. When CUIs are connected through both common genes and common pathways, the resulting disease associations function as confidence builders with a higher level of evidence to support the connection. Hypoglycaemia, hyperinsulinaemic (C1864903) and Diabetes, type 2 (C0011860) were connected through the genes HNF1A, ABCC8, HNF4A, and GCK, as well as the KEGG pathway Type II Diabetes Mellitus. This association is expected, as hypoglycemia is known to affect type II diabetes patients near insulin-deficiency . CUI connections made through pathways but not through genes extend the functional context of variants and provide new potential disease associations. Noonan Syndrome (C0028326) and Essential Hypertension (C0085580) were connected through the KEGG pathway Vascular Smooth Muscle Contraction, despite associated variants not having any common genes in our repository. A common symptom of Noonan syndrome is hypertrophic cardiomyopathy , which in turn is highly related to hypertension and often occurs in conjunction in elderly patients , suggesting a logical connection between Noonan Syndrome and Essential Hypertension concepts in our network.
As shown in Fig. 7, comparison of Cardiovascular (C14) and Skin and Connective Tissue (C17) networks shows high overlap in the largest cluster of KEGG pathways, which includes basic cellular functions such as cell signaling, growth, and maintenance. Many of these kinds of pathways are also altered in different types of cancer, as seen by the connections and enrichment of cancerous pathways in the main network cluster. A few examples include Melanoma, MAPK Signaling Pathway, and Pathways in Cancer. The high similarity between C14 and C17 is to be expected, as many cardiac disorders involve the connective tissue within/surrounding the heart, and relationships have been observed between normal development of connective tissue and the cardiovascular system . Comparison of Hemic and Lymphatic (C15) and Immune System (C20) networks shows high overlap in a cluster of immunological KEGG pathways, including Primary Immunodeficiency, Type I Diabetes Mellitus, and Autoimmune Thyroid Disease. This intersection is also expected to be significant, as lymphatic diseases are highly linked to the immune surveillance and adaption .
Our next step is to continue analyzing human genetic variants at different levels of clustering, expanding our classifications and extending functional context to find new disease connections. If a pathway is found to link to multiple diseases, a drug being used to treat one disease could potentially be repurposed to treat another disease connected at the same pathway level . In addition, if a disease is found to link to multiple pathways, a patient with this disease may benefit from a pathway-guided combination therapy . With the addition of patient data, variant-based disease-pathway associations can be compared across individuals and provide a platform for incorporating new variant data into our database. In the future, this will allow us to develop computational tools that facilitate the optimization of personalized diagnosis, prognosis, and disease treatment.
Expanding our view of the human diseasome to include human variant-derived KEGG pathway associations allowed for an extended functional view of disease-variant associations. Novel disease connections were made by disease-pathway associations not otherwise detected through single-gene analysis. This shows that seemingly unrelated disease variants can be associated at the pathway level. Additionally, this type of analysis provides new relationships between metabolic pathways and disease-drug networks, potentially enabling novel diagnostic and therapeutic interventions.
A special thanks to Veneeth Antony, Dr. Olivier Bodenreider, Thomas Coard, Ashley Funai, and Dr. Thomas Peterson for helpful discussion and comments. We would also like to thank the 2017 Translational Bioinformatics Conference organizers and reviewers for their helpful feedback.
The National Institutes of Health Grants 1K22CA143148 to M.G.K. (PI) and R01LM009722 to M.G.K. (coPI); NSF (award #1446406, PI: M.G.K.), and the American Cancer Society, American Cancer Society Institutional Research Awards grant to M.G.K. (PI); This investigation was sponsored, in part, by NIH/NIGMS MARC U*STAR T34 08663 National Research Service Award to UMBC (A.G.C.), and through an Undergraduate Research Award from the UMBC Office of Undergraduate Education (A.G.C). The cost of publication was funded by NIH/NIGMS MARC U*STAR.
Availability of data and materials
Partial datasets generated and/or analyzed during the current study are available in the Human-Disease-Phenome repository (https://anncir1.github.io/Human-Disease-Phenome/). Full datasets are available from the corresponding author on reasonable request.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 11 Supplement 3, 2018: Selected articles from the 7th Translational Bioinformatics Conference (TBC 2017): medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-11-supplement-3.
Conceived and designed the experiments: AGC, MGK; Performed the experiments: AGC, KLC; Analyzed the data: AGC, MGK; Wrote the paper: AGC, MGK. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 3.About ClinVar. [cited 2013 February 3rd]; Available from: www.ncbi.nlm.nih.gov/clinvar/
- 16.Cryer PE, Davis SN, Shamoon H. Hypoglycemia in diabetes. Diabetes Care. 2003;26:1902–12. Retrieved from http://care.diabetesjournals.org/content/diacare/26/6/1902.full.pdf CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.