An integrative network-based approach to identify novel disease genes and pathways: a case study in the context of inflammatory bowel disease
There are different and complicated associations between genes and diseases. Finding the causal associations between genes and specific diseases is still challenging. In this work we present a method to predict novel associations of genes and pathways with inflammatory bowel disease (IBD) by integrating information of differential gene expression, protein-protein interaction and known disease genes related to IBD.
We downloaded IBD gene expression data from NCBI’s Gene Expression Omnibus, performed statistical analysis to determine differentially expressed genes, collected known IBD genes from DisGeNet database, which were used to construct a IBD related PPI network with HIPPIE database. We adapted our graph-based clustering algorithm DPClusO to cluster the disease PPI network. We evaluated the statistical significance of the identified clusters in the context of determining the richness of IBD genes using Fisher’s exact test and predicted novel genes related to IBD. We showed 93.8% of our predictions are correct in the context of other databases and published literatures related to IBD.
Finding disease-causing genes is necessary for developing drugs with synergistic effect targeting many genes simultaneously. Here we present an approach to identify novel disease genes and pathways and discuss our approach in the context of IBD. The approach can be generalized to find disease-associated genes for other diseases.
KeywordsDisease gene Inflammatory bowel disease Gene expression Protein-protein interaction
Area under the curve
The comparative toxicogenomics database
Differentially expressed gene
Genome-wide association studies
Human integrated protein-protein interaction rEference
Inflammatory bowel disease
IBD related differentially expressed gene
Molecular function ODEG: Only differentially expressed gene
Single nucleotide polymorphisms
Inflammatory bowel disease (IBD) causes chronic inflammation of some or all part of the digestive tract. There are two major subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Both types usually involve severe diarrhea, pain, fatigue and weight loss. IBD can bring severe situations and can lead to life-threatening complications. IBD is still not curable since there are no suitable drugs and targets for curing the disease.
IBD is an idiopathic, chronic and often disabling inflammatory disorders of the gastrointestinal tract characterized by dysregulated mucosal immune response. IBD can result in life threatening bleeding, sepsis and bowel obstruction. The pathogenesis of IBD is still elusive and therefore needs to be understood for developing cure for IBD. Genome-wide association studies (GWAS), have significantly advanced our understanding on the importance of genetic susceptibility in IBD. The GWAS performed to date together with a meta-analyasis of several GWAS have identified a total of 163 IBD loci . These studies mainly focused on the common genetic variants (single nucleotide polymorphisms (SNPs)). These risk loci are asscciated to a handful of candidate genes which have small contributory effects in IBD.
Significant interest has been developed for inventing new methods based on integrating omics data for identifying disease causal genes. For example, network-based classification approaches have been developed to integrate gene expression and protein interaction data to predict breast cancer metastasis [2, 3], multiple sclerosis relapse and remissions  and autoimmune disease . Other studies also identified subnetwork modules from integrating protein interaction data with GWAS signals for complex diseases .
During the past decade, a huge pile of biological data has been generated from various large-scale omics studies, prompting the scientific community to gain deeper insight into underlying biological mechanisms of different diseases. One of the interesting topics is to find disease-gene associations. Broadly speaking, a disease-gene association can be a connection reported in the literature, such as a genetic association (i.e., mutations in a given gene may lead to a specific disease), or inferred from other sources . Similarities between disease symptomes and gene functions could be used to predict disease-causing genes by text mining . The human diseasome was constructed by connecting diseases to shared disease-causing genes . Understanding of disease relationships has been explored using different types of omics data such as biological pathways , transcriptome data [11, 12], biomedical ontologies [13, 14], and genome-wide association study (GWAS) data [14, 15, 16, 17]. Recently, large-scale biological data have been analyzed based on networks, and network topology has been utilized to provide insights into diseases and their associations with genes [9, 18, 19, 20]. Because the interactions between bio-molecules play crucial roles in the cell, the topology of biological networks is likely to have various biological and clinical applications [21, 22].
Cellular functions rely on the coordinated actions of multiple genes, proteins, and metabolites. Therefore, organizing biological information in the context of networks is important for deep understanding of biological systems. Discovery of modules in biological networks helps isolate systems with disease related properties and reduces interactome complexity . Proteins rarely act alone as their functions tend to be regulated. Many molecular processes within a cell are carried out by molecular machines that are built from a large number of protein components organized by their protein-protein interactions (PPIs). The disease proteins (the product of disease genes) are not scattered randomly in the interactome but tend to interact with each other. Because of incompleteness of disease genes and PPI data, the known disease genes usually fail to form observable modules in PPI networks. Out of 299 diseases only 20% of the respective known disease gene from some type of modules . To compensate for such gaps to a certain extent, In the present work we focus on finding novel IBD associated genes and pathways by integrating IBD gene expression, PPIs, and known IBD genes by adapting the DPClusO network clustering algorithm we published previously.
Results and Discussion
Construction of a disease relevant PPI network
We initially downloaded 866 genes reported in DisGeNet database  as IBD genes. We found that 318 of the 866 IBD genes are out of the 4477 differentially expressed genes (DEGs) we identified from gene expression analysis. Let us name these 318 genes as IBD related differentially expressed genes (IDEGs) and the rest 4159 as only differentially expressed genes (ODEGs). In this work we consider these 318 genes as known IBD genes.
Clustering of the PPI network
After creating the disease related PPI network we determined clusters in the network by DPClusO algorithm. DPClusO generates overlapping clusters and ensures coverage. For example, each node goes to at least one cluster. We hypothesize that clustering of a disease relevant PPI network helps isolate systems with disease related properties and therefore statistically significant clusters enriched with known IBD genes can be used to predict novel IBD genes and pathways based on the associations determined by combined information of IBD gene expression and protein-protein interactions.
Characteristices of the clusters generated with different input densties using the DPClusO algorithm based on the IBD related PPI network
Prediction and validation
We predicted 909 genes (with adjusted p−value<0.05) included in the clusters selected from the set corresponding to the highest AUC as our predicted IBD genes. These 909 genes are other than the genes considered as known IBD genes (IDEGs) in this work. The list of the 909 predicted IBD genes and corresponding adjusted p-values are shown in Additional file 1. To validate our results we initially searched how many of the predicted genes are exactly matched with well curated known IBD genes. We found 83, 8, 54, 22 of the predicted genes matched with reported IBD genes in (1) HuGeNet, (2) CTD, (3) DisGeNet databases and (4) GWAS results respectively. After considering overlapping between databases, 14.5% of our predicted genes matched with good quality known IBD genes. Given the fact that we made predictions based only on a specific gene expression data and a limited set of known IBD genes, the 14.5% matching with good quality data is significant (p−value<3.45×10−12, p-value determined based on hypergeometric ditsribution assuming total number of human genes as 20000). However, our approach is a computational approach. So, it is rational to compare our result also with computationally predicted IBD genes. In CTD database other than the good quality curated set there is a big set of genes inferred as IBD genes by various methods. When we compare our result with this big set, we find that 93.8% of the genes we predicted matched with reported IBD genes (p−value<9.8×10−14). As we have predicted the genes by wisely integrating the information of gene expression and protein-protein interaction, it is very likely that they are truely related to IBD. One of the predicted genes IL12B is supported by all four above-mentioned sources as an IBD related gene. IL12B and IL23R have been identified as susceptibility genes for IBD by recent genome-wide association studies . Each of the three genes CCR5, IL1R2 and LTA is mentioned as IBD related gene in three of the above mentioned sources. High expression of CCR5 has been reported in active IBD . Epithelial IL1R2 takes part in homeostatic regulation during remission of ulcerative colitis . It has been reported that LTA elicits a strong inflammatory reaction controlled by intestinal dendritic cells . Thus we have found IBD relevance of many other predicted genes by literature review. The proposed method, however is a computational one and the role of the newly predicted genes in IBD pathogenesis should be clarified by further studies.
The degree of relevance of the 909 genes (shown in Additional file 1) predicted by the proposed approach can be evaluated by the corresponding p-values. The top 20 predicted novel IBD genes (not reported in any of the four sources of Fig. 4) based on p-values are IKBKG, BIRC3, BCL10, RNF31, RBCK1, CCRL1, LAMC3, CARD11, KISS1, THBS2, TRAF2, TRAF1, PYCARD, MIS12, ALB, AR, RIPK1, SHARPIN, SNAPIN and ITGA2B. Many of these 20 top IBD risk genes we identified from this study have been found to be associated with IBD. In human, the IKBKG gene encodes NF- κB essential modulator (NEMO) which is an inhibitor of nuclear factor κB kinase subunit gamma (IKK- γ) . NEMO (IKK- γ) is the regulatory subunit of the inhibitor of the I- κB kinase (IKK) complex, that activates NF- κB causing activation of genes involved in inflammation, immunity, cell survival, and other pathways. IBD-like immunopathology can be developed by IKBKG . BIRC2 and BIRC3 are important genes in regulating the expression of proinflammatory cytokines, such as TNF- α, through NF- κB and MAPK pathways . BCL10 is an adaptor protein which is assumed to play role in the PAF-induced inflammatory pathway in human intestinal epithelial cells . RNF31 and HOIL-1L complex functions in linear ubiquitination of proteins in the NF- κB pathway in response to proinflammatory cytokines . CCRL1 acts as a functional receptor for the monocyte chemoattractant protein family of chemokines; elevated chemokine expression is associated with many inflammatory diseases such as IBD, rheumatoid arthritis and asthma [44, 45]. As a component of the LUBAC complex, RBCK1 conjugates linear (Met1-linked) polyubiquitin chains to substrates and thus plays imoportant role in NF- κB activation and inflammation regulation . RBCK1-deficiency is associated with autoinflammatory syndrome and immunodeficiency . LAMC3 is expressed saliently at significantly different proportions in low and high coherence expression profiles of IBD patients . The elevated stromal protein thrombospondin-2 (THBS2) has been reported to be a part of a fibroblast-specific inflammation signature . It has been shown that TRAFs are important mediators of innate immune receptor signaling . IBD and IBD recurrence is associated with the overexpression of TRAF2 [50, 51, 52]. TRFA1 is reported to be highly expressed in IBD patients . To form the basic Inflammasome subunit, the adaptor protein ASC (encoded by the PYCARD gene) links the NLR sensor to caspase-1 . TNF- α-induced necroptosis is associated with two members of the receptor-interacting protein (RIP) family of kinases – RIPK1 and RIPK3 . Tumor necrosis factor- α (TNF- α) can bind to one of two receptors, TNFR1 or TNFR2; TNFR activation results in the activation of NF- κB leading to the induction of proinflammatory cytokines .
Comparison with ToppGene
Results of comparison with ToppGene
Parameter of comparison
Page rank with priors
Hits with priors
Number of match
Gene ontology and pathway analysis
As a group the top 20 predicted genes (names mentioned in the previous section) are enriched in some important BP(Biological Process) related GO terms, such as I- κB kinase/NF- κB signaling, positive regulation of immune response, regulation of tumor necrosis factor-mediated signaling pathway and MF(Molecular Function) terms, such as ubiquitin protein ligase binding, identical protein binding. We also performed enrichment analysis for all of the 909 genes. Some significant BP related GO terms enriched in these genes are nitrogen compound metabolic process, response to stimulus, immune system process, cell surface receptor signaling pathway, response to stress, response to lipid, positive regulation of leukocyte cell-cell adhesion and MF terms are enzyme regulator activity, kinase activity, protein complex binding, histone deacetylase binding, transcription factor activity, protein binding, protein C-terminus binding. NF- κB pathway mediate events including the activation of genes encoding inflammatory molecules and is found to be chronically active in IBD . All the above mentioned GO terms associated to a group of genes were searched by using the enrichment analysis tool  provided in the web page of Gene Ontology Consortium.
Based on significant p-values, we empirically selected some enriched BP and MF terms for these clusters. Some important BP related GO terms enriched in these clusters (a)-(f) are as follows: (a) cell surface receptor signaling pathway, regulation of cellular response to insulin stimulus, cellular response to hormone stimulus, (b) negative regulation of programmed cell death, response to endogenous stimulus, cell differentiation, (c) regulation of cytokine production, intracellular signal transduction, regulation of type I interferon production, (d) toll-like receptor signaling pathway, activation of innate immune response, inflammatory response, (e) regulation of transcription from RNA polymerase II promoter, negative regulation of transcription, DNA-templated, negative regulation of nitrogen compound metabolic process, (f) chemotaxis, inflammatory response, positive regulation of MAPK cascade and MF related GO terms are as follows: (a) phosphatidylinositol 3-kinase binding, insulin receptor binding, receptor binding (b) transcription factor binding, regulatory region DNA binding, chromatin binding, (c) transcription factor activity, sequence-specific DNA binding, chromatin binding, (d) signal transducer activity, Toll-like receptor binding, (e) SUMO transferase activity, ubiquitin-like protein ligase binding, (f) G-protein coupled receptor binding, cytokine receptor activity.
MAPK signaling pathway are evolutionarily conserved kinase modules whose fanctions are to transmit extracellular signals to various machinery inside the cell that manage fundamental cellular processes such as growth, differentiation, migration, proliferation and apoptosis. Activation of ERK1/2 by growth factors depends on the MAPKKK c-Raf, but other MAPKKKs may activate ERK1/2 in response to pro-inflammatory stimuli . Small chemoattractant peptides called Chemokines provide directional cues for the cell trafficking and therefore are important for protective host response. They are soluble factors which play key roles in regulating immune cell recruitment during inflammatory responses and defense againsst foreign pathogens. Soluble extracellular proteins or glycoproteins known as Cytokines are crucial intercellular regulators and mobilizers of cells involved in inherent as well as adaptive inflammatory host defenses, cell death, cell growth, angiogenesis, differentiation and development and repair processes targeting the restoration of homeostasis. It has been reported that cytokines/chemokines are engaged in not only the initiation but also the persistence of pathologic pain by activating nociceptive sensory neurons. There are inflammatory cytokines engaged in nerve-injury/inflammation-induced central sensitization, and are associated to the development of contralateral hyperalgesia/allodynia [71, 72]. Toll-like receptors (TLRs) are a family of pattern recognition receptors that are best-known for their role in host defence from infection. It has been reported that TLRs play important role in maintaining tissue homeostasis by regulating the inflammatory responses to injury . The intracellular NOD-like receptor (NLR) family contains more than 20 members in mammals and plays a pivotal role in the recognition of intracellular ligands. The activated state of caspase-1 regulates maturation of the pro-inflammatory cytokines IL-1B, IL-18 and drives pyroptosis .
We presented a method for predicting IBD related genes and pathways by integrating the information of IBD gene expression and protein-protein interactions and a set of known IBD genes from DisGeNet database. We determined differentially expressed genes (DEGs) based on IBD gene expression data and constructed a IBD relevant PPI network using DEGs and known IBD genes. We extracted high density modules from the PPI network using our graph clustering algorithm DPClusO. We determined modules enrichment with known IBD genes by Fisher’s exact test and used those statistically significant modules to predict novel IBD genes and pathways. We compared our results with several other databases and published literatures. We found 93.8% of our predictions are found in these published results. Specially we found our results substantially matched with IBD genes collected in curated databases and high-profile publications.
Furthermore, based on our ranking score, we selected top 20 predicted novel IBD genes and by literature survey we observe that most of these genes are really substantially related to IBD. As a group these 20 genes are enriched in the GO term I- κB kinase/NF- κB signaling. NF- κB pathway mediates events including the activation of genes encoding inflammatory molecules and is found to be chronically active in IBD. Also, based on statistically significant clusters we identified top 10 IBD related pathways which include MAPK signaling pathway, Chemokine signaling pathway, Cytokine-cytokine receptor interaction etc. These pathways play roles in inflammation related diseases including IBD.
Finding disease-causal genes is the part of the process to understand disease mechanism and develop drugs that can provide synergistic effects targeting many genes/proteins simultaneously. This study discussed a computational approach to reach these goals in the context of IBD. The proposed method can also be applied to find disease-causal genes related to other diseases.
Data collection and preprocessing
We downloaded the IBD gene expression data from NCBI’s Gene Expression Omnibus (GSE57945) . The gene expression data was generated using TopHat . The samples were collected for three biological groups: healthy control, Crohn disease and ulcerative colitis . We removed genes with expression values equaling to zero across all samples. The final expression data set included 14664 genes and 322 samples, which included 42 control samples, 218 CD samples, and 62 UC samples. We also downloaded reported IBD genes from several other databases, such as The Comparative Toxicogenomics Database (CTD) , DisGeNet , HuGENet . The protein-protein interaction data was downloaded from HIPPE database .
Identifying differentially expressed genes
We performed differential expression analysis using the R package edgeR, which is based on negative binomial models . We implemented the exact test for a difference in mean between two groups of negative binomial random variables by using edgeR after applying Trimmed Mean of M-value(TMM) normalization [77, 78] to data. False discovery rate (FDR) was estimated from unadjusted p-values using Benjamini Hochberg multiple testing method [34, 79].
Network clustering by DPClusO
Nk is the number of nodes in cluster k. Enk is the total number of edges between the node n and each of the nodes of cluster k.
Fisher’s exact test
We evaluated the enrichment of the known IBD genes (referred to as IDEGs in the present work) in the clusters from our PPI analysis using Fisher’s exact test. The test is an alternative statistical significance test used in the analysis of 2×2 contingency tables [82, 83].
Not in Cluster
Here n is the total number of genes in the network.
We assigned a score called SScore (Significance Score) to each gene as a measure of confidence of prediction based on the p-values of the clusters they belong to. By definition SScore=−log(p−value). As DPClusO generates overlapping clusters, a gene may belong to more than one clusters and thus may correspond to more than one p-values. We used the lowest p-value corresponding to a gene to calculate its SScore.
Corresponding to a certain threshold SScore th, false positive (FP), true positive (TP), false negative (FN) and true negative (TN) are defined as follows: TP is the number of reported IBD genes having SScore≥th, FP is the number of non-IBD genes having SScore≥th, TN is the number of non-IBD genes having SScore<th, and FN is the number of reported IBD genes having SScore<th.
We observed the performance of SScore to identify known IBD genes by using the Area Under the ROC Curve (AUC) analysis . In term of AUC analysis, we used R package named ROCR . We considered a prediction as ’True’ prediction if a gene is reported as IBD gene in any of the following four sources: (1) Human Genome Epidemiology Network (HuGENet), (2) Comparative Toxicogenomics Database (CTD), 3) DisGeNet database and (4) GWAS results [30, 31, 32, 33]. Here, FP, TP, FN, TN were calculated based on known information i.e. without having knowledge of all IBD related and unrelated genes. Therefore, the calculated TPR and FPR values were affected by the unknown nature of the TN and FN genes.
This work was supported by NAIST Global Collaborative Program 2017 and partially supported by the Ministry of Education, Culture, Sports, Science, and Technology of Japan (16K07223 and 17K00406), NAIST Big Data Project and by Research Manitoba, Health Sciences Centre Foundation and Mitacs of Canada.
Md. A-U-A, PH, RE and MBK designed the research and conducted the experiments. TS, NO and SK guided the research with valuable comments. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 8.Lage K, Karlberg EO, Størling ZM, Ólason PÍ, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007; 25(3):309–16. https://doi.org/10.1038/nbt1295.CrossRefPubMedGoogle Scholar
- 13.Finding disease similarity based on implicit semantic similarity. J Biomed Inform. 2012; 45(2):363–71. https://doi.org/10.1016/j.jbi.2011.11.017.
- 26.Piñero J, Queralt-Rosinach N, Bravo À, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes,. Database: J Biol Databases Curation. 2015; 2015:028. https://doi.org/10.1093/database/bav028.CrossRefGoogle Scholar
- 28.Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, Mcmorran R, Wiegers J, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Res. 2017; 45. https://doi.org/10.1093/nar/gkw838.
- 34.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.Google Scholar
- 38.Mohamadzadeh M, Pfeiler EA, Brown JB, Zadeh M, Gramarossa M, Managlia E, Bere P, Sarraj B, Khan MW, Pakanati KC, et al.Regulation of induced colonic inflammation by lactobacillus acidophilus deficient in lipoteichoic acid. Proc Natl Acad Sci. 2011; 108(Supplement 1):4623–30.CrossRefPubMedGoogle Scholar
- 54.Ringel-Scaia VM, McDaniel DK, Allen IC. The goldilocks conundrum: Nlr inflammasome modulation of gastrointestinal inflammation during inflammatory bowel disease. Crit Rev™ Immunol. 2016;36(4).Google Scholar
- 68.Manousou P, Kolios G, Valatas V, Drygiannakis I, Bourikas L, Pyrovolaki K, Koutroubakis I, Papadaki H, Kouroumalis E. Increased expression of chemokine receptor ccr3 and its ligands in ulcerative colitis: the role of colonic epithelial cells in in vitro studies. Clin Exp Immunol. 2010; 162(2):337–47.CrossRefPubMedPubMedCentralGoogle Scholar
- 71.Chow MT, Luster AD. Chemokines in Cancer. Cancer Immunol Res. 2014; 2(12):1125–1131. http://cancerimmunolres.aacrjournals.org/content/2/12/1125.CrossRefPubMedPubMedCentralGoogle Scholar
- 74.Tervaniemi MH, Katayama S, Skoog T, Siitonen HA, Vuola J, Nuutila K, Sormunen R, Johnsson A, Linnarsson S, Suomela S, Kankuri E, Kere J, Elomaa O. NOD-like receptor signaling and inflammasome-related pathways are highlighted in psoriatic epidermis. Nat Publ Group. 2016. https://doi.org/10.1038/srep22745.
- 80.Altaf-Ul-Amin M, Wada M, Kanaya S. Partitioning a ppi network into overlapping modules constrained by high-density and periphery tracking. ISRN Biomath. 2012; 2012:11.Google Scholar
- 84.Metz CE. Basic principles of roc analysis. In: Seminars in Nuclear Medicine. New York: Elsevier: 1978. p. 283–98.Google Scholar
- 85.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning. New York: ACM: 2006. p. 233–40.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.