Introduction

Crohn’s disease (CD) and ulcerative colitis (UC), are the two major manifestations of what is known as inflammatory bowel disease (IBD). They are chronic conditions characterized by prolonged inflammation of the digestive tract and their exact cause is unknown. However, genetics and problems with the immune system have been associated with IBD. Even if recent specific epidemiological data does not exist for Greece, which is the sample source of this work, it was estimated that 2.5–3 million people in Europe are affected by IBD, with a direct healthcare cost of 4.6–5.6 bn Euros/year [1]. Over the last years, a significant number of trait associated gene variants were identified through genome-wide association studies (GWAS) in diverse populations, which strengthened our understanding of complex diseases such as IBD [2]. Regarding European ancestry populations, approximately 200 genome-wide significant (GWS) IBD susceptibility loci [3] have been identified, however, IBD has been associated with significant geographic and ethnic differences in incidence and prevalence [4].

Generally, since GWAS focus on testing association of disease with individual SNPs over the genome and only top-ranked SNPs with the strongest statistical evidence for association are described, GWAS are underpowered to detect loci which have small marginal effect but rather act jointly or interact with trait variability [4, 5]. Thus, more sophisticated analyses such as network-assisted studies that integrate GWAS results are very promising approaches towards the discovery of functionally related genes including those that have a small marginal effect but rather act jointly in disease susceptibility.

Computational approaches have become standard practice in the last decades for managing and analyzing biological data. Due to the accumulative amount of information biological experiments produced, also known as –omics data, the need arose for powerful computational inquiries and storage. Biological databases had to be developed and specialized tools, each targeting specific data types, had to be developed. Contemporary practices and literature [3, 6,7,8] are focused on these approaches producing more and more knowledge to be consumed. Systems bioinformatics [9] implementations try to combine all this newfound and/or newly appreciated knowledge into comprehensible interactions and provide insights into the patient-disease complex.

In the present study, we employed a bioinformatics pipeline to integrate IBD GWAS results with experimental and bibliographic data via two different approaches; one that informs on pathway-pathway networks and one that provides protein–protein association (via their respective genes) networks. These allowed us to perform network analysis and clustering, to identify sets of interconnected genes and functional pathways associated with each of the two IBD forms and their phenotypes.

More specifically we use the results of our GWAS study of an extended cohort of 573 Greek IBD patients (364 CD and 209 UC) and 441 controls using 89 single nucleotide polymorphisms (SNPs) that showed moderate or strong association in previous studies [6, 10, 11] to perform various network analyses. The data and analysis of CD samples is novel whereas regarding UC we have employed re-analysis of our previously published data using new contemporary bioinformatics approaches. Our results were combined with pathway interaction, and gene co-expression, co-localization, co-occurrence and fusion data to reveal biologically meaningful processes that underlie the risk of IBD. This work aims to have a two-fold impact: to provide scientists who are in with new information on the pathogenesis of IBD and to propose and highlight new methodologies which can be applied on genetic data of different pathological origins.

Materials and methods

Study design

The overall experimental design is illustrated as a flowchart in Fig. 1 and will be explained in detail here.

Fig. 1
figure 1

Flow chart showcasing the experimental methodology and study design

Samples and DNA isolation

We had conducted GWAS using case–control datasets, totaling 573 Greek IBD cases 364 CD and 209 UC) and 445 healthy controls from unrelated, self-identified Greek individuals as previously described (Table 1) [12]. Our samples were stratified to disease sub-phenotypes according to the Montreal Classification [13] and more specifically CD samples were categorized based on their behavioral subphenotypes (B1: Non-stricturing, Non-penetrating, B2: Stricturing, B3: Penetrating), whereas, UC samples were categorized based on their extent subphenotypes (E1: Ulcerative proctitis, E2: distal UC, E3: pancolitis). None of the patients or controls had a family history of autoimmune disease. The diagnosis of IBD was based on standard clinical, endoscopic, radiological, and histological criteria. Before commencement of the study, the Ethics Committee at the participating centers approved the recruitment protocols. All participants were informed of the study. DNA was isolated from blood with the NucleoSpin blood kit (Macherey–Nagel, Germany).

Table 1 Characteristics of case/control sets used

Genotyping

A genome-wide SNP typing of a discovery panel, using the Affymetrix Genome-Wide Human SNP Array 5.0 was carried out previously at Institute for Clinical Molecular Biology, Christian-Albrechts-University, Kiel, Germany [6, 10]. Part of this panel has been used in previous studies [12].

SNP quality control and association analysis

The inclusion criteria for the samples in our statistical analysis accounted for SNP missing rate, minor allele frequency and a Hardy–Weinberg Equilibrium exact test p value to rule out genotyping errors. Association analysis was performed on the included samples based on a pairwise comparison of the disease phenotype and sub-phenotypes using a 1 df χ2 (Chi square) test. Estimated odds ratios (OR) with a 95% confidence interval (CI) were also calculated for allele 1 (minor) versus allele 2 (major) in our preselected SNPs. Only the SNPs with an asymptomatic p value ≤ 0.05 were considered in our results for further analyses. Quality control and association tests were performed using PLINK [14] v1.90b4.9. The R package metaphor [15] v2.0 was used for the creation of OR plots based on our test results and VENNY [16] was used to identify SNPs common between IBD phenotypes and subphenotypes.

Signaling pathways enrichment and functional associations

Using the genes carrying the SNPs highlighted by our association analyses, gene-set lists were created as input to the PathwayConnector [17] (Method 1 of the flowchart) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), a database of known and predicted protein–protein associations [18] (Method 2 of the flowchart) platforms.

In Method 1, KEGG [19] was selected as the default signaling pathway database, the top ten Enrichr pathways per set were considered as the initial seed pathways used in the complementary network analysis and edge betweenness was selected as the community detection algorithm for clustering on the complementary pathway network.

For Method 2 each gene of our gene-set was converted to a best matched protein set. The networks were then created using an interaction score of 0.400 (medium confidence) with an enrichment of 30 interactors in total (no more than 20 1st shell and 10 2nd shell interactors), after testing various combinations for the most accurate results based on current knowledge. 1st shell interactors are proteins directly associated with our initial set while 2nd shell ones are those associated with the 1st shell interactors. As active interaction sources all categories had been selected (Textmining: data extracted from the abstracts of scientific literature, Experiments: data extracted from other PPA databases, Databases: data extracted from curated databases, Co-expression: genes that are co-expressed in the same or in other species (transferred by homology), Neighborhood: genes that occur repeatedly in close neighborhood in (prokaryotic) genomes, Gene Fusion: gene fusion events per species, Co-occurrence: proteins linked across species). The Markov Cluster Algorithm (MCL) [20] with an inflation parameter of 3 was applied to the final network for cluster detection based on domain architecture. Edges were created by confidence levels, and disconnected nodes were hidden. Using cytoscape [21], as well as, the igraph [22] and centiserve [23] packages for R, we calculated various network analysis metrics, in order to detect hubs (Degree Centrality), bottlenecks (Betweenness Centrality), shortest path topology (Latora harmonic closeness centrality) and in general nodes (proteins) that play an important role in the protein (PPA) networks. We devised a gene ranking score by using a weighted function, giving Degree centrality a 0.2 factor, Latora Closeness Centrality a 0.3 and Betweenness Centrality a 0.5. This score tries to signify the knowledge represented in literature about the actual significance of those metrics in a protein network [24, 25]. Finally, pathway analysis was performed, on the enriched networks of the disease phenotypes and sub-phenotypes, keeping the KEGG database as reference and the resulting signaling pathway lists were compared using the VENNY online tool to detect and visualize commonalities between them using Venn diagrams. The average combined score of centralities for each protein contributing to a pathway was used to calculate a pathway ranking score.

Results

As described previously, to elucidate the functional links between single nucleotide polymorphisms (SNPs) and IBD, we used the results from our GWAS analysis to investigate signaling pathways involved in IBD using 2 different computational methods.

The PLINK analysis results pointed to 17 statistically significant SNPs specific for CD, 8 for UC and 13 generally for IBD compared to healthy individuals (HC), which were used as input in our pathway and enrichment analyses (Table 2). Figure 2a–c showcases the OR diagrams (Forest plots) of these SNPs versus their association to each disease phenotype and sub-phenotype as endoscopically and clinically categorized. The statistical hypothesis here is versus Allele1 and whether the SNP must be a homozygote or heterozygote to be associated with the disease. Results with an OR score < 1 point to a disease association when the SNP is a homozygote and an OR score > 1 points to a heterozygote SNP related to the disease phenotype.

Table 2 Overview of the SNPs included in the pathway and enrichment analyses
Fig. 2
figure 2

Forest plots of OR ratios for the SNPs highlighted by the SNP analysis performed via plink. These refer to a IBD vs HC, b CD vs HC, and c UC vs HC. All the depicted SNPs statistically significantly relative to the corresponding disease phenotype (p value < 0.05 and the ones with the star have a p-value < 0.01). Furthermore, results with an OR score < 1 point to a disease association where the SNP is a homozygote with the minor allele and an OR score > 1 points to a heterozygote

Our results revealed regarding CD, 15 SNPs for B1, 9 for B2 and 1 for B3. Concerning UC, 7 SNPs were related to E1, 2 were associated to E2 phenotype and 13 to E3 phenotype (Table 2). It is worth mentioning that the low count of SNPs associated with the B3 and E2 sub-phenotypes is heavily perturbed by the rarity of these cases in our Greek samples and in the worldwide population in general. Figure 3a, in a Venn Diagram, showcases all the SNPs that are common between CD and UC from this initial analysis whereas Fig. 3b the common SNPs between B1 and B2 CD and finally Fig. 3c shows that there are no common SNPs in our results between E1 and E3.

Fig. 3
figure 3

Common SNPs found from the analysis on our datasets, between phenotypes and sub-phenotypes of IBD. a 4 common SNPs were found between CD and UC, b 3 common SNPs were found between B1 and B2, c no common SNPs were found between E1 and E3

Our results although clearly pointing to a specific and distinct genetic background of the disease phenotypes and sub-phenotypes highlighted the fact that our datasets only contained a handful of genes that don’t allow us to see the bigger picture. It is well known that gene products exert their functions through interactions with other cellular components, and the impact of a genetic perturbation can spread along the links of any functional network the gene product is involved in [26].

To study the role of specific signaling pathways in IBD pathogenesis, we employed Methods 1 and 2 on the gene sets inferred from these SNPs. Genes associated with the B3 and E2 sub-phenotypes gave extremely small datasets to be analyzed so they were disregarded.

Using Method 1 we identified the top 10 pathways after enrichment for all IBD phenotypes and subphenotypes. Moreover, 23 complementary pathways for CD, 11 for UC, 31 for B1, 15 for B2, 24 for E1 and 11 for E3 were detected as interacting with our original 10. The individual results along with visualizations of the complementary networks are included in Additional file 1.

Using Method 2, we constructed PPA networks and detected signaling pathways. The CD and UC risk genes interaction networks are presented in Fig. 4a, b respectively, whereas Fig. 5a, b showcases the networks created by the B1–B2 and E1–E3 sub-phenotype risk genes as those arose from our previous analyses. Different color groups signify clusters.

Fig. 4
figure 4

Enriched protein–protein association networks created from the risk genes highlighted from previous analyses for a CD and b UC. STX7, STX8, VTI1B proteins were found to be common between the 2 networks. 4 distinct clusters detected for CD and 2 for UC

Fig. 5
figure 5

Enriched PPA networks created from the risk genes highlighted from previous analyses for a B1 and B2 CD sub-phenotypes and b E1 and E3 UC sub-phenotypes. Only the protein NKX2-3 was found to be common between the CD sub-phenotypes, whereas, none were found for UC. 4 clusters were detected for B1, 2 for B2, 2 for E1 and 3 for E3

The PPA network constructed for CD has 38 nodes, 220 edges and the MCL clustering algorithm has signified 4 clusters, whereas, the UC one has 33 nodes, 164 edges and 2 clusters. In total using the enriched PPA networks only 3 proteins were common between UC and CD: STX7, STX8, VTI1B. The same process for the B1 and B2 CD sub-phenotypes and the E1 and E3 UC sub-phenotypes highlighted: For B1 the enriched PPA network consists of 37 nodes, 187 edges and 4 clusters. For B2 the enriched PPA network consists of 34 nodes, edges and 2 clusters. Only the protein NKX2-3 was found to be common between the 2 enriched networks. The E1 PPA network consists of 32 nodes, 261 edges and 2 clusters, while, the E3 of 34 nodes, 146 edge and 3 clusters. No proteins were found in common between the 2 networks of the UC sub-phenotypes.

Network analysis uses the three different centralities and their subsequent transformation into a combined score has provided, for each phenotype and its sub-phenotypes, a ranked list (Additional file 2) highlighting the proteins most topologically important regarding their protein–protein association networks.

The enrichment process via STRING combined with centrality analysis has also enabled us to study the functional pathways involving the proteins highlighted by the network using KEGG. In total, for the main IBD phenotypes, 26 signaling pathways were found exclusively for CD, 22 for UC and 27 were shared between them. Regarding CD sub-phenotypes B1 and B2, 13 pathways were found exclusively for B1, 21 exclusively for B3 and 15 in common between them. For the UC sub-phenotypes 15 pathways were found exclusively for E1, 30 for E3 and 33 in common between them. Additional file 3 showcases the aforementioned group intersections. Finally, Additional file 4 provides a ranked listing of all the pathways for each phenotype and sub-phenotypes, based on the previous combined scores for each protein, helping identify pathways that might play a significant role to IBD pathogenesis/functional background.

To understand better our findings and arrive at a consensus between our methodologies, we have created Fig. 6 which provides common and individually highlighted pathways between Methods 1 and 2 for the IBD phenotypes and subphenotypes. The common ones are four for CD, seven for B1, four for B2, two for UC, two for E1 and two for E3. Finally, using the data from these merged results we constructed a Disease–Disease association network as depicted in Fig. 7. This network allows us to visualize disorders that share molecular mechanisms with our IBD sub-phenotypes.

Fig. 6
figure 6figure 6

a Final merged pathway results from the 2 methods for all CD sub-phenotypes, b final merged pathway results from the 2 methods for all UC sub-phenotypes

Fig. 7
figure 7

Disease–Disease association network based on molecular background commonalities

Discussion

Recent successes of large GWAS studies have had a large impact on identifying the variants of complex diseases, such as IBD [11, 27,28,29]. Here, using an integrated pipeline of methodologies we integrate GWAS data of a Greek IBD population with curated databases of fundamental human pathways as well as gene and reaction-based functional networks, in order to obtain novel insights into the potential causal process of IBD and their sub-phenotypes, hopefully leading to specific diagnostic and therapeutic targets.

A novel stride in our present work was the further examination of the main phenotypes of IBD and their sub-phenotypes using a combination of –omics data and network-based approaches. The specificity of the results regarding SNPs, proteins and signaling pathways involved in IBD allows us to shift through general literature findings and pinpoint those that apply exactly to the population under study. We acknowledge that the two approaches showcased in this paper provide us only with a few common results (as depicted in Fig. 6). This is to be expected due to the differences in the methodologies of the two approaches and their intermediate steps. This signifies that when employing various omics methods to extrude conclusions, especially about the functional role of genes, researchers should consider combinational approaches which complement each other, rather than relying on a single method. We also must recognize the limitations of the databases, as highlighted by the KEGG pathway results from both methods, to identify specific disorder pathways when provided with a limited set of genes. Many disorders share common pathophysiological mechanisms like inflammation making it difficult for the database to distinguish the specific disorder under study. This highlights the importance of more specific mechanism-oriented databases.

The use of pathway network connectivity and centrality analysis of the protein–protein association networks, as well as their rankings, not only allows for more unbiased/unmanaged results of important proteins and their role in IBD but also draws attention to specific pathways to be considered out of all those “discovered” by plain pathway analysis methods. By using a weighted approach to combine centralities as shown here, and by modifying the initial scheme presented according to the weight that is desired to be given each time to each centrality, researchers might find the answers to the questions about which nodes are important to a protein association network according to their biological significance/role.

The current analysis implicates a significant number of core pathways indicating an important role among others for IBD, such as Toll-like receptor signaling, TNF signaling, Jak-STAT signaling, PI3K-Akt signaling, T cell receptor signaling, MAPK signaling and B cell receptor signaling pathways components. The NF-kappa B signaling, NOD-like receptor signaling, regulation of autophagy, chemokine signaling, adherents junction pathways were found to be CD specific, whereas the intestinal immune network for IgA production, natural killer cell mediated cytotoxicity, Wnt signaling, cytokine-cytokine receptor interaction, colorectal cancer, VEGF signaling, cGMP-PKG signaling, cell adhesion molecules (CAMs), and Fc epsilon RI signaling pathways seem to be UC specific. When we stratified the cases according to disease sub-phenotypes we identified distinct pathways for the B1 and B2 sub-phenotypes regarding CD, and the E1 and E3 sub-phenotypes regarding UC. Interestingly, the role of most of the identified pathways in IBD pathogenesis and its clinical significance in IBD therapy and diagnostics are well studied [30, 31]. Toll-like receptors are basic mediators of innate host defense in the intestine, involved in maintaining mucosal and commensal homeostasis [32]. Additionally, novel therapies have been developed targeting alternative TNF and ILs signaling (i.e. IL-12/23 axis, IL-6) pathways as well as Jak inhibitors in IBD [33]. It is also well known that combination of disease-associated variants of ATG16L1 and NOD2/CARD15 leads to synergistically increased susceptibility for CD, indicating a possible crosstalk between NOD2- and ATG16L1-mediated processes in the pathogenesis of CD [34]. Notably Kini et al. [35] indicated that changes in signaling through Wnt primarily affected colonic stem cells, whereas Notch affected progenitor function, providing new insights into the development of inflammation and relapse in UC. As depicted in our results, the central role of all these pathways is highlighted.

In the present study the protein–protein association network analysis revealed that 3 proteins were common between UC and CD: STX7, STX8, VTI1B. This is expected since there role of autophagy in the pathogenesis and progression of IBD is well documented [36]. Furthermore, SNARE complexes and their regulators have a key role during inflammation and may present potential therapeutic targets in a wide range of inflammatory diseases such as IBD [37]. SNAREs have recently been implicated in controlling autophagosome development in mammalian cells [38] and the SNAREs vesicle-associated membrane protein (VAMP)7, syntaxin-7 (STX7), syntaxin-8 (STX8), and VTI1B regulate the homotypic fusion of phagophore precursors [39]. These fusion events allow the growth of these structures into a tubular network leading to the formation of phagophores and autophagosomes [40].

Our results further indicated that B1 and B2, CD sub-phenotypes exhibit distinct protein and pathway profiles, and that the NKX2-3 gene was found common in these two entities. These findings are in accordance with previous studies which indicated that NKX2-3 is a susceptibility locus for IBD in Eastern European patients but hasn’t been related to a specific sub-phenotype [41]. However, the B2 network presents two disjointed clusters which might be attributed to the fact that a limited number of SNPs was used in GWAS and the possible links remain outside our initial targets. Regarding UC sub-phenotypes E1 and E3 revealed that they have distinct pathways.

Our observations were also confirmed by the combined centralities network analysis. More specific for CD the proteins identified to have the strongest significant involvement with the disease are TLR4, SRC, NOD2, MYD88 and IL6. These results are not surprising since it is well known that NOD2 is a major genetic risk factor for CD, and NOD2 signal cascade is enhanced by toll-like receptor (TLR) agonists through NF-κB. NOD2 and TLR signaling collaborate to enhance immune responses [42]. TLR4 engages the adaptor MyD88 in combination with the adaptor TIRAP/Mal. Additionally via the signal transduction pathways involving MyD88, IRAK a number of mediators induced that could implicated in the CD pathogenesis such as TNFa, and IL6 [43]. The rest of the proteins identified, are involved in the pathways related to inappropriate immune response to floral components as well as autophagy signaling pathways [44]. Examining the main implicated proteins in CD sub-phenotypes, our results revealed some significant observations. The main proteins related to B1 sub-phenotype are the proteins implicated mainly in TLR and NOD2 signaling pathways (i.e. TLR4, MyD88, NOD2). Regarding NOD2, a previous study suggested that L1007fs mutation, in central Europeans is associated with fibrostenotic disease, [45] but this cannot confirmed in our results and might be be explained by the different ethnic population in our own study. Other proteins correlated mainly with the B1 sub-phenotype are PRPF8, SNRPF as well as TRAF6. Reduced TRAF6 gene expression was found in IBD patients due to hypermethylation [46]. Regarding SNRPF recently Wang et al. [47] identified an antibody against SNRPB, as an autoantibody marker in CD but there are not information related to disease sub-phenotypes. For PRPF8 there are not data available regarding its implication to CD pathogenesis. About the B2 sub-phenotype the autophagy related proteins seem to be more important (ATG12, ATG4B, ATG3 etc.). Even if there are no data supporting the association of autophagy genes with specific CD sub-phenotype, undoubtedly autophagy plays an important role in CD pathogenesis [48]. Conclusively there are distinct protein patterns implicated in these two sub-phenotypes than probably can be used for CD progression prediction.

Interestingly the proteins strongly implicated in UC pathogenesis are distinct from those of CD. IL2, STX3, NFATC2 and JUN seem to have major role in UC. Regarding IL2 it has been shown that Il2−/−mice develop IBD most reminiscent of UC [49]. Regarding STX3, a novel mechanism was recently reported, regulating intestinal serotonin transporter (SERT) via PI3K and STX3 [50]. Sikander et al. [51] demonstrated that there may be a potential association between polymorphisms in the (SERT) gene promoter and UC, thus STX3 seems to be important for UC pathogenesis. Considering NFATC2, we know that it is a transcription factor with pleotropic roles [52]. Remarkably, the existing data suggest an important cell-intrinsic role for NFAT family transcription factors in intrinsic negative T cell regulation and Weigmann et al. [53] supported that oxazolone-induced ulcerative colitis and progression to colon cancer are attenuated in NFATC2 KO mice due to ineffective production of IL-6. This suggests that NFATC2 can act as a more generalized modulator of inflammation. Regarding the sub-phenotypes of UC, we observed that E1 is mostly related to proteins such as TLR4, TNF, NFKB1, TNFRSF1A, and others involved in the NF-kappa B signaling pathway. Interestingly E1 sub-phenotype seems to also be strongly associated with Ras-related C3 botulinum toxin substrate 1 (RAC1) protein. It is known that disruption of Rac1 in macrophage and neutrophils of mice protected them against dextran sulphate sodium (DSS)-induced colitis [54]. On the other hand E3 sub-phenotype is mostly related to IL2 protein and also with autophagosomes and inflammation-related proteins i.e. syntaxins and NFATC2 [55, 56]. A strong association for the IL2/IL21 locus with UC is well known [49]. STX3 has a crucial role in trafficking pathways of cytokines in neutrophil granulocytes [57]. Additionally, FASLG seems also to play a basic role in this sub-phenotype and has been documented in the attenuation of apoptosis response to Fas-ligand in active ulcerative colitis [58]. NFATC2 is involved in colitis by controlling mucosal T cell activation in an IL-6-dependent manner and seems to be a potential therapeutic target for UC [56]. Our data indicate that distinct pathways also characterize the UC sub-phenotypes.

Genetic variants and their role in functional changes, though, are not only important in understanding IBD pathophysiology but also understanding treatment-related enigmas like patient response. As previous works [59,60,61,62,63] have shown, traditional IBD treatments like glucosteroids and azathioprine, but also newer approaches like anti-TNF, are all susceptible to inefficiency due to specific genetic polymorphisms. The IBD landscape is vast and includes many factors and pitfalls that should be considered when trying to identify “who” is responsible for disease onset, progression and treatment, by making use of various technical approaches, each targeting a different subsystem [64]. Highlighted among these factors, the microbiome, has become a scientific trend in recent years due to its apparent implication in various diseases, especially IBD. Microbiota dysbiosis appears to either drive or uniquely classify, aspects of IBD like progression [65] and response to treatment [66].

Collectively, our approaches provide important insights into the interplay among IBD risk variants and their related signaling pathways in IBD. All this information is implicated directly to our understanding of the mechanisms underlying IBD and its clinical sequelae. Moreover, by applying these approaches to several disorders and then comparing the results we might be able to understand how key pathophysiological mechanisms can lead to comorbidities previously unknown.