Background

At present, the reconstruction of molecular genetic networks (gene networks) is one of the most widely used approaches for studying the mechanisms of the functioning of complex biological processes. The use of this approach is often a necessary requirement for solving many problems in the field of biology, medicine, and pharmacology, among others [1,2,3,4,5,6,7].

Around the world, many databases containing molecular genetic networks describing metabolic processes, diseases, phenotypic traits, etc. have been developed – for example, KEGG PATHWAY [8], BioCyc [9], BioGRID [10] and IntAct [11].

There are systems that allow the reconstruction of gene networks for a given set of genes/proteins including FunCoup [12], STRING [13], Pathway Studio [14], Ingenuity Pathway Analysis [15], PINA [16], GeneMANIA [17] and ReactomeFIViz [18]. These systems use various information sources on interactions of molecular genetic objects, including scientific publications and factual databases. FunCoup is one such system containing more than 37 million interactions that include mRNA/protein co-expression, protein–protein interaction, similarity by phylogenetic profile, binding of shared transcription factors, sub-cellular co-localization and others. STRING is another example of such systems, containing information about protein–protein associations, information obtained from curated databases, predictions (gene neighborhood, gene fusions, gene co-occurrence), text-mining, co-expression, etc.

Earlier, we developed the ANDSystem, which has a wide range of tools for the reconstruction of associative gene networks [19]. The knowledge base of ANDSystem contains more than 14 million interactions between proteins, genes, metabolites, microRNAs, diseases, biological processes, etc. Information on interactions was extracted from PubMed abstracts using a text-mining method and was also extracted from various molecular genetic databases. Interactions were subdivided into physical interactions, catalytic reactions, chemical transformations, associations, regulation of expression, activity, transport/release, stability/degradation, etc. The ANDSystem was used to solve a wide range of tasks related to the reconstruction of gene networks – in particular, for the interpretation of data of proteomic experiments [20,21,22], the analysis of the tissue-specific effect of gene knockout [23], the analysis of the hepatitis C virus interaction [24, 25], the identification of genes susceptibility to tuberculosis [26] and analysis of molecular mechanisms of comorbidity of diseases [27, 28].

Another well-known approach to the study of functional linkages in gene sets is analysis of over-representation of the Gene Ontology (GO) biological processes, KEGG pathways and diseases. There are several computer tools aimed at facilitating this task, such as DAVID [29], BINGO [30], GO-function [31] and others. These programs are widely used to interpret the experimental sets of genes obtained in transcriptome analysis, genome-wide association studies, mass spectrometric experiments, etc. [22, 32,33,34,35]. However, such methods do not take into account a structure of the networks, which describe interactions between genes. Due to this, for the last ten years, several methods allowing to perform an analysis of gene networks were developed [36,37,38,39]. One such method is EnrichNet [37], which uses a random walk procedure for the estimation of the distance between experimentally obtained and predefined functional gene sets inside a network. Comparison of gene networks with random networks is an alternative approach for determining functional connectivity in experimental sets of genes/proteins [40,41,42]. In the work of McCormack et al. [43], a stand-alone tool, CrossTalkZ, was developed to assess the statistical significance of inter and intra-connectivity (crosstalk enrichment) between or within gene sets. CrossTalkZ uses the FunCoup database for the reconstruction of the gene networks, while random networks are generated by the permutations of all edges or nodes in a global network [12].

In this paper, we describe a web tool that allows evaluation of the functional relationship between genes using the STRING and ANDSystem databases, which differ from FunCoup by types of interactions between objects as well as information sources. Based on the analysis of the gene sets involved in GO biological processes, it is shown that the sensitivity of the method exceeds 0.8 at a specificity of 0.95 for both STRING and the ANDSystem. This study identified that the significance of the difference between gene networks of biological processes and random networks depends on the type of interactions (protein-protein interaction, co-expression, expression regulation, etc.). In particular, networks constructed for apoptosis (GO), including separate types of links, such as “activity and transport regulation”, “catalysis”, “co-expression” and “interaction”, were statistically significantly different from random networks. However, as a rule, the greatest reliability was observed for networks that included not individual types of links, but a general type of connection – that is, a type of connection in which two objects are considered to be connected if there is a link between them of any particular form. The FunGeneNet web tool allows users to upload a list of human gene/protein identifiers as an input. The output data is an associative gene network built either by the ANDSystem or STRING, as well as the evaluation of network functionality, expressed as the significance of the network enrichment with links of a given type. FunGeneNet is available at URL: http://www-bionet.sscc.ru/fungenenet/.

Implementation

FunGeneNet algorithm

In the first step, the network is automatically reconstructed for the input list of genes/proteins, using the ANDSystem or STRING base of knowledge. The networks used by FunGeneNet are subnetworks obtained from of the global ANDSystem or STRING networks. In the STRING networks, vertices correspond only to the proteins, linked by a generalized type of interaction. In the ANDSystem, genes and proteins are represented by separate objects, which can be linked by various types of interactions, including protein-protein interactions, protein-DNA interactions, regulation of gene expression, activity regulation, etc. In the next step, a filtration of the subnetwork by user-specified interaction type is performed. There are two operation modes in FunGeneNet. The first mode is applied when a user selects “all types” for the interaction. In this case, all interactions presented in the FunGeneNet network are considered as a generalized type of interaction. The second mode is used when a user selects a specified type of interaction (for example, “activity regulation and transport”, “catalysis”, etc.). In this case, the system employs only interactions of the specified type, while any others are removed from the network. It should be noted that in the case of STRING, only the generalized interactions are used.

The method for assessing the functional enrichment consists of comparing the number of links between the analyzed and random networks. For this purpose, the connectivity of 100 random networks is calculated and the parameters of the normal distribution are evaluated for this sample to use a one-sided single-sample t-test (pnorm function of R language). In the absence of connections in both the analyzed and all random networks (edgeless networks), the p-value is taken to be 0.5, since in the case of a small non-zero number of edges in the sample of random networks, the p-value for an edgeless network is close to 0.5.

For the reconstruction of the random networks we used the node permutation approach proposed in [44]. The main difference of our algorithm is that labels of vertices were swapped in the global network, not in the local one. Other randomization methods were not considered because they are significantly inferior in performance to the method of node permutation and do not yield a significant gain in accuracy [43]. Performance in this study was critical because FunGeneNet is a web-application.

Random networks were built according to the following rules: (1) For each protein of the analysed set, the vertex degree in the global network was counted and the set of proteins of the global network with the same vertex degree was determined; (2) One protein was randomly selected from this set, which served as the starting vertex for the reconstruction of the random network; (3) The network reconstruction for the starting vertices was performed as for the network being analyzed.

Thus, each random network contained the same number and type of vertices as the original network, and the link types were also the same, while the number of links in random networks and the original were different due to permutations.

Restriction (1) on the degrees of selectable vertices in the global network is aimed at reducing the study bias described by Jensen et al. [45] as a tendency to study, in various aspects, primarily well-studied molecules. In this connection, we assume that vertices with relatively large degrees (hubs) accumulate more false-positive interactions than vertices with lower degrees. As can be seen from Fig. 1, the vertex degrees in the global gene network can be roughly described by a power law with the coefficient γ = 1.39. Therefore, the probability of choosing at random a vertex with a small degree is significantly higher than the probability of choosing a hub. Thus, if in the studied group of genes/proteins the hubs predominate for some reason, then such a network is likely to be more connected than the networks with randomly selected genes. The presence of well-studied genes in the analyzed sample can lead to a systematic error in random sampling, which was also noted in other works [40, 43].

Fig. 1
figure 1

Distribution of vertices in the ANDSystem global human gene network by vertex degree. Black colour identifies the trend line of the number of vertices f(x), where x is the vertex degree. R2 is the coefficient of determination

FunGeneNet input data

A list of protein IDs for the following databases is supplied to the input: UniProt, Ensemble. The program also understands NCBI gene identifiers. In a case where genes are fed to the input of the tool, the list of encoded proteins is first determined, and then the reconstruction is performed. The user has the opportunity to select the STRING system or the ANDSystem, through which the gene network will be reconstructed. In the case of using STRING, the user can select one of the standard thresholds for the presence of a connection in the global network: 150, 400, 700 and 900. In the case of the ANDSystem, the user can select the type of interaction from the list (activity and transport regulation, catalysis, coexpression, expression regulation, interaction and all types).

FunGeneNet output data

The FunGeneNet output is a file containing an interactive network in ANDSystem/tab-delimited format and the t-test p-value, which characterizes the difference between the analysed network and random networks. The given t-test p-value assumes the normal distribution of the number of links in random networks and can be biased from the true probability values. Therefore, in addition to the network being analyzed, the ROC curve p-value is calculated as the proportion of negative sample networks having a t-test p-value less than or equal to that for the network being analyzed (coords function of the pROC package of R language).

Accuracy estimation of the FunGeneNet method

To analyze the accuracy of the FunGeneNet method, we applied the ROC analysis technique [46]. Networks constructed for GO biological processes were considered as a positive sample. Information on the involvement of proteins in the processes was taken from the UniProt-GOA database (Submission date: 3/16/2016) [47]. GO networks were divided into two groups according to the number of proteins. The first group included processes for which 2 to 50 proteins were annotated, and the second group included processes with more than 50 proteins according to UniProt-GOA (Additional file 1: Table S1).

As a negative sample of networks, four types of random networks were used, for which it was assumed that they include functionally unrelated genes. Networks of the first type (simply random) were constructed by randomly selecting proteins from the whole set of human proteins, each of which had at least one connection in the global ANDSystem network. This restriction, to exclude proteins not participating in the formation of the global network, is also applied to other types of random networks. To build networks of the second type (well-studied), a random selection was made from proteins, mentioned in at least 50 PubMed publications. Thus, this group was represented by the relatively well-studied proteins. This group was created in order to take into account the possible FunGeneNet misclassification bias introduced by the level of scrutiny of proteins [45]. Networks of the third type (GO-based) were built using a random selection of proteins from a variety of proteins annotated in the GOA database (Additional file 2: Table S2). The reconstruction of these networks was carried out in such a way that one network did not contain the proteins involved in the same biological process. Networks of the fourth type (identical degree distribution [IDD]) were constructed with a restriction on the vertex degrees, so that each set of proteins from the positive sample corresponded to a set of the negative sample. The selection procedure consisted of three steps: (1) the vertex degree in the global network is determined for each protein of a positive sample, (2) the list of all proteins with the same degree as for a particular protein of a positive sample is extracted from the global network, (3) the starting protein for IDD network reconstruction is selected at random from this list. This method of reconstruction guaranteed equal vertex degree distributions in positive and negative samples. When considering characteristics of FunGeneNet – depending on the size and completeness of the networks, the STRING score, and the t-test/permutation option – networks of the type “simply random” (Additional file 2: Table S2) were used.

To construct the ROC curves, the number of random networks in a negative sample, as well as the distribution of the number of proteins in the random networks were specified to be equal to those in the positive sample. The same positive and negative samples of proteins were used to reconstruct networks for the ANDSystem and STRING (version 9.1).

The ROC curve classifier score was taken to be equal to 1 − p-value, where p-value characterized the statistical significance of the differences between the analysed networks and random networks, given out in the output data of the program. The area under the ROC curve (AUC) was calculated using the “roc” function of the pROC package of R language. As the “roc” argument “auc”, a “predictor” vector consisting of values of 1 − p-value for functional and random networks was fed. The argument “response” was a vector, with the coordinate values equal to 1 for functional networks and 0 for random networks.

To analyze the performance of the method depending on the type of interactions, the ANDSystem types were combined into larger types: (1) “activity and transport regulation”, which included the following types of interactions: “activity downregulation”, “activity regulation”, “activity upregulation”, and “transport regulation”; (2) “catalysis”, including “catalyze”, “cleavage”, “degradation downregulation”, “degradation regulation”, and “degradation upregulation”; (3) “coexpression”, which was taken as a separate type; (4) “expression regulation”, consisting of “up-”, “down-”, and “expression regulation” itself; (5) “interaction”, which was taken as a separate type; and (6) “all types”, including all of the above types, as well as the type “expression” and the type “association”.

To estimate how the completeness of genes of the studied process, presented in the experimental set of genes, would affect the obtained results, the following analysis was performed. At the first step, all GO biological processes were divided into five main groups according to the number of genes involved in each process: (1) processes, involving 10 genes; (2) from 20 to 22 genes; (3) from 40 to 50 genes; (4) from 100 to 200 genes; (5) from 400 to 1000 genes. Next, for each process, 10 genes were randomly selected from its entire set of genes. Thus, the completeness for the first group was 100% (the experimental set contained all genes of the process), for the second it was 45–50%, for the third it was 5–10%, etc. The selected lists of proteins are given in Additional file 1: Table S1. At the next step, an ROC curve was constructed for each range of the completeness.

The significance of the difference in the AUC of the ROC curves was estimated using the two-sided unpaired DeLong’s test, through the roc.test function of R language.

The p.adjust function of R language was used for the Benjamini Hochberg multiple testing correction.

Results

Method assessment

We consider two method variants, based on 1000 permutations as well as the t-test, using parameters of normal distribution estimated from 100 permutations. To assess any decrease in accuracy in the case of using the t-test instead of permutations, we build ROC curves for these variants (Fig. 2). Figure 2 shows that the AUC for these variants is nearly the same for both the ANDSystem and STRING. Due to this, and based on the fact that the method variant using a t-test reduces the number of calculations by approximately 10 times, below we show ROC curves constructed by the method based on the t-test.

Fig. 2
figure 2

Method accuracy for t-test. GO sets including from 11 to 50 proteins were used as a positive control. “Simply random” sample was used as a negative control

An interesting question about the FunGeneNet applicability is the dependence of the quality of the functional/non-functional network classification on the size of the gene set. Figure 3 shows that FunGeneNet performs non-random classification even in cases of small network sizes.

Fig. 3
figure 3

Dependence of the method accuracy on the size of the networks for ANDSystem (a) and STRING (b). The size of the network (designated as “size”) was defined as the number of proteins annotated with the GO biological process. As a negative sample, the sample “simply random” was taken (see methods). The STRING score was used by default (= 400)

Interactions between the genes contained in the global network have a different degree of reliability. Therefore, in the STRING system, a special score is used, which describes the weight of interactions. The STRING score is the threshold for eliminating noise information. Increasing the score for STRING networks can reduce the share of false interactions and decrease the completeness of networks. For this reason, a decision was made to check how the accuracy of FunGeneNet depends on the STRING score. Figure 4 shows the ROC curves for the standard values of the STRING score.

Fig. 4
figure 4

ROC-curves for different values of STRING score for networks of size > 10 and ≤ 50 (a) and for networks of size = 10 (b). In all variants, as a negative control, all genes with at least one bond in the global network with a score above 400 were taken. Edgeless networks correspond to the linear sections of the ROC curve. Linear segments are due to a fairly large proportion of edgeless networks with the same p-values in positive and negative controls

The use of ANDSystem networks in FunGeneNet allows analysis of different types of interactions, including all types (generalized type), activity and transport regulation, catalysis, co-expression, expression regulation, and interaction. Figure 5 shows the ROC curves for the different interaction types from the ANDSystem according to the different network sizes.

Fig. 5
figure 5

ROC-curves for different types of interactions for ANDSystem networks of size from 11 to 50 genes (a) and over 50 (b)

Another important issue to assess the quality of the method is the appropriate sampling of non-functional networks. We proposed four models of non-functional networks: “simply random” — random selection of a set of proteins, from having at least one connection in the global network; “well-studied” — the choice is the same as in “simply random”, but from proteins found in more than 50 publications; “GO based” — random selection is made from GOA, so that all the proteins in the sample do not have common GO biological processes in the direct GOA annotation; “the same degree of distribution” (IDD) — with this choice of negative control, the vertex degree distributions (vertex degrees are counted using the global network) in negative protein samples are exactly the same for those of positive samples. Figure 6 illustrates the ROC curves for the ANDSystem for various models of negative control.

Fig. 6
figure 6

The quality of the method for different models of negative control. The size of the networks used in the comparison lies in the range: 10 < size ≤ 50

Since, in an experimental gene/protein set that can be analyzed with the help of FunGeneNet, for some reason only a small part of the biological process under investigation may appear, we explored how much the accuracy of the method depends on the completeness of the data on the observed process. Figure 7 shows the ROC curves for different portions of GO biological processes for which the network is built. It can be observed from the figure that, as expected, with a decrease in the proportion of proteins over which the network is built, the area under the ROC curve decreases. For protein sets composed of 5–10% of all proteins assigned to the GO biological process, the classification is weaker, but not yet random, and for sets of 1–2.5%, it is close to random.

Fig. 7
figure 7

ROC-curves for different completeness of networks. The percentages indicate the completeness of the network, which was calculated as the ratio of the random sample size of the GO biological process, chosen to construct the ROC curve to the full size of this process (See methods). Each of the considered networks includes 10 genes

Thyroid cancer network

Papillary thyroid cancer is the most common form of thyroid cancer [48]. In the dbDEPC database, we identified data from three experiments on papillary thyroid cancer: EXP00039 (E39), EXP00050 (E50) and EXP00051 (E51). E39 contained a list of 30 differentially expressed proteins [49]. E50 and E51 were conducted within the same work and gave an identical list of 16 proteins for two different variants of cancer cell types [50]. At the intersection between the E39 and E50 lists, there were five proteins: ANXA1, Beta-actin, Moesin, FTL and Galectin-3.

Using FunGeneNet, we reconstructed networks for E39 (Additional file 3) and E50 (Additional file 4), as well as for intersection (Additional file 5) and union (Additional file 6) of the protein lists. The results of comparing networks E39 and E50 with random networks are listed in Table 2.

Apoptosis network

As an example, we considered a functional network formed by genes/proteins participating in the GO apoptotic process [GO: 0006915]. Apoptosis is known to be necessary for the normal development and functioning of the organism and is also of key importance in mechanisms of many diseases, such as neurodegenerative and cancer diseases [51,52,53]. A wide range of interactions is involved in this process, including the protein–protein interaction and regulatory links that determine the regulation of gene expression, as well as the regulation of protein activity and transport, etc. The identification of the significance of different connection types in the gene network of the apoptotic process can help to better understand the mechanisms of functioning and the role of participants of this complex biological process.

The protein list of apoptosis according to UniProt-GOA included 593 proteins (Additional file 7: Table S4). The network included 591 proteins, 585 genes and 12,529 interactions (Additional file 8: apoptotic process.andz).

FunGeneNet established the apoptosis network as functionally enriched by the types of “activity and transport regulation” (p-value = 3.95e-09), “catalysis” (p-value = 3.06e-06), “coexpression” (p-value = 3.09e-02), “interaction” (p-value = 3.24e-76) and “all types” (ANDSystem p-value = 1.46e-30, STRING p-value = 0). All networks for these types of links generally correspond to the power law of vertex degree distribution (Additional file 7: Table S4). This means that a small fraction of vertices aggregates most of the connections and these vertices can be of considerable interest.

Discussion

Method assessment

From Fig. 2, we can see that CrossTalkZ as compared to FunGeneNet showed a slightly higher AUC value. Nevertheless, due to its low performance, this program was not used in our system. In particular, this is due to the permutation of all the vertex names of the global network, while our algorithm is based on the permutation of only those nodes of the global network that are relevant to the analyzed one.

An increase in the number of correctly predicted functional networks with an increase in their size (Fig. 3) may be explained by an information noise decrease in the networks. The noise is a proportion of false positive links in the analyzed network. To detect a significantly higher number of links in the functional network compared with the random ones, it is required that the signal/noise ratio in the functional network be greater than that in the random networks. Another reason for the slight difference of small networks from random ones may be their incompleteness in comparison to large networks. This is because the meaningful part of the small networks has only been studied recently and not all genes involved and links between them have been revealed.

From Fig. 3 it is seen that for α = 1 − specificity = 0.05 for networks of a size greater than 10 and not greater than 50, the sensitivity exceeds 0.8 for both the ANDSystem and STRING. Further, we use this group of GO biological processes to examine the behaviour of the ROC curves in various conditions.

It is interesting that for the networks with sizes from 2 to 50 genes in Fig. 3a, a sensitivity jump by 0.02 is detected in the range of α [0.33, 0.34]. This jump corresponds to edgeless networks of positive and negative samples, for which all the random networks turned out to be edgeless. This jump is due to there being fewer networks in the negative sample (29 from 3458 edgeless) compared to the positive sample (115 from 1211). This fact is most likely connected with the difference in the distribution of vertex degrees in functional and non-functional networks. This question deserves a separate consideration with the goal of constructing, based on the vertex degrees in the global gene network, a method for classifying functional and non-functional gene sets. For GO biological processes including more than 10 genes, there are only six such edgeless networks. Thus, the contribution of edgeless networks is insignificant and the jump on the chart is invisible.

As can be seen in Fig. 4a, the highest AUC is observed for a STRING score of 150, and the smallest for a score of 900. This indicates that the method works better in complete but noisy networks, compared to networks with a small fraction of false positive interactions, with a high proportion underpredicted. Figure 4b shows the same pattern on networks with a fixed size of 10. This dependence is even more pronounced, apparently because of the small size of the networks and the absence of its variability. In a small-sized network, there are only a few well-established links, and that is why long linear segments corresponding to edgeless networks for scores of 700 and 900 appeared in Fig. 4b. It can be shown that the greater the score, the longer the linear sections and the more edgeless networks.

It appeared that with the increase of network size, the quality of classification grows (Fig. 5). The most accurate among the considered types of interactions appeared to be “all types”. Such a result was expected, since the consideration of only a specific type of interaction (for example, only transport regulation) leads to information loss, while the use of a generalized type avoids this [54, 55]. It should be noted that the quality of classification for some interaction types (for example, “catalysis” or “activity and transport regulation”) can be explained by the smaller number of genes involved in the GO biological process linked by such types, which, in particular, may cause the appearance of edgeless networks.

From 1625 processes involving from 11 to 50 proteins, 1507 differed in at least one type of connection (Additional file 9: Table S3). For each type of link, there were processes that differed only by this type of link (Table 1). This means that, having examined the differences from random networks by the mixed type of links (“all types”), we under-predict some functional networks. We tested if there are any common properties for networks that differ in a certain type. It turned out that for the “expression regulation” type among 21 significant GO biological processes that do not differ from random networks by “all types”, a group of 10 processes are distinguished, which are related to cell proliferation and the cell cycle: negative regulation of B-cell proliferation, positive regulation of phosphorylation, negative regulation of cyclin-dependent protein serine/threonine kinase activity, negative regulation of proteasomal ubiquitin-dependent protein catabolic process, positive regulation of cell cycle, negative regulation of organ growth, homeostasis of number of cells, positive regulation of cellular component movement, regulation of protein kinase B signalling and regulation of actin cytoskeleton reorganization. In addition to this group, it is possible to identify a more specific group of B-cell proliferation (negative regulation of B-cell proliferation, cellular response to interleukin-4, cellular response to interleukin-6). IL-4 (BSF-1) and IL-6 stimulate B-cell proliferation [56, 57]. Another group of processes for the “expression regulation” type refers to cell differentiation (trophectodermal cell differentiation, monocyte differentiation, and endothelial cell differentiation). Thus, it can be assumed that the networks of some functionally related GO biological processes are more different from random networks by a certain type of interaction compared with networks for other GO biological processes.

Table 1 The distribution of the number of networks of GO biological processes statistically significantly different from random ones for different interaction types

The most represented type of links, not counting “all types”, was the “interaction” type. Of 1507 significant networks, 1250 differed by this type. However, if we exclude processes that are significant by the “all types”, then only 77 networks will remain. Increased representation of the interaction type in comparison with other types of interactions can be explained by the appearance of high-performance methods, such as mass spectrometry [58] and yeast two-hybrid analysis [59].

It can be demonstrated from the Fig. 6 that the AUC varies insignificantly with different models of negative control (p-value = 0.105 for comparison of “simply random” and “well-studied”). On the one hand, this shows that imposing a strict vertex limit on IDD does not turn random networks into functional ones, which could be expected, since the pool of vertices for random selection is greatly reduced. On the other hand, the proximity of “simply random”, “GO-based” and “well-studied” curves shows that the proposed increased examination of GOA-annotated genes, compared to random genes, does not significantly affect the quality of the method.

Since it is difficult to determine what is really a non-functional network, we consider random networks as non-functional networks. It is possible that of all the reconstructed random networks, some are functional networks, which can underestimate the sensitivity and specificity, because among these random functional networks there may be those with a connectivity that is higher than the connectivity of the analyzed functional networks. Perhaps, in the presence of such a phenomenon, the addition of a restriction on the vertex degrees in a random network (as in the IDD variant) may lead to an increase in the proportion of such false positive networks in the negative sample and a greater underestimation of the method accuracy. However, as can be seen from Fig. 6, this understatement does not occur. In addition, we showed that for networks larger than 10, the method works well enough. So, even with some portion of the functional networks among random ones, random networks are an acceptable model of negative control.

The result of the dependency analysis between accuracy of the method and completeness of the data on the observed process (Fig. 7) is important for choosing a strategy for analyzing experimental sets of genes/proteins using FunGeneNet, as well as other methods based on the analysis of gene networks constructed from experimental gene sets. The absence of the significant differences from random networks may be related to incompleteness of experimental gene sets with respect to the real number of genes involved in the studied biological process. By taking into account this fact, experiments can be adjusted. Another way to solve this problem can be the extension of an experimental set of genes by gene-prioritization methods [60]. In particular, our analysis of the different levels of completeness of experimental gene sets on the example of GO biological processes showed that in cases where an experimental set of genes was less than 2.5% of the total number of genes of the target process, the absence of significance of functional connectivity in gene sets can be expected.

Thyroid cancer network

We were interested in identifying which genes/proteins in the Papillary thyroid cancer networks and their connections contribute to distinction from random networks (Table 2). The most important was the combined network of E50 and E39, which differed from random networks by the combined type “all types”, and the E50 network, which differed in the type of “catalysis”. Interestingly, the latter difference was due to the presence of two catalytic bonds in the protein transthyretin (TTR), which is a carrier of thyroid hormones. The involvement of TTR in thyroid cancer is consistent with the previously advanced hypothesis of an increased risk of thyroid cancer in the presence of particularly polybrominated diphenyl ethers (PBDEs), metabolites of which compete with thyroid hormones for binding to TTR [61]. Since TTR has catalytic activity [62], it can be assumed that PBDEs, through binding to TTR, change its catalytic activity. Interestingly, the second catalytic TTR link in the analyzed network is aimed at its cleavage by oncogene DJ-1, which, in an unknown way, regulates the phosphatidylinositol-3 kinase signalling pathway through the tumour suppressor PTEN [63]. Mutations in the PTEN gene lead to syndromes accompanied by cancer in various tissues, including the thyroid gland [63]. Thus, it can be assumed that TTR mediates the regulation of PTEN by DJ-1 protein.

Table 2 Enrichment by various types of functional interactions in groups of differentially expressed proteins in thyroid cancer (dbDEPC) estimated with FunGeneNet

Apoptosis network

An analysis of the gene network of GO apoptotic process [GO: 0006915] has revealed, that of the 90 transport regulation links, 30 links regulating the release of cytochrome c attract attention. This is consistent with the key role of cytochrome c in the mitochondria-dependent pathway of apoptosis [64]. The second leader by the number of “regulation of transport” connections is BAX protein. This protein aggregates 12 such bonds, of which nine show the influence of other proteins on BAX translocation from the cytosol into the mitochondria. This translocation is also the central event in the mechanism of apoptosis [65]. Among the links regulating activity, the maximum degrees of the vertices are for NFKB1 and P53. Among the 43 bonds, NFKB1 37 is directed at regulating the activity of this protein. As is known, this protein initiates apoptosis in order to suppress the development of tumours [66]. Of the 39 links of p53, 24 are directed to its regulation, this is consistent with p53’s key role in triggering apoptosis due to DNA damage, oncogenes expression and the effects of other factors [67]. For the “catalysis” type links, the participants with the highest degrees of vertices were the anti-apoptotic kinase AKT1, the proapoptotic CASP3, and the apoptosis-inducing p53. Among the 19 links of AKT1, 12 are phosphorylation of apoptotic proteins with AKT1. Of the 30 CASP3 links, 17 are cleavage of apoptotic proteins by CASP3. Of the 35 bonds of p53, 13 are responsible for regulation of its stability. For the co-expression network, a connected component was found containing four cellular receptors (CD2, IL2RA, TNFRSF18, and PRAME), kinase CDK11B and keratin KRT20. The maximum number of links in the co-expression network was three and was observed as adjacent to the CD2 protein. In the “interaction” network, Polyubiquitin-C (373 bonds), P53 (101 bonds) and CASP3 (67 bonds) were the leading proteins (Additional file 7: Table S4). When protein is ranked according to the specificity rate (SR = [number of connections in this network] / [number of connections in the global network]), the proteasome proteins PSMA8 (SR = 0.929) and PSMB6 (SR = 0.816) are leaders, as well as HIPPI protein (SR = 0.750), inducing apoptosis through activation of caspase-8 expression [68].

Thus, on the basis of analysis of different types of interactions in the gene network, describing the GO process of apoptosis, the connected components were identified, i.e. sets of genes involved in over-represented interactions. It appeared that these components include genes that are key for apoptosis.

Network modularity

The difference between functional networks and random ones is in good agreement with the principle of modular organization of biological systems, which Hartwell et al. (1999) brought into focus [69]. According to their definition, the module is part of a biological system that has a function that can be separated from the function of other such subsystems. The reflection of the principle of the modular organization at the level of gene networks is that the genes belonging to one module are closely located in the network [70, 71]. The work carried out by Ames (2013) [54] showed that the cohesive sub-graphs of global networks constructed from experimental data of different types overlap significantly with each other and with GO. Furthermore, the combination of these networks increases the coverage of GO. Based on the network modularity in the studies of Dutkowski et al. [72], Gligorijevi’c et al. [55] and Kramer et al. [73], gene ontologies were constructed exclusively based on network topology. Such topologies have shown a significant intersection with the existing topology of GO. Thus, the difference between functional networks and those that are random in terms of the number of connections is in agreement with the modular principle of network organization. For example, FunGeneNet showed the significance of the functional connectivity of the set of genes involved in histone deubiquitination [GO:0016578] as being equal to 6.48e-21, calculated by the ANDSystem, which can be a functional module (NeXO:8805) according to the NeXO ontology [72].

Conclusions

At present, using experimental transcriptomic, genomic and proteomic technologies, large arrays of experimental gene sets are generated. Such approaches are widely used to study medical-biological problems related to phenotypic traits, diseases, pathological conditions, etc. Reconstruction and analysis of gene networks that describe the functional interactions between genes in experimental sets of genes is a promising approach for the interpretation of omics data. FunGeneNet is dedicated to the analysis of the functional connectivity in experimental gene sets and identification of the most important links between genes from these sets, including the physical protein–protein interactions, protein–DNA interactions, and regulatory links such as regulation of expression, activity, etc. Reconstruction of gene networks for analyzed gene sets in FunGeneNet is carried out automatically using STRING and the ANDSystem.

The application of FunGeneNet to the analysis of gene sets involved in Gene Ontology biological processes has shown the statistical significance of the difference of networks reconstructed for these processes from random networks, which is in good agreement with the notion that functionally related genes participate in common biological processes. Sensitivity of our method exceeds 0.8, while specificity is 0.95.

The main feature of the method implemented in FunGeneNet is that it allows consideration of specific types of molecular-genetics interactions. An analysis of the connection types showed that the difference of GO biological processes from random networks depends on the types of interactions represented in them. Thus, genes involved in such processes can play an important functional role in analyzed processes. In particular, the analysis of a set of genes involved in apoptosis showed that such genes as NFKB1, P53, AKT1, CASP3 and HIPPI possess significant links, which is in good agreement with the literature data. Analysis of the gene sets associated with thyroid cancer taken from the dbDEPC database showed that these genes are significantly functionally related, and also suggests the molecular mechanisms of the role of genes involved in significant catalytic reactions.

An analysis of the gene sets associated with thyroid cancer taken from the dbDEPC database showed that these genes are significantly functionally related, and also allowed to suggest molecular mechanisms of the role of genes involved in significant catalytic reactions.

Availability and requirements

Project name: FunGeneNet.

Project home page: http://www-bionet.sscc.ru/fungenenet

Operating system: platform independent.

Programming language: PHP, Python.

Other requirements: Any browser with HTML5 support.

License: GPL-3.

Any restrictions to use by non-academics: none