# GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function

- 19k Downloads
- 284 Citations

## Abstract

### Background:

Most successful computational approaches for protein function prediction integrate multiple genomics and proteomics data sources to make inferences about the function of unknown proteins. The most accurate of these algorithms have long running times, making them unsuitable for real-time protein function prediction in large genomes. As a result, the predictions of these algorithms are stored in static databases that can easily become outdated. We propose a new algorithm, GeneMANIA, that is as accurate as the leading methods, while capable of predicting protein function in real-time.

### Results:

We use a fast heuristic algorithm, derived from ridge regression, to integrate multiple functional association networks and predict gene function from a single process-specific network using label propagation. Our algorithm is efficient enough to be deployed on a modern webserver and is as accurate as, or more so than, the leading methods on the MouseFunc I benchmark and a new yeast function prediction benchmark; it is robust to redundant and irrelevant data and requires, on average, less than ten seconds of computation time on tasks from these benchmarks.

### Conclusion:

GeneMANIA is fast enough to predict gene function on-the-fly while achieving state-of-the-art accuracy. A prototype version of a GeneMANIA-based webserver is available at http://morrislab.med.utoronto.ca/prototype.

### Keywords

Gene Ontology Additional Data File Conjugate Gradient Method Association Network Label Propagation### Abbreviations

- AUC
area under the ROC curve

- BP
biological process

- CC
cellular component

- CG
conjugate gradient

- GO
Gene Ontology

- MF
molecular function

- MIPS
Munich Information Center for Protein Sequences

- PGS
probabilistic graph search

- ROC
receiver operating characteristic

- SVM
support vector machine

- TSS
Tsuda and coworkers [17] algorithm

- ZBLWS
Zhou and coworkers [22] algorithm.

## Introduction

The vast amount and increasing variety of genomic and proteomic data generated for model organisms creates an opportunity for *in silico* prediction of gene function through extrapolation of the functional properties of known genes. Genes with similar patterns of expression [1], synthetic lethality [2], or chemical sensitivity [3] often have similar functions. Additionally, function tends to be shared among genes whose gene products interact physically [4], are part of the same complex [5], or have similar three-dimensional structures [6]. Computational analyses have also revealed shared function among genes with similar phylogenetic profiles [7] or with shared protein domains [8]. More accurate predictions can be made by combining multiple heterogeneous sources of genomic and proteomic data [9]. Collectively, these observations have led to functional categorization of a number of previously uncharacterized genes using the so-called 'guilt-by-association' principle [10, 11, 12].

Algorithms that predict gene function using the guilt-by-association principle do so by extending a 'seed list' of genes known to have the given function by adding other genes highly associated with the seed list in one or more genomic and proteomic data sources. These algorithms typically compute a 'functional association network' to represent each dataset; in this network the nodes correspond to genes or proteins and the undirected links (or edges) are weighted according to evidence of co-functionality implied by the data source. Types of functional association networks include kernels used by support vector machines (SVMs) [9], functional linkage networks [13], and protein-protein linkage maps [14]. Individual functional association networks are often combined to generate a composite functional association network that summarizes all of the evidence of co-functionality. This network is then used as input to an algorithm that scores each gene based on its proximity to the genes in the seed list. When employed on multiple complementary data sources, these algorithms can accurately predict previously annotated gene functions in blind tests [15], suggesting that their predictions for unannotated genes are also quite accurate.

Despite these successes, guilt-by-association algorithms have yet to achieve widespread use in gene annotation or as sources of new hypotheses about gene function; to do so, their predictions need to become more accessible, more accurate, and more regularly updated. In principle, all available data should be used when generating hypotheses about gene function; however, compiling a large number of heterogeneous data sources, generating functional association networks to represent these sources, and then mapping gene identifiers among the networks is a complex and onerous task that is best handled by specialists. Centrally managed web-based 'prediction servers' are an efficient strategy to ensure that casual users have access to the best available predictions.

However, maintaining accurate and up-to-date prediction servers can be computationally prohibitive. Though a large number of algorithms have been developed to predict the function of unannotated genes by combining heterogeneous data sources (see [16] for a recent review), the most accurate of these algorithms have long running times, which can range from minutes [17] to hours [9] on yeast. Larger mammalian genomes increase the run time of these algorithms even more. As such, these algorithms cannot feasibly be run online and instead their predictions are made offline based on sets of pre-defined seed lists derived, for example, from Gene Ontology (GO) annotations [18]. However, because new data and annotations are being generated at a rapid rate, maintaining an up-to-date database of the best available predictions for all possible functions requires substantial and potentially unavailable computational resources.

Due to this limitation, most prediction servers sacrifice accuracy for speed by relying on a single, or a small number of, pre-computed composite functional association networks and using simple heuristics to score genes based on a given seed list (for example, see [13, 14, 19]). While the scoring heuristics are fast enough to provide online predictions for arbitrary seed lists, we will show that their predictions are much less accurate than more advanced methods. Furthermore, by using a single pre-computed network, these servers do not take advantage of the fact that different data sources are more relevant for different categories of gene function [2, 9] and are not extensible to new or user-supplied data sources.

Here we demonstrate that it is not necessary to surrender either accuracy nor flexibility when building a prediction server by showing that GeneMANIA (Multiple Association Network Integration Algorithm) can, in seconds, generate genome-wide predictions that achieve state-of-the-art accuracy on arbitrary seed gene lists without relying on a pre-specified association network. We have achieved this goal through a series of algorithmic and technical advances that we have encapsulated in a new software package. With GeneMANIA, it is no longer necessary to maintain lists of *in silico* predictions of gene function because they can be recomputed as needed.

## Background

### Automated methods for predicting gene and protein function

Many algorithms for predicting gene function use a functional association network as a common representation of evidence of co-functionality for heterogeneous datasets. In this representation, nodes represent genes or proteins and the undirected edges (or links) between these nodes are weighted according to the evidence for co-functionality of the genes implied by the particular datasets. There are various techniques for deriving these weights depending on the source of data and it is still unclear how best to extract evidence of co-functionality (see, for example, [20, 21]); as such we here use a single method to calculate these weights (see Materials and methods for details). Here, we assume that all of the association weights are positive.

### Predicting gene function from functional association networks

Following Lanckriet and coworkers [9], we pose the problem of predicting gene function as a binary classification problem and attempt to solve the problem of integrating multiple heterogeneous input data sources by assigning each functional association network derived from these data sources a positive weight that reflects its usefulness in predicting a given function of interest. Once these weights are calculated, we construct a function-specific association network by taking the weighted average of the individual association networks. In contrast to prior work [9, 17] in which the weights and discriminant values were optimized simultaneously using the same objective function, we use a separate objective function to fit the weights, leading to a simpler optimization problem and a substantial reduction in run time.

We predict gene function from the composite network using a variation of the Gaussian field label propagation algorithm [22, 23] that we adopted to be more suitable for unbalanced classification problems like gene function prediction. Label propagation algorithms, like most function prediction algorithms, assign a score to each node in the network, called the 'discriminant value', that reflects the computed degree of association that the node has to the seed list defining the given function. This value can be thresholded to make predictions. While there are a large variety of label propagation algorithms (for example, those based on functional flow [24, 25] or Markov random fields [26]), our interest in the Gaussian field algorithm stems from the fact that it has a well-defined solution, it is derived from a principled framework, and it has recently been shown to perform as well as SVMs on gene function prediction tasks [17]. SVMs had previously been shown to be among the best algorithms for predicting gene function [27, 28].

## Results

The GeneMANIA algorithm consists of two parts: an algorithm, based on linear regression, for calculating a single, composite functional association network from multiple networks derived from different genomic or proteomic data sources; and a label propagation algorithm for predicting gene function given this composite network. Below we describe the two parts of the GeneMANIA algorithm individually. We evaluate GeneMANIA on yeast benchmarks as well as the MouseFunc I benchmark.

### GeneMANIA label propagation

Gaussian field label propagation algorithms for binary classification take as input an association network, a list of nodes with positive labels, possibly a list of nodes with negative labels, and initial label bias values. For gene function prediction, each gene is associated with a single node and nodes representing genes in the seed list (that is, positive genes) are assigned an initial label bias value of +1, and those representing genes that are deemed negative examples are assigned an initial bias value of -1. In the Results section, we explore a couple of strategies to defining negative examples; here we will assume that these examples are provided. GeneMANIA label propagation differs from [22] in how we assign the label bias to unlabeled genes, which are those that appear neither on the seed list nor among the negative examples.

Given this input, label propagation algorithms assign a discriminant value to each node. These discriminant values are determined by letting the initial label bias of nodes propagate through the association network to nearby nodes. To account for potential noise in the initial labelings, GeneMANIA label propagation, like [22], allows the discriminant values assigned to positively and negatively labeled nodes to deviate from their initial biases.

Our label propagation algorithm assigns discriminant values by finding those that minimize a cost function that penalizes both differences between the discriminant values of neighboring nodes in the network as well as differences between the discriminant values of nodes and their label bias (see Materials and methods for details). This cost function allows information about the node labels to propagate through the network to affect the discriminant values of genes that are not directly connected to the seed list. As such, the initial bias values of labeled and unlabeled nodes are important in determining their discriminant values and those of their neighbors; to account for the fact that only a small portion of the genes are expected to be labeled as positive, in the GeneMANIA algorithm we set the initial bias of unlabeled nodes to be the average bias of the labeled nodes: $\frac{{n}^{+}-{n}^{-}}{{n}^{+}+{n}^{-}}$, where *n*^{+} is the number of positive and *n*^{-} is the number of negative examples.

### Efficient implementation of GeneMANIA label propagation for large genomes

In GeneMANIA, label propagation on the composite association network is the most time-consuming step. Here we describe how we implemented this algorithm in order make it more efficient.

The discriminant values resulting from GeneMANIA label propagation can be calculated by solving a system of linear equations, * y*=

*A*

*, for*

**f***, the vector of discriminant values, given*

**f***A*, the coefficient matrix (whose derivation from the association network weights is described in Materials and methods) and

*, the vector of node label biases. While in principle this system can be easily solved by multiplying the inverse of the coefficient matrix by*

**y***, we employ a conjugate gradient (CG) method to solve this system that makes efficient use of CPU processing time and computer memory resources, especially for large genomes.*

**y**To show why the CG method is well-suited for this problem, we will briefly describe the algorithm. The CG method iteratively improves an estimate, **f**^{ t }, of a solution to the linear system as follows: at each iteration *t*, the current estimate, **f**^{ t }, is multiplied by the matrix *A*. If the result of this matrix multiplication **y**^{ t }= *A* **f**^{ t }is equal to * y*then

**f**^{ t }is a correct solution. On the other hand, if

**y**^{ t }does not equal

*then the CG method calculates a new estimate,*

**y**

**f**^{ t +1 }, based on the difference between

**y**^{ t }and

*. Calculating*

**y**

**f**^{ t +1 }requires performing another matrix multiplication between

*A*and a vector with the same number of elements as

**f**^{ t }. The new estimate,

**f**^{ t +1 }, is guaranteed to be more accurate than

**f**^{ t }[29] and, starting from a random estimate of

*, the CG method is guaranteed to converge to a correct solution after*

**f***n*iterations, where

*n*is the number of nodes in the network. In practice, however, the CG method can converge in many fewer iterations.

*m*, of non-zero elements in

*A*, we can reduce the runtime of the CG method for label propagation. The runtime of each CG iteration is proportional to

*m*, which is equal to the number of edges plus the number of nodes in the functional association network that

*A*represents. Many types of functional association networks are sparsely connected, and those that are not, such as networks derived from molecular profiling data, can be made sparse with little resulting impact on the performance of GeneMANIA (Figure 2). In contrast to the CG method, the run times of standard methods for solving linear systems, like Gaussian elimination, depend on the number of potential connections, which can be orders of magnitude larger than the number of actual ones.

*m*, in contrast to other methods for solving linear systems whose runtime in the worst case would depend on the product of

*m*and

*n*.

In summary, to make our algorithm efficient for large genomes, we use a conjugate gradient-based procedure to solve for the discriminant values and we sparsify functional association networks so that only the most informative associations are retained. These two changes speed up the label propagation algorithm by orders of magnitude for large genomes.

### GeneMANIA network integration

Like two recent methods for combining multiple functional association networks [9, 17], GeneMANIA builds a composite functional association network by taking a weighted average of the individual functional association networks. However, unlike these two methods, GeneMANIA optimizes the network weights and calculates the discriminant values separately. The advantage of this approach is that unlike the algorithm described in Tsuda and coworkers [17] (hereafter TSS), which also uses Gaussian field label propagation, GeneMANIA needs to run the computationally intensive label propagation only once, whereas TSS runs it multiple times on different networks while computing the network weights. For example, on our five network yeast benchmark (described in Materials and methods) across 400 GO functional categories, the TSS runs label propagation, on average, 68.8 times per category.

Our approach to computing network weights was inspired by recent work showing that when all available functional association networks are relevant for predicting gene function, a composite network generated by weighting each data source equally supports predictions as accurate as those derived using a composite network generated by an optimal choice of weights [17, 30], but when some of the association networks are irrelevant, the prediction performance of equal weighting scheme is degraded [30]. These observations suggest that heuristics network-weighting methods that can identify and down-weight irrelevant networks may be as accurate as optimal network weightings. As we show later, it is also important to be able to identify and down-weight redundant functional association networks.

The regularized linear regression algorithm that GeneMANIA uses is, by design, robust to the inclusion of irrelevant and redundant networks. This property is especially important when the data sources cannot be carefully controlled - for example, in a webserver that automatically downloads new data from web repositories or allows users to contribute their own data.

To set the network weights, we use ridge regression [31], which corresponds to finding a vector of network weights, * α*= [

*α*

_{1},...,

*α*

_{ d }]

^{ t }, that minimizes the cost function: (

*- Ω*

**t***)*

**α**^{ t }(

*- Ω*

**t***) + (*

**α***-*

**α***)*

**α**^{ t }

*S*(

*-*

**α***). Here*

**α***α*

_{ i }is the weight of the i

^{th}network,

*is a vector derived from the initial label of the labeled nodes, Ω is a matrix with columns corresponding to individual association networks,*

**t***is the mean prior weight vector, and*

**α***is a diagonal precision matrix. When all the diagonal entries of*

**S***are set to zero, we say that the cost function is unregularized and solving for*

**S***α*becomes equivalent to unregularized linear regression. For versions of the GeneMANIA algorithm designed to be deployed on a webserver, we have limited prior information about the networks and we set the mean prior weight of all networks to be equal. We call this the 'equal-weight' prior. However, when predicting membership in a GO functional category, we can improve prediction accuracies if we set the mean prior weight of each network to be the average of all the weight vectors calculated using unregularized linear regression on a large number of functional classes in the same branch of GO. We call this the 'branch-specific' weight prior (see Materials and methods for details).

### Gene function prediction in mouse

To evaluate GeneMANIA, we performed function prediction using the MouseFunc I benchmark data. To do so, we constructed ten association networks from the ten datasets (see Materials and methods for details) and used the GeneMANIA network integration and label propagation algorithm to predict gene function for each function class separately. When constructing individual association networks, we set the sparsity level to S = 50, that is, we kept only the top 50 association weights (links) for each gene and set the rest to zero. Setting S = 100 results in better performance on most specific GO biological process (BP) classes while it has minimal effect in other evaluation categories (supplementary Figure 2 in Additional data file 1).

We follow the same procedure as [15] and compare prediction accuracies using average area under the receiver operating characteristic (ROC) curve (AUC) in twelve MouseFunc I evaluation categories. Briefly, the evaluation classes are created by grouping GO categories corresponding to all pairwise combinations of the three GO branches (BP, cellular component [CC], and molecular function [MF]) and four specificity levels, that is, number of annotations ([3 to 10], [11 to 30], [31 to 100], and [101 to 300]). We report prediction performance on both the 'test' and the 'novel' benchmarks (see [15] for details).

We compare GeneMANIA's predictions in the first and second rounds of MouseFunc I (GeneMANIA^{Entry-1} and GeneMANIA^{Entry-2}, respectively) and the version of GeneMANIA designed to be implemented on a webserver (GeneMANIA^{WS}) against the best performing methods in MouseFunc I. For GeneMANIA^{WS}, we assume no knowledge about the source of the gene list whereas in MouseFunc I the competing methods were provided with the name of GO category from which each gene list was derived. Below we discuss the outcome of our evaluation on the 'test' benchmark and then on the 'novel' benchmark.

^{Entry-1}achieved the highest prediction performance on all evaluation classes containing GO categories with more than ten annotated genes, and GeneMANIA

^{Entry-2}achieved the highest prediction performance on the categories with ten or fewer annotated genes. We measure performance by using one minus the AUC (1 - AUC), so lower values indicate better performance. GeneMANIA

^{Entry-2}had significantly lower median 1 - AUC than all other methods in predicting BP and MF categories with 3 to 10 annotations (

*p*< 0.05, Wilcoxon-Mann-Whitney) while GeneMANIA

^{Entry1}and GeneMANIA

^{WS}had significantly lower median 1 - AUC compared with GeneMANIA

^{Entry-2}in predicting BP categories with 31 to 100 and 101 to 300 annotations, CC categories with 101 to 300 annotations, and MF categories with 31 to 100 annotations.

The reduced performance of GeneMANIA^{Entry-1} in the categories with the fewest annotations is likely due to overfitting, because we used unregularized linear regression to set the network weights. In GeneMANIA^{Entry-2}, we switched to ridge regression with the branch-specific weight prior, which is less prone to overfitting. Note also that GeneMANIA^{WS}, which uses an equal-weight prior, also has improved performance on the smaller GO categories.

^{Entry-2}compared with GeneMANIA

^{Entry-1}on the larger GO categories in the test benchmark is because of our definition of negative examples. In our first entry, we defined as negative examples all genes with any GO annotation but not one in the category being predicted. In our second entry, we refined this definition so that the negative examples for a given category were only those annotated to a sibling category, that is, one that shared a parent in the GO hierarchy. Although choosing the genes annotated to sibling categories of interest as negatives improved prediction performance on novel tasks (Figure 5), it degraded the prediction performance on the test set. This may due to the reduction in the number of negative examples. One way to alleviate this effect is to define a range of label biases between [-1, 0] for genes that have GO annotations but are not annotated to the function of interest.

On the novel benchmark, in addition to the ten association networks, eight month out-of-date annotations for test genes were also made available. Annotations in other categories of gene function can be informative for predicting new annotations in a given category (for example [32]); in our second entry to MouseFunc I, we investigated two ways in which old annotations may be helpful in predicting new annotations on the novel benchmark. First, we reasoned that genes previously annotated in parent categories but not sibling categories should be considered as potential new positive examples and as such, as described earlier, we used only genes annotated to sibling categories as negatives. Second, often annotations in one branch of the GO imply annotations in a different branch. For example, transcription factors (MF annotation) are located in the nucleus at least some of the time (CC annotation) and a number of transcription factors are involved in cell cycle regulation (BP category). To include this information, we constructed association networks from each branch of GO (see Materials and methods).

Figure 5 shows the prediction accuracies of three versions of GeneMANIA as well as the best accuracy obtained for each evaluation set on the 'novel' MouseFunc I benchmark. For the evaluation classes containing the GO categories with 3 to 10 or 11 to 30 annotations, GeneMANIA entries still had the lowest error on the MouseFunc I benchmark, with GeneMANIA^{Entry-2} having significantly smaller median error in five of the six 'small-sized' evaluation categories (*p* < 0.05, Wilcoxon-Mann-Whitney). GeneMANIA did less well in the larger GO categories, possibly suggesting that we are not making full use of the older annotations. In contrast to the test benchmark, here, GeneMANIA^{Entry-2} improved performance can be mostly attributed to the choice of negative examples (data not shown), although the inclusion of the GO networks also slightly improves the prediction performance (data not shown).

In summary, we observed that employing regularization when using linear regression to combine multiple networks results in a drastic improvement in prediction accuracies in most specific functional classes. This is because with only a few positive examples, identifying relevant networks is a challenging task and prone to over-fitting. One way of alleviating this effect is to estimate the relevancy of each network based on its average weight on a large number of similar prediction tasks (for example, prediction of functional classes in the same branch of GO). Second, we have demonstrated that in the binary classification of genes according to GO classes, the genes that are used as negative examples have a large impact on the prediction outcome with label propagation.

### Performance of GeneMANIA on the yeast benchmark

Inspired by the performance of GeneMANIA on the MouseFunc I benchmark, we compare GeneMANIA's performance to that of the TSS algorithm and bioPIXIE [14] on yeast data. To compare GeneMANIA with the TSS algorithm, we used five yeast functional association networks (from [9]) and 400 yeast GO functional classes that we derived from a 2006 version of GO annotations (see Materials and methods for details). We compare GeneMANIA with the TSS algorithm, which had previously been shown to be as accurate as an SVM with optimized network weights while requiring orders of magnitude less computation time [17], enabling us to run much more extensive comparisons. We compared GeneMANIA with the bioPIXIE network because it is currently deployed on a popular website that provides gene function predictions. For bioPIXIE, we evaluate both the network and the probabilistic graph search algorithm published with the bioPIXIE network in comparison with GeneMANIA. Note that the bioPIXIE network was built using different data sources, which are not included in the five yeast network benchmark, so the reported performance is with regard to the specific published network, not the Bayesian network algorithm used to derive the network. In addition, to illustrate the attainable accuracies with additional datasets, we performed function prediction with GeneMANIA using a 15 yeast benchmark data set (GeneMANIA^{15}) that we derived from recent genomics and proteomics data sources (see Materials and methods). However, we were unable to use these networks as input to the TSS algorithm due to the memory requirements of the algorithm.

^{15}(GM-15

^{WS}) to GeneMANIA label propagation applied to the bioPIXIE network (GM-biPx) and to the bioPIXIE probabilistic graph search (PGS) algorithm applied on the same network (PGS-biPx) underlines the value of the GeneMANIA algorithm for a webserver. Given the bioPIXIE network as input, GeneMANIA has a significantly (

*p*< 0.05, Wilcoxon-Mann-Whitney) lower median 1 - AUC than PGS-biPx in all evaluation classes, showing that using simple heuristics to assign discriminant values leads to a considerable loss in accuracy. The value of being able to tailor the composite functional association network to the input seed list is shown by the comparative performance of GM-biPx and GM-15; while GM-biPx performs well compared with GM-15 in predicting the smaller GO-BP categories, for which the bioPIXIE network was optimized, GM-biPx has much higher error, and significantly lower performance (

*p*< 0.05, Wilcoxon-Mann-Whitney) for the largest GO-BP categories and GO categories from other branches of the hierarchy that the bioPIXIE network was not designed to predict. In our tests, GM-biPx outperforms GM-15 on the smaller GO-BP categories; however, the performance of GM-biPx is likely overestimated because the test sets used for evaluating GM-biPx are not independent from the training sets used to build the bioPIXIE network since we were not able to recover the cross-validation folds used in [14].

We were not able to compare GeneMANIA with the TSS algorithm on the 15 yeast network benchmark; the TSS algorithm (as obtained from [33]) run on the 15 yeast networks resulted in an out of memory error when solving the system of linear equations when running MATLAB (version 7.4) on a quad-core MAC desktop with Intel Xeon 2.66 GHz processors and 4 GB of memory. However, when comparing GeneMANIA on the five yeast networks from [17] (TSS-5 versus GM-5), we see that GeneMANIA performs as well as or better than TSS; among the evaluation categories shown in Figure 6, GM-5 has a significant lower median 1 - AUC than TSS-5 on BP and CC categories with 31 to 100 and 101 to 300 annotations (*p* < 0.05, Wilcoxon-Mann-Whitney).

### Running time and prediction accuracy

### The effect of redundancy and random networks on equal weighting

## Discussion and conclusion

We have shown that GeneMANIA is as accurate as, or more so than, leading gene function prediction algorithms on yeast and mouse despite requiring, in some cases, orders of magnitude less computation time. We achieve the highest accuracy using a version of our algorithm that requires between 10 and 15 seconds computation time on a modern desktop computer with slightly optimized MATLAB code. With more careful optimization, we expect that GeneMANIA will be even faster. Consequently, we have demonstrated that it is possible to design a gene function prediction algorithm that performs on-demand function prediction with most up-to-date annotation list and data sources while achieving the same or better accuracy as the leading algorithms.

Our algorithm automatically combines multiple input data sources fast enough to be employed on a web server. With the availability of such an algorithm, it is no longer necessary to maintain static databases of gene function predictions that can quickly become out of date and are restricted to predicting function in pre-defined categories. As proof of this principle, we have generated a website [34] where users can define their own function categories by providing a gene list, and receive predictions in seconds.

Additionally, we have demonstrated that sparsifying data sources can result in increased prediction accuracy as well as decreased computation time. This suggests that, in the context of gene function prediction, many of the derived associations may be uninformative and only a small selection of a gene's top neighbors are sufficient for accurate function determination.

Here we have not investigated the possibility of using a gene's prior annotations to predict new ones; this type of information can be useful [32], although it can easily be incorporated by using it to generate a functional association network, as we did in our MouseFunc I entry. We have also noticed that, in some cases, a network representation is not the most efficient encoding of input data; for example, our yeast gene localization dataset has a sparse feature-based representation and a dense network-based representation. This can be addressed using techniques proposed in [35], which allows the propagation of label information on networks where each node is also associated with a feature vector.

## Materials and methods

### Software

The GeneMANIA prototype webserver is available at [34].

### Yeast benchmark datasets

To perform function prediction in yeast with the GeneMANIA algorithm, we used five yeast functional association networks (used in [9] and [17] and available from [33]), constructed an extended yeast benchmark consisting of 15 association networks and 400 GO functional classes, and used the bioPIXIE network (obtained from [36]).

We constructed 15 yeast association networks from various genomics data sources (supplementary Table 1 in Additional data file 2). In addition, we also downloaded the GO association file from the *Saccharomyces* Genome Database on (1 June 2006) and obtained a set of 400 functional classes that we grouped according to their specificity level and GO branch. In particular, we randomly selected 100 functional classes from each specificity level (3 to 10, 11 to 30, 31 to 100, and 101 to 300) and organized them based on their GO branch. In addition, the 5 yeast network benchmark consists of 13 Munich Information Center for Protein Sequences (MIPS) functional classes. We have included the prediction performance on these functional classes in supplementary Figures 3 and 5 in Additional data file 1.

In addition to the above, we investigated the prediction performance of GeneMANIA using the bioPIXIE [14] network (we refer to this method as GeneMANAI^{bioPIXIE}). bioPIXIE is a composite association network that has been constructed from 925 referenced data sources. It contains 15,551,081 functional associations between 7,034 yeast genes.

### Zhang and coworkers mouse benchmark data

We obtained the mouse tissue expression benchmark data from [37]. For comparison of GeneMANIA with the benchmark data from [12], following their experimental framework, we considered 9,499 genes and 992 GO categories.

### MouseFunc I benchmark data

We constructed ten association networks from the MouseFunc I benchmark data (as detailed below). We used minor variations of GeneMANIA for our evaluations: GeneMANIA^{Entry-1}, GeneMANIA^{Entry-1}, and GeneMANIA^{WS}.

In GeneMANIA^{Entry-1} we used unregularized linear regression to combine association networks, and we used all genes with any GO annotations that are not annotated to the functional class of interest as negative examples.

In GeneMANIA^{Entry-2} we used ridge regression to combine association networks, and we set the prior weight vector *α*_{ h }for network *h* to the average weight that network *h* obtained when using ordinary linear regression in all predictions from the same GO branch. We define negative examples for each class of interest *g* as those genes that have annotations to any class that shares a parent with *g*. In addition, we included three GO networks that were constructed from three GO ontologies.

GeneMANIA^{WS} is the same as GeneMANIA^{Entry-1} except that we used ridge regression, setting all of the elements of the mean prior weight vector to *1*/*d* where *d* is the number of interaction networks.

### Building functional association networks

When constructing association networks, we distinguish two types of data: binary and continuous valued. In binary datasets, for a given feature, all zeros were replaced with *log*(1 - *β*) and ones were replaced with, -*log*(*β*), where *β* is the proportion of examples for which the given feature has a value of 1. This allows for the emphasis of similarities between genes that share 'uncommon' features. Subsequently, for both types of data, we constructed similarity matrices using the Pearson correlation coefficient to measure pair-wise similarities. To keep our association networks sparse, for each gene we set all of its links (association weights), except for its top 50 links, to zero.

### Mouse GO networks

For predicting gene function on the MouseFunc I benchmark, in addition to the ten association networks that we constructed from the genomics and proteomics data, we constructed three GO networks. We represented a gene's annotations to a given branch of GO as a binary vector whose *g*^{ th }element is set to '1' if the gene was annotated in GO category *g*, or any of the descendants of *g* in the GO hierarchy, and is set to '0' otherwise. We constructed one GO network for each branch of the hierarchy. However, when predicting GO category *g* in ontology BP (CC, or MF), we would construct the BP GO network by not including any annotations in *g* or its descendents.

### Network normalization

We normalize all our functional association networks and the composite process-specific network by dividing each entry *W*_{ ij }, the association between node *i* and node *j*, by the square root of the product of the sum of the elements of row *i* and the sum of the elements in column *j* of the unnormalized network.

### Figure of merit for accuracy

We evaluated the performance of GeneMANIA using the AUC.

ROC is a graphical plot of true positive rate (sensitivity) as a function of false positive rate (1 - specificity) for a binary classifier system as its discrimination threshold is varied; a perfect classifier would yield an AUC of 1 and a random classifier yields an AUC of 0.5 [38]. 1 - AUC is a more appropriate measure because it corresponds to the expected portion of the negatives that are higher than a random positive - while a gain of 0.90 to 0.95 AUC seems negligible, it corresponds to a two-fold reduction in the expected portion of negatives scoring higher than a random positive (1 - AUC).

### GeneMANIA label propagation algorithm

*, the bias vector by*

**f***, and the matrix derived from the association network by W. We can represent an association network over*

**y***n*genes by a symmetric matrix W whose non-zero entries indicate the associations in the network. In particular, the (

*i*,

*j*)

^{th}element of W,

*W*

_{ ij }, is the association between genes

*i*and

*j*, with

*W*

_{ ij }= 0 indicating no edge between genes

*i*and

*j*. To ensure that all associations are non-negative, we set any negative associations to zero. For each binary classification task, we have

*l*labeled genes and

*u*unlabeled genes (

*n*= l +

*u*). We use these labels to specify a bias vector

*, where*

**y***y*∈ {+1,

*k*, -1}, indicating that gene

*i*is positive, unlabeled, or negative, respectively. In the GeneMANIA label propagation algorithm:

*n*

^{+}and

*n*

^{-}are the numbers of positive and negative genes, respectively. The discriminant values are computed by solving the following objective function:

which ensures that the discriminant values of positive and negative genes remain close to their label bias (first term in the summation) and the discriminant values of genes that are associated (genes that have positive *w*_{ ij }) are not too different from each other (second term in the summation). Equation 1 can be written in matrix notation as:

* f** = argmin

_{ f }(

*-*

**f***)*

**y**^{ t }(

*-*

**f***) +*

**y**

**f**^{ t }

*L*

**f**where *L* = *D* - *W* is called the graph Laplacian matrix and *D = diag*(*d*_{ i }) (that is, D is a diagonal matrix with *D*_{ ii }= *d*_{ i }and *D*_{ ij }= 0 if *i* ≠ *j*) and *d*_{ i }= Σ_{ j }*w*_{ ij }. Since the association matrix is symmetric, by definition, *L* is symmetric and semi-definite positive and equation 2 is a quadratic optimization problem with a global minimum. In fact, the solution to equation 2 can be obtained by solving a sparse linear system *y* = (* I*-

*L*)

*.*

**f**### GeneMANIA network combination

*d*association networks encoded as matrices

*W*

_{1},..,

*W*

_{ d }, we form a composite network as:

where the vector of network weights, * α*= [

*α*

_{1},

*α*

_{2},...,

*α*

_{ d }], is customized to a particular prediction task by choosing it to maximize a form of kernel-target alignment [39] between the composite network and a 'target' network constructed from the class label vector

*.*

**y**Specifically, to obtain the weight vector * α*, we solve the ridge regression problem:

* α*= argmin

_{ α' }(Ω

*-*

**α'***)*

**t**^{ t }(Ω

*-*

**α'***) + (*

**t***-*

**α'***)*

**t**^{ t }

*S*(

*-*

**α'***)*

**α**where each column Ω_{ h }of Ω constructed from the matrix *W*_{ h }by collecting its elements *W*_{ ij }for which either: one of the genes *i* or *j* is labeled positive and the other negative; or both are labeled positives. Each element of the vector * t*also corresponds to either a pair of negatively and positively labeled genes or two positively labeled genes. Elements of

*corresponding to a positive and a negative are set to -*

**t***n*

^{+}×

*n*

^{-}and those corresponding to two positives are set to

*n*

^{-}×

*n*

^{-}, where

*n*

^{+}and

*n*

^{-}are the numbers of positively and negatively labeled genes, respectively. The same row ordering is used for all Ω

_{ h }and

*, so that each row corresponds to a single*

**t***i*,

*j*gene pair. When solving equation 3, we also include a bias term,

*α*

_{0}, by adding a column of '1's, Ω

_{0}, to Ω and discard this bias term when constructing the composite association network. Intuitively, we are attempting to choose network weights such that pairs of positively labeled genes have high similarity, pairs containing a positive and negative gene have low similarity, and pairs of negatively labeled genes have no influence in determining the weights.

Given the prior mean vector * α* and precision matrix

*S*, the solution to equation 3 is:

* α*= (Ω

^{ t }Ω +

*S*)

^{-1}(Ω

^{ t }

*+*

**t***S*

*)*

**α**where Ω^{ t }indicates the transpose of Ω In the text, we describe two different methods to set * α*, and we specify

*S*by setting all its elements to zero, except for its diagonal elements, which we specify by setting

*S*

_{ hh }= trace(

*W*

_{ h }

^{ t }

*W*

_{ h }). If we are solving the unregularized version of this problem, we simply set all of the elements of

*S*to zero.

To avoid negative network weights, if after solving for * α*,

*α*

_{ h }< 0 for any

*h*>

*0*, then we set

*α*

_{ h }= 0, remove Ω

_{ h }from Ω, and redo the ridge regression. If through this procedure,

*α*

_{ h }= 0 for all

*h*> 0, we set

*α*

_{ h }= 1/

*d*for all

*h*> 0.

The time complexity of this algorithm scales at worst quadratically in *n*. Computing equation 3 requires O(*n* × *n*^{+} × *d*) time to calculate Ω^{ t }Ω and the matrix inversion can be done in O(*d*^{3}) time.

### The TSS algorithm

We downloaded the TSS algorithm from [33]. There are three parameters to be set: *C* is the constraint that enforces an upper-bound on the sum of the network weights (that is, $\sum _{k}{\alpha}_{k}\le C$), *C*_{0} enforces and upper-bound on individual network weights (that is, 0 ≤ *α*_{ k }≤ *C*_{0}), and *const* is a parameter to adjust for the class imbalances. In the code provided by [33], the parameters *C* and *C*_{0} have been previously optimized for function prediction in the 13 MIPS categories for the five yeast network input. For the 400 GO category yeast benchmark, we set these parameters to their default values (that is, *C*_{0} = 0.4).

### bioPIXIE probabilistic graph search algorithm

Given a seed list of positive genes, the probabilistic graph search algorithm adds two sets of genes to it. The first set contains the *n*_{1} genes with the highest total association weight to the seed list and the second set contains the *n*_{2} genes with highest total association weights to the first set. In [14], the authors set *n*_{1} to be between 10 and 20 and *n*_{2} to be 40 - *n*_{1}. In our implementation of [14], we set both *n*_{1} and *n*_{2} to 20. We chose these two parameters empirically, by comparing several settings in the suggested range. Our reported results are robust to slight variation of these two parameters.

## Additional data files

## Notes

### Acknowledgements

SM, DWF, and DR gratefully acknowledge the support from NSERC USRAs and an operating grant assigned to QM. CG and SM were partially supported by a Genome Canada grant administered by the Ontario Genomics Institute.

This article has been published as part of *Genome Biology* Volume 9 Supplement 1, 2008: Quantitative inference of gene function from diverse large-scale datasets. The full contents of the supplement are available online at http://genomebiology.com/supplements/9/S1

## Supplementary material

### References

- 1.Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003, 302: 249-255.PubMedCrossRefGoogle Scholar
- 2.Zhang LV, King OD, Wong SL, Goldberg DS, Tong AH, Lesage G, Andrews B, Bussey H, Boone C, Roth FP: Motifs, themes and thematic maps of an integrated
*Saccharomyces cerevisiae*interaction network. J Biol. 2005, 4: 6-PubMedPubMedCentralCrossRefGoogle Scholar - 3.Giaever G, Shoemaker DD, Jones TW, Liang H, Winzeler EA, Astromoff A, Davis RW: Genomic profiling of drug sensitivities via induced haploinsufficiency. Nat Genet. 1999, 21: 278-283.PubMedCrossRefGoogle Scholar
- 4.Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM: A comprehensive analysis of protein-protein interactions in
*Saccharomyces cerevisiae*. Nature. 2000, 403: 623-627.PubMedCrossRefGoogle Scholar - 5.von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P: Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002, 417: 399-403.PubMedCrossRefGoogle Scholar
- 6.Polacco BJ, Babbitt PC: Automated discovery of three-dimensional motifs for protein function annotation. Bioinformatics. 2006, 22: 723-730.PubMedCrossRefGoogle Scholar
- 7.Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999, 96: 4285-4288.PubMedPubMedCentralCrossRefGoogle Scholar
- 8.Hegyi H, Gerstein M: The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol. 1999, 288: 147-164.PubMedCrossRefGoogle Scholar
- 9.Lanckriet GR, Deng M, Cristianini N, Jordan MI, Noble WS: Kernel-based data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput. 2004, 300-311.Google Scholar
- 10.Walker MG, Volkmuth W, Sprinzak E, Hodgson D, Klingler T: Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. Genome Res. 1999, 9: 1198-1203.PubMedPubMedCentralCrossRefGoogle Scholar
- 11.Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Large-scale prediction of
*Saccharomyces cerevisiae*gene function using overlapping transcriptional clusters. Nat Genet. 2002, 31: 255-265.PubMedCrossRefGoogle Scholar - 12.Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, Mohammad N, Robinson MD, Zirngibl R, Somogyi E, Laurin N, Eftekharpour E, Sat E, Grigull J, Pan Q, Peng WT, Krogan N, Greenblatt J, Fehlings M, Kooy van der D, Aubin J, Bruneau BG, Rossant J, Blencowe BJ, Frey BJ, Hughes TR: The functional landscape of mouse gene expression. J Biol. 2004, 3: 21-PubMedPubMedCentralCrossRefGoogle Scholar
- 13.Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature. 1999, 402: 83-86.PubMedCrossRefGoogle Scholar
- 14.Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, Dolinski K, Troyanskaya OG: Discovery of biological networks from diverse functional genomic data. Genome Biol. 2005, 6: R114-PubMedPubMedCentralCrossRefGoogle Scholar
- 15.Peña-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, Zhang C, Guan Y, Leone M, Pagnani A, Kyu Kim W, Krumpelman C, Tian W, Obozinski G, Qi Y, Mostafavi S, Ning Lin G, Berriz GF, Gibbons FD, Lanckriet G, Qiu J, Grant C, Barutcuoglu Z, Hill DP, Warde-Farley D, Grouios C, Ray D, Blake JA, Deng M, Jordan MI, Noble WS: A critical assessment of
*Mus musculus*gene function prediction using integrated genomic evidence. Genome Biol. 2008, 9 (Suppl 1): S2-PubMedPubMedCentralCrossRefGoogle Scholar - 16.Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol. 2007, 3: 88-PubMedPubMedCentralCrossRefGoogle Scholar
- 17.Tsuda K, Shin H, Scholkopf B: Fast protein classification with multiple networks. Bioinformatics. 2005, 21 (Suppl 2): ii59-i65.PubMedCrossRefGoogle Scholar
- 18.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29.PubMedPubMedCentralCrossRefGoogle Scholar
- 19.von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Kruger B, Snel B, Bork P: STRING 7 - recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 2007, D358-D362. 35 DatabaseGoogle Scholar
- 20.Yona G, Dirks W, Rahman S, Lin DM: Effective similarity measures for expression profiles. Bioinformatics. 2006, 22: 1616-1622.PubMedCrossRefGoogle Scholar
- 21.Wong SL, Zhang LV, Roth FP: Discovering functional relationships: biochemistry versus genetics. Trends Genet. 2005, 21: 424-427.PubMedCrossRefGoogle Scholar
- 22.Zhou D, Bousuet O, Lal T, Weston J, Schoelkopf B: Learning with local and global consistency. Neural Information Processing Systems. 2003, [http://books.nips.cc/papers/files/nips16/NIPS2003_AA41.pdf]Google Scholar
- 23.Zhu X, Ghahramani Z, Lafferty J: Semi-supervised learning using Gaussian fields and harmonic functions. Proceedings of the Twentieth International Conference on Machine Learning: August 21-24, 2003; Washington, DC, USA. Edited by: Fawcett T, Mishra N. 2003, Menlo Park, CA:AAAI Press, 912-919.Google Scholar
- 24.Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol. 2003, 21: 697-700.PubMedCrossRefGoogle Scholar
- 25.Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics. 2005, 21 (Suppl 1): i302-310.PubMedCrossRefGoogle Scholar
- 26.Deng M, Chen T, Sun F: An integrated probabilistic model for functional prediction of proteins. J Comput Biol. 2004, 11: 463-475.PubMedCrossRefGoogle Scholar
- 27.Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA. 2000, 97: 262-267.PubMedPubMedCentralCrossRefGoogle Scholar
- 28.Pavlidis P, Weston J, Cai J, Noble WS: Learning gene functional classifications from multiple data types. J Comput Biol. 2002, 9: 401-411.PubMedCrossRefGoogle Scholar
- 29.Golub GH, Van Loan CF: Matrix Computations. 1996, Baltimore, MD: Johns Hopkins University Press, 1: 3Google Scholar
- 30.Lewis DP, Jebara T, Noble WS: Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics. 2006, 22: 2753-2760.PubMedCrossRefGoogle Scholar
- 31.Hastie T, Tibshirani R, Friedman JH: The Elements of Statistical Learning: Data Mining, Inference, and Predictions. 2001, Berlin/Heidelberg: Springer-VerlagCrossRefGoogle Scholar
- 32.King OD, Foulger RE, Dwight SS, White JV, Roth FP: Predicting gene function from patterns of annotation. Genome Res. 2003, 13: 896-904.PubMedPubMedCentralCrossRefGoogle Scholar
- 33.ECCB05 paper supplement. [http://www.kyb.tuebingen.mpg.de/bs/people/tsuda/eccb05.html]
- 34.GeneMANIA prediction server prototype. [http://morrislab.med.utoronto.ca/prototype]
- 35.Zhang T, Popescul A, Dom B: Linear prediction model with graph regularization for web-page categorization. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: August 20-23, 2006. 2006, Philadelphia, PA, USA. New York, NY: ACM, 821-826.CrossRefGoogle Scholar
- 36.bioPixie. [http://pixie.princeton.edu/pixie/]
- 37.The functional landscape of mouse gene expression. [http://hugheslab.ccbr.utoronto.ca/supplementary-data/Zhang/]
- 38.Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput Chem. 1996, 20: 25-33.PubMedCrossRefGoogle Scholar
- 39.Cristianini N, Elisseef A, Shawe-Taylor J, Kndola J: On kernel-target alignment. Proceedings of Neural Information Processing Systems. 2001, [http://books.nips.cc/papers/files/nips14/LT17.pdf]Google Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.