GIBA: a clustering tool for detecting protein complexes
- 4k Downloads
During the last years, high throughput experimental methods have been developed which generate large datasets of protein – protein interactions (PPIs). However, due to the experimental methodologies these datasets contain errors mainly in terms of false positive data sets and reducing therefore the quality of any derived information.
Typically these datasets can be modeled as graphs, where vertices represent proteins and edges the pairwise PPIs, making it easy to apply automated clustering methods to detect protein complexes or other biological significant functional groupings.
In this paper, a clustering tool, called GIBA (named by the first characters of its developers' nicknames), is presented. GIBA implements a two step procedure to a given dataset of protein-protein interaction data. First, a clustering algorithm is applied to the interaction data, which is then followed by a filtering step to generate the final candidate list of predicted complexes.
The efficiency of GIBA is demonstrated through the analysis of 6 different yeast protein interaction datasets in comparison to four other available algorithms. We compared the results of the different methods by applying five different performance measurement metrices.
Moreover, the parameters of the methods that constitute the filter have been checked on how they affect the final results.
GIBA is an effective and easy to use tool for the detection of protein complexes out of experimentally measured protein – protein interaction networks. The results show that GIBA has superior prediction accuracy than previously published methods.
KeywordsProtein Interaction Network Geometrical Accuracy Tandem Affinity Purification Detect Protein Complex Protein Interaction Dataset
Proteomic data and more specifically PPIs data are of great scientific interest through their connection with important cellular functions such as extra and intra cellular signaling, cell communication etc . Moreover, multi protein complexes reveal insights of the functional and topological organization of the protein networks. In the past years, new high throughput methods for identifying pairwise PPIs have been developed that generate enormous datasets. Depending on the method used, different kinds of protein interactions are recorded. This is the reason why there exist differences on the generated datasets from different methods. The most popular ones are yeast two hybrid systems , mass spectrometry , tandem affinity purification , microarrays  and phage display .
Each method has its strengths and weaknesses; however every method has a certain error rate for the detection of a protein-protein interaction. The main basic errors are under-prediction and over-prediction (false positive) of protein interactions . Besides that, we currently don't know the real "truth" in these datasets, due to the fact that most of the protein complexes are experimentally not yet determined .
Usually, the aggregation of the PPIs of an organism is modeled as an undirected graph, symbolized as G= -(V, E), where nodes (V) represent the proteins and edges (E) the pairwise PPIs. The graph model makes it easy for many computational methods derived from the graph theory to be applied on these noisy datasets to extract functional modules such as protein complexes. The goal of those approaches is to detect highly connected subgraphs which are protein complex candidates.
Each algorithmic strategy relies on a very different approach. The best known one is the Molecular complex detection algorithm (Mcode) . Another algorithm, that has been characterized for its efficiency , is the MCL (Markov Clustering) algorithm . Besides that, King et al suggested the RNSC algorithm  which uses a cost local search algorithm based loosely on a tabu search meta – heuristic. Another algorithm of the local search approach is the Local Clique Merging Algorithm (LCMA)  which first locates cliques in a graph and then tries to expand them. Two algorithms that use the hierarchical approach are the Highly Connected Subgraph method (HCS)  and the SideS algorithm . The main concept of these methods is the use of numerous graph min cuts until the stopping criterion of each algorithm is satisfied.
In this paper, we have developed a new clustering tool called GIBA that offers the ability to detect important protein modules such as protein complexes. GIBA implements a two step strategy, where in the first one the whole protein – protein interaction graph is divided into clusters and in the second step these clusters are filtered and only the ones considered important are kept. Extensive experiments were performed on 6 different datasets of yeast organism which are either derived from individual experiments (Tong , Krogan  and Gavin [1, 17]) or from online databases (DIP  and MIPS ). These datasets vary on the number of proteins as well as the number of interactions composing either sparse (Tong dataset) or relatively dense (MIPS and DIP datasets) graphs. Moreover, by using the recorded yeast protein complexes of the MIPS database, we compared the results obtained from GIBA with 4 other algorithms: Mcode, HCS, SideS and RNSC and examined the derived results based on 5 different metrics. Selecting appropriate combinations between clustering algorithms and filtering methods, GIBA proved its superiority compared to the remaining methods. The undertaken experiments and their results are presented in detail in the Results and Discussion section. Finally, an evaluation of the filter methods has been performed to test how these methods affect the final results and to decide, as accurately as possible, the most effective set of filter parameters that produce the best results.
The remaining of the paper is organized as follows: in the next section, we present the algorithms and the filter methods that are hosted in GIBA tool. In Methods section, the properties of GIBA are presented and the evaluation procedure is presented. In Results and Discussion section we performed extensive experiments on datasets with different properties. Results and Discussion section also contains a discussion about the parameters and the methods that compose the filter of GIBA tool and how these approaches affect the final results. Finally, the conclusions of our work are quoted and the main directions for future work are suggested.
To identify accurate protein complexes given a protein-protein interaction network, we built a workflow consisting of a two step procedure . Initially, a protein – protein interaction network is clustered by the MCL or the RNSC algorithm and in the second step the results are filtered based either on individual or on a combination of 4 different methods. These are: a) density, b) haircut operation, c) best neighbour and d) cutting edge. This two step approach maintains only those clusters that have high probability to be real biological complexes. A real biological complex can be defined as a set of proteins that are commonly involved in a biological process . A brief description of the algorithms of the first step (MCL and RNSC) and the methods used for the filtering process is given below.
Description of the MCL algorithm
The MCL algorithm  is a fast and scalable unsupervised clustering algorithm based on simulation of stochastic flow in graphs. The MCL algorithm can detect cluster structures in graphs by a mathematical bootstrapping procedure. The process deterministically computes the probabilities of random walks through a graph, and uses two operators transforming one set of probabilities into another. It does so by using the language of stochastic matrices (also called Markov matrices), which capture the mathematical concept of random walks on a graph.
Description of the RNSC algorithm
The RNSC algorithm  searches for a low cost clustering by composing first an initial random clustering, then iteratively moving one node from one cluster to another in a randomized fashion to improve the clustering cost. In order to avoid local minima, RNSC makes diversification moves and performs multiple experiments. Furthermore, it maintains a tabu list that prevents cycling back to a previously explored partitioning. Due to the randomness of the algorithm, different runs on the same input data produce different outputs.
Description of the cluster density method
where |E| is the number of edges and |V| the number of vertices of the subgraph.
Description of the haircut operation method
Haircut operation is a method that detects and excludes vertices with low degree of connectivity from the potential cluster that these nodes belong to. Proportionally, the lower the connectivity of a node is, the lower the probability for this node to belong to a protein complex is. In such a way, the deletion of such nodes that add noise to the cluster leads to protein complexes that are more likely to be present in nature.
Description of the best neighbour method
The best neighbor method is mostly suitable to detect larger protein complexes that offer extra information about protein complexes included in a protein interaction dataset. Another advantage of using best neighbor method is that a protein can be assigned to more than one protein complex as it is known that there are shared components between protein complexes.
Description of the cutting edge method
where |inside edges| is the number of edges inside a cluster and |total edges| is the number of edges that are adjacent to at least one vertex of the cluster. The clusters in which the cutting edge metric is below a user defined threshold are discarded from the filter of our method.
In order to test the efficiency of GIBA, we have compared it with 4 other algorithmic methods: the Mcode, the HCS, the SideS and the RNSC algorithm as it was presented in . The benchmark that we have used to evaluate the algorithms tested consists of known yeast protein complexes retrieved from the MIPS database. MIPS protein complexes composed from smaller ones, also recorded in MIPS database, were removed to avoid redundancy. The final evaluation dataset comprises 220 complexes.
In addition to the collection of MIPS protein complexes, we have also used the same evaluation metric adopted in , called geometric similarity index. This method considers a predicted complex as valid if Open image in new window where I is the number of common proteins, A the number of proteins in the predicted complex and B the number of proteins in the recorded complex. In our measurements, we have calculated the mean geometric similarity index of the valid predicted complexes called mean score.
Furthermore, 4 different matching statistic metrics, that were presented in , were used in the evaluation process of the algorithms tested. These are sensitivity (Sn), Positive Predictive Value (PPV) and Geometrical Accuracy (Acc_g). These metrics are typically used to measure the correspondence between the result of a classification and a reference. Sensitivity is defined as the fraction of proteins of a recorded protein complex in MIPS database that are found in a cluster. Positive predictive value is the proportion of members of a cluster which belong to a recorded complex, relative to the same number of members found in all recorded complexes. The geometrical accuracy is measured through the geometrical mean of the sensitivity and the positive predictive value. It has the advantage that gives a more "objective" picture of the quality of the results as it obtains high values only if the values of sensitivity and of positive predictive value metrics are high.
To demonstrate the use of our methodology, we have used six datasets derived from various small scale and high-throughput methods. The multifaceted nature of the datasets enables us to perform a more "objective" comparison of the algorithms tested. In this section, we give a short description of the datasets that were used.
This network consists of 7430 edges and 2262 vertices . A genetic interaction network was mapped by crossing mutations in several genes into a set of viable gene yeast deletion mutants scoring the double mutant progeny for fitness defects. The interactions of this network were produced by predicting the functions of the interactive elements often produced by bringing together functionally related genes or components or elements that belong to the same pathway. The genetic network exhibited dense local neighbourhoods; our method aims to go one step further by predicting these neighbourhoods but also by splitting them in smaller groups that are functionally more significant.
This dataset consists of 7088 edges and 2675 vertices and contains different tagged proteins of the yeast Saccharomyces cerevisiae. In a previous analysis , the MCL algorithm was used to cluster and organize the proteins into several groups so that about half of them were absent from the MIPS database. We observed that a small amount of noise was added to these data and therefore we have applied our method to detect and filter the groups detected by MCL.
In this case, we have used two networks, the first consisting of 3210 edges and 1352 vertices and the second consisting of 6531 edges and 1430 vertices [1, 17]. In the first dataset, large-scale tandem affinity purification and mass spectrometry were used to characterize multiprotein complexes in Saccharomyces cerevisiae. Extending this information to human genome, this dataset provides an outline of the eukaryotic proteome as a network of protein complexes. Using the whole network, we try to see how successfully our method isolates the network complexes. The second dataset comes with the first genome-wide screen for complexes in yeast.
The Database of Interacting Proteins (DIP) is a database that documents experimentally determined protein-protein interactions . We have used this database to isolate a network consisting of 17491 edges and 4934 vertices. One of the reasons why we have included this source data for our experiments is that beyond cataloging details of protein-protein interactions, the DIP database helps us not only to understand protein functions but also the value of protein-protein relationships as well. The used DIP dataset version in our experiments was the one of 04/03/2007.
The Munich Information Center for Protein Sequences provides resources mainly related to genome information . Most of the databases that contain information about a variety of genomes of different organisms are manually curated. Furthermore 400 genomes that were automatically annotated are also included. One of the aims of this database is to provide information related to interactions such as PPIs. In this study case, we have isolated a network consisting of 12526 edges and 4554 vertices given by the MIPS database. The used MIPS dataset in our experiments was created on 05/18/2006.
The GIBA tool is a java application, while the RNSC, MCL and the methods used in the filtering process are implemented in C language. Three out of the four algorithms that were used in our experiments (SideS, RNSC and HCS) were implemented in C language too. The Mcode algorithm is implemented as a java plugin for Cytoscape . All the experiments were performed using an Intel Double Core 2.13 GHz processor, with 2 GB of RAM and Microsoft windows XP. Loop edges were not taken into account.
The filter we have used for the results of the RNSC algorithm was composed by two out of three parameters as they are presented in  (size and density). We did not use the third parameter (functional homogeneity) as this kind of information was not available for all datasets so that the comparison with the other algorithms, which did not use this kind of information, would not be biased. The SideS and HCS algorithms do not take any parameters, whereas for the use of Mcode and MCL algorithms we used the optimal parameters for accuracy as they are defined in .
Results and discussion
The GIBA tool
The workflow of the tool is straightforward. Initially, the user loads a tab delimited file that contains a simple weighted list of the protein – protein interactions. Then, the user can choose either the MCL or the RNSC algorithm and define their parameters to cluster the protein interaction network. Initially, the parameters of each algorithm have default values which offer the maximum accuracy according to . In the third step, the user chooses which methods will constitute the filter and defines the necessary parameters. Depending on the selections of the algorithm and methods, the corresponding parameters are set to active state, while all the others are set to inactive state. Moreover, there are pop up error messages that inform the user for potentially wrong parameter values. Finally, the user can press the "Run Workflow" button and start the clustering process. The Help Panel of GIBA is also providing explanations about the algorithms incorporated and their parameters.
After a successful run, GIBA generates various outputs: The "File loaded" tag shows the contents of the input file that is the protein – protein interactions. The proteins that constitute the clusters which derived from the first step clustering (the MCL or RNSC algorithm results) are shown in the "Clustering results" tag, while the interactions into each cluster are presented in the "Intermediate results" tag. Every file is stored locally on the hard disk so the user can reuse the intermediate results by skipping the time consuming run of MCL or RNSC algorithm. The final results, after the filtering, are presented on the "Final results" tag, where the number and the labels of the proteins that constitute the final clusters are shown. In addition, the number of interactions for each final cluster is also presented. This file is also stored on the local hard disk drive.
Many screen shots of the use of GIBA as well as information about GIBA algorithms and methods are given in Additional File 1.
Comparison with other algorithms
We have compared GIBA results with those derived from 4 different algorithms: Mcode, SideS, HCS and RNSC as it has been presented in . All the results of our experiments are presented in Additional File 2.
The methods used in the filtering process.
Density = 0.75, Haircut = 2
Cutting_Edge = 0.55, Density = 0.7, Haircut = 3
Cutting_Edge = 0.5, Density = 0.6, Haircut = 2
Cutting_Edge = 0.75, Density = 0.6, Haircut = 2, Best_neighbor = 0,6
Cutting_Edge = 0.5, Density = 0.6, Haircut = 3
Cutting_Edge = 0.5, Density = 0.7, Haircut = 2, Best_neighbor = 0,75
Analysis of GIBA filtering methods
As it was proved, the GIBA results are sensitive to the methods and their parameters that were used in the filtering process. Therefore, we have tested the possible combinations of the 4 methods that compose the filtering process in order to see how they affect the final results and how the function of one method affects the others.
We have chosen specific range of values for each method parameter:
for the density parameter: [ 0.55, 0.8 ]
for the haircut operation parameter: 2 or 3
for the best neighbor parameter: [ 0.6, 0.75 ] and
for the cutting edge parameter: [ 0.6, 0.75 ].
Choosing a parameter value out of the proposed range would be meaningless because the parameter method would become either too rigorous and it would produce very few clusters (if it was higher than the proposed maximum) or would add noise to the final data (if it was lower than the proposed minimum).
We have examined all possible combinations, using a parameter step of 0.5, in three datasets with different properties: 2 online database dataset (MIPS and DIP) and a dataset from individual experiment (Gavin_2006). So we have run 192 different filter combinations for each dataset.
Conclusion and future work
In this paper, we have introduced the GIBA tool in order to identify protein complexes from pairwise protein – protein interaction datasets. GIBA workflow splits in a two step process: initially, it clusters the whole input protein network and afterwards it applies a filtering process to obtain the final clusters. In addition, GIBA is user friendly and can also provide intermediate results for every step that can be useful for further use. With our experiments we proved the efficiency of GIBA comparing to 4 other methods.
The main issue of our future work will be the appliance of machine learning methods to detect how the properties of the initial protein – protein interaction dataset can take advantage of the filtering process in order to achieve better results. This could lead to the development of a new algorithmic approach with adaptive behavior relative to the initial protein network.
Availability and requirements
Project name: GIBA: A clustering tool for detecting protein complexes
Project home page: http://www.bioacademy.gr/bioinformatics/projects/GIBA
Operating system(s): Windows.
Programming language: Java and C.
License: GNU GPL.
Any restrictions to use by non-academics: No.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
- 19.Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 2006, (34 Database):D169–172. 10.1093/nar/gkj148Google Scholar
- 20.Moschopoulos CN, Pavlopoulos GA, Likothanassis SD, Kossida S: An enchanced Markov clustering method for detecting protein complexes. 8th IEEE International Conference on BioInformatics and BioEngineering (BIBE 2008): 8–10 October, 2008, 2008; Athens 2008.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.