Phylogeny-guided interaction mapping in seven eukaryotes
- 4.7k Downloads
The assembly of reliable and complete protein-protein interaction (PPI) maps remains one of the significant challenges in systems biology. Computational methods which integrate and prioritize interaction data can greatly aid in approaching this goal.
We developed a Bayesian inference framework which uses phylogenetic relationships to guide the integration of PPI evidence across multiple datasets and species, providing more accurate predictions. We apply our framework to reconcile seven eukaryotic interactomes: H. sapiens, M. musculus, R. norvegicus, D. melanogaster, C. elegans, S. cerevisiae and A. thaliana. Comprehensive GO-based quality assessment indicates a 5% to 44% score increase in predicted interactomes compared to the input data. Further support is provided by gold-standard MIPS, CYC2008 and HPRD datasets. We demonstrate the ability to recover known PPIs in well-characterized yeast and human complexes (26S proteasome, endosome and exosome) and suggest possible new partners interacting with the putative SWI/SNF chromatin remodeling complex in A. thaliana.
Our phylogeny-guided approach compares favorably to two standard methods for mapping PPIs across species. Detailed analysis of predictions in selected functional modules uncovers specific PPI profiles among homologous proteins, establishing interaction-based partitioning of protein families. Provided evidence also suggests that interactions within core complex subunits are in general more conserved and easier to transfer accurately to other organisms, than interactions between these subunits.
KeywordsProtein Pair Reference Dataset Input Dataset Chromatin Remodel Complex Bayesian Network Model
Protein-protein interactions are essential to most cellular processes. Thus large-scale PPI networks can greatly contribute to our understanding of the cellular machinery at systems level. Experimental techniques such as yeast two-hybrid assays [1, 2, 3, 4] and TAP-MS [5, 6] have generated large amounts of binary PPIs and protein complex data, providing the first snapshots of eukaryotic interactomes. Unfortunately, the available experimental techniques are far from perfect, both in terms of their accuracy, as well as coverage. For instance, the yeast interactome has recently been estimated to contain from around 37,000 up to even 75,500 protein interactions between approximately 6,000 proteins . Although already over 80,000 yeast PPIs have been reported, given the estimated false positive rates of the experiments, the yeast interactome is suggested to be roughly 50% complete . Using a more conservative definition and omitting indirect co-complex associations, the authors of  estimate the number of yeast interactions to be ~18,000 and conclude that three idependent Y2H assays cover only around 20% of this amount. In case of human, the entire interacome is estimated to be covered in roughly 10% [7, 9]. Furthermore, many doubts and criticism have been expressed in the literature regarding the low overlap between independent screens - originally attributed to a high false-positive rate of these experiments [10, 11, 12]. More recent studies (e.g. ) suggest that the low overlap can largely be explained by low sampling sensitivity and differences in assay types. Considering all mentioned limitations, none of the existing experimental systems can provide a complete and error-proof interaction map of a complex organism within reasonable time and respecting budget limitations. As recently estimated, around 20 independent proteome-scale screens would be required to reliably identify each mappable interaction in a moderately-sized interactome of Drosophila melanogaster.
Simultaneously with the development of experimental techniques, computational methods for predicting PPIs have emerged [14, 15, 16]. These approaches complement experimental methods and can be used to validate noisy data, as well as to select new targets for screening experiments . Available computational techniques exploit various sources of evidence. Among them are ones based on genomic data [17, 18], protein sequences [19, 20], phylogenetic profiles , and classification-based approaches [22, 23, 24]. Other methods explore the premise that interacting proteins often co-evolve and thus similarity of phylogenetic trees can be used to infer interactions [25, 26, 27]. Approaches using maximum likelihood estimation (MLE) for inferring the probability of domain-domain interactions have been presented. The first of such analysis was performed in , where the authors used yeast PPI data to estimate the probability of domain-domain interactions, and subsequently predict the interactions between proteins. Finally, multiple data sources have been integrated in a Bayesian framework in . The last concept was further extended and applied to a wide range of heterogeneous data types from multiple species to construct comprehensive databases of functional associations [30, 31].
In this study we are specifically interested in techniques which integrate and transfer PPI evidence across species. In its simplest form, this idea is implemented in the interolog (the term interlog is also used) mapping approach , which predicts an interaction between a pair of proteins (a, b) if in another species there exists a known interaction between a pair (a', b'), where a' and b' are orthologs of a and b, respectively. The transfer of PPI evidence across species can also be achieved at the level of conserved domains. In  the authors devised a maximum likelihood method, similar to , but using data from multiple organisms. In summary, the method estimates the probability of interactions between each pair of considered domains, based on the PPI evidence from multiple species. Inferred domain-domain interactions constitute integrated evidence, which is in turn used to predict protein-protein interactions. A similar method, but using heterogeneous data sources (including protein fusion and Gene Ontology annotations), was used in . In general, combining interaction evidence from different species makes PPI predictions more robust to experimental noise. False positive observations are unlikely to be reproduced across multiple species . Furthermore, evolutionarily conserved interactions are expectedly biologically significant. Evolutionary pressures are more likely to constrain functional units such as protein complexes rather than single interactions . Hence, if an interaction has experimental support in datasets from diverse species, it is likely to be part of a significant functional module. Highly probable interactions identified in a subset of species can be transferred to other species , as was done in  to predict missing interactions within conserved protein modules.
We present an approach which uses protein family phylogenies to accurately map PPI evidence between homologous proteins. Contrary to previous studies [25, 26, 27], the phylogenies are not used to assess protein co-evolution, but to account for evolutionary relationships when integrating data from different organisms. Our current work builds on previously proposed CAPPI framework for comparing PPI networks across species . CAPPI is based on a duplication and divergence model which mimics the processes by which most protein interactions are formed i.e. by copying from ancestral interactions during protein duplication and subsequently being sustained or lost over time. Using this model we can naturally incorporate inter-dependencies between PPIs and study the available data in evolutionary context. The only previous works based on these principles are  and  both of which concentrated on inferring ancestral states of the protein interaction network (the analysis in  was limited to a single protein family). Our current work presents the first application of the duplication and divergence model towards genome-scale inference of PPIs in extant species.
We use our framework to integrate and infer new PPIs in seven eukaryotes: Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae and Arabidopsis thaliana. We perform a comprehensive validation of our predictions using a GO-based functional similarity measure and assessment based on reference datasets of binary and co-complex PPIs. The obtained results demonstrate CAPPI's ability to identify a large percentage of known interactions in a blind test and provide new hypothesis for experimental verification when all known data is integrated. Our method shows a significant advantage over the standard interlog mapping approach and a maximum-likelihood domain-oriented method. We also analyze specific examples of valid PPI predictions in well-characterized complexes in yeast and human (proteasome, endosome and exosome), and show that core subcomplexes can be accurately recovered based solely on the data from the other species (i.e. without any use of the experimental data from the species of interest). Many of the between-module interactions (possibly species-specific) are harder to transfer from distant organisms. Finally, based on our predictions, we present hypothesis on new proteins interacting with the putative SWI/SNF chromatin remodeling complex in A. thaliana. Our results are freely available at http://bioputer.mimuw.edu.pl/cappi.
Results and Discussion
We consider two modes of application of our framework. First, the integration mode which gathers all available input data to provide a reconciled interactome view for each species. Second, the prediction mode which predicts the interactions for each species only based on the evidence from the other species (blind test). To demonstrate the different aspects of our method and enable a straight-forward comparison to the previous approaches we use different combinations of the input datasets and different reliability values, yielding the following sets of inferred interactions (for details see Additional file 1):
CAPPI-Integ: interactions for all seven species inferred using all available experimental datasets.
CAPPI-Integ-3sp: yeast, fly and worm interactions inferred based on experimental datasets from Ito et al. , Uetz et al. , Giot et al.  and Li et al. , with reliability parameters set according to .
CAPPI-Pred: interactions inferred for each species using experimental datasets only from the other six species.
We compare the results of CAPPI with the following methods:
Domain-ML: a maximum likelihood domain-oriented method . Yeast interaction predictions, based on experimental datasets of Ito, Uetz, Giot and Li, were provided by the authors.
Interlog: an interlog-based method implemented in . The program was downloaded from the InteroPORC website http://biodev.extra.cea.fr/interoporc/Default.aspx and ran for each species using experimental datasets only from the other six species (same datasets as in CAPPI-Pred).
In the following sections, we investigate the performance of our method on large-scale data, as well as in small-scale experiments focused on specific functional modules.
Integration of interactions in seven eukaryotes
CAPPI-Integ provides an integrated and reconciled view of seven eukaryotic interactomes. Our ultimate goal is to provide a higher quality interactome for each input species. To assess the potential improvement, we perform two separate evaluations using a GO-based functional similarity measure and gold standard reference datasets.
BP score improvement over the input datasets.
Wilcoxon p -value
Wilcoxon p -value
The improvement in mean BP score described above is achieved for relatively large predicted datasets (as large as the initial inputs). As we show in Figure 2B, BP scores are actually higher for our top predictions. Figure 2B plots mean similarity scores according to all three ontologies: biological process (BP), molecular function (MF) and cellular component (CC), as functions of the number of predicted interactions. The mean scores for both CAPPI versions are negatively correlated with the size of the output dataset. This enables the user to trade size for quality, obtaining a smaller dataset, but of greater reliability.
Testing against gold standard datasets
We further survey the performance of our method using a set of gold standard binary PPIs pulled from  and , as well as co-complex data from the MIPS  and CYC2008  complex catalogues (see Additional file 1 for details). Once again, we score CAPPI predictions and compare them to the scores of the input datasets. The results are presented in Figure 2C. The figure plots the ratio of true positive and false positive interactions present among a subset of a given size. The true positive interactions are either confirmed by binary PPIs or known to participate in a characterized complex. Unfortunately negative gold standard sets of non-interacting protein pairs are not available. We take a standard heuristic approach and consider pairs of proteins with different subcellular localization as putative negative examples. We note that in certain situations, e.g. signalling pathways, it is possible that interacting proteins are in fact in different cellular compartments. Note also that in general true interactions constitute only a very small fraction of all possible protein pairs - at most 0.5% in yeast based on recent estimates from . This is reflected in our reference datasets. The positive reference used in this case contains 22480 PPIs and co-complex pairs while the negative set contains 4857065 differencially localized pairs (see also Additional file 1). It is unlikely to identify a true interaction by pure chance alone. Results presented in Figure 2C confirm the previous observation that reliable interactions are generally ranked high by our method. It is comforting that both CAPPI datasets contain more confirmed interactions than differentially localized pairs among the top ranked predictions (TP/FP >> 1). Note that a reference interaction can only be identified if a relevant evidence interaction is present in the input experimental evidence for one of the species. Given that the gold standard datasets generally do not have a large overlap with the input high-thoughput datasets, many of the reference interactions will not be inferred by any integration procedure. Importantly as shown in Figure 2C CAPPI-Integ-3sp has a much higher TP/FP ratio than the input yeast datasets (Ito and Uetz) used for its training. CAPPI-Integ integrates four more high-throughput yeast datasets and consistently scores higher than three out of four of these inputs - Gavin (2002) dataset has a higher score, but for a smaller number of interactions.
Prediction of interactions in a blind test
We continue the performance evaluation by testing CAPPI's ability to predict interactions in a blind test. To this end, we compute the CAPPI-Pred dataset by iteratively leaving out PPI data of one of the seven species and predicting its interactions based only on the data from the other six species. We discuss the assessment of yeast and human predicted interactomes based on the two scoring frameworks.
In Figure 3C we plot the ratio of true positives and false positives as a function of the number of yeast PPIs returned by CAPPI-Pred. We evaluate the predictions separately using co-complex datasets (CAPPI-Pred Complex), gold standard binary PPI datasets (CAPPI-Pred PPI), as well as all available reference data (CAPPI-Pred All) - see Additional file 1 for details. An analogous study is performed for the predicted human interactome using the HPRD (complex and binary PPI) catalogues as reference (see Figure 3D). Note that similarly as for yeast, also for human the positive reference set is significantly smaller than the negative reference set. The joint human reference set (All) contains 57,093 protein pairs, which is less than 0.2% of the number of differentially localized pairs - consistent with the expected ratio of true interactions to all protein pairs in human, as estimated in . The results show that CAPPI is able to infer high-scoring PPIs also in the case when no interactions from the predicted interactome are included in the training set. Most of the top predictions are confirmed by experimental data. We observe that while more yeast predictions are confirmed by co-complex pairs than by binary PPI data, the opposite is true in case of the human predictions. This can be explained by the differences in size of the respective reference datasets for the two species (see Additional file 1). When all available reference data is considered (CAPPI-Pred-All), the TP/FP ratios for the top 5,000 interactions in yeast and human are comparable (~0.8).
Filtering co-complex predictions
Evolutionary pressures are more likely to constrain essential functional units than individual interactions . Thus co-complex PPIs should be easier to map accurately across species. This premise was previously explored in , where the authors showed that screening PPI predictions against conserved clusters improves prediction specificity. In an attempt to increase the percentage of co-complex PPIs in our predictions, we filtered the CAPPI-Pred output dataset, leaving only the predicted PPIs placed within conserved dense network regions. To this end, an ancestral interaction network was computed as in , and clustered using the MCL algorithm  to identify dense clusters. Each cluster was projected onto the network of the extant species (yeast or human) and CAPPI-Pred predictions within the projected regions were identified as a result. As shown in Figure (3C and 3D), this procedure significantly boosts the TP/FP ratio for both yeast and human data (see "Filtered Complex" plots). Interestingly, while the fraction of co-complex PPIs was increased, the fraction of confirmed binary PPIs was in general lowered by the filtering (except for the top ranked human predictions), suggesting that many binary PPIs placed outside or between protein complexes are filtered out in this case. This is in line with the observations made in  that binary and co-complex datasets are of complementary nature and often have small overlap.
Comparison with previous high-throughput multi-species approaches
Numerous existing computational approaches for predicting protein associations in multiple species can be loosely divided into three categories. The first group of methods contains approaches for predicting interactions de novo from protein sequence. These methods often utilize evolutionary information such as phylogenetic profiles or gene fusion events, but they do not explicitly transfer pre-identified interactions from one species to another. The second group of methods takes as input experimentally identified PPIs, integrates them and transfers the evidence to other species. The third group of studies is directed towards integration of heterogeneous experimental evidence such as PPI, mRNA co-expression, phylogenetic profile similarity, co-localization, domain associations, etc., and attempts to predict various types of functional associations, not limited strictly to protein-protein interactions. CAPPI was specifically designed as a model-based approach for integrating and transferring protein-protein interactions across species and as such it falls into the second category. Here we compare the performance of our method and two well-established frameworks for mapping PPIs: the interlog approach and the domain-based maximum likelihood method.
Comparison with the domain-based maximum likelihood approach
In  the domain-domain interaction prediction method was generalized to multiple species and applied to infer interactions in yeast, worm and fly (we refer to this method as the Domain-ML approach). As a final output, this approach predicts protein-protein interactions based on inferred interactions between conserved domains. Liu et al. trained their method using Ito, Uetz, Giot and Li experimental datasets, so the their results can be directly compared to CAPPI-Integ-3sp. Note that only the yeast interaction predictions were provided by the authors. The mean GO scores for Domain-ML and CAPPI are shown in Figure 2B. CAPPI-Integ-3sp significantly outperforms Domain-ML in terms of all three GO scores. The performance evaluation using gold standard data (Figure 2C) also indicates a higher accuracy of CAPPI compared to the domain-based approach.
Comparison with the interlog-based approach
Next, we compare our results with a popular method of interlog mapping. This approach, similarly to CAPPI, relies on protein sequence similarity to transfer the interaction evidence across species. We choose for comparison the interlog mapping implementation from  and use the same input data for predicting our CAPPI-Pred dataset (for details see Additional file 1). Figure (3A and 3B) provides the distributions of GO scores for the Interlog and CAPPI datasets of the same size: 1576 (yeast) and 17105 (human), respectively. CAPPI predictions also contain a larger fraction of highest-scoring interactions (those with GO score > 0.8) and obtain a higher average score. The mean score for the CAPPI-predicted yeast dataset is noticeably higher than that of the Interlog method (0.57 vs. 0.39). CAPPI's advantage is also apparent in case of the human predictions (mean score 0.42 vs. 0.33). To assess the significance of the difference in score distributions we performed the Wilcoxon test which returned p-values < 2.2 × 10-16 in all cases.
Figure (3C and 3D) shows the mean scores for the Interlog output (in blue circles), which can be compared with the CAPPI rankings. In all cases CAPPI achieves a higher fraction of true positive interactions: 0.88 vs. 0.47 for the yeast co-complex predictions, 0.72 vs. 0.40 for the yeast binary PPI prediction, 0.16 vs. 0.14 for the human co-complex predictions, and 0.38 vs. 0.28 for the human binary PPI predictions. As we show in the next section, CAPPI recovers many known interactions within essential functional modules enabling the reconstruction of module subunits. The InteroPORC method is too restrictive in most of the studied cases (see Additional file 1: Table S1), suggesting that a less stringent ortholog search is needed. In fact this is recognised in  where more sensitive methods are considered for predicting interactions in cyanobacterium Synechocystis. An additional advantage of our method lies in the provided ranking (induced by the posterior probabilities), which enables the user to easily identify the most reliable interactions. As an example, for the purpose of selecting human PPI targets for verification, one could make a heuristic decision to consider only around 3,500 top predictions for which the TP/FP ratio is greater than 1 (see Figure 3D).
Case studies: mapping interactions within conserved functional modules
We now zoom-in on specific examples of functional units in the interactomes of human, yeast and thale cress, and analyze co-complex interactions inferred by CAPPI-Pred. In all described cases we demonstrate that the general topological features and organization of these complexes, as well as many known pairwise PPIs, can be recovered by our method based solely on data from the other species. We verify the inferred interactions against previously reported experimental data and assess the significance of our predictions. For an example of how the threshold selection impacts the number of interactions and the resulting p-value see Additional file 1: Figure S1. Note that in the following discussion gene names are used to denote corresponding proteins.
Human and yeast proteasome subnetworks
The ubiquitin-proteasome pathway is essential for eliminating damaged proteins and for regulation of intra-cellular level of proteins involved in wide spectrum of cellular functions . It is conserved in eukaryotes, from yeast to human. The 26S proteasome complex contains a 20S catalytic core particle (CP), which is capped on each side by a 19S regulatory particle (RP). The structure of the 20S proteasome from yeast has been resolved . It consists of 28 protein subunits: two α-rings (α 1,...,α 7) and two β-rings (β 1,...,β 7). The 19S proteasome can be further decomposed into two subcomplexes: the base (Rpt1-Rpt6, Rpn1, Rpn2, Rpn10 and Rpn13 - the last one probably not present in human) that binds directly to the 20S proteasome, and the lid (Rpn3, Rpn5-Rpn9, Rpn11, Rpn12 and Sem1), which is a peripheral subcomplex. In addition there is a number of transiently associated factors like p27 and S5b (the latter is apparently not present in yeast). We discuss our predictions of the 26S proteasome interactions from yeast and from human separately.
Human and yeast endosome subnetworks
The ESCRT complexes comprise a major pathway for the lysosomal degradation of transmembrane proteins (see ). We investigate the predicted interactions for the ESCRT complexes in human and yeast and compare the obtained results with the interactions reported in the literature. The list of proteins involved in these complexes was taken from .
Human mRNA decay complexes
A. thaliana SWI/SNF chromatin remodeling complex
The graph in Figure 7 contains the core SWI/SNF proteins - the SWI3-type proteins: At2g47620 (SWI3A), At2g33610 (SWI3B), At1g21700 (SWI3C), At4g34430 (SWI3D), together with the SNF5-type protein At3g17590 (BSH). This core is presented at the bottom of the graph. In addition to the above proteins we considered four groups of Arabidopsis proteins which are reported to play a putative role in chromatin remodeling in this plant (see ). These are: four ATPases which are reported in  as potential members of the SWI/SNF complex (At2g46020 (BRM), At2g28290 (SYD), At3g06010 (Chr 12), At5g19310 (Chr 23)); two SWP73-type proteins (At3g01890 (SWP73A), and At5g14170 (SWP73B)); nine actin-related proteins (At3g27000 (ARP2), At1g13180 (ARP3), At1g18450 (ARP4), At1g73910 (ARP4A), At3g12380 (ARP5), At3g33520 (ARP6), At3g60830 (ARP7), At5g56180 (ARP8) and At5g43500 (ARP9)); and three OSA-type proteins (At1g04880, At1g76110, and At3g13350). We excluded from the graph proteins which did not show any predicted interactions. Altogether we identified 13 of 14 known interactions between the proteins visualized in Figure 7 - the missing one is At3g01890-At1g21700 (see ). We notice some interesting peculiarities of the presented network. Three of four of the SWI3-type proteins, are predicted to interact with the four ATPases. Only one actin-type protein (At1g18450) has a predicted interaction with the SWI/SNF core and only two more (At3g60830 and At5g56180) can be associated with the complex through member ATPases. The ability to make distinctions within homologous groups is an important feature of our approach. While methods mapping interactions to highly similar orthologs usually make very specific predictions and avoid false-positives, they are also likely to miss many true interactions which can be inferred from slightly less similar proteins. As summarised in Additional file 1: Table S1, the restrictive search applied in InteroPORC fails to map the known interactions in the SWI/SNF complex in A. thaliana. In fact according to the PORC ortholog clusters, only two proteins (SWI3C and SWP73A) have orthologs in any of the other six eukaryotic species considered here. In this case, a less stringent method is clearly needed. On the other hand, CAPPI bases its prediction on evidence from all homologs and thus is in danger of loosing sensitivity and assigning the same interactions to all family members. The above examples demonstrate that we can avoid these potential pitfalls by considering family members in phylogenetic context when integrating and distributing the interaction evidence.
These observations are strengthened when we consider the larger family-oriented view of the SWI/SNF-related network in Additional file 1: Figure S2. This graph was obtained from the one in Figure 7 by expanding the set of proteins to all members of the considered protein families (once again, proteins without any interactions were removed). Interestingly, the four peripheral families represented in the graph can be divided into smaller subfamilies based on the interactions partners of their members. Specifically, of the 14 ATPases presented in the larger graph only the four above described are predicted to interact directly with the core of the SWI/SNF complex. Two of them (At2g46020 (BRM) and At2g28290 (SYD)) have confirmed interactions while for the other two (At3g06010 (Chr 12), At5g19310 (Chr 23)) interaction hypothesis based on sequence similarity were formulated . In fact the entire ATPase family, as detected by our method, contains 48 Arabidopsis proteins (a vast majority not having any predicted interactions to other proteins in the SWI/SNF subnetwork), which makes the presented predictions even more significant. These specific cases of confirmed predictions let us suggest that some of the distinctive members of the other protein families predicted to interact with the putative SWI/SNF complex (At1g18450 and six OSA family members interacting with At3g17590, five SWP73 family members interacting either with At3g17590 or at least one of the SWI3-type proteins, as well as five other actin family members interacting with ATPases At2g46020 and At2g28290), may pose valuable targets for future experimental validation.
We have presented a systematic phylogeny-based framework for reconciling PPI datasets across species and inferring missing interactions. Our method naturally incorporates interaction evidence from different species and experimental sources. It considers the reliability of each source and the evolutionary relationships between protein pairs. The approach was successfully applied to compute integrated interactomes for seven eukaryotic species, providing confidence scores for each possible edge in each network. Detailed analysis of our predictions indicates that we can accurately recover known interactions within conserved protein complexes. Confirmed interactions identified in a blind test provide a strong case for our top-ranked predictions, many of which await experimental verification. We also find that while core subcomplexes can be accurately recovered based solely on the data from distant species, many of the between-module interactions are harder to identify this way, suggesting possible rewiring events. One natural direction for future research is to extend our framework to include other kinds of data which may serve as indirect evidence of interaction. The integration of heterogeneous experimental sources with account of the phylogenetic model may possibly improve existing catalogues of functional associations.
Bayesian model of network evolution
Integrating diverse experimental data
The above-described model captures the basic notions of protein network evolution. We previously assumed that the PPI data is free of error and complete and we used the model to make inferences about the ancestral interaction networks. However, due to experimental errors and incomplete sampling, the real interactions and non-interacting protein pairs are not certain. This implies that the experimental data should only be used as supporting evidence of putative interactions. To model this accurately in our framework we keep the random variables corresponding to extant interactions unknown and add another level of random variables corresponding to experimental evidence (see Figure 8A). The evidence in each experimental dataset is weighted by the dataset's reliability.
where by |A| we denote the number of elements in the set A. Now each experimentally observed interaction can be naturally incorporated into the BN framework. Similarly each pair not observed to interact in the considered experiment ((n x , n y ) ∉ Open image in new window ) can be incorporated into the model with conditional probabilities corresponding to the false negative rate and true negative rate of the experiment (see Additional file 1 for details). The model can also be easily generalized to incorporate distinct reliability values for each single interaction.
Inferring extant protein interactions via message passing
The integrated BN model, comprising all PPI edges from every level of evolution and from the experimental datasets, is used to infer protein interactions in the input species. Each random variable corresponding either to a possible interaction, or to a single experiment outcome, depends on exactly one random variable which denotes an edge (or non-edge) in the direct evolutionary predecessor in the first case, and in the network of an extant species in the second case. The considered BN model is a set of Bayesian trees, where each tree represents the joint distribution of the random variables corresponding to putative interactions (which descended from a single edge in the ancestral graph) and the associated experimental evidence (an example of such tree is shown in Figure 8B). The tree structure allows us to apply Pearl's message passing (MP) algorithm  to compute the exact posterior probability of interaction between proteins in extant species, in time linear to the number of random variables (see Figure 8B for an example and  or  for details). Specifically we determine the posterior probability of interaction P( Open image in new window = 1|O) for each pair of nodes (n x , n y ) in each extant network Open image in new window , where O denotes all experimental datasets for all species.
Assessing PPI predictions in large-scale studies
In general, the assessment of PPI predictions posses problems due to the limited number of "gold standard" interactions and the lack of negative test cases. Motivated by previous studies, we employ two scoring schemes to assess the quality of predicted PPIs, as well as those from the input datasets. The first one compares Gene Ontology (GO) annotations  of adjacent gene products and measures their functional similarity. Functional similarity is used as an indirect measure of interaction: the more similar the annotations of the two proteins are, the more confident we are in deeming an interaction between them. We apply a recent information content method , implemented in the SemSim R package by Xiao Gou: http://www.bioconductor.org/packages/2.0/bioc/html/SemSim.html, which extends the measures previously proposed by  and . For each pair of proteins we individually measure the similarity of annotations in each of the three ontologies: biological process (BP), molecular function (MF) and cellular component (CC). This results in a BP score, MF score and CC score, respectively, each ranging from 0 (no similarity) to 1 (maximum similarity). When the context allows, we refer to each of these scores as a GO score of a pair of proteins.
Our second kind of quality assessment is based on a comparison with a reference dataset. We estimate the ratio of true positive interactions (predictions which are confirmed in a reference dataset) and putative false positive interactions (unconfirmed predictions for which the two proteins have disjoint cellular localizations). A similar procedure was applied in . We use separate reference datasets for binary PPIs (direct physical interactions) and for co-complex PPIs (pairs of proteins co-occurring within the same complex). For details on the reference datasets and the localization data see Additional file 1. Note that the proper sensitivity and specificity measures are hard to estimate because the reference sets of positive interactions and negative protein pairs are not comprehensive. Due to interdependencies between interactions, implied by our model, cross-validation cannot be easily applied. Instead, we perform a blind test in which we leave out the data of one species and predict its interactions only based on the data from the other species.
Assessing predictions in functional module case-studies
For small-scale functional module case studies we report all interactions predicted among a determined set of proteins for a selected threshold value. To assess the statistical significance of interaction predictions we compute a p-value based on the hypergeometric distribution, where confirmed interactions are regarded as successes and unconfirmed interactions are regarded as failures (Fisher's exact test). As the predictions are made by CAPPI-Pred which is trained without the use of the input datasets for the predicted species, we use the held out input data as a reference. Note that it is possible that some of the reference interactions are in fact false-positives - an inherent risk of using high-throughput data. In this particular test, however, we are interested in assessing the possibility to predict a significant portion of known PPIs (of which many are from high-throughput studies) by a mapping from other organisms. The reference set is further extended in each case by PPIs curated from specific publications characterizing interactions within the studied complexes. These are as follows: [66, 67] for the 26S proteasome PPIs, [56, 57] for the endosome-related PPIs,  for the exosome-related PPIs, and [68, 69, 70, 71, 72] for the SWI/SNF-related PPIs. Note that for A. thaliana there are no high-throughput datasets available, so all reference data for this species come from small-scale studies.
We would like to thank Andrzej Jerzmanowski, Maciej Kotliński and Marlena Roszczyk for many helpful discussions. This work was supported by the Polish Ministry of Science grants No 0652/B/P01/2009/36 and PBZ-MNiI-2/1/2005, and by the CoE BioExploratorium project WKP 1/1.4.3/1/2004/44/44/115.
- 1.Uetz P, Giot L, Cagney G, Mansfield T, Judson R, Knight J, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg J: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 2000, 403: 623–627. 10.1038/35001009CrossRefPubMedGoogle Scholar
- 3.Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, Fraughton C, Llamosas E, Cevik S, Bex C, Lamesch P, Sikorski RS, Vandenhaute J, Zoghbi HY, Smolyar A, Bosak S, Sequerra R, Doucette-Stamm L, Cusick ME, Hill DE, Roth FP, Vidal M: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437: 1173–1178. 10.1038/nature04209CrossRefPubMedGoogle Scholar
- 4.Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, Timm J, Mintzlaff S, Abraham C, Bock N, Kietzmann S, Goedde A, Toksöz E, Droege A, Krobitsch S, Korn B, Birchmeier W, Lehrach H, Wanker EE: A human protein-protein interaction network: a resource for annotating the proteome. Cell 2005, 122(6):957–968. 10.1016/j.cell.2005.08.029CrossRefPubMedGoogle Scholar
- 5.Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dümpelfeld B, Edelmann A, Heurtier MA, Hoffman V, Hoefert C, Klein K, Hudak M, Michon AM, Schelder M, Schirle M, Remor M, Rudi T, Hooper S, Bauer A, Bouwmeester T, Casari G, Drewes G, Neubauer G, Rick JM, Kuster B, Bork P, Russell RB, Superti-Furga G: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440: 631–636. 10.1038/nature04532CrossRefPubMedGoogle Scholar
- 6.Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP, Punna T, Peregran-Alvarez JaM, Shales M, Zhang X, Davey M, Robinson MD, Paccanaro A, Bray JE, Sheung A, Beattie B, Richards DP, Canadien V, Lalev A, Mena F, Wong P, Starostine A, Canete MM, Vlasblom J, Wu S, Orsi C, Collins SR, Chandran S, Haw R, Rilstone JJ, Gandi K, Thompson NJ, Musso G, St Onge P, Ghanny S, Lam MHY, Butland G, Altaf-Ul AM, Kanaya S, Shilatifard A, O'Shea E, Weissman JS, Ingles JC, Hughes TR, Parkinson J, Gerstein M, Wodak SJ, Emili A, Greenblatt JF: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440: 637–643. 10.1038/nature04670CrossRefPubMedGoogle Scholar
- 8.Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J, Hirozane-Kishikawa T, Gebreab F, Li N, Simonis N, Hao T, Rual JF, Dricot A, Vazquez A, Murray RR, Simon C, Tardivo L, Tam S, Svrzikapa N, Fan C, de Smet AS, Motyl A, Hudson ME, Park J, Xin X, Cusick ME, Moore T, Boone C, Snyder M, Roth FP, Barabasi AL, Tavernier J, Hill DE, Vidal M: High-Quality Binary Protein Interaction Map of the Yeast Interactome Network. Science 2008, 3: 104–110. 10.1126/science.1158684CrossRefGoogle Scholar
- 9.Venkatesan K, Rual JF, Vazquez A, Stelzl U, Lemmens I, Hirozane-Kishikawa T, Hao T, Zenkner M, Xin X, Goh KI, Yildirim MA, Simonis N, Heinzmann K, Gebreab F, Sahalie JM, Cevik S, Simon C, de Smet AS, Dann E, Smolyar A, Vinayagam A, Yu H, Szeto D, Borick H, Dricot A, Klitgord N, Murray RR, Lin C, Lalowski M, Timm J, Rau K, Boone C, Braun P, Cusick ME, Roth FP, Hill DE, Tavernier J, Wanker EE, Barabási AL, Vidal M: An empirical framework for binary interactome mapping. Nature Methods 2009, 6: 83–90. 10.1038/nmeth.1280PubMedCentralCrossRefPubMedGoogle Scholar
- 20.Burger L, van Nimwegen E: Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol Syst Biol 2008., 4(165):Google Scholar
- 30.Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research 2009, (37 Database):D412-D416. 10.1093/nar/gkn760Google Scholar
- 32.Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, Garrels J, Vincent S, Vidal M: Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". Genome Research 2001, 11(12):2120–2126. 10.1101/gr.205301PubMedCentralCrossRefPubMedGoogle Scholar
- 41.Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin E, Das U, Michoud K, Phan I, Gattiker R, Kulikova T, Faruque N, Duggan K, Mclaren P, Reimholz B, Duret L, Penel S, Reuter I, Apweiler R: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Research 2005, 33: 297–302. 10.1093/nar/gki039CrossRefGoogle Scholar
- 42.Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Reseach 2004, (32 Database):D452-D455. 10.1093/nar/gkh052Google Scholar
- 43.Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Research 2007, (35 Database):D572-D574. 10.1093/nar/gkl950Google Scholar
- 44.Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Research 2004, (32 Database):D449-D451. 10.1093/nar/gkh086Google Scholar
- 45.Giot L, Bader J, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao Y, Ooi C, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, Dasilva A, Zhong J, Stanyon C, Finley JR, White K, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets R, McKenna M, Chant J, Rothberg J: A protein interaction map of Drosophila melanogaster. Science 2003, 302: 1727–1736. 10.1126/science.1090289CrossRefPubMedGoogle Scholar
- 46.Li Siming, Armstrong C, Bertin N, Ge H, Milstein S, Boxem M, Vidalain PO, Han JD, Chesneau A, Hao T, Goldberg D, Li N, Martinez M, Rual JF, Lamesch P, Xu L, Tewari M, Wong S, Zhang L, Berriz G, Jacotot L, Vaglio P, Reboul J, Hirozane-Kishikawa T, Li Q, Gabel H, Elewa A, Baumgartner B, Rose D, Yu H, Bosak S, Sequerra R, Fraser A, Mango S, Saxton W, Strome S, Heuvel S, Piano F, Vandenhaute J, Sardet C, Gerstein M, Doucette-Stamm L, Gunsalus K, Harper J, Cusick M, Roth F, Hill D, Vidal M: A map of the interactome network of the metazoan C. elegans. Science 2004, 303: 540–543. 10.1126/science.1091403PubMedCentralCrossRefPubMedGoogle Scholar
- 49.Reguly T, Breitkreutz A, Boucher L, Breitkreutz BJ, Hon G, Myers C, Parsons A, Friesen H, Oughtred R, Tong A, Stark C, Ho Y, Botstein D, Andrews B, Boone C, Troyanskya O, Ideker T, Dolinski K, Batada N, Tyers M: Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. Journal of Biology 2006., 5(11):Google Scholar
- 50.Mewes HW, Frishman D, Mayer KF, Münsterkötter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Sümpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Research 2006, (34 Database):D169-D172. 10.1093/nar/gkj148Google Scholar
- 61.Pearl J: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann; 1988.Google Scholar
- 62.Neapolitan RE: Learning Bayesian Networks. Prentice Hall; 2003.Google Scholar
- 63.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMedCentralCrossRefPubMedGoogle Scholar
- 64.Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
- 65.Lin D: An information-theoretic definition of similarity. Proc 15th International Conf on Machine Learning, Morgan Kaufmann, San Francisco, CA 1998, 296–304.Google Scholar
- 68.Sarnowski TJ, Swiezewski S, Pawlikowska K, Kaczanowski S, Jerzmanowski A: AtSWI3B, an Arabidopsis homolog of SWI3, a core subunit of yeast Swi/Snf chromatin remodeling complex, interacts with FCA, a regulator of flowering time. Nucleic Acids Research 2002, 30: 3412–3421. 10.1093/nar/gkf458PubMedCentralCrossRefPubMedGoogle Scholar
- 70.Sarnowski TJ, Ríos G, Jásik J, Swiezewski S, Kaczanowski S, Li Y, Kwiatkowska A, Pawlikowska K, Kozbial M, Kozbial P, Koncz C, Jerzmanowski A: SWI3 subunits of putative SWI/SNF chromatin-remodeling complexes play distinct roles during Arabidopsis development. The Plant Cell 2005, 17: 2454–2472. 10.1105/tpc.105.031203PubMedCentralCrossRefPubMedGoogle Scholar
- 72.Bezhani S, Winter C, Hershman S, Wagner JD, Kennedy JF, Kwon CS, Pfluger J, Su Y, Wagner D: Unique, shared, and redundant roles for the Arabidopsis SWI/SNF chromatin remodeling ATPases BRAHMA and SPLAYED. The Plant Cell 2007, 19: 403–416. 10.1105/tpc.106.048272PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.