Predicting protein-protein binding sites in membrane proteins
- 6.5k Downloads
Many integral membrane proteins, like their non-membrane counterparts, form either transient or permanent multi-subunit complexes in order to carry out their biochemical function. Computational methods that provide structural details of these interactions are needed since, despite their importance, relatively few structures of membrane protein complexes are available.
We present a method for predicting which residues are in protein-protein binding sites within the transmembrane regions of membrane proteins. The method uses a Random Forest classifier trained on residue type distributions and evolutionary conservation for individual surface residues, followed by spatial averaging of the residue scores. The prediction accuracy achieved for membrane proteins is comparable to that for non-membrane proteins. Also, like previous results for non-membrane proteins, the accuracy is significantly higher for residues distant from the binding site boundary. Furthermore, a predictor trained on non-membrane proteins was found to yield poor accuracy on membrane proteins, as expected from the different distribution of surface residue types between the two classes of proteins. Thus, although the same procedure can be used to predict binding sites in membrane and non-membrane proteins, separate predictors trained on each class of proteins are required. Finally, the contribution of each residue property to the overall prediction accuracy is analyzed and prediction examples are discussed.
Given a membrane protein structure and a multiple alignment of related sequences, the presented method gives a prioritized list of which surface residues participate in intramembrane protein-protein interactions. The method has potential applications in guiding the experimental verification of membrane protein interactions, structure-based drug discovery, and also in constraining the search space for computational methods, such as protein docking or threading, that predict membrane protein complex structures.
KeywordsSolvent Accessible Surface Area Surface Residue Residue Type Random Forest Classifier Protein Data Bank Entry
Integral membrane proteins constitute a significant fraction of all proteins in sequenced organisms and also are targets of slightly more than half of all current drugs [1, 2]. Similar to non-membrane proteins, many membrane proteins form complexes in order to carry out their biological function. Structural details of these protein-protein interactions can aid in generating experimentally verifiable mechanistic hypotheses for the relevant complexes and also can form a basis for the structure-based discovery of therapeutics to modulate these interactions. However, high-resolution experimental structures of membrane protein complexes are relatively scarce (< 1% of all Protein Data Bank structures), due to technical difficulties in obtaining X-ray or NMR structures . Also, even with an available structure, the annotation of the biological complex in the Protein Data Bank (PDB) file may be incorrect . Furthermore, even as new techniques are developed to speed up the experimental determination of membrane protein structures, the combinatorial nature of protein-protein interactions precludes solving the structures of all possible protein complexes from an organism's proteome.
Computational methods can address these challenges by providing predictions of which residues on the protein surface participate in protein-protein interactions. These predictions can be subsequently verified by, for example, mutagenesis experiments. The predictions can also be used as constraints for predicting the structure of the protein complex by, for example, protein-protein docking. Existing computational methods for predicting protein-protein binding sites can be broadly classified into those that utilize only 1D sequence information and those that require some information about the 3D protein structure. Sequence-only methods [5, 6, 7, 8] have the advantage that they can be applied to proteins for which no experimental structures are available and no close templates can be found for comparative modeling. However, structure provides additional information that helps distinguish binding site residues, such as solvent accessibility and the proximity of residues in 3D space. Because of these additional signals, prediction methods that incorporate this information generally perform better than sequence-only methods, although the use of different data sets and interface residue definitions prevents a direct comparison. Many previous structure-based methods used either scoring functions [9, 10, 11], artificial neural networks (ANNs) [12, 13, 14, 15], or Support Vector Machines (SVMs) [16, 17, 18] trained on various properties within roughly circular surface patches to predict protein-protein binding sites. Two exceptions are a study that limited the predictions to surface pockets  and a recent study that used a Random Forest trained on residue types and properties within a sliding 9-residue window for prediction .
Here we consider the problem of predicting protein-protein binding sites within the intramembrane region of integral membrane proteins. The previous studies mentioned above were limited to non-membrane proteins, for which considerably more experimental structures are available. Nonetheless, we find that there are currently a sufficient number of structures for training and validating a predictor that achieves accuracy comparable to our previous results for non-membrane proteins . There are large differences in the frequencies of residue types on the surfaces of membrane and non-membrane proteins due to their hydrophobic and hydrophilic environments, respectively. This means that separate predictors, trained only on data from their respective class of proteins (membrane or non-membrane), are needed. The prediction method employs a Random Forest trained on residue frequencies in a multiple alignment of related protein sequences and the evolutionary rates of each site. Random Forest predictions are first made for individual surface residues and then these are averaged over a local surface region in order to arrive at the final prediction. This procedure was found to yield better accuracy than directly including the properties of surrounding residues in the training data, as was done in previous machine learning based methods. In addition, we compared the residue properties between protein-protein binding sites and the remaining surface and also between membrane and non-membrane proteins in order to discern which properties contribute to the prediction in each case. Also, we examined the relative contribution of each property to the overall prediction accuracy and considered examples of predictions for particular membrane proteins.
Benchmark set of membrane protein complex structures
A diverse set of alpha-helical membrane protein complex structures was first compiled for training and testing the prediction method. Monomers as well as multimeric complexes were included. The initial set of PDB entries for alpha-helical membrane proteins were taken from the PDBTM database [21, 22]. A non-redundant subset of protein complexes, for which no pair of complexes have all proteins differing by less than 30% sequence identity, were then selected from each initial set of structures. Information on generating the biological complex in the PDB structure files (the BIOMT record) was used as an initial guess of the complex structure. Because this information is sometimes erroneous [4, 23], it was compared with the literature and the structure of the complete protein complex was corrected where necessary.
Next, a set of non-redundant proteins, each of which contacts at least one other protein in a complex, was extracted from these structures. Because the individual proteins are taken from structures of protein complexes, their protein-protein binding sites are known. This set of protein structures was then used to train the prediction method and to assess its accuracy. The same procedure was also used to build a set of beta barrel membrane protein complex structures as well as a non-redundant set of proteins taken from these complexes with known protein-protein binding sites. Finally, the alpha-helical and beta barrel sets were combined to make the membrane protein benchmark set.
The final set of membrane protein complexes contained 64 alpha-helical multimeric protein complexes comprised of 149 unique subunits, 17 alpha-helical monomeric complexes, 14 beta barrel homomultimeric complexes, and 23 beta barrel monomers. The details of this benchmark set are provided as additional file 1 accompanying this article.
Only surface residues, with relative solvent accessible surface area (SASA) ≥ 0.2, that are also within the hydrophobic core of the membrane are considered and so included in the training data. The relative SASA is calculated by dividing the residue SASA by the value for the same residue type in an extended conformation surrounded by glycine residues. Residues in the membrane core have z-coordinates with |z| ≤ 15 Å, in which the z-axis is perpendicular to the plane of the membrane predicted by PDBTM and the origin is in the center of the membrane. In other words, the membrane core was assumed to be 30 Å thick, which is in agreement with the approximate values from PDBTM predictions and experimental results on lipid bilayers .
Random Forest predictions were made for each individual residue based on its properties. The training data for each residue consisted of frequencies of each of the 20 standard residues in a multiple sequence alignment of similar sequences and the evolutionary rate. The sequence alignments were created by searching for similar protein sequences in the NCBI nr database with BLAST  at an E-value cutoff of 10-2, removing redundant sequences at the 90% sequence identity level using the CD-HIT program , and generating multiple alignments of the remaining sequences with MUSCLE . Only proteins with at least 20 sequences in the final alignment were included in the training set. This criterion reduced the number of unique proteins included in the training data to 128. The residue frequency for a particular residue type was simply calculated as the fraction of residues of that type in the corresponding multiple sequence alignment column. The evolutionary rate, which varies inversely with conservation, was calculated using the REVCOM method . Because REVCOM accounts for the evolutionary relationships between the protein sequences via an inferred phylogenetic tree, the resulting evolutionary conservation values are more robust to the particular set of sequences and local alignment errors than methods that do not, such as the column entropy. Finally, each surface residue was labeled as either a binding site residue, if it contacted another protein chain in the complex structure (< 4 Å non-H atom separation), or otherwise as a non-binding site residue.
Most machine learning classifier methods, include Random Forests, perform better on balanced input data that has a comparable number of positive and negative examples. Because of this, negative (non-binding site) examples were randomly chosen from the negative data such that there were an equal number of positive and negative examples in the training data. After training the Random Forest classifier on a balanced subset of the training data, predictions were made for all data in the (unbalanced) test set. The input data contained a total of 2391 positive examples for binding site residues.
A Random Forest binary classifier was trained on the labeled residue data and used to predict whether or not each intramembrane surface residue is in a protein-protein binding site. The Random Forest method  was chosen because it is fast and achieves competitive accuracy on standard test classification problems. In addition, unlike the popular alternative methods of Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs), it can utilize heterogeneous training data without rescaling and can also efficiently estimate the contribution of each variable to the prediction performance.
The overall prediction performance was evaluated by 10-fold cross-validation in which the data was randomly divided into 10 approximately equal size sets and predictions were made for each set in turn using a Random Forest trained on the data in the remaining 9 sets. The data was divided so that all residue data for a particular protein was contained entirely within one set. This insures that the predictions are made for a distinct set of proteins from those used to train the Random Forest classifier so that one obtains an accurate estimate of the prediction performance for novel data.
Briefly, a Random Forest is a set of decision trees in which the input data for each tree is randomized in two ways, by using a random subset of the total variables and by using a bootstrap sample of the data. The two main parameters in the method are the total number of trees and the number of variables per tree. Because the Random Forest generalization error converges to an asymptotic value as more trees are added, increasing the number of trees does not generally lead to worse overfitting . For the binding site residue prediction, a total of 2000 trees in the Random Forest were found to be sufficient, since adding further trees did not significantly improve the prediction performance but increased the calculation time. Also the number of variables per tree was set at two because this gave the highest cross-validation accuracy. The accuracy showed little change upon varying this parameter. The prediction score, which varied from 0.0 to 1.0, was calculated as the fraction of decision trees classifying the data as a binding site.
The likelihoods in the numerator and denominator were calculated using Gaussian kernel density estimation of the scores in each respective class. A high value of R for a residue indicates that it is confidently predicted to be in a binding site, a low value indicates that it is confidently predicted to be outside of binding sites, and an intermediate value indicates an ambiguous prediction. The R values are useful for prioritizing the predictions before undertaking time-consuming and costly experimental validation.
Results and Discussion
Distinguishing characteristics of intramembrane protein-protein binding sites
Throughout this section we consider only the intramembrane portion of membrane protein complexes since the general properties of the solvated portions of the complexes, specifically both binding site and non-binding site surfaces, are expected to have similar properties to those of cytosolic proteins. The membrane core was defined to extend 15 Å in both directions perpendicular to the central membrane plane predicted by the TMDET method and available from the PDBTM database. The TMDET method accounts for both the protein backbone geometry and hydrophobicity in order to predict the extent and orientation of the membrane relative to the protein complex. TMDET uses the structure of the complex, and so incorporates the geometrical constraints that all transmembrane segments are delimited by two common membrane boundaries. This is an advantage over sequence-only prediction methods, when experimental structures are available.
Protein-protein binding sites on cytosolic proteins have different distributions of residue types on average than those on the exposed protein surface. Specifically protein-protein binding sites are enriched in large hydrophobic and uncharged polar residues and are depleted of charged residues [18, 33]. This can be partially explained by the favorable solvation energy of burying hydrophobic residues and the unfavorable energy of burying charged residues in the interface.
A different trend in residue frequencies is expected for the intramembrane portion of membrane proteins because their surfaces are contacting the hydrocarbon tails of the lipid molecules comprising the membrane so that hydrophobic residues are energetically favorable on the exposed protein surface. Statistical tests using the benchmark set revealed that intramembrane protein-protein binding sites have higher frequencies of phenylalanine, tryptophan, and tyrosine residues and lower frequencies of valine residues than the remaining intramembrane protein surface (p < 0.05; Wilcoxon paired sign-rank tests with multiple testing corrections). In addition, residues occurring within protein-protein binding sites in membrane proteins have lower evolutionary rates, or equivalently higher conservation, than residues on the remaining intramembrane surface (p < 2.2 × 10-16, Wilcoxon rank sum test).
Spatial Averaging of Scores
Our previous method for predicting protein-protein binding sites , as well as those of others [12, 13, 14, 16], included the properties of neighboring residues in the training data. We found that this resulted in better performance than using only the properties of each individual residue. One explanation for the improved accuracy is that the binding sites are contiguous regions on the protein surface so that a given residue in a binding site is likely to be surrounded by other binding site residues. Likewise, surface residues outside of the binding sites are likely to be surrounded by other non-binding site residues. In other words, the binding site residues are spatially clustered and not randomly scattered about the surface. Including data for neighboring residues then provides additional independent information that improves the prediction accuracy.
in which the summations are over all residues within the cutoff distance, r max . The score of the central residue has a weight of 1 while a residue at the cutoff distance would have the minimum weight, w min . Thus the scores for residues closest to the central one make a larger contribution to the average score S avg than those further away. The best values for the two adjustable parameters, which resulted in the highest AUC, were chosen by a grid search. The optimal values were found to be r max = 18 Å and w min = 0.1.
Overall Prediction Accuracy
The overall prediction accuracy of the method was assessed by the area under the Receiver Operating Characteristic (ROC) curve for cross-validation results. The ROC curve is a plot of the sensitivity, or true positive rate, versus (1 - specificity), or false positive rate, and displays the tradeoff between these two quantities as the prediction score cutoff is varied. AUC can vary between 0.0 and 1.0. A value of 1.0 indicates perfect accuracy whereas a value near 0.5 indicates poor prediction performance.
It is tempting to use the larger quantity of protein-protein binding site data for non-membrane proteins in order to train a predictor for membrane proteins. This was directly tested by training the same prediction method described above on data from a non-redundant set of 4296 non-membrane proteins, sharing less than 30% sequence identity, and making predictions for the membrane protein benchmark set data. The AUC for the prediction was only 0.36. This AUC value is actually less than the random expected value of 0.5 because the prediction results are anticorrelated, i.e. binding site residues are more often predicted as non-binding site residues and vice versa. One explanation of the anticorrelation is that whereas hydrophobic residues are more prevalent in non-membrane protein binding sites they are instead more prevalent on the lipid-exposed non-binding site surfaces of membrane proteins. Likewise, hydrophilic residues are more prevalent on the solvent-exposed non-binding site surface of non-membrane proteins whereas they are more prevalent in the protein-protein binding sites of membrane proteins. This result confirms the expectation that the different frequencies of surface residue types for membrane and non-membrane proteins, resulting from the different physiochemical environments of proteins in each class, implies that separate predictors trained on the same class of proteins (membrane or non-membrane) are required in order to achieve good prediction accuracy.
Also we found in our previous study  that central protein-protein binding site residues had higher prediction reliability than those near the periphery of the binding site. This was attributed to two factors: (1) the 14 nearest residues, whose properties are included in the training data, are more likely to also be within the binding site and so provide additional independent data to improve the prediction accuracy, (2) there is some ambiguity in the binding site boundary depending on how the binding site is defined (for example, based on loss of SASA upon forming a complex or intermolecular atomic contacts), and (3) central residues had greater evolutionary conservation than peripheral binding site residues, resulting in a stronger signal.
Here we define a core residue as one for which all other residues within a Cα separation distance of 8 Å belong to the same class (binding site residue or non-binding site residue). Thus core residues can be either inside or outside of the binding sites but are not near the binding site boundaries. The AUC for the core residues alone was 0.86, which is considerably higher than when residues near the binding site boundaries are included. This is consistent with the results of the earlier study, although that study only examined core residues within protein-protein binding sites. However, unlike that study, there was no significant difference in the evolutionary conservation between the core and peripheral (non-core) binding site residues. This implies that the last factor (#3), mentioned above, that contributes to improved prediction performance for core residues in cytosolic proteins does not contribute for membrane proteins. However, the remaining two factors (#1 and #2) probably also contribute to the improved accuracy for core residues in membrane proteins.
Relative Importance of Residue Properties to Prediction Accuracy
Although the quantity of training data for membrane proteins is considerably less than for non-membrane proteins, the fact that the importance of the column residue frequencies exhibit the same dependence on their frequencies of occurrence suggests that this trend is not due to a lack of sufficient data. Rather, the simplest interpretation of these trends is that the overall abundance of each residue type, which determines how prevalent the residue type is in the training data, generally dominates any differences in residue frequencies between each class (binding site and non-binding site residues). For example, even though the statistical tests showed that tyrosine residues are more prevalent in binding sites whereas leucine residues are not, the column frequencies of leucine residues are more important than those of tyrosine residues because the training data contains significantly more leucine residues, thus giving them a larger contribution to the overall prediction accuracy.
We next briefly examine two examples in which the protein-protein binding site predictions aid in identifying or confirming the correct biologically relevant complex from X-ray structures. Again, cross-validation predictions, in which the predictor was trained on data for dissimilar protein complexes, were used in order to provide a realistic assessment of the prediction performance.
The protein-protein binding site prediction method for membrane proteins described in this study was found to yield accuracy that was comparable to that for non-membrane proteins. Although there are considerably fewer experimental structures of membrane proteins than non-membrane proteins, because the predictions are made for individual surface residues there is a sufficient quantity of independent examples for training a Random Forest classifier that gives accurate results. Also, as expected from the different occurrence frequencies of surface residue types in membrane and non-membrane proteins, a predictor trained on non-membrane proteins gave poor accuracy when applied to membrane proteins. Thus separate predictors for membrane and non-membrane proteins are needed. In addition, a prediction procedure that is different than the ones used in previous studies was found to give better accuracy. Random Forest predictions were first made for individual surface residues and then the resulting scores of nearby residues were averaged in order to arrive at the final prediction score. Predictions could not be made for some proteins due to an insufficient number of related protein sequences needed for the multiple sequence alignment, however this is expected to improve with the rapidly growing number of available protein sequences.
The prediction method presented here is expected to have applications in guiding experimental investigations of membrane protein interactions and also in the prediction of protein complex structures using computational methods such as docking or threading. In addition to these applications, several future areas of investigation are possible. First, because the method relies only on residue-level information, it is expected to give accurate results for homology models, which are generally correct for regions with well-defined secondary structure but often have errors in loops or side chain conformations. A study of the prediction accuracy for homology models of varying quality would help quantify what accuracy can be expected. Second, because the method relies on a multiple sequence alignment of similar sequences, the choice of included sequences can affect the final prediction accuracy. The implicit assumption that the proteins with sequences in the multiple alignment have the same protein-protein binding site, may be incorrect, particularly if distantly related sequences are included. It would be useful to have a method for selecting the optimal set of sequences to include in the alignment. Finally, contiguous binding patches could be calculated from the individual residue predictions. This would then give a lower bound on the number of independent binding sites on the protein surface.
This work was funded by the Mayo Clinic.
- 20.Sikic M, Tomic S, Vlahovicek K: Prediction of protein-protein interaction sites in sequences and 3D structures by Random Forests. PLOS Comp Biol 2009., 5(1): 10.1371/journal.pcbi.1000278Google Scholar
- 22.Tusnady GE, Dosztanyi Z, Simon I: PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res 2005, (33 Database):D275–278.Google Scholar
- 30.R Development Core Team: R: A language and environment for statistical computing. Vienna, Austria 2009.Google Scholar
- 31.Liaw A, Wiener M: Classification and regression by randomForest. R News 2002, 2(3):18–22. [http://www.r-project.org/doc/Rnews/Rnews_2002–3.pdf]Google Scholar
- 34.Lupo D, Li XD, Durand A, Tomizaki T, Cherif-Zahar B, Matassi G, Merrick M, Winkler FK: The 1.3-A resolution structure of Nitrosomonas europaea Rh50 and mechanistic implications for NH3 transport by Rhesus family proteins. Proc Natl Acad Sci USA 2007, 104(49):19303–19308. 10.1073/pnas.0706563104PubMedCentralCrossRefPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.