# Regression applied to protein binding site prediction and comparison with classification

- 4.5k Downloads
- 3 Citations

## Abstract

### Background

The structural genomics centers provide hundreds of protein structures of unknown function. Therefore, developing methods enabling the determination of a protein function automatically is imperative. The determination of a protein function can be achieved by studying the network of its physical interactions. In this context, identifying a potential binding site between proteins is of primary interest. In the literature, methods for predicting a potential binding site location generally are based on classification tools. The aim of this paper is to show that regression tools are more efficient than classification tools for patches based binding site predictors. For this purpose, we developed a patches based binding site localization method usable with either regression or classification tools.

### Results

We compared predictive performances of regression tools with performances of machine learning classifiers. Using leave-one-out cross-validation, we showed that regression tools provide better predictions than classification ones. Among regression tools, Multilayer Perceptron ranked highest in the quality of predictions. We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server. Our method performed similarly to the other methods.

### Conclusion

Regression is more efficient than classification when applied to our binding site localization method. When it is possible, using regression instead of classification for other existing binding site predictors will probably improve results. Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable. This adaptability is useful when either false positive or negative rates have to be limited.

## Keywords

Support Vector Machine Positive Predict Value Random Forest Statistical Tool Partial Less Square Regression## List of abbreviations

- PDB
Protein Data Bank

- SES
Solvent Excluded Surface

- SAS
Solvent Accessible Surface

- PPV
Positive Predictive Value

- Lin.Reg.
Multiple Linear Regression

- SVM
Support Vector Machine

- PLSR
Partial Least Square Regression

- PCR
Principal Component Regression

- MLP
Multilayer Perceptron

- RFR
Random Forest Regression

- RFC
Random Forest Classifier

- NBC
Naive Bayes Classifier.

## Background

Structural genomics [1] is an important field, the objective of which is the determination of the three-dimensional structures of all the proteins coded by a genome. Recent advances in this field increase our understanding of protein function [2] but have also an impact on the pharmaceutical industry [3, 4]. The structural genomics centers provide hundreds of protein structures of unknown function [5].

Development of automated protein function predictor from the structure is imperative and is nowadays an active research field in bioinformatics. One of the various approaches for assigning a function to a protein is to study the network of its physical interactions [6, 7]. In this context, identifying a potential binding site between proteins is of primary interest. The localization of such binding site can also reduce the search space required by protein docking algorithms to predict the best match between two proteins. This localization has also an importance in studies about interactions between proteins and small molecules, such as ligands or substrates.

Bartlett et al. [8] presented an analysis of the properties of the catalytic residues. They showed that catalytic residues share common properties including the propensity of residue types, the conservation and the solvent accessibility. Petrova and Wu [9] compared 26 machine learning classifiers to specifically identify the catalytic residues within the whole enzyme. Gutteridge et al. [10] trained a neural network classifier to score the residues of a protein by their likelihood to be catalytic. The location of the active site were determined by searching for clusters of high-ranking residues.

Keskin et al., in their algorithm PRISM [11], used spatial similarity to predict protein-protein binding sites locations. Shulman-Peleg et al. proposed a method, called SiteEngine [12, 13], comparing the properties at different regions of the proteins surface, such as conformational properties, to find proteins with similar functions in databases. The same kind of approach has been done by Shatsky et al. [14] to recognize binding patterns common to a set of protein structures. Jones and Thornton [15] presented a method to predict protein-protein interaction sites. The prediction was based on the calculation of relative combined scores for patches constructed on the surface of protein structures. Zhou and Shan [16] published a method based on a neural network for predicting individual residues in protein-protein interfaces. Since then, a lot of methods predicting individual residues or patches that overlap with interface have been proposed [9, 17, 18, 19, 20, 21, 22].

Bradford et al. [18] proposed a protein binding site predictor based on a Support Vector Machine (SVM) discriminating the patches constructed by segmentation of the surface. After the segmentation and the computation of some relevant properties for each patch of each protein, the SVM was trained and became usable to distinguish binding site from non-binding site surface parts. Bradford and Westhead [23] used a Bayesian network in combination with a surface patch analysis to design a protein-protein binding site predictor. Petsalaki et al. [24] presented a different approach to predict peptide binding sites on protein surface. This approach is based on the construction of spatial position specific scoring matrices for each of the 20 standard amino acid. Several groups performed the comparison of different tools usable for the prediction of protein-protein binding sites [25, 26].

In the literature, patches based binding site localization methods generally use supervised classifier to find the relationship between the patches properties and their overlap with the true binding site. Machine learning or statistical techniques are used to discriminate patches totally superposed with the real binding site from patches lacking joined surface with it. In these cases, the response variable is categorical because it only takes two distinct values. So, supervised binary classifiers are used to predict the response variable from the predictor variables. However, when predicting the location of the binding site of a protein, a lot of patches constructed are partially superposed with the real binding site. Using regression tools in place of classification tools allows to include partially superposed patches during the training process what should improve the predictive performance of the model.

The first contribution of this paper is the description of a patches based protein-protein binding site localization method including either regression or classification tools. The predictive performance of several regression and classification tools are compared using leave-one-out cross-validation. Our method is also compared with three methods usable through Web Servers. The first method, named SHARP^{2} [19], is a patches based binding site localization method using a combined scoring function. The second method, named PINUP [20], is based on an empirical scoring function including a side-chain energy term, a solvent accessibility term and a conservation term. The third method named cons-PPISP [17] is based on a neural network taking PSI-blast sequence profile and solvent accessibility as input. Another contribution is to consider the travel depth as one of the surface properties. The travel depth is a notion introduced by Coleman and Sharp [27] to quantify the depth of pockets within a molecular surface. The third contribution of our method is the construction of the predicted binding site. In the literature, for patches based methods, the predicted binding site in chosen among the original patches according to the ranking of their probability to be the real binding site. Our method takes into account the output of the statistical model on the whole surface to construct a new patch, which forms the predicted binding site. The size of this predicted binding site can be adjusted according to the application. Whereas a too small predicted binding site increases the risk of false negative results, a too large counterpart increases the risk of false positive results. Limiting false negative results avoids to miss a binding site even if the predicted zone is too large, whereas limiting false positive results ensures that predicted zones is a part of the real binding site.

## Methods

The aim of this work is to develop a method enabling to predict the location of potential binding sites in a protein according to its 3D structure. The development of this method is composed of two successive parts: the training and the application.

Input data of the training part are the PDB files of several protein complexes. External surfaces of the proteins are constructed by using the 3D structures encoded in the PDB files. Some characteristics are extracted and mapped on these surfaces. Then, the surfaces are segmented into patches. The mean values of the properties and an index quantifying the overlap with the real binding site are computed for each patch. Finally, a statistical model is used to find a relationship between the patches properties and their overlap indices.

Input data of the application part are one or more PDB files. Patches are constructed in the same way as in the training part except that the overlap index with the real binding site cannot be computed. The location of the real binding site is indeed unknown. The statistical model previously trained is then used to allocate a score to each patch. Finally, a predicted binding site is located by using the score mapped on the protein surface.

To validate the method, the application part can be tested on a protein of known binding site location. In this case, the location is considered as unknown during the prediction of the binding site location. Comparing the predicted location of the binding site with the real one, is used to validate the performance of our method.

### PDB

The input data of the algorithm in both the training and application parts, are a set of protein 3D conformations. For the tests performed in this work, they are downloaded from the Protein Data Bank [28]. Two datasets are used in this work and both are composed of known protein complexes. The first one, used to train the models and to compare the statistical tools, is the one used by Bradford et al [18] (See additional file 1: Dataset1). This dataset is composed of proteins in their bounded conformation and had already been filtered at 20% sequence identity. To compare our method with existing methods, the dataset used consists of 35 proteins in the enzyme/inhibitor category of Docking Benchmark 2.0 [29], after filtering at 35% identity (See additional file 2: Dataset2). The genuine binding sites were identified using the bounded complexes but the tests were performed on the unbounded structures.

### Surface

With the Van der Waals Surface Model, the electron clouds around atoms are approximated by rigid spheres of different radii, which are called the Van der Waals radii of the atoms. With the SAS Model, the inner surface of the volume is filled by the possible positions of the center of a ball representing a molecule of solvent (e.g., water). The SES Model is comparable but considering the exterior surface of the ball. In this work, we used the SES Model, which is approximated by a 3D mesh. A mesh is a collection of points, edges and faces defining surfaces in a 3D environment. As no conformation changes are taken into account in our method, proteins are considered to be rigid objects. It allows to work with surface points without considering what occurs inside the protein.

### Binding site

A residue is considered to lie in the binding site if more than 1 Å of its SES area is hidden after a complex formation [31]. As the proteins of the training data set originate from PDB files containing proteins complexes, the locations of their binding sites are already known.

### Properties

Properties of different types are affected to mesh points. These properties can be classified into three groups: geometric properties, composition and conservation score. In order to make the method more uniform, all properties are considered to be surface points properties. So, each point of the SES is related to the closest atom center and inherits its properties.

#### Geometric properties

Two geometric properties are considered: the local curvature and the travel depth.

The travel depth of a surface point is defined as the shortest distance that a solvent molecule should do to reach this point from the convex hull of the SES. It is more global than the curvature computation while keeping a good surface resolution. It is computed with a surface-based algorithm using octrees [32].

#### Composition

The composition is the proportion of each surface amino acid composing a surface patch. For instance, if a fifth of the points in a patch corresponds to a leucine and the rest to a proline, the composition of the patch is Leu: 0.2, Pro: 0.8, other amino acids: 0. Other physico-chemical properties are redundant with the composition and are not significant for the statistical model.

#### Conservation score

The conservation score is a property in relation with the structure conservation across the evolution process. Key amino acid positions are often under strong evolutionary constraints. They are important for maintaining the 3D structure of a protein and/or its functions. Thus, the biological importance of a residue is often correlated with its level of evolutionary conservation within the protein family. The web-based tool Consurf http://consurf.tau.ac.il/ was used to calculate the evolutionary conservation score of each residue [33].

### Patches

*PPV*), reflecting its similarity with the true binding site:

where *S* is the area of the SES, *P* is the analyzed patch and *I* is the true binding site. To give an intuitive vision of this measure, *PPV* is equal to one when the patch is completely included in the binding site. In the training part, the true binding site is also considered as an additional patch.

### Statistical tools

The objective of the statistical model is to assign a score to patches of a protein of unknown binding site location.

During the training part, the input data of the statistical model are the patches associated to their PPV and the properties extracted in the previous steps. The objective of the training is to find a relationship between the extracted patches properties (predictor variables), and the patches PPV (response variable). The relationship is used afterwards in the application part to allocate a score to the patches resulting from the segmentation of a target protein.

In this section, different statistical tools are presented. They can be separated into two categories: the classifiers, which are more commonly used to treat this kind of problems, and the regression tools. The output of a classifier is the probability of each patch to represent the real binding site. The output of a regression model is a predicted PPV for each patch. The statistical model output, whether it be a probability or a predicted PPV, is denominated the score of the patch. The score has to be highly correlated with the genuine PPV to ensure the success of the method.

#### Multiple linear regression (Lin. Reg.)

Multiple Linear Regression [34] is a statistical technique that fits the relationship between the response variable and the predictor variables, usually, by minimizing the sum of the squared deviations. The best multiple linear model can be found by using a stepwise selection of the predictor variables.

#### Partial least square regression (PLSR)

Partial Least Square Regression [35] is an extension of the multiple linear regression. Using multiple linear regression with a high number of predictor factors leads to over-fitting: the model fits the sampled data closely but fails to correctly predict new data responses. In a partial least square regression, a few latent factors are extracted and replace the original predictor factors for the response fitting. Partial least square is useful to find a good predictive model without necessarily understanding the existing relationship between variables.

#### Principal component regression (PCR)

Principal Component Regression [36] uses principal component analysis to fit the relationship between the predictor variables and the response variable. The first step is to compute the principal components of the predictor variables. The second step is the regression on a subset of the principal components.

#### Multilayer perceptron (MLP)

Multilayer Perceptron [37] is a machine learning technique using multiple layers of neurons. A neuron is a simple processing element, connected to the neurons of the previous and of the following layers. The connections between the neurons are characterized by weights which are adjusted during the training step. Multilayer Perceptrons can be used for regression as well as for supervised classification task. In this work, it was used as a regression tool.

#### Random forest for regression (RFR) and classification (RFC)

Random Forest [38] is a machine learning tool used for classification and regression tasks. In both cases, a random forest is a set of trees created by bootstrapping samples of the training data set. Random forests for regression use a set of regression trees. In this case, the prediction is made by averaging the predictions of the different trees. Random forests for classification use a set of classification trees. In this case, the prediction is made by a majority vote on the predictions of the different trees.

Regression trees and classification trees are both decision trees. A decision tree is a technique used to predict a response variable using a set of predictor variables. Regression trees are used for continuous response variables, whereas classification trees are used for discrete response variables. In both cases, it uses a series of *"if then else"* conditions based on values of the predictors to predict the response. During the training phase, the decision tree is built through an iterative process of data splitting. This iterative process continues until each node reaches a fixed minimum size. For instance, if there are three predictor variables, the data can be separated from the root according to the value of the first variable. At each new node, the data can then be separated according to the value of other variables, and then, at children nodes, they can be separated according to the first variable again, ... During the prediction phase, a new data starts from the root and follows the branches corresponding to the value of its predictor variables. The leaves determine the value of the response variable.

#### Naive bayes classifier (NBC)

Naive Bayes Classifier [39] is a machine learning tool using the bayes theorem to compute the probability of a case to belong to a category. The NBC is based on a conditional independence assumption. Given the value of the response variable, the predictor variables have to be independent. Despite the fact that this assumption is rarely true in reality, naive Bayes classifier often performs better than expected.

#### Support vector machine (SVM)

Support Vector Machine [40] is a machine learning tool used for classification and regression. When used for binary classification, the objective of the SVM is to map the training data into a property space by the aid of a kernel function, and to constructs an N-dimensional hyperplane that optimally separates the case according to the category of their response variable.

### Predicted binding site construction

After statistical analyses, the score assigned to each patch is used to construct a score map on the protein SES. Each point of the surface is associated to the geodesically closest center of patch and get the score affected to this patch. Next, the points of the mesh representing the SES are added successively to the predicted binding site in descending order of score. The growth of the predicted binding site stops when either a predetermined size or a score threshold is reached.

## Results and discussion

The two major consecutive estimation steps of our method are the statistical prediction and the predicted binding site construction. The predicted binding site construction is divided in two steps: the score mapping on the SES and the score map thresholding. Results comparing, on one hand, the different statistical tools, and, on the other, our method to other existing methods are presented in this section. The different statistical tools were compared after the statistical prediction by calculating the correlation coefficient between predictions and real PPVs. Next, they were compared after the mapping step by analyzing the score distribution inside and outside the genuine binding site. Finally, overlapping indices between the predicted and the real binding site were computed. These indices were used to compare the results obtained with different statistical models, as well as to compare results of our method with results of other existing methods.

### Statistical prediction

The performance of our method strongly depends on the ability of the statistical model to correctly classify a patch or to predict the corresponding PPV. For both classification and regression tools, the output is a number between 0 and 1 called the *score* of the patch.

Each classification and regression tool was evaluated through leave-one-out cross-validation, which is an iterated process. At each step, the data set, comprising the patches and their properties, was divided into two sub-sets: a test data set, which contained the patches of one protein, and a training data set, which contained the patches of all the other proteins. The statistical model was trained on the training data set, and used to affect a score to the patches of the test data set. The scores were finally compared to the real PPV through the computation of the Pearson Correlation Coefficient. The protein in the test data set was different at each iteration until prediction and correlation coefficient were obtained for the patches of each protein in the complete data set.

### Binding site localization

After statistical prediction, a predicted binding site was constructed in two steps, namely the score map construction and the thresholding on the score map. The score map construction was validated by comparing the score distribution inside and outside the genuine binding site, using a histogram. The thresholding on the score map was validated by analyzing the PPV and the sensitivity of the resulting predicted binding site. The validation results for the different statistical tools are compared in this section.

#### Score map histograms

A histogram is given when the exact PPVs are taken as the score, another one represents the expected values for a random method, and each other one corresponds to a specific statistical tool. Red hatched bars represent the proportion of score for surface points belonging to the binding site and green plain bars for points belonging to the rest of the surface. An ideal histogram would show 100% of the observation in the first bin for the non-site parts and 100% of the observations in the last bin for the site parts. In the histograms corresponding to classifications (RFC, SVM, NBC), a lot of binding site parts have a small score, what leads to a lot of false negative results. On the other hand, in the histograms corresponding to regressions (Lin. Reg., RFR, MLP, PLSR, PCR), as for the "Exact" histogram, the repartition is more distinct between site and non-site zones, what makes the prediction more accurate.

#### Binding site prediction

*S*is the area of the SES,

*P*is the analyzed patch and

*I*is the true binding site. Our whole method was tested using different statistical tools. A Precision-Recall graph comparing results for all these tools appears in Figure 7. Curves corresponding to regression tools generally are above the one corresponding to classification tools, what is consistent with the previous observations.

*S*

_{ T }is the whole surface area.

*S*(

*I*) and

*S*(

*P*) are considered to be constant. So, in these formulas, the only variable is the overlap between

*P*and

*I*. The expected proportion of

*P*inside

*I*is the same as anywhere else on the surface. Then,

*E*(

*S*(

*P*∩

*I*)) is this proportion weighted by

*S*(

*I*). The reasoning is the same for the proportion of

*I*inside

*P*. To quantify the success rate of our method with each statistical tool, the percentage of proteins of the data set with a higher index than expected (via random selection) was calculated (Table 1). As for other tests, the results are better for regressions than for classifications. In 84% of the cases, the MLP gave a better result than the expected value, whereas the best classification tool (NBC) gave a result of 78%.

Success rate for different statistical tools.

Statistical Tool | Success Rate |
---|---|

Exact | 100 |

Lin. Reg. | 81.67 |

PLSR | 82.78 |

PCR | 76.67 |

RFR | 77.78 |

MLP | 83.89 |

SVM | 75.00 |

NBC | 77.78 |

RFC | 69.44 |

#### Comparison with other methods

^{2}[19]. These three methods return a score for each residue. These scores were mapped on the protein surfaces and the binding site localizations were predicted as it was done for the scores resulting from the different statistical models. The tests were performed on the second dataset (See Supplementary Materials 2). A Precision-Recall graph comparing results for all these methods appears in Figure 9. Performances of our method using MLP are higher than those of Sharp

^{2}method and comparable to those of the PINUP and Cons-PPISP methods. The percentage of proteins of the data set with a higher index than expected (via random selection) was also calculated (Table 2). For both our method using MLP and the PINUP method, the result was better than expected in 71% of the cases. These results are a bit worse than for the other dataset, probably because the training dataset was made of proteins from bounded structures.

Success rate for different methods.

Method | Success Rate |
---|---|

MLP | 71.43 |

Cons-PPISP | 69.70 |

PINUP | 71.43 |

Sharp | 62.86 |

## Conclusion

In this paper, we present a patch based binding site prediction method based on either classification or regression tools. This was motivated by the fact that patches based method presented in the literature generally use classification tools. In this case a binary classifier is trained to discriminate patches totally superposed with the real binding site from patches lacking joined surface with it. Using regression instead of classification allows to include patches partially overlapping the genuine binding site during the training step. The variable to be estimated by the regression is the overlap between the real binding site and the patches constructed on the protein surface.

Using leave-one-out cross-validation, we showed that regression tools have better predictive performance than classification ones. As the patches constructed during the application step partially overlap the genuine binding site, the predictions for these patches generally are more correlated with their PPV when regression is used. Among regression tools, the Multilayer Perceptron is the most efficient. In 84% of cases, with dataset 1 (See Supplementary Materials), the method using an MLP for regression, allowed a better prediction than the expected value via random selection. Our method used with MLP was also compared with three methods usable through a web server. Our method performed better than Sharp^{2}, which is also a patches based method, and performed equivalently to the two other methods.

To sum up, regression tools appeared to be more efficient than classification tools for a new patches based method comparable with existing binding site prediction methods. When possible, using regression instead of classification for other predictors will probably improve the results, not only when patches are used, but every time the output is a continuous variable.

The method presented in this paper is also flexible. Indeed, the final predicted binding site is obtained by applying a threshold on the surface score map derived form the prediction of the regression or the classification tools. So, the size of the final predicted binding site is adaptable to the approximate false positive and negative rates required for the final application. Therefore, the output of our method is a single predicted binding site instead of a list of top ranked patches [18, 23] or residues [9].

In future works, our method will be applied on real cases and combined with other bioinformatics tools. For instance, the combination of our method with dynamic molecular simulation will be used to study the impact of residue mutation on the location of the binding site.

## Notes

### Acknowledgements

JG is funded by Nanotic/Tsarine, a project of the Région Wallonne of Belgium (convention number: 516248). JA is funded by Nanotic/Dediccas, a project of the Région Wallonne of Belgium (convention number: 516250).

## Supplementary material

## References

- 1.Zhang C, Kim S: Overview of structural genomics: from structure to function.
*Current Opinion in Chemical Biology*2003, 7: 28–32. 10.1016/S1367-5931(02)00015-7CrossRefPubMedGoogle Scholar - 2.Watson J, Laskowski R, Thornton J: Predicting protein function from sequence and structural data.
*Current Opinion in Structural Biology*2005, 15(3):275–284. 10.1016/j.sbi.2005.04.003CrossRefPubMedGoogle Scholar - 3.Drews J: Drug Discovery: A Historical Perspective.
*Science*2000, 287(5460):1960–4. 10.1126/science.287.5460.1960CrossRefPubMedGoogle Scholar - 4.Fetrow J, Skolnick J: Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to Glutaredoxins/Thioredoxins and T1Ribonucleases.
*Journal of Molecular Biology*1998, 281(5):949–968. 10.1006/jmbi.1998.1993CrossRefPubMedGoogle Scholar - 5.Sali A: 100,000 protein structures for the biologist.
*Nature Structural Biology*1998, 5(12):1029–1031. 10.1038/4136CrossRefPubMedGoogle Scholar - 6.Keskin O, Nussinov R: Similar binding sites and different partners: implications to shared proteins in cellular pathways.
*Structure*2007, 15(3):341–354. 10.1016/j.str.2007.01.007CrossRefPubMedGoogle Scholar - 7.Keskin O, Gursoy A, Ma B, Nussinov R,
*et al*.: Principles of Protein-Protein Interactions: What are the Preferred Ways For Proteins To Interact?*Chem Rev*2008, 108(4):1225–1244. 10.1021/cr040409xCrossRefPubMedGoogle Scholar - 8.Bartlett G, Porter C, Borkakoti N, Thornton J: Analysis of catalytic residues in enzyme active sites.
*Journal of molecular biology*2002, 324: 105–121. 10.1016/S0022-2836(02)01036-7CrossRefPubMedGoogle Scholar - 9.Petrova N, Wu C: Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.
*BMC Bioinformatics*2006, 7(312):1471–2105.Google Scholar - 10.Gutteridge A, Bartlett G, Thornton J: Using A Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes.
*Journal of Molecular Biology*2003, 330(4):719–734. 10.1016/S0022-2836(03)00515-1CrossRefPubMedGoogle Scholar - 11.Keskin O, Nussinov R, Gursoy A: Prism: Protein-Protein Interaction Prediction by Structural Matching.
*Methods in molecular biology (Clifton, NJ)*2008, 484: 505–21. full_textCrossRefGoogle Scholar - 12.Shulman-Peleg A, Nussinov R, Wolfson H: Recognition of functional sites in protein structures.
*Journal of molecular biology*2004, 339(3):607–633. 10.1016/j.jmb.2004.04.012CrossRefPubMedGoogle Scholar - 13.Shulman-Peleg A, Nussinov R, Wolfson H: SiteEngines: recognition and comparison of binding sites and protein-protein interfaces.
*Nucleic acids research*2005, (33 Web Server):W337. 10.1093/nar/gki482Google Scholar - 14.Shatsky M, Shulman-Peleg A, Nussinov R, Wolfson H: The multiple common point set problem and its application to molecule binding pattern detection.
*Journal of Computational Biology*2006, 13(2):407–428. 10.1089/cmb.2006.13.407CrossRefPubMedGoogle Scholar - 15.Jones S, Thornton J: Prediction of protein-protein interaction sites using patch analysis.
*Journal of molecular biology*1997, 272: 133–143. 10.1006/jmbi.1997.1233CrossRefPubMedGoogle Scholar - 16.Zhou H, Shan Y: Prediction of protein interaction sites from sequence profile and residue neighbor list.
*Proteins: Structure, Function, and Genetics*2001, 44(3):336–43. 10.1002/prot.1099CrossRefGoogle Scholar - 17.Chen H, Zhou HX: Prediction of interface residues in protein-protein complexes by a consensus neural network method: test against NMR data.
*Proteins*2005, 61: 21–35. 10.1002/prot.20514CrossRefPubMedGoogle Scholar - 18.Bradford JR, Westhead DR: Improved prediction of protein-protein binding sites using a support vector machines approach.
*BIOINFORMATICS*2005, 21(8):1487–1494. 10.1093/bioinformatics/bti242CrossRefPubMedGoogle Scholar - 19.Murakami Y, Jones S: SHARP2: protein-protein interaction predictions using patch analysis.
*Bioinformatics*2006, 22(14):1794–1795. 10.1093/bioinformatics/btl171CrossRefPubMedGoogle Scholar - 20.Liang S, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function.
*Nucleic acids research*2006, 34(13):3698–3707. 10.1093/nar/gkl454PubMedCentralCrossRefPubMedGoogle Scholar - 21.Qin S, Zhou H: meta-PPISP: a meta web server for protein-protein interaction site prediction.
*Bioinformatics*2007, 23(24):3386–3387. 10.1093/bioinformatics/btm434CrossRefPubMedGoogle Scholar - 22.Li N, Sun Z, Jiang F: Prediction of protein-protein binding site by using core interface residue and support vector machine.
*BMC bioinformatics*2008, 9: 553. 10.1186/1471-2105-9-553PubMedCentralCrossRefPubMedGoogle Scholar - 23.Bradford JR, Needham CJ, Bulpitt AJ, Westhead DR: Insights into Protein-Protein Interfaces using a Bayesian Network Prediction Method.
*Journal of Molecular Biology*2006, 362(2):365–386. 10.1016/j.jmb.2006.07.028CrossRefPubMedGoogle Scholar - 24.Petsalaki E, Stark A, García-Urdiales E, Russell R: Accurate Prediction of Peptide Binding Sites on Protein Surfaces.
*PLoS Computational Biology*2009., 5(3): 10.1371/journal.pcbi.1000335Google Scholar - 25.Zhou H, Qin S: Interaction-site prediction for protein complexes: a critical assessment.
*Bioinformatics*2007, 23(17):2203–2209. 10.1093/bioinformatics/btm323CrossRefPubMedGoogle Scholar - 26.Tuncbag N, Kar G, Keskin O, Gürsoy A, Nussinov R: A survey of available tools and web servers for analysis of protein-protein interactions and interfaces.
*Briefings in Bioinformatics*2009, 10(3):217–232. 10.1093/bib/bbp001PubMedCentralCrossRefPubMedGoogle Scholar - 27.Coleman RG, Sharp KA: Travel Depth, a New Shape Descriptor for Macromolecules: Application to Ligand Binding.
*J Mol Biol*2006, 362(3):441–458. 10.1016/j.jmb.2006.07.022CrossRefPubMedGoogle Scholar - 28.Berman H, Westbrook J, Z Feng GG, Bhat T, Weissig H, Shinkdyalov I, EBourne P: The protein data bank.
*Nucleic Acid Res*2000, 28: 235–242. 10.1093/nar/28.1.235PubMedCentralCrossRefPubMedGoogle Scholar - 29.Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z: Protein-protein docking benchmark 2.0: an update.
*PROTEINS-NEW YORK-*2005, 60(2):214–6.Google Scholar - 30.Connolly ML: Molecular surfaces: A review.
*Network Science*1996. [http://www.netsci.org/Science/Compchem/feature14.html]Google Scholar - 31.Jones S, Thornton J: Principles of protein-protein interactions.
*Proceedings of the National Academy of Sciences of the United States of America*1996, 93: 13–20. 10.1073/pnas.93.1.13PubMedCentralCrossRefPubMedGoogle Scholar - 32.Giard J, Alface P, Macq B: Fast and accurate travel depth estimation for protein active site prediction.
*Proceedings of SPIE*2008, 6812: 68120Q. full_textCrossRefGoogle Scholar - 33.Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures.
*Nucleic Acids Research*2005, (33 Web-Server):299–302. 10.1093/nar/gki370Google Scholar - 34.Vittinghoff E, Glidden D, Shiboski S, McCulloch C:
*Regression Methods In Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models*. Springer New York; 2005.Google Scholar - 35.Tobias R: An introduction to partial least squares regression. In
*Proceedings of the Twentieth Annual SAS Users Group International Conference*. Cary, NC: SAS Institute Inc; 1995:1250–1257.Google Scholar - 36.Xie Y, Kalivas J: Evaluation of principal component selection methods to form a global prediction model by principal component regression.
*Analytica Chimica Acta*1997, 348(1–3):19–27. 10.1016/S0003-2670(97)00035-4CrossRefGoogle Scholar - 37.Zhang G, Eddy Patuwo B, Y Hu M: Forecasting with artificial neural networks: The state of the art.
*International Journal of Forecasting*1998, 14: 35–62. 10.1016/S0169-2070(97)00044-7CrossRefGoogle Scholar - 38.Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data. In
*Tech rep*. University of California at Berkeley: Statistics Department; 2004.Google Scholar - 39.Zhang H, Su J: Naive Bayesian Classifiers for Ranking.
*European Conference on Machine Learning*2004, 501–512.Google Scholar - 40.Gunn S: Support Vector Machines for Classification and Regression.
*Tech. rep., Faculty of Engineering, Science and Mathematics School of Electronics and Computer Science*1998.Google Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.