Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression
Abstract
Accurately predicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction exploiting structural data are essential for analysing the outputs of Molecular Docking, which is in turn an important technique for drug discovery, chemical biology and structural biology. Conventional scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that characterise the complex and its predicted binding affinity. The inherent problem of this approach is in the difficulty of explicitly modelling the various contributions of intermolecular interactions to binding affinity.
Recently, a new family of 3D structure-based regression models for binding affinity prediction has been introduced which circumvent the need for modelling assumptions. These machine learning scoring functions have been shown to widely outperform conventional scoring functions. However, to date no direct comparison among machine learning scoring functions has been made. Here the performance of the two most popular machine learning scoring functions for this task is analysed under exactly the same experimental conditions.
Keywords
molecular docking scoring functions machine learning chemical informatics structural bioinformaticsReferences
- 1.Moitessier, N., et al.: Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go. Br. J. Pharmacol. 153, S7–S26 (2008)Google Scholar
- 2.Huang, N., et al.: Molecular mechanics methods for predicting protein-ligand binding. Phys. Chem. Chem. Phys. 8, 5166–5177 (2006)CrossRefGoogle Scholar
- 3.Mitchell, J.B.O., et al.: BLEEP - potential of mean force describing protein-ligand interactions: I. Generating potential. J. Comput. Chem. 20, 1165–1176 (1999)CrossRefGoogle Scholar
- 4.Guvench, O., MacKerell Jr., A.D.: Computational evaluation of protein-small molecule binding. Curr. Opin. Struct. Biol. 19, 56–61 (2009)CrossRefGoogle Scholar
- 5.Michel, J., Essex, J.W.: Prediction of protein–ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. J. Comput. Aided Mol. Des. 24, 639–658 (2010)CrossRefGoogle Scholar
- 6.Ballester, P.J., Mitchell, J.B.O.: A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26, 1169–1175 (2010)CrossRefGoogle Scholar
- 7.Marshall, G.R.: Limiting assumptions in structure-based design: binding entropy. J. Comput. Aided Mol. Des. 26(1), 3–8 (2012)CrossRefGoogle Scholar
- 8.Baum, B., Muley, L., Smolinski, M., Heine, A., Hangauer, D., Klebe, G.: Non-additivity of functional group contributions in protein-ligand binding: a comprehensive study by crystallography and isothermal titration calorimetry. J. Mol. Biol. 397, 1042–1054 (2010)CrossRefGoogle Scholar
- 9.Arunan, E., et al.: Definition of the hydrogen bond (IUPAC Recommendations 2011). Pure and Applied Chemistry 83, 1637–1641 (2011)CrossRefGoogle Scholar
- 10.Snyder, P.W., et al.: Mechanism of the hydrophobic effect in the biomolecular recognition of arylsulfonamides by carbonic anhydrase. Proceedings of the National Academy of Sciences 108, 17889–17894 (2011)CrossRefGoogle Scholar
- 11.Li, L., Li, J., Khanna, M., Jo, I., Baird, J.P., Meroueh, S.O.: Docking to Erlotinib Off-Targets Leads to Inhibitors of Lung Cancer Cell Proliferation with Suitable in Vitro Pharmacokinetics. ACS Med. Chem. Lett. 1(5), 229–233 (2010)CrossRefGoogle Scholar
- 12.Durrant, J.D., McCammon, J.A.: NNScore: A Neural-Network-Based Scoring Function for the Characterization of Protein−Ligand Complexes. J. Chem. Inf. Model. 50(10), 1865–1871 (2010)CrossRefGoogle Scholar
- 13.Ballester, P.J., Mitchell, J.B.O.: Comments on ‘Leave-Cluster-Out Cross-Validation is appropriate for scoring functions derived from diverse protein data sets’: Significance for the validation of scoring functions. J. Chem. Inf. Model. 51, 1739–1741 (2011)CrossRefGoogle Scholar
- 14.Cheng, T., Li, Q., Zhou, Z., Wang, Y., Bryant, S.H.: Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review. The AAPS Journal 14(1), 133–141 (2012)CrossRefGoogle Scholar
- 15.Kinnings, S.L., Liu, N., Tonge, P.J., Jackson, R.M., Xie, L., Bourne, P.E.: A Machine Learning-Based Method to Improve Docking Scoring Functions and its Application to Drug Repurposing. J. Chem. Inf. Model. 51, 408–419 (2011)CrossRefGoogle Scholar
- 16.Das, S., Krein, M.P., Breneman, C.M.: Binding Affinity Prediction with Property-Encoded Shape Distribution Signatures. J. Chem. Inf. Model. 50, 298–308 (2010)CrossRefGoogle Scholar
- 17.Li, L., Wang, B., Meroueh, S.O.: Support Vector Regression Scoring of Receptor-Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. J. Chem. Inf. Model. 51, 2132–2138 (2011)CrossRefGoogle Scholar
- 18.Durrant, J.D., McCammon, J.A.: NNScore 2.0: A Neural-Network Receptor–Ligand Scoring Function. J. Chem. Inf. Model. 51(11), 2897–2903 (2011)CrossRefGoogle Scholar
- 19.Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001)CrossRefMATHGoogle Scholar
- 20.Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)MATHGoogle Scholar
- 21.Amini, A., et al.: A general approach for developing system-specific functions to score protein-ligand docked complexes using support vector inductive logic programming. Proteins 69, 823–831 (2007)CrossRefGoogle Scholar
- 22.Breiman, L., et al.: Classification and regression trees. Chapman & Hall/CRC (1984)Google Scholar
- 23.Cheng, T., Li, X., Li, Y., Liu, Z., Wang, R.: Comparative Assessment of Scoring Functions on a Diverse Test Set. J. Chem. Inf. Model. 49, 1079–1093 (2009)CrossRefGoogle Scholar
- 24.Rucker, C., Rucker, G., Meringer, M.: y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 47, 2345–2357 (2007)CrossRefGoogle Scholar
- 25.The Comprehensive R Archive Network (CRAN) Package e1071, http://cran.r-project.org/web/packages/e1071/index.html (last accessed November 2, 2011).
- 26.Sotriffer, C.A., Sanschagrin, P., Matter, H., Klebe, G.: SFCscore: scoring functions for affinity prediction of protein-ligand complexes. Proteins 73, 395–419 (2008)CrossRefGoogle Scholar
- 27.Zsoldos, Z., Reid, D., Simon, A., Sadjad, S.B., Johnson, A.P.: eHiTS: a new fast, exhaustive flexible ligand docking system. J. Mol. Graph. Model. 26, 198–212 (2007)CrossRefGoogle Scholar
- 28.Joachims, T.: Making large-Scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press (1999)Google Scholar
- 29.Kirkpatrick, S.C., Gelatt, D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)MathSciNetCrossRefMATHGoogle Scholar
- 30.LIBSVM - A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (last accessed November 2, 2011).
- 31.CSAR, http://www.csardock.org (last accessed November 2, 2011).
- 32.The PDBbind database, http://www.pdbbind-cn.org/ (last accessed November 2, 2011).
- 33.Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
- 34.The Comprehensive R Archive Network (CRAN) Package caret, http://cran.r-project.org/web/packages/caret/index.html (last accessed November 2, 2011).