Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression

  • Pedro J. Ballester
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7632)


Accurately predicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction exploiting structural data are essential for analysing the outputs of Molecular Docking, which is in turn an important technique for drug discovery, chemical biology and structural biology. Conventional scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that characterise the complex and its predicted binding affinity. The inherent problem of this approach is in the difficulty of explicitly modelling the various contributions of intermolecular interactions to binding affinity.

Recently, a new family of 3D structure-based regression models for binding affinity prediction has been introduced which circumvent the need for modelling assumptions. These machine learning scoring functions have been shown to widely outperform conventional scoring functions. However, to date no direct comparison among machine learning scoring functions has been made. Here the performance of the two most popular machine learning scoring functions for this task is analysed under exactly the same experimental conditions.


molecular docking scoring functions machine learning chemical informatics structural bioinformatics 


  1. 1.
    Moitessier, N., et al.: Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go. Br. J. Pharmacol. 153, S7–S26 (2008)Google Scholar
  2. 2.
    Huang, N., et al.: Molecular mechanics methods for predicting protein-ligand binding. Phys. Chem. Chem. Phys. 8, 5166–5177 (2006)CrossRefGoogle Scholar
  3. 3.
    Mitchell, J.B.O., et al.: BLEEP - potential of mean force describing protein-ligand interactions: I. Generating potential. J. Comput. Chem. 20, 1165–1176 (1999)CrossRefGoogle Scholar
  4. 4.
    Guvench, O., MacKerell Jr., A.D.: Computational evaluation of protein-small molecule binding. Curr. Opin. Struct. Biol. 19, 56–61 (2009)CrossRefGoogle Scholar
  5. 5.
    Michel, J., Essex, J.W.: Prediction of protein–ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. J. Comput. Aided Mol. Des. 24, 639–658 (2010)CrossRefGoogle Scholar
  6. 6.
    Ballester, P.J., Mitchell, J.B.O.: A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26, 1169–1175 (2010)CrossRefGoogle Scholar
  7. 7.
    Marshall, G.R.: Limiting assumptions in structure-based design: binding entropy. J. Comput. Aided Mol. Des. 26(1), 3–8 (2012)CrossRefGoogle Scholar
  8. 8.
    Baum, B., Muley, L., Smolinski, M., Heine, A., Hangauer, D., Klebe, G.: Non-additivity of functional group contributions in protein-ligand binding: a comprehensive study by crystallography and isothermal titration calorimetry. J. Mol. Biol. 397, 1042–1054 (2010)CrossRefGoogle Scholar
  9. 9.
    Arunan, E., et al.: Definition of the hydrogen bond (IUPAC Recommendations 2011). Pure and Applied Chemistry 83, 1637–1641 (2011)CrossRefGoogle Scholar
  10. 10.
    Snyder, P.W., et al.: Mechanism of the hydrophobic effect in the biomolecular recognition of arylsulfonamides by carbonic anhydrase. Proceedings of the National Academy of Sciences 108, 17889–17894 (2011)CrossRefGoogle Scholar
  11. 11.
    Li, L., Li, J., Khanna, M., Jo, I., Baird, J.P., Meroueh, S.O.: Docking to Erlotinib Off-Targets Leads to Inhibitors of Lung Cancer Cell Proliferation with Suitable in Vitro Pharmacokinetics. ACS Med. Chem. Lett. 1(5), 229–233 (2010)CrossRefGoogle Scholar
  12. 12.
    Durrant, J.D., McCammon, J.A.: NNScore: A Neural-Network-Based Scoring Function for the Characterization of Protein−Ligand Complexes. J. Chem. Inf. Model. 50(10), 1865–1871 (2010)CrossRefGoogle Scholar
  13. 13.
    Ballester, P.J., Mitchell, J.B.O.: Comments on ‘Leave-Cluster-Out Cross-Validation is appropriate for scoring functions derived from diverse protein data sets’: Significance for the validation of scoring functions. J. Chem. Inf. Model. 51, 1739–1741 (2011)CrossRefGoogle Scholar
  14. 14.
    Cheng, T., Li, Q., Zhou, Z., Wang, Y., Bryant, S.H.: Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review. The AAPS Journal 14(1), 133–141 (2012)CrossRefGoogle Scholar
  15. 15.
    Kinnings, S.L., Liu, N., Tonge, P.J., Jackson, R.M., Xie, L., Bourne, P.E.: A Machine Learning-Based Method to Improve Docking Scoring Functions and its Application to Drug Repurposing. J. Chem. Inf. Model. 51, 408–419 (2011)CrossRefGoogle Scholar
  16. 16.
    Das, S., Krein, M.P., Breneman, C.M.: Binding Affinity Prediction with Property-Encoded Shape Distribution Signatures. J. Chem. Inf. Model. 50, 298–308 (2010)CrossRefGoogle Scholar
  17. 17.
    Li, L., Wang, B., Meroueh, S.O.: Support Vector Regression Scoring of Receptor-Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. J. Chem. Inf. Model. 51, 2132–2138 (2011)CrossRefGoogle Scholar
  18. 18.
    Durrant, J.D., McCammon, J.A.: NNScore 2.0: A Neural-Network Receptor–Ligand Scoring Function. J. Chem. Inf. Model. 51(11), 2897–2903 (2011)CrossRefGoogle Scholar
  19. 19.
    Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001)CrossRefzbMATHGoogle Scholar
  20. 20.
    Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)zbMATHGoogle Scholar
  21. 21.
    Amini, A., et al.: A general approach for developing system-specific functions to score protein-ligand docked complexes using support vector inductive logic programming. Proteins 69, 823–831 (2007)CrossRefGoogle Scholar
  22. 22.
    Breiman, L., et al.: Classification and regression trees. Chapman & Hall/CRC (1984)Google Scholar
  23. 23.
    Cheng, T., Li, X., Li, Y., Liu, Z., Wang, R.: Comparative Assessment of Scoring Functions on a Diverse Test Set. J. Chem. Inf. Model. 49, 1079–1093 (2009)CrossRefGoogle Scholar
  24. 24.
    Rucker, C., Rucker, G., Meringer, M.: y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 47, 2345–2357 (2007)CrossRefGoogle Scholar
  25. 25.
    The Comprehensive R Archive Network (CRAN) Package e1071, (last accessed November 2, 2011).
  26. 26.
    Sotriffer, C.A., Sanschagrin, P., Matter, H., Klebe, G.: SFCscore: scoring functions for affinity prediction of protein-ligand complexes. Proteins 73, 395–419 (2008)CrossRefGoogle Scholar
  27. 27.
    Zsoldos, Z., Reid, D., Simon, A., Sadjad, S.B., Johnson, A.P.: eHiTS: a new fast, exhaustive flexible ligand docking system. J. Mol. Graph. Model. 26, 198–212 (2007)CrossRefGoogle Scholar
  28. 28.
    Joachims, T.: Making large-Scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press (1999)Google Scholar
  29. 29.
    Kirkpatrick, S.C., Gelatt, D., Vecchi, M.P.: Optimization by simulated annealing. Science 220, 671–680 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    LIBSVM - A Library for Support Vector Machines, (last accessed November 2, 2011).
  31. 31.
    CSAR, (last accessed November 2, 2011).
  32. 32.
    The PDBbind database, (last accessed November 2, 2011).
  33. 33.
    Berman, H.M., et al.: The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  34. 34.
    The Comprehensive R Archive Network (CRAN) Package caret, (last accessed November 2, 2011).

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Pedro J. Ballester
    • 1
  1. 1.European Bioinformatics InstituteCambridgeUK

Personalised recommendations