Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules

  • Vladimir Svetnik
  • Andy Liaw
  • Christopher Tong
  • Ting Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3077)


Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.


Support Vector Machine Partial Little Square Variable Reduction Irrelevant Variable Variable Importance Measure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Amb02]
    Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99, 6562–6566 (2002)zbMATHCrossRefGoogle Scholar
  2. [Bak00]
    Bakken, G.A., Jurs, P.C.: Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis. J. Med. Chem. 43, 4534–4541 (2000)CrossRefGoogle Scholar
  3. [Bre98]
    Breiman, L.: Arcing classifiers. Ann. Stat. 26, 801–849 (1998)zbMATHCrossRefMathSciNetGoogle Scholar
  4. [Bre01]
    Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  5. [Don02]
    Doniger, S., Hofmann, T., Yeh, J.: Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms. J. Comput. Biol. 9, 849–864 (2002)CrossRefGoogle Scholar
  6. [Eki00]
    Ekins, S., et al.: Progress in predicting human ADME parameters in silico. J. Pharmac. Toxic. Meth. 44, 251–272 (2000)CrossRefGoogle Scholar
  7. [Fri03]
    Friedman, J.H., Popescu, B.E.: Importance sampled learning ensembles,
  8. [Gil92]
    Gilligan, P.J., et al.: Novel piperidine σ receptor ligands as potential antipsychotic drugs. J. Med. Chem. 35, 4344–4361 (1992)CrossRefGoogle Scholar
  9. [Guy02]
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002)zbMATHCrossRefGoogle Scholar
  10. [Has01]
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)zbMATHGoogle Scholar
  11. [Haw01]
    Hawkins, D.M., Basak, S.C., Shi, X.: QSAR with few compounds and many features. J. Chem. Inf. Comput. Sci. 41, 663–670 (2001)Google Scholar
  12. [Kau01]
    Kauffman, G.W., Jurs, P.C.: QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J. Chem. Inf. Comput. Sci. 41, 1553–1560 (2001)Google Scholar
  13. [Lia02]
    Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2/3, 18–22 (2002)Google Scholar
  14. [Pen02]
    Penzotti, J.E., Lamb, M.L., Evensen, E., Grootenhuis, P.D.J.: A computational ensemble pharmacophore model for identifying substrates of p-glycoprotein. J. Med. Chem. 45, 1737–1740 (2002)CrossRefGoogle Scholar
  15. [Reu03]
    Reunanen, J.: Overfitting in making comparisons between variable selection methods. J. Machine Learning Res. 3, 1371–1382 (2003)zbMATHCrossRefGoogle Scholar
  16. [Sve03]
    Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: QSAR modeling using Random Forest, an ensemble learning tool for regression and classification. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003)Google Scholar
  17. [Ton03]
    Tong, W., Hong, H., Fang, H., Xie, Q., Perkins, R.: Decision forest: combining the predictions of multiple independent decision tree models. J. Chem. Inf. Comput. Sci. 43, 525–531 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Vladimir Svetnik
    • 1
  • Andy Liaw
    • 1
  • Christopher Tong
    • 1
  • Ting Wang
    • 1
  1. 1.Biometrics Research RY33-300Merck & Co., Inc.RahwayUSA

Personalised recommendations