Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules
Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.
Unable to display preview. Download preview PDF.
- [Fri03]Friedman, J.H., Popescu, B.E.: Importance sampled learning ensembles, http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
- [Haw01]Hawkins, D.M., Basak, S.C., Shi, X.: QSAR with few compounds and many features. J. Chem. Inf. Comput. Sci. 41, 663–670 (2001)Google Scholar
- [Kau01]Kauffman, G.W., Jurs, P.C.: QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. J. Chem. Inf. Comput. Sci. 41, 1553–1560 (2001)Google Scholar
- [Lia02]Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2/3, 18–22 (2002)Google Scholar
- [Sve03]Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: QSAR modeling using Random Forest, an ensemble learning tool for regression and classification. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003)Google Scholar
- [Ton03]Tong, W., Hong, H., Fang, H., Xie, Q., Perkins, R.: Decision forest: combining the predictions of multiple independent decision tree models. J. Chem. Inf. Comput. Sci. 43, 525–531 (2003)Google Scholar