Validation tools for variable subset regression

  • Knut Baumann
  • Nikolaus Stiefl


Variable selection is applied frequently in QSAR research. Since the selection process influences the characteristics of the finally chosen model, thorough validation of the selection technique is very important. Here, a validation protocol is presented briefly and two of the tools which are part of this protocol are introduced in more detail. The first tool, which is based on permutation testing, allows to assess the inflation of internal figures of merit (such as the cross-validated prediction error). The other tool, based on noise addition, can be used to determine the complexity and with it the stability of models generated by variable selection. The obtained statistical information is important in deciding whether or not to trust the predictive abilities of a specific model. The graphical output of the validation tools is easily accessible and provides a reliable impression of model performance. Among others, the tools were employed to study the influence of leave-one-out and leave-multiple-out cross-validation on model characteristics. Here, it was confirmed that leave-multiple-out cross-validation yields more stable models. To study the performance of the entire validation protocol, it was applied to eight different QSAR data sets with default settings. In all cases internal and external model performance was good, indicating that the protocol serves its purpose quite well.


chance correlation cross-validation validation variable selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Cramer, R.D., Patterson, D.E., Bunce, J.D. 1988J. Am. Chem. Soc.1105959Google Scholar
  2. Cruciani, G., Crivori, P., Carrupt, P.-A., Testa, B. 2000J. Mol. Struct.50317Google Scholar
  3. Topliss, J.G., Costello, R.J. 1972J. Med. Chem.151066Google Scholar
  4. Topliss, J.G., Edwards, R.P. 1979J. Med. Chem.221238Google Scholar
  5. Zucchini, W. 2000J. Math. Psychol.4441Google Scholar
  6. Osten, D.W. 1988J. Chemom.239Google Scholar
  7. Baumann, K., Albert, H., von Korff, M. 2002J. Chemom.16339Google Scholar
  8. Baumann, K., von Korff, M., Albert, H. 2002J. Chemom.16351Google Scholar
  9. Geisser, S. 1975J. Am. Stat. Assoc.70320Google Scholar
  10. Shao, J. 1993J. Am. Stat. Assoc.88486Google Scholar
  11. Cruciani, G., Baroni, M., Clementi, S., Costantino, G., Riganelli, D., Skagerberg, B. 1992J. Chemom.6335Google Scholar
  12. Baumann, K. 2003Trends Anal. Chem.22395Google Scholar
  13. Shao, J. 1996J. Am. Stat. Assoc.91655Google Scholar
  14. Wehrens, R., Putter, H., Buydens, L.M.C. 2000Chemom. Intell. Lab. Syst.,5435Google Scholar
  15. Rencher, A.C., Pun, F.C. 1980Technometrics2249Google Scholar
  16. Flack, V.F., Chang, P.C. 1987Am. Stat.,4184Google Scholar
  17. Hurvich, C.M., Tsai, C.L. 1990Am. Stat.44214Google Scholar
  18. Baumann, K., Stiefl, N. and von Korff, M., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 290–292.Google Scholar
  19. Breiman, L. 1996Ann. Stat.,242350Google Scholar
  20. Coats, E.A. 1998Perspect. Drug Discov. Des.12-14199Google Scholar
  21. Stiefl, N., Baumann, K. 2003J. Med. Chem.,461390Google Scholar
  22. Rao, R.C., Toutenburg, H. 1999Linear Models2SpringerNew YorkGoogle Scholar
  23. Ye, J. 1998J. Am. Stat. Assoc.93120Google Scholar
  24. Breiman, L. 2000Mach. Learning40229Google Scholar
  25. Klopman, G., Kalos, A.N. 1985J. Comput. Chem.6492Google Scholar
  26. So, S.S., Karplus, M. 1997J. Med. Chem.,404347Google Scholar
  27. Kubinyi, H., Hamprecht, F.A., Mietzner, T. 1998J. Med. Chem.,412553Google Scholar
  28. Martens, H., Naes, T. 1989Multivariate CalibrationJohn Wiley & SonsChichester, UKGoogle Scholar
  29. Kubinyi, H. 1996J. Chemom.,10119Google Scholar
  30. Selwood, D.L., Livingstone, D.J., Comley, J.C.W., O’Dowd, A.B., Hudson, A.T., Jackson, P., Jandu, K.S., Rose, V.S., Stables, J.N. 1990J. Med. Chem.33136Google Scholar
  31. Krystek, S.R., Hunt, J.T., Stein, P.D., Stouch, T.R. 1995J. Med. Chem.38659Google Scholar
  32. Robinson, D.D., Winn, P.J., Lyne, P.D., Richards, W.G. 1999J. Med. Chem.,42573Google Scholar
  33. Gancia, E., Bravi, G., Mascagni, P., Zaliani, A. 2000J. Comput.-Aided Mol. Des.14293Google Scholar
  34. Baumann, K. 2002Quant. Struct.-Act. Relat.21507Google Scholar
  35. Breiman, L. 1996Mach. Learning26123Google Scholar
  36. Freund, Y. and Schapire, R., In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kaufmann Publishers, San Francisco, CA, 1996, pp. 148–156.Google Scholar
  37. Freund, Y., Schapire, R. 1997J. Comp. Syst. Sci.,55119Google Scholar
  38. Baumann, K. 2002J. Chem. Inf. Comput. Sci.4226Google Scholar
  39. Kennard, R.W., Stone, L.A. 1969Technometrics11137Google Scholar
  40. Wu, W., Walczak, B., Massart, D.L ., Heuerding, S., Erni, F., Last, I.R., Prebble, K.A. 1996Chemom. Intell. Lab. Syst.3335Google Scholar
  41. Stiefl, N., Bringmann, G., Rummey, C., Baumann, K. 2003J. Comput.-Aided Mol. Des.17347Google Scholar
  42. Faber, N.M. 1999Chemom. Intell. Lab. Syst.4979Google Scholar
  43. Jouan-Rimbaud, D., Bouveresse, E., Massart, D.L., de Noord, O.E. 1999Anal. Chim. Acta338283Google Scholar
  44. Golbraikh, A., Tropsha, A. 2002J. Mol. Graph. Mod.20269Google Scholar
  45. Tropsha, A., Gramatica, P., Gombar, V.K. 2003QSAR Comb. Sci.2269Google Scholar
  46. Kulkarni, A., Hopfinger, A.J., Osborne, R., Bruner, L.H., Thompson, E.D. 2001Toxicol. Sci.59335Google Scholar
  47. Stiefl, N., Holzgrabe, U. and Baumann, K., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 195–197.Google Scholar
  48. Baumann, K. and Stiefl, N., In Ford, M., Livingstone, D., Dearden, J. and van de Waterbeemd, H. (Eds.), EuroQSAR 2002, Designing Drugs and Crop Protectants: Processes, Problems and Solutions, Blackwell Publishing, Oxford, UK, 2003, pp. 153–157.Google Scholar
  49. Sippl, W., Contreras, J.M., Parrot, I., Rival, Y.M., Wermuth, C.G. 2001J. Comput.-Aided Mol. Des.,15395Google Scholar
  50. Barreca, M.L., Carotti, A., Carrieri, A., Chimirri, A., Monforte, A.M., Pellegrini Calace, M., Rao, A. 1999Bioorg. Med. Chem.,72283Google Scholar
  51. Costantino, G., Macchiarulo, A., Camaioni, E., Pellicciari, R. 2001J. Med. Chem.443786Google Scholar
  52. Burman, P. 1989Biometrika76503Google Scholar
  53. Mosteller, F., Tukey, J.W. 1977Data Analysis and RegressionAddison-WesleyReading, MAGoogle Scholar
  54. Picard, R.P., Cook, R.D. 1984J. Am. Stat. Assoc.,79575Google Scholar
  55. Kubinyi, H. and Abraham, U., In Kubinyi, H. (Ed.), 3D QSAR in Drug Design–Theory Methods and Applications, ESCOM Science Publishers, Leiden, The Netherlands, 1993, pp. 717–728.Google Scholar

Copyright information

© Springer 2004

Authors and Affiliations

  1. 1.Department of Pharmacy and Food ChemistryUniversity of WuerzburgAm Hubland, WuerzburgGermany

Personalised recommendations