Journal of Computer-Aided Molecular Design

, Volume 25, Issue 1, pp 67–80 | Cite as

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features

  • Dongsheng Cao
  • Yizeng Liang
  • Qingsong Xu
  • Yifeng Yun
  • Hongdong Li


Building a robust and reliable QSAR/QSPR model should greatly consider two aspects: selecting the optimal variable subset from a large pool of molecular descriptors and detecting outliers from a pool of samples. The two problems have the specific similarity and complementarity to some extent. Given a particular learning algorithm on a particular data set, one should consider how the interaction could happen between variable selection and outlier detection. In this paper, we describe a consistent methodology for simultaneously performing variable subset selection and outlier detection using the idea of statistical distribution which can be simulated by the establishment of many cross-predictive linear models. The approach exploits the fact that the distribution of linear model coefficients provides a mechanism for ranking and interpreting the effects of variable, while the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples. The use of statistic of these distributions, namely mean value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Several examples are used to demonstrate the prediction ability of our proposed approach through the comparison of different approaches as well as their combinations.


QSAR/QSPR Outlier detection Variable selection Monte Carlo Statistical distribution 



We would like to thank the reviewers for their useful discussions, comments and suggestions throughout this entire work. This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104 and No. 10771217), the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680). The studies meet with the approval of the university’s review board.

Supplementary material

10822_2010_9401_MOESM1_ESM.doc (969 kb)
Supplementary material 1 (DOC 969 kb)


  1. 1.
    Dudek AZ, Arodz T, Galvez J (2006) Comb Chem High Throughput Screen 9:213CrossRefGoogle Scholar
  2. 2.
    Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2007) J Chem Inf Model 47:150CrossRefGoogle Scholar
  3. 3.
    Svetnik V, Wang T, Tong C, Liaw A, Sheridan RP, Song Q (2005) J Chem Inf Model 45:786CrossRefGoogle Scholar
  4. 4.
    Xue Y, Yap CW, Sun LZ, Cao ZW, Wang JF, Chen YZ (2004) J Chem Inf Comput Sci 44:1497Google Scholar
  5. 5.
    Gunturi SB, Narayanan R (2007) QSAR Comb Sci 26:653CrossRefGoogle Scholar
  6. 6.
    Konovalov DA, Coomans D, Deconinck E, Vander Heyden Y (2007) J Chem Inf Model 47:1648CrossRefGoogle Scholar
  7. 7.
    Liang YZ, Yuan DL, Xu QS, Kvalheim OM (2008) J Chemometr 22:23CrossRefGoogle Scholar
  8. 8.
    Rucker C, Meringer M, Kerber A (2005) J Chem Inf Model 45:74CrossRefGoogle Scholar
  9. 9.
    Karthikeyan M, Glen RC, Bender A (2005) J Chem Inf Model 45:581CrossRefGoogle Scholar
  10. 10.
    Cronin MTD, Livingstone DJ (2004) Predicting chemical toxicity and fate. CRC Press, Boca RatonCrossRefGoogle Scholar
  11. 11.
    Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York, p 329Google Scholar
  12. 12.
    Liang Y-Z, Kvalheim OM (1996) Chemom Intell Lab Syst 32:1CrossRefGoogle Scholar
  13. 13.
    Konovalov DA, Llewellyn LE, Vander Heyden Y, Coomans D (2008) J Chem Inf Model 48:2081CrossRefGoogle Scholar
  14. 14.
    Huber PJ (2004) Robust statistics in Wiley Series in probability and statistics. Wiley, New YorkGoogle Scholar
  15. 15.
    Rousseeuw PJ (1984) J Am Stat Assoc 79:871CrossRefGoogle Scholar
  16. 16.
    Agull J, Croux C, Van Aelst S (2008) J Multivar Anal 99:311CrossRefGoogle Scholar
  17. 17.
    Walczak B, Massart DL (1995) Chemom Intell Lab Syst 27:41CrossRefGoogle Scholar
  18. 18.
    Juan AG, Rosario R (1998) J Chemometr 12:365CrossRefGoogle Scholar
  19. 19.
    Hubert M, Branden KV (2003) J Chemometr 17:537CrossRefGoogle Scholar
  20. 20.
    Zhang MH, Xu QS, Massart DL (2003) Chemom Intell Lab Syst 67:175CrossRefGoogle Scholar
  21. 21.
    Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) J Chem Inf Comput Sci 44:1630Google Scholar
  22. 22.
    Sutter JM, Dixon SL, Jurs PC (2002) J Chem Inf Comput Sci 35:77Google Scholar
  23. 23.
    Clark DE, Westhead DR (1996) J Comput Aided Mol Des 10:337CrossRefGoogle Scholar
  24. 24.
    Rogers D, Hopfinger AJ (2002) J Chem Inf Comput Sci 34:854Google Scholar
  25. 25.
    Shen Q, Jiang J-H, Jiao C-X, Shen G-l, Yu R-Q (2004) Eur J Pharm Sci 22:145CrossRefGoogle Scholar
  26. 26.
    Xu L, Zhang W-J (2001) Anal Chim Acta 446:475CrossRefGoogle Scholar
  27. 27.
    Tibshirani R (1996) J R Stat Soc B Methodol 58:267Google Scholar
  28. 28.
    Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Ann Stat 32:407CrossRefGoogle Scholar
  29. 29.
    Rainer G, Torsten S (2008) J Comput Chem 29:847CrossRefGoogle Scholar
  30. 30.
    Kirchner H (2000) Altern Lab Anim 28:364Google Scholar
  31. 31.
    Cronin MTD, Dearden JC, Moss GP, Murray-Dickson G (1999) Eur J Pharm Sci 7:325CrossRefGoogle Scholar
  32. 32.
    Cronin MTD, Schultz TW (2003) J Mol Struct THEOCHEM 622:39CrossRefGoogle Scholar
  33. 33.
    Cavill R, Keun HC, Holmes E, Lindon JC, Nicholson JK, Ebbels TMD (2009) Bioinformatics 25:112CrossRefGoogle Scholar
  34. 34.
    Tolvi J (2004) Soft Comput Fusion Found Methodol Appl 8:527Google Scholar
  35. 35.
    Wiegand P, Pell R, Comas E (2009) Chemom Intell Lab Syst 98:108CrossRefGoogle Scholar
  36. 36.
    Menjoge RS, Welsch RE (2010) Comput Stat Data Anal 54:3181Google Scholar
  37. 37.
    Aksenova T, Volkovich V, Villa AEP (2005) Robust structural modeling and outlier detection with GMDH-type polynomial neural networks, in artificial neural networks: formal models and their applications. ICANN, p 881Google Scholar
  38. 38.
    Plomin R, Haworth CMA, Davis OSP (2009) Nat Rev Genet 10:872CrossRefGoogle Scholar
  39. 39.
    Manly BFJ (1998) Randomization, bootstrap and Monte Carlo in biology, in texts in statistical science, 2nd edn. Chapman and Hall, London, p 399Google Scholar
  40. 40.
    Robert CP, Casella G (1999) Monte Carlo statistical methods in Springer texts in statistics. Springer, New YorkGoogle Scholar
  41. 41.
    Efron B, Tribshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall/CRC, New York, p 436Google Scholar
  42. 42.
    Efron B (1979) Ann Stat 7:1CrossRefGoogle Scholar
  43. 43.
    Efron B, Gong G (1983) Am Stat 37:36CrossRefGoogle Scholar
  44. 44.
    Efron B, Tibshirani R (1986) Stat Sci 1:54CrossRefGoogle Scholar
  45. 45.
    Gentle JE (2006) Elements of computational statistics. Springer Science and Business Media, Inc., New YorkGoogle Scholar
  46. 46.
    Shao J (1993) J Am Stat Assoc 88:486CrossRefGoogle Scholar
  47. 47.
    Xu Q-S, Liang Y-Z (2001) Chemom Intell Lab Syst 56:1CrossRefGoogle Scholar
  48. 48.
    Xu Q-S, Liang Y-Z, Du Y-P (2004) J Chemometr 18:112CrossRefGoogle Scholar
  49. 49.
    Cao D-S, Liang Y-Z, Xu Q-S, Li H-D, Chen X (2010) J Comput Chem 31:592Google Scholar
  50. 50.
    Centner V, Massart D-L, de Noord OE, de Jong S, Vandeginste BM, Sterna C (1996) Anal Chem 68:3851CrossRefGoogle Scholar
  51. 51.
    Riccardo L (1994) J Chemometr 8:65CrossRefGoogle Scholar
  52. 52.
    Hawkins DM, Basak SC, Mills D (2003) J Chem Inf Comput Sci 43:579Google Scholar
  53. 53.
    Bak A, Gieleciak R, Magdziarz T, Polanski J (2005) J Chem Inf Model 46:2310Google Scholar
  54. 54.
    Myers RH (2005) Classical and modern regression with applications. PWS-KENT, BostonGoogle Scholar
  55. 55.
    Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear regression models. Irwin, ChicagoGoogle Scholar
  56. 56.
    Sutherland JJ, O’Brien LA, Weaver DF (2004) J Med Chem 47:5541CrossRefGoogle Scholar
  57. 57.
    Cao C, Liu S, Li Z (1999) J Chem Inf Comput Sci 39:1105Google Scholar
  58. 58.
    Rucker G, Rucker C (1999) J Chem Inf Comput Sci 39:788Google Scholar
  59. 59.
    Wessel MD, Jurs PC (1995) J Chem Inf Comput Sci 35:68Google Scholar
  60. 60.
    Polanski J, Gieleciak R (2003) J Chem Inf Comput Sci 43:656Google Scholar
  61. 61.
    Bak A, Polanski J (2007) J Chem Inf Model 47:1469CrossRefGoogle Scholar
  62. 62.
    Kim K (2007) J Comput Aided Mol Des 21:63CrossRefGoogle Scholar
  63. 63.
    Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A (2008) J Chem Inf Model 48:1733CrossRefGoogle Scholar
  64. 64.
    Beck B, Breindl A, Clark T (2000) J Chem Inf Comput Sci 40:1046Google Scholar
  65. 65.
    Chalk AJ, Beck B, Clark T (2001) J Chem Inf Comput Sci 41:457Google Scholar
  66. 66.
    Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sulzle D, Ganzer U, Heinrich N, Muller K-R (2007) J Chem Inf Model 47:407CrossRefGoogle Scholar
  67. 67.
    Kolossov E, Stanforth R (2007) SAR QSAR Environ Res 18:89CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Research Center of Modernization of Traditional Chinese MedicinesCentral South UniversityChangshaPeople’s Republic of China
  2. 2.School of Mathematical Sciences and Computing TechnologyCentral South UniversityChangshaPeople’s Republic of China

Personalised recommendations