Molecular Diversity

, 12:157 | Cite as

Classification of bioaccumulative and non-bioaccumulative chemicals using statistical learning approaches

  • Xiuli Sun
  • Yan Li
  • Xianjie Liu
  • Jun Ding
  • Yonghua Wang
  • Hui Shen
  • Yaqing Chang
Full Length Paper


The present work aimed at developing in silico models allowing for a reliable prediction of bioaccumulative compounds and non-bioaccumulative compounds based on the definition of Bioconcentration Factor (BCF) using a diverse data set of 238 organic molecules. The partial least squares analysis (PLS), C4.5, support vector machine (SVM), and random forest (RF) algorithms were applied, and their performance classifying these compounds in terms of their quantitative structure-activity relationships (QSAR) was evaluated and verified with 5-fold cross-validation and an independent evaluation data set. The obtained results show that the overall prediction accuracies (Q) of the optimal PLS, C4.5, SVM and RF models are 84.5–87.7% for the internal cross-validation, with prediction accuracy (CO) of 86.3–91.1% in the external test sets, and C4.5 is slightly better than the three other methods which presents a Q of 87.7%, and a CO of 91.1% for the test sets. All these results prove the reliabilities of the in silico models, which should be valuable for the environmental risk assessment of the substances.


In silico prediction Bioconcentration Quantitative structure-activity relationships (QSAR) Statistical methods 

Supplementary material

11030_2008_9092_MOESM1_ESM.xls (442 kb)
ESM 1 (XLS 443 kb)


  1. 1.
    Faroon O, Jones D, De Rosa C (2000) Effects of polychlorinated biphenyls on the nervous system. Toxicol Ind Health 16: 305–333. doi: 10.1177/074823370001600708 CrossRefGoogle Scholar
  2. 2.
    Arnot JA, Gobas FAPC (2006) A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 14:257–297. doi: 10.1139/A06-005 CrossRefGoogle Scholar
  3. 3.
    UNEP (1998) Report for the first session of the INC for an international legally binding instrument for implementing international action on certain persistent organic pollutants (POPs). UNEP Report, International Institute for Sustainable Development (IISD) 15(10) or
  4. 4.
    Nendza M (1991) QSAR of bioconcentration: validity assessment of logPow/logBCF correlations. In: Nagel R, Loskill R (eds) Bioaccumulation in aquatic systems, VCH, Weinheim, pp 43–66Google Scholar
  5. 5.
    Mackay D (1982) Correlation of bioconcentration factors. Environ Sci Technol 16: 274–278. doi: 10.1021/es00099a008 CrossRefGoogle Scholar
  6. 6.
    Isnard P, Lambert S (1988) Estimating bioconcentration factors from octanol-water partition coefficient and aqueous solubility. Chemosphere 17: 21–34CrossRefGoogle Scholar
  7. 7.
    Government of Canada (1995) Toxic substances management policy persistence and bioaccumulation criteria. Ottawa Canada No En 40-499:21Google Scholar
  8. 8.
    Veith GD, DeFoe DL, Bergstedt BV (1979) Measuring and estimating the bioconcentration factor of chemicals in fish. J Fish Res Board Can 36: 1040–1048Google Scholar
  9. 9.
    Meylan WM, Howard PH, Boethling RS, Aronson D, Printup H, Gouchie S (1999) Improved method for estimating bioconcentration/bioaccumulation factor from octanol/water partitioning coefficient. Environ Toxicol Chem 18: 664–672 doi: 10.1897/1551-5028(1999)018<0664:IMFEBB>2.3.CO;2 CrossRefGoogle Scholar
  10. 10.
    Dimitrov SD, Mekenyan OG, Walker JD (2002) Non-linear modeling of bioconcentration using partition coefficients for narcotic chemicals. SAR QSAR Environ Res 13: 177–184. doi: 10.1080/10629360290002299 PubMedCrossRefGoogle Scholar
  11. 11.
    Environment Canada (1994) Criteria for the selection of substances for virtual elimination. Final Report of the ad hoc Science Group on Criteria. A companion document to ‘Towards a Toxic Substances Management Policy for Canada’. Ottawa, Ontario 25 12Google Scholar
  12. 12.
    State of Knowledge Report of the UNECE Task force on Pesestence Organic Pollutants (1994)Google Scholar
  13. 13.
    Priority Setting for Long-range Transboundary Air Pollution by Persistent Organic Chemicals (1993) AEA Technology ReportGoogle Scholar
  14. 14.
    Tyle H, Larsen HS, Wedebye L, Sijm D, Krog T, Niemelä J (2002) Identification of potential PBTs and vPvBs by use of QSARs. Danish EPA CopenhagenGoogle Scholar
  15. 15.
    Dimitrov SD, Dimitrova NC, Walker JD, Veith GD, Mekenyan OG (2003) Bioconcentration potential predictions based on molecular atributes. An early worning system for chemicals found in humans, fish and wildlife. QSAR Comb Sci 22: 58–68CrossRefGoogle Scholar
  16. 16.
    Dimitrov S, Dimitrova N, Parkerton T, Comber M, Bonnell M, Mekenyan O (2005) Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ Res 16: 531–544PubMedCrossRefGoogle Scholar
  17. 17.
    Lu X, Tao S, Hu H, Dawson RW (2000) Estimation of bioconcentration factors of nonionic organic compounds in fish by molecular connectivity indices and polarity correction factors. Chemosphere 41: 1675–1688PubMedCrossRefGoogle Scholar
  18. 18.
    Nordberg A, Rudén C (2007) The usefulness of the bioconcentration factor as a tool for priority setting in chemicals control. Toxic Lett 168: 113–120CrossRefGoogle Scholar
  19. 19.
    Hall LH, Mohney BK, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10: 43–58CrossRefGoogle Scholar
  20. 20.
    Hansch C, Leo A (1979) A substituent constants for correlation analysis in chemistry and biology. John Wiley & Sons, New YorkGoogle Scholar
  21. 21.
    Kohonen T (1995) Self-organizing maps, vol 30. Springer, Berlin/HeidelbergGoogle Scholar
  22. 22.
    Ramsden CA, InHansch C (1990) Comprehensive medicinal chemistry, vol 4. Pergamon Press Oxford, NewYorkGoogle Scholar
  23. 23.
    Wang Y, Li Y, Wang B (2007) An in silico method for screening nicotine derivatives as cytochrome P450 2A6 selective inhibitors based on Kernel partial least squares. Int J Mol Sci 8: 166–179CrossRefGoogle Scholar
  24. 24.
    Leardi R, Nørgaard L (2000) Application of genetic algorithm-PLS for feature selection in spectral data sets. J Chem Intel Lab Syst 14: 643–655Google Scholar
  25. 25.
    Nguyen DV, Rocke DM (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18: 39–50PubMedCrossRefGoogle Scholar
  26. 26.
    Wang HW (2000) Partial least-squares regression-method and applications. National Defense Industrial Press, pp 236–270Google Scholar
  27. 27.
    Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (eds) (2001) Multi- and megavariate data analysis principles and applications. Umetries Academy, pp 123–131Google Scholar
  28. 28.
    Hasegawa K, Funatsu K (2000) Partial least squares modeling and genetic algorithm optimization in quantitative structure-activity relationships. SAR QSAR Environ Res 11: 189–209PubMedCrossRefGoogle Scholar
  29. 29.
    Frank IE, Friedman JH (1993) A statistical view of some chemometric regression tools (with discussion). Technometrics 35: 109–147CrossRefGoogle Scholar
  30. 30.
    Quinlan JR (1993) C4.5: Programs for machine learning (morgan kaufmann series in machine learning). Morgan Kaufmann Publishers Inc, San Mateo, CAGoogle Scholar
  31. 31.
    Li H, Ung CY, Yap CW, Xue Y, Li ZR, Chen YZ (2006) Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. J Mol Graph Model 25: 313–323PubMedCrossRefGoogle Scholar
  32. 32.
    Vapnik VN (1998) Statistical learning theory (adaptive and learning systems for signal processing, communications and control series). John Wiley & Sons, New York. A Wiley-Interscience PublicationGoogle Scholar
  33. 33.
    Cristianini N, Shawe-Taylor J (eds) (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University PressGoogle Scholar
  34. 34.
    Burges CJC (1998) A tutorial on support vector machine for pattern recognition. Data Min Knowl Disc 2: 121–167CrossRefGoogle Scholar
  35. 35.
    Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26: 5–14PubMedCrossRefGoogle Scholar
  36. 36.
    Song M, Breneman CM, Bi J, Sukumar N, Bennett KP, Cramer S, Tugcu N (2002) Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci 42: 1347–1357PubMedGoogle Scholar
  37. 37.
    Kramer S, Frank E, Helma C (2002) Fragment generation and support vector machines for inducing SARs. SAR QSAR Environ Res 13: 509–523PubMedCrossRefGoogle Scholar
  38. 38.
    Yao XJ, Panaye A, Doucet JP, Chen HF, Fan BT (2005) Comparative classification study of toxicity mechanisms using support vector machines and radial basis function neural networks. Anal Chim Acta 535: 259–273CrossRefGoogle Scholar
  39. 39.
    Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, ACM Press, Pittsburgh PA, pp 144–152Google Scholar
  40. 40.
    Breiman L (2001) Random forests. J Mach Learn Res 45: 5–32CrossRefGoogle Scholar
  41. 41.
    Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43: 1947–1958PubMedGoogle Scholar
  42. 42.
    Baolin W, Tom A, David F, Walter M, Gil M (2003) Comparison of statistical methods for classication of ovarian cancer using mass spectrometry data. Bioinformatics 19: 1636–1643CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Xiuli Sun
    • 1
  • Yan Li
    • 2
  • Xianjie Liu
    • 1
  • Jun Ding
    • 1
  • Yonghua Wang
    • 1
  • Hui Shen
    • 1
  • Yaqing Chang
    • 1
  1. 1.Key Lab of Mariculture and Biotechnology, Ministry of AgricultureDalian Fisheries UniversityDalianChina
  2. 2.School of Chemical EngineeringDalian University of TechnologyDalianChina

Personalised recommendations