Skip to main content
Log in

Classification of bioaccumulative and non-bioaccumulative chemicals using statistical learning approaches

  • Full Length Paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

The present work aimed at developing in silico models allowing for a reliable prediction of bioaccumulative compounds and non-bioaccumulative compounds based on the definition of Bioconcentration Factor (BCF) using a diverse data set of 238 organic molecules. The partial least squares analysis (PLS), C4.5, support vector machine (SVM), and random forest (RF) algorithms were applied, and their performance classifying these compounds in terms of their quantitative structure-activity relationships (QSAR) was evaluated and verified with 5-fold cross-validation and an independent evaluation data set. The obtained results show that the overall prediction accuracies (Q) of the optimal PLS, C4.5, SVM and RF models are 84.5–87.7% for the internal cross-validation, with prediction accuracy (CO) of 86.3–91.1% in the external test sets, and C4.5 is slightly better than the three other methods which presents a Q of 87.7%, and a CO of 91.1% for the test sets. All these results prove the reliabilities of the in silico models, which should be valuable for the environmental risk assessment of the substances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Faroon O, Jones D, De Rosa C (2000) Effects of polychlorinated biphenyls on the nervous system. Toxicol Ind Health 16: 305–333. doi:10.1177/074823370001600708

    Article  CAS  Google Scholar 

  2. Arnot JA, Gobas FAPC (2006) A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 14:257–297. doi:10.1139/A06-005

    Article  CAS  Google Scholar 

  3. UNEP (1998) Report for the first session of the INC for an international legally binding instrument for implementing international action on certain persistent organic pollutants (POPs). UNEP Report, International Institute for Sustainable Development (IISD) 15(10) or http://irptc.unep.ch/pops/

  4. Nendza M (1991) QSAR of bioconcentration: validity assessment of logPow/logBCF correlations. In: Nagel R, Loskill R (eds) Bioaccumulation in aquatic systems, VCH, Weinheim, pp 43–66

  5. Mackay D (1982) Correlation of bioconcentration factors. Environ Sci Technol 16: 274–278. doi:10.1021/es00099a008

    Article  CAS  Google Scholar 

  6. Isnard P, Lambert S (1988) Estimating bioconcentration factors from octanol-water partition coefficient and aqueous solubility. Chemosphere 17: 21–34

    Article  CAS  Google Scholar 

  7. Government of Canada (1995) Toxic substances management policy persistence and bioaccumulation criteria. Ottawa Canada No En 40-499:21

  8. Veith GD, DeFoe DL, Bergstedt BV (1979) Measuring and estimating the bioconcentration factor of chemicals in fish. J Fish Res Board Can 36: 1040–1048

    CAS  Google Scholar 

  9. Meylan WM, Howard PH, Boethling RS, Aronson D, Printup H, Gouchie S (1999) Improved method for estimating bioconcentration/bioaccumulation factor from octanol/water partitioning coefficient. Environ Toxicol Chem 18: 664–672 doi:10.1897/1551-5028(1999)018<0664:IMFEBB>2.3.CO;2

    Article  CAS  Google Scholar 

  10. Dimitrov SD, Mekenyan OG, Walker JD (2002) Non-linear modeling of bioconcentration using partition coefficients for narcotic chemicals. SAR QSAR Environ Res 13: 177–184. doi:10.1080/10629360290002299

    Article  PubMed  CAS  Google Scholar 

  11. Environment Canada (1994) Criteria for the selection of substances for virtual elimination. Final Report of the ad hoc Science Group on Criteria. A companion document to ‘Towards a Toxic Substances Management Policy for Canada’. Ottawa, Ontario 25 12

  12. State of Knowledge Report of the UNECE Task force on Pesestence Organic Pollutants (1994)

  13. Priority Setting for Long-range Transboundary Air Pollution by Persistent Organic Chemicals (1993) AEA Technology Report

  14. Tyle H, Larsen HS, Wedebye L, Sijm D, Krog T, Niemelä J (2002) Identification of potential PBTs and vPvBs by use of QSARs. Danish EPA Copenhagen

  15. Dimitrov SD, Dimitrova NC, Walker JD, Veith GD, Mekenyan OG (2003) Bioconcentration potential predictions based on molecular atributes. An early worning system for chemicals found in humans, fish and wildlife. QSAR Comb Sci 22: 58–68

    Article  CAS  Google Scholar 

  16. Dimitrov S, Dimitrova N, Parkerton T, Comber M, Bonnell M, Mekenyan O (2005) Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ Res 16: 531–544

    Article  PubMed  CAS  Google Scholar 

  17. Lu X, Tao S, Hu H, Dawson RW (2000) Estimation of bioconcentration factors of nonionic organic compounds in fish by molecular connectivity indices and polarity correction factors. Chemosphere 41: 1675–1688

    Article  PubMed  CAS  Google Scholar 

  18. Nordberg A, Rudén C (2007) The usefulness of the bioconcentration factor as a tool for priority setting in chemicals control. Toxic Lett 168: 113–120

    Article  CAS  Google Scholar 

  19. Hall LH, Mohney BK, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10: 43–58

    Article  CAS  Google Scholar 

  20. Hansch C, Leo A (1979) A substituent constants for correlation analysis in chemistry and biology. John Wiley & Sons, New York

    Google Scholar 

  21. Kohonen T (1995) Self-organizing maps, vol 30. Springer, Berlin/Heidelberg

    Google Scholar 

  22. Ramsden CA, InHansch C (1990) Comprehensive medicinal chemistry, vol 4. Pergamon Press Oxford, NewYork

    Google Scholar 

  23. Wang Y, Li Y, Wang B (2007) An in silico method for screening nicotine derivatives as cytochrome P450 2A6 selective inhibitors based on Kernel partial least squares. Int J Mol Sci 8: 166–179

    Article  CAS  Google Scholar 

  24. Leardi R, Nørgaard L (2000) Application of genetic algorithm-PLS for feature selection in spectral data sets. J Chem Intel Lab Syst 14: 643–655

    CAS  Google Scholar 

  25. Nguyen DV, Rocke DM (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18: 39–50

    Article  PubMed  CAS  Google Scholar 

  26. Wang HW (2000) Partial least-squares regression-method and applications. National Defense Industrial Press, pp 236–270

  27. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (eds) (2001) Multi- and megavariate data analysis principles and applications. Umetries Academy, pp 123–131

  28. Hasegawa K, Funatsu K (2000) Partial least squares modeling and genetic algorithm optimization in quantitative structure-activity relationships. SAR QSAR Environ Res 11: 189–209

    Article  PubMed  CAS  Google Scholar 

  29. Frank IE, Friedman JH (1993) A statistical view of some chemometric regression tools (with discussion). Technometrics 35: 109–147

    Article  Google Scholar 

  30. Quinlan JR (1993) C4.5: Programs for machine learning (morgan kaufmann series in machine learning). Morgan Kaufmann Publishers Inc, San Mateo, CA

    Google Scholar 

  31. Li H, Ung CY, Yap CW, Xue Y, Li ZR, Chen YZ (2006) Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. J Mol Graph Model 25: 313–323

    Article  PubMed  CAS  Google Scholar 

  32. Vapnik VN (1998) Statistical learning theory (adaptive and learning systems for signal processing, communications and control series). John Wiley & Sons, New York. A Wiley-Interscience Publication

  33. Cristianini N, Shawe-Taylor J (eds) (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press

  34. Burges CJC (1998) A tutorial on support vector machine for pattern recognition. Data Min Knowl Disc 2: 121–167

    Article  Google Scholar 

  35. Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26: 5–14

    Article  PubMed  CAS  Google Scholar 

  36. Song M, Breneman CM, Bi J, Sukumar N, Bennett KP, Cramer S, Tugcu N (2002) Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci 42: 1347–1357

    PubMed  CAS  Google Scholar 

  37. Kramer S, Frank E, Helma C (2002) Fragment generation and support vector machines for inducing SARs. SAR QSAR Environ Res 13: 509–523

    Article  PubMed  CAS  Google Scholar 

  38. Yao XJ, Panaye A, Doucet JP, Chen HF, Fan BT (2005) Comparative classification study of toxicity mechanisms using support vector machines and radial basis function neural networks. Anal Chim Acta 535: 259–273

    Article  CAS  Google Scholar 

  39. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, ACM Press, Pittsburgh PA, pp 144–152

  40. Breiman L (2001) Random forests. J Mach Learn Res 45: 5–32

    Article  Google Scholar 

  41. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43: 1947–1958

    PubMed  CAS  Google Scholar 

  42. Baolin W, Tom A, David F, Walter M, Gil M (2003) Comparison of statistical methods for classication of ovarian cancer using mass spectrometry data. Bioinformatics 19: 1636–1643

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yonghua Wang.

Electronic Supplementary Material

The Below is the Electronic Supplementary Material.

ESM 1 (XLS 443 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, X., Li, Y., Liu, X. et al. Classification of bioaccumulative and non-bioaccumulative chemicals using statistical learning approaches. Mol Divers 12, 157–169 (2008). https://doi.org/10.1007/s11030-008-9092-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-008-9092-x

Keywords

Navigation