Abstract
The present work aimed at developing in silico models allowing for a reliable prediction of bioaccumulative compounds and non-bioaccumulative compounds based on the definition of Bioconcentration Factor (BCF) using a diverse data set of 238 organic molecules. The partial least squares analysis (PLS), C4.5, support vector machine (SVM), and random forest (RF) algorithms were applied, and their performance classifying these compounds in terms of their quantitative structure-activity relationships (QSAR) was evaluated and verified with 5-fold cross-validation and an independent evaluation data set. The obtained results show that the overall prediction accuracies (Q) of the optimal PLS, C4.5, SVM and RF models are 84.5–87.7% for the internal cross-validation, with prediction accuracy (CO) of 86.3–91.1% in the external test sets, and C4.5 is slightly better than the three other methods which presents a Q of 87.7%, and a CO of 91.1% for the test sets. All these results prove the reliabilities of the in silico models, which should be valuable for the environmental risk assessment of the substances.
Similar content being viewed by others
References
Faroon O, Jones D, De Rosa C (2000) Effects of polychlorinated biphenyls on the nervous system. Toxicol Ind Health 16: 305–333. doi:10.1177/074823370001600708
Arnot JA, Gobas FAPC (2006) A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 14:257–297. doi:10.1139/A06-005
UNEP (1998) Report for the first session of the INC for an international legally binding instrument for implementing international action on certain persistent organic pollutants (POPs). UNEP Report, International Institute for Sustainable Development (IISD) 15(10) or http://irptc.unep.ch/pops/
Nendza M (1991) QSAR of bioconcentration: validity assessment of logPow/logBCF correlations. In: Nagel R, Loskill R (eds) Bioaccumulation in aquatic systems, VCH, Weinheim, pp 43–66
Mackay D (1982) Correlation of bioconcentration factors. Environ Sci Technol 16: 274–278. doi:10.1021/es00099a008
Isnard P, Lambert S (1988) Estimating bioconcentration factors from octanol-water partition coefficient and aqueous solubility. Chemosphere 17: 21–34
Government of Canada (1995) Toxic substances management policy persistence and bioaccumulation criteria. Ottawa Canada No En 40-499:21
Veith GD, DeFoe DL, Bergstedt BV (1979) Measuring and estimating the bioconcentration factor of chemicals in fish. J Fish Res Board Can 36: 1040–1048
Meylan WM, Howard PH, Boethling RS, Aronson D, Printup H, Gouchie S (1999) Improved method for estimating bioconcentration/bioaccumulation factor from octanol/water partitioning coefficient. Environ Toxicol Chem 18: 664–672 doi:10.1897/1551-5028(1999)018<0664:IMFEBB>2.3.CO;2
Dimitrov SD, Mekenyan OG, Walker JD (2002) Non-linear modeling of bioconcentration using partition coefficients for narcotic chemicals. SAR QSAR Environ Res 13: 177–184. doi:10.1080/10629360290002299
Environment Canada (1994) Criteria for the selection of substances for virtual elimination. Final Report of the ad hoc Science Group on Criteria. A companion document to ‘Towards a Toxic Substances Management Policy for Canada’. Ottawa, Ontario 25 12
State of Knowledge Report of the UNECE Task force on Pesestence Organic Pollutants (1994)
Priority Setting for Long-range Transboundary Air Pollution by Persistent Organic Chemicals (1993) AEA Technology Report
Tyle H, Larsen HS, Wedebye L, Sijm D, Krog T, Niemelä J (2002) Identification of potential PBTs and vPvBs by use of QSARs. Danish EPA Copenhagen
Dimitrov SD, Dimitrova NC, Walker JD, Veith GD, Mekenyan OG (2003) Bioconcentration potential predictions based on molecular atributes. An early worning system for chemicals found in humans, fish and wildlife. QSAR Comb Sci 22: 58–68
Dimitrov S, Dimitrova N, Parkerton T, Comber M, Bonnell M, Mekenyan O (2005) Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ Res 16: 531–544
Lu X, Tao S, Hu H, Dawson RW (2000) Estimation of bioconcentration factors of nonionic organic compounds in fish by molecular connectivity indices and polarity correction factors. Chemosphere 41: 1675–1688
Nordberg A, Rudén C (2007) The usefulness of the bioconcentration factor as a tool for priority setting in chemicals control. Toxic Lett 168: 113–120
Hall LH, Mohney BK, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10: 43–58
Hansch C, Leo A (1979) A substituent constants for correlation analysis in chemistry and biology. John Wiley & Sons, New York
Kohonen T (1995) Self-organizing maps, vol 30. Springer, Berlin/Heidelberg
Ramsden CA, InHansch C (1990) Comprehensive medicinal chemistry, vol 4. Pergamon Press Oxford, NewYork
Wang Y, Li Y, Wang B (2007) An in silico method for screening nicotine derivatives as cytochrome P450 2A6 selective inhibitors based on Kernel partial least squares. Int J Mol Sci 8: 166–179
Leardi R, Nørgaard L (2000) Application of genetic algorithm-PLS for feature selection in spectral data sets. J Chem Intel Lab Syst 14: 643–655
Nguyen DV, Rocke DM (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18: 39–50
Wang HW (2000) Partial least-squares regression-method and applications. National Defense Industrial Press, pp 236–270
Eriksson L, Johansson E, Kettaneh-Wold N, Wold S (eds) (2001) Multi- and megavariate data analysis principles and applications. Umetries Academy, pp 123–131
Hasegawa K, Funatsu K (2000) Partial least squares modeling and genetic algorithm optimization in quantitative structure-activity relationships. SAR QSAR Environ Res 11: 189–209
Frank IE, Friedman JH (1993) A statistical view of some chemometric regression tools (with discussion). Technometrics 35: 109–147
Quinlan JR (1993) C4.5: Programs for machine learning (morgan kaufmann series in machine learning). Morgan Kaufmann Publishers Inc, San Mateo, CA
Li H, Ung CY, Yap CW, Xue Y, Li ZR, Chen YZ (2006) Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. J Mol Graph Model 25: 313–323
Vapnik VN (1998) Statistical learning theory (adaptive and learning systems for signal processing, communications and control series). John Wiley & Sons, New York. A Wiley-Interscience Publication
Cristianini N, Shawe-Taylor J (eds) (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press
Burges CJC (1998) A tutorial on support vector machine for pattern recognition. Data Min Knowl Disc 2: 121–167
Burbidge R, Trotter M, Buxton B, Holden S (2001) Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput Chem 26: 5–14
Song M, Breneman CM, Bi J, Sukumar N, Bennett KP, Cramer S, Tugcu N (2002) Prediction of protein retention times in anion-exchange chromatography systems using support vector regression. J Chem Inf Comput Sci 42: 1347–1357
Kramer S, Frank E, Helma C (2002) Fragment generation and support vector machines for inducing SARs. SAR QSAR Environ Res 13: 509–523
Yao XJ, Panaye A, Doucet JP, Chen HF, Fan BT (2005) Comparative classification study of toxicity mechanisms using support vector machines and radial basis function neural networks. Anal Chim Acta 535: 259–273
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual workshop on computational learning theory, ACM Press, Pittsburgh PA, pp 144–152
Breiman L (2001) Random forests. J Mach Learn Res 45: 5–32
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43: 1947–1958
Baolin W, Tom A, David F, Walter M, Gil M (2003) Comparison of statistical methods for classication of ovarian cancer using mass spectrometry data. Bioinformatics 19: 1636–1643
Author information
Authors and Affiliations
Corresponding author
Electronic Supplementary Material
The Below is the Electronic Supplementary Material.
Rights and permissions
About this article
Cite this article
Sun, X., Li, Y., Liu, X. et al. Classification of bioaccumulative and non-bioaccumulative chemicals using statistical learning approaches. Mol Divers 12, 157–169 (2008). https://doi.org/10.1007/s11030-008-9092-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-008-9092-x