Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset

Golbraikh, Alexander; Fourches, Denis; Sedykh, Alexander; Muratov, Eugene; Liepina, Inta; Tropsha, Alexander

doi:10.1007/978-1-4899-7445-7_7

Alexander Golbraikh³,
Denis Fourches³,
Alexander Sedykh³,
Eugene Muratov^3,4,
Inta Liepina⁵ &
…
Alexander Tropsha³

1208 Accesses
9 Citations

Abstract

It is not always possible to build predictive Quantitative Structure-Activity Relationships (QSAR) models for a given chemical dataset. In this work, we propose several statistical criteria, which can with high confidence answer a question, whether it is possible to build a predictive model for a dataset prior to actual modeling, i.e. to establish, whether the dataset is modelable. Calculation of these criteria is fast, and using them in QSAR studies could dramatically reduce modelers’ time and efforts, as well as computational resources necessary to build QSAR models for at least some datasets, especially for those which are not modelable. The calculation of modelability criteria is based on the k-nearest neighbors approach. For all datasets, as modelability criteria we have proposed dataset diversity (MODI_DIV) and new activity cliff indices (MODI_ACI). For datasets with binary end points, as modelability criteria we have proposed the correct classification rate (MODI_CCR) CCR = 0.5(sensitivity + specificity) for leave-one-out (LOO) cross-validation in the entire descriptor space, and correct classification rate for similarity search (MODI_ssCCR) in the entire descriptor space with leave 20 %-out (five-fold) cross-validation. For binary datasets, all these modelability criteria were tested on 42 datasets with previously generated QSAR models. Two latter criteria (MODI_CCR and MODI_ssCCR) were found to have high correlation with the predictivity of QSAR models (QSAR_CCR) and were additionally tested on 60 ToxCast end points with QSAR modeling results published recently (Thomas RS, Black MB, Li L, Healy E, Chu T-M, Bao W, Andersen MD, Wolfinger RD. Toxicol Sci: Off JSoc Toxicol 128(2):398–417, 2012). These modelability criteria can be used to classify many datasets as modelable or non-modelable. These criteria can be generalized to datasets with compounds belonging to more than two categories or classes. Additionally, criteria which take into account errors of prediction MODI_CAT _i and MODI_CLASS _i were proposed for datasets with compounds belonging to more than two (i > 2) categories or classes and continuous end points, divided into i > 2 bins. For continuous end points, LOO cross-validation q ² for similarity search with different numbers of nearest neighbors in the entire descriptor space (MODI_q ²), and similarity search coefficient of determination (MODI_ssR ²) in the entire descriptor space were proposed as modelability criteria. Our preliminary studies demonstrated high correlation between the external predictivity of QSAR models (QSAR_R ²) and each of the MODI_q ² and MODI_ssR ². On the other hand, for datasets with any binary or continuous response variable, MODI_DIVs and MODI_ACIs were found to be less useful to establish dataset modelability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dragon Descriptors. http://www.talete.mi.it/products/dragon_description.htm. Accessed 21 Aug 2012
Molecular Operating Environment (MOE). http://www.chemcomp.com/software.htm. Accessed 21 Aug 2012
Molconn-Z descriptors. http://www.edusoft-lc.com/molconn. Accessed 08 Sept 2013
Mold2 descriptors. http://www.fda.gov/ScienceResearch/BioinformaticsTools/Mold2/default.htm. Accessed 08 Sept 2013
CDK Descriptor Calculator. http://pele.farmbio.uu.se/nightly/dnames.html. Accessed 08 Sept 2013
Volsurf Descriptors. http://www.moldiscovery.com/soft_volsurf.php. Accessed 08 Sept 2013
Adriana Descriptors. http://molecular-networks.com/node/45. Accessed 08 Sept 2013
Martin TM, Harten P, Venkatapathy R, Das S, Young DM (2008) A hierarchical clustering methodology for the estimation of toxicity. Toxicol Mech Method 18(2–3):251–266
Article CAS Google Scholar
Kuz’min VE, Artemenko AG, Muratov EN (2008) Hierarchical QSAR technology based on the simplex representation of molecular structure. J Comput Aided Mol Des 22(6–7):403–421
Article Google Scholar
Isida Fragments. http://infochim.u-strasbg.fr/recherche/Download/FragmentorNomenclature_of_ISIDA_fragments_2011.pdf. Accessed 08 Sept 2013
Adams MJ (2004) Chemometrics in analytical spectroscopy. Royal Society of Chemistry, Cambridge, UK
Google Scholar
Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemometrics Intel Lab Syst 58(2):109–130
Article CAS Google Scholar
Zheng W, Tropsha A (2000) Novel variable selection quantitative structure–property relationship approach based on the k-nearest-neighbor principle. J Chem Inf Comput Sci 40(1):185–194
Article CAS Google Scholar
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Book Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1998) Classification and regression trees. Chapman & Hall/CRC, New York
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Cutler A. Random forests. http://www.stat.berkeley.edu/~breiman/andomForests/. Accessed 08 Sept 2013
Chirico N, Gramatica P (2011) Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient. J Chem Inf Model 51(9):2320–2335
Article CAS Google Scholar
Chirico N, Gramatica P (2012) Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J Chem Inf Model 52(8):2044–2058
Article CAS Google Scholar
Kovatcheva A, Golbraikh A, Oloff S, Feng J, Zheng W, Tropsha A (2005) QSAR modeling of datasets with enantioselective compounds using chirality sensitive molecular descriptors. SAR QSAR Environ Res 16(1–2):93–102
Article CAS Google Scholar
Tropsha A, Golbraikh A (2010) Predictive quantitative structure–activity relationships modeling: development and validation of QSAR models. In: Faulon J-L, Bender A (eds) Handbook of chemoinformatics algorithms. Chapman & Hall/CRC, London, pp 213–233
Google Scholar
Kovatcheva A, Golbraikh A, Oloff S, Xiao Y-D, Zheng W, Wolschann P, Buchbauer G, Tropsha A (2004) Combinatorial QSAR of ambergris fragrance compounds. J Chem Inf Comput Sci 44(2):582–595
Article CAS Google Scholar
de Cerqueira Lima P, Golbraikh A, Oloff S, Xiao Y-D, Tropsha A (2006) Combinatorial QSAR modeling of P-glycoprotein substrates. J Chem Inf Model 46(3):1245–1254
Article Google Scholar
ToxCastTM. http://epa.gov/ncct/toxcast. Accessed 11 Jan 2012
U.E.-N.C. for C. Toxicology, Computational Toxicology Research Program (CompTox). http://www.epa.gov/ncct/toxrefdb/. Accessed 08 Sept 2013
Thomas RS, Black MB, Li L, Healy E, Chu T-M, Bao W, Andersen MD, Wolfinger RD (2012) A comprehensive statistical analysis of predicting in vivo hazard using high-throughput in vitro screening. Toxicol Sci: Off J Soc Toxicol 128(2):398–417
Article CAS Google Scholar
Veber DF, Johnson SR, Cheng H-Y, Smith BR, Ward KW, Kopple KD (2002) Molecular properties that influence the oral bioavailability of drug candidates. J Med Chem 45(12):2615–2623
Article CAS Google Scholar
Shen M, LeTiran A, Xiao Y-D, Golbraikh A, Kohn H, Tropsha A (2002) Quantitative structure-activity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest neighbor and simulated annealing PLS methods. J Med Chem 45(13):2811–2823
Article CAS Google Scholar
Goret M, Wang-Bell M, Golbraikh A, Tropsha A (2006) QSAR analysis of a dataset of 91 functionalized amino acids anticonvulsant agents using k nearest neighbor. Unpublished results
Google Scholar
Boyd WA, McBride SJ, Rice JR, Snyder DW, Freedman JH (2010) A high-throughput method for assessing chemical toxicity using a Caenorhabditis elegans reproduction assay. Toxicol Appl Pharmacol 245(2):153–159
Article CAS Google Scholar
Sedykh A, Zhu H, Tang H, Zhang L, Richard A, Rusyn I, Tropsha A (2011) Use of in vitro HTS-derived concentration-response data as biological descriptors improves the accuracy of QSAR models of in vivo toxicity. Environ Health Persp 119(3):364–370
Article CAS Google Scholar
Tropsha A, Golbraikh A (2007) Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr Pharm Des 13(34):3494–3504
Article CAS Google Scholar
Golbraikh A (2000) Molecular dataset diversity indices and their applications to comparison of chemical databases and QSAR analysis. J Chem Inf Comput Sci 40(2):414–425
Article CAS Google Scholar
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204
Article CAS Google Scholar
Guha R, Van Drie JH (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646–658
Article CAS Google Scholar
Zhang L, Sedykh A, Tripathi A, Zhu H, Afantitis A, Mouchlis VD, Melagraki G, Rusyn I, Tropsha A (2013) Identification of putative estrogen receptor-mediated endocrine disrupting chemicals using qsar- and structure-based virtual screening approaches. Toxicol Appl Pharmacol 23. doi:pii: S0041-008X(13)00216-0. 10.1016/j.taap.2013.04.032. Epub ahead of print
ChemBL databases. https://www.ebi.ac.uk/chembl/. Accessed 08 Sept 2013
PDSP database. http://pdsp.med.unc.edu/indexR.html. Accessed 08 Sept 2013
USEPA ECOTOX database (2008). http://cfpub.epa.gov/ecotox. Accessed 08 Sept 2013
U.E.-N.C. for C. Toxicology, Computational Toxicology Research Program (CompTox). http://www.epa.gov/ncct/toxcast/. Accessed 08 Sept 2013
CDK Descriptor Names. http://pele.farmbio.uu.se/nightly/dnames.html. Accessed 08 Sept 2013
R: Classification and Regression with Random Forest. http://rss.acs.unt.edu/Rdoc/library/randomForest/html/randomForest.html. 08 Sept 2013
Polishchuk PG, Muratov EN, Artemenko AG, Kolumbin OG, Muratov NN, Kuz’min VE (2009) Application of random forest approach to QSAR prediction of aquatic toxicity. J Chem Inf Model 49(11):2481–2488
Article CAS Google Scholar
Zhu H, Tropsha A, Fourches D, Varnek A, Papa E, Gramatica P, Ӧberg T, Phuong D, Cherkasov A, Tetko IV (2008) Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. J Chem Inf Model 48(4):766–784
Article CAS Google Scholar
Schultz TW (1997) TETRATOX: Tetrahymena pyriformis population growth impairment endpoint – a surrogate for fish lethality read more. http://informahealthcare.com/doi/abs/10.1080/105172397243079. Toxicol Mech Method 7(4):289–309
Schultz TW, Netzeva TI (2004) Development and evaluation of QSARs for ecotoxic endpoints: the benzene response-surface model for Tetrahymena toxicity. In: Cronin MTD, Livingstone DJ (eds) Modeling environmental fate and toxicity. CRC Press, Boca Raton
Google Scholar
Schultz TW, TETRATOX. http://www.vet.utk.edu/TETRATOX/index.php. Accessed 08 June 2013
ChemiDplus Advanced Database National Library of Medicine 2011 (NLM). http://chem.sis.nlm.nih.gov/chemidplus/. Accessed 24 Feb 2011
USEPA, User’s Guide for T.E.S.T. (Toxicity Estimation Software Tool). http://www.epa.gov/ORD/NRMRL/std/cppb/qsar/testuserguide.pdf. Accessed 27 Oct 2009
Zhu H, Martin TM, Ye L, Sedykh A, Young DM, Tropsha A (2009) Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure. Chem Res Toxicol 22(12):1913–1921
Article CAS Google Scholar
Zhu H, Ye L, Richard A, Golbraikh A, Wright FA, Rusyn I, Tropsha A (2009) A novel two-step hierarchical quantitative structure-activity relationship modeling work flow for predicting acute toxicity of chemicals in rodents. Environ Health Persp 117(8):1257–1264
Article CAS Google Scholar
Martin TM, Harten P, Young DM, Muratov EN, Golbraikh A, Zhu H, Tropsha A (2012) Does rational selection of training and test sets improve the outcome of QSAR modeling? J Chem Inf Model 52(10):2570–2578
Article CAS Google Scholar
Hamelink JL (1977) Current bioconcentration test methods and theory. In: Mayer FL, Hamelink JL (eds) Aquatic toxicology and hazard evaluation. ASTM STP 634, American Society for Testing and Materials, Baltimore, pp 149–161
Google Scholar
OEHHA Toxicity Criteria Database. http://www.oehha.ca.gov/risk/ChemicalDB/index.asp. Accessed 08 May 2013
Regional Screening Levels | Region 9: Superfund | US EPA. http://www.epa.gov/region9/superfund/prg/. Accessed 08 May 2013
O. US EPA, Integrated Risk Information System (IRIS). http://www.epa.gov/iris/. Accessed 08 May 2013
O. of P.P. US EPA, Pesticide Reregistration Status | Pesticides | US EPA. http://www.epa.gov/oppsrrd1/reregistration/status.htm. Accessed 08 May 2013
Pharmaceutical Press. Martindale: the complete drug reference, 37th edn. http://www.pharmpress.com/product/9780853699330/martindale. Accessed 08 July 2013
U.E.-N.C. for C. Toxicology, Computational Toxicology Research Program (CompTox). http://www.epa.gov/ncct/dsstox/sdf_fdamdd.html. Accessed 08 Sept 2013
Tang H, Wang XS, Huang X-P, Roth X-P, Butler KV, Kozikowski AP, Jung M, Tropsha A (2009) Novel inhibitors of human histone deacetylase (HDAC) identified by QSAR modeling of known inhibitors, virtual screening, and experimental validation. J Chem Inf Model 49(2):461–476
Article CAS Google Scholar
Kennard RW, Stone L (1969) Computer aided design of experiments. Technometrics 11(1):137–148
Article Google Scholar
Golbraikh A, Shen M, Xiao Z, Xiao Y-D, Lee K-H, Tropsha A (2003) Rational selection of training and test sets for the development of validated QSAR models. J Comput Aided Mol Des 17(2–4):241–253
Article CAS Google Scholar
Kuz’min VE, Artemenko AG, Muratov EN, Volineckaya IL, Makarov VA, Riabova OB, Wutzler P, Schmidtke M (2007) Quantitative structure-activity relationship studies of [(biphenyloxy)propyl]isoxazole derivatives. Inhibitors of human rhinovirus 2 replication. J Med Chem 50(17):4205–4213
Article Google Scholar
Golbraikh A, Muratov E, Fourches D, Tropsha A. Data set modelability by QSAR. J Chem Inf Model. 8 Jan 2014 [Epub ahead of print]
Google Scholar

Download references

Acknowledgements

The authors are thankful to Dr. Guiyu Zhao (AstraZeneca, Shanghai, China) for providing QSAR modeling results for 34 GPCRome datasets, Dr. Jessica Wignall (University of North Carolina at Chapel Hill) for providing the QSAR modeling results for 10 regulatory information datasets, and Dr. Jonathan Freedman (National Institute of Environmental Health Science, Research Triangle Park, NC) and Dr. Ruchir Shah (SciOme, LLC, Research Triangle Park, NC) for providing experimental results for C. Elegans datasets.

Author information

Authors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Alexander Golbraikh, Denis Fourches, Alexander Sedykh, Eugene Muratov & Alexander Tropsha
A.V. Bogatsky Physical-Chemical Institute NAS of Ukraine, Odessa, Ukraine
Eugene Muratov
Latvian Institute of Organic Synthesis, Riga, Latvia
Inta Liepina

Authors

Alexander Golbraikh
View author publications
You can also search for this author in PubMed Google Scholar
Denis Fourches
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Sedykh
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Muratov
View author publications
You can also search for this author in PubMed Google Scholar
Inta Liepina
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Tropsha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Tropsha .

Editor information

Editors and Affiliations

Department of Chemistry and Biochemistry, Interdisciplinary Center for Nanotoxicity, Jackson State University, Jackson, Mississippi, USA
Jerzy Leszczynski
Environmental Laboratory, US Army Engineer Research and Development Center, Vicksburg, Mississippi, USA
Manoj K. Shukla

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Golbraikh, A., Fourches, D., Sedykh, A., Muratov, E., Liepina, I., Tropsha, A. (2014). Modelability Criteria: Statistical Characteristics Estimating Feasibility to Build Predictive QSAR Models for a Dataset. In: Leszczynski, J., Shukla, M. (eds) Practical Aspects of Computational Chemistry III. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7445-7_7

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7445-7_7
Published: 25 March 2014
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7444-0
Online ISBN: 978-1-4899-7445-7
eBook Packages: Chemistry and Materials ScienceChemistry and Material Science (R0)

Publish with us

Policies and ethics