Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets

Ambure, Pravin; Cordeiro, M. Natália Dias Soeiro

doi:10.1007/978-1-0716-0150-1_5

Pravin Ambure³ &
M. Natália Dias Soeiro Cordeiro³

Part of the book series: Methods in Pharmacology and Toxicology ((MIPT))

1230 Accesses
6 Citations

Abstract

A huge amount of chemical and biological data that is available in several online databases can now be easily retrieved and studied by many researchers (including QSAR modelers) to extract meaningful information. Everyone is naturally aware, however, of the errors in chemical structures and biological data that are possibly present in the retrieved data from these online databases. Implications of those might be severe, particularly for QSAR modelers since developing models using such erroneous data will certainly lead to false or non-predictive models. Proper curation of the retrieved chemical and biological data is therefore crucial and mandatory prior to any QSAR modeling. For large datasets, manual data curation becomes highly impossible, nevertheless. This chapter reviews and discusses the several data curation tools normally applied for such endeavors, paying special attention to those that can be used to semiautomate the curation process, like resorting to a workflow by employing the freely available KNIME software.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Williams AJ, Ekins S (2011) A quality alert and call for improved curation of public chemistry databases. ‎Drug Discov Today 16:747–750
Article CAS PubMed Google Scholar
Waldman M, Fraczkiewicz R, Clark RD (2015) Tales from the war on error: the art and science of curating QSAR data. J Comput Aided Mol Des 29(9):897–910
Article CAS PubMed Google Scholar
Hersey A, Chambers J, Bellis L, Bento AP, Gaulton A, Overington JP (2015) Chemical databases: curation or integration by user-defined equivalence? Drug Discov Today Technol 14:17–24
Article PubMed PubMed Central Google Scholar
Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA (2015) PubChem substance and compound databases. Nucleic Acids Res 44(D1):D1202–D1213
Article PubMed PubMed Central Google Scholar
Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107
Article PubMed PubMed Central Google Scholar
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2006) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res 35(Suppl_1):D198–D201
PubMed PubMed Central Google Scholar
Degtyarenko K, De Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2007) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Suppl_1):D344–D350
Article PubMed PubMed Central Google Scholar
Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182
Article CAS PubMed PubMed Central Google Scholar
Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124
Article CAS Google Scholar
Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345
Article CAS Google Scholar
Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17(13–14):685–701
Article CAS PubMed Google Scholar
Kramer C, Kalliokoski T, Gedeck P, Vulpetti A (2012) The experimental uncertainty of heterogeneous public K i data. J Med Chem 55(11):5165–5173
Article CAS PubMed Google Scholar
Tiikkainen P, Bellis L, Light Y, Franke L (2013) Estimating error rates in bioactivity databases. J Chem Inf Model 53(10):2499–2505
Article CAS PubMed Google Scholar
Kalliokoski T, Kramer C, Vulpetti A, Gedeck P (2013) Comparability of mixed IC50 data–a statistical analysis. PLoS One 8(4):e61007
Article CAS PubMed PubMed Central Google Scholar
Papadatos G, Gaulton A, Hersey A, Overington JP (2015) Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 29(9):885–896
Article CAS PubMed PubMed Central Google Scholar
Mansouri K, Grulke C, Richard A, Judson R, Williams A (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27(11):911–937
Article CAS Google Scholar
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explorations Newsletter 11(1):26–31
Article Google Scholar
Toropova A, Toropov A, Benfenati E, Gini G (2011) QSAR modelling toxicity toward rats of inorganic substances by means of CORAL. Open Chem 9(1):75–85
CAS Google Scholar
Toropova A, Toropov A, Benfenati E, Gini G (2011) Co-evolutions of correlations for QSAR of toxicity of organometallic and inorganic substances: an unexpected good prediction based on a model that seems untrustworthy. Chemom Intell Lab Syst 105(2):215–219
Article CAS Google Scholar
Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204
Article CAS PubMed PubMed Central Google Scholar
Oprisiu I, Varlamova E, Muratov E, Artemenko A, Marcou G, Polishchuk P, Kuz’min V, Varnek A (2012) QSPR approach to predict nonadditive properties of mixtures. Application to bubble point temperatures of binary mixtures of liquids. Mol Inform 31(6–7):491–502
Article CAS PubMed Google Scholar
Csizmadia F (2000) JChem: Java applets and modules supporting chemical database handling from web browsers. J Chem Inf Comput Sci 40(2):323–324
Article CAS PubMed Google Scholar
Gadaleta D, Lombardo A, Toma C, Benfenati E (2018) A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications. J Chem 10(1):60
Article Google Scholar
Fourches D, Sassano MF, Roth BL, Tropsha A (2013) HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data. Bioinformatics 30(4):588–589
Article PubMed PubMed Central Google Scholar
Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25(6):533–554
Article CAS PubMed PubMed Central Google Scholar
Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR (2014) Matched molecular pair analysis: significance and the impact of experimental uncertainty. J Med Chem 57(9):3786–3802
Article CAS PubMed Google Scholar
Fourches D, Muratov E, Tropsha A (2016) Trust, but verify II: a practical guide to chemogenomics data curation. J Chem Inf Model 56(7):1243–1252
Article CAS PubMed PubMed Central Google Scholar
Kim MT, Wang W, Sedykh A, Zhu H (2016) Curating and preparing high-throughput screening data for quantitative structure-activity relationship modeling. In: High-throughput screening assays in toxicology. Springer, Humana Press, New York, NY, pp 161–172
Google Scholar
Kausar S, Falcao AO (2018) An automated framework for QSAR model building. J Chem 10(1):1
Article Google Scholar
Ambure P, Bhat J, Puzyn T, Roy K (2019) Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: an in silico approach. J Biomol Struct Dyn 37:1282–1306
Article CAS PubMed Google Scholar

Download references

Acknowledgments

This work was supported by UID/QUI/50006/2019 with funding from FCT/MCTES through national funds.

Author information

Authors and Affiliations

LAQV@REQUIMTE, Department of Chemistry and Biochemistry, University of Porto, Porto, Portugal
Pravin Ambure & M. Natália Dias Soeiro Cordeiro

Authors

Pravin Ambure
View author publications
You can also search for this author in PubMed Google Scholar
M. Natália Dias Soeiro Cordeiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Natália Dias Soeiro Cordeiro .

Editor information

Editors and Affiliations

Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
Kunal Roy

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Ambure, P., Cordeiro, M.N.D.S. (2020). Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets. In: Roy, K. (eds) Ecotoxicological QSARs. Methods in Pharmacology and Toxicology. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0150-1_5

Download citation

DOI: https://doi.org/10.1007/978-1-0716-0150-1_5
Published: 17 January 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0149-5
Online ISBN: 978-1-0716-0150-1
eBook Packages: Springer Protocols

Publish with us

Policies and ethics