Skip to main content

Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets

  • Protocol
  • First Online:
Ecotoxicological QSARs

Abstract

A huge amount of chemical and biological data that is available in several online databases can now be easily retrieved and studied by many researchers (including QSAR modelers) to extract meaningful information. Everyone is naturally aware, however, of the errors in chemical structures and biological data that are possibly present in the retrieved data from these online databases. Implications of those might be severe, particularly for QSAR modelers since developing models using such erroneous data will certainly lead to false or non-predictive models. Proper curation of the retrieved chemical and biological data is therefore crucial and mandatory prior to any QSAR modeling. For large datasets, manual data curation becomes highly impossible, nevertheless. This chapter reviews and discusses the several data curation tools normally applied for such endeavors, paying special attention to those that can be used to semiautomate the curation process, like resorting to a workflow by employing the freely available KNIME software.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Williams AJ, Ekins S (2011) A quality alert and call for improved curation of public chemistry databases. ‎Drug Discov Today 16:747–750

    Article  CAS  PubMed  Google Scholar 

  2. Waldman M, Fraczkiewicz R, Clark RD (2015) Tales from the war on error: the art and science of curating QSAR data. J Comput Aided Mol Des 29(9):897–910

    Article  CAS  PubMed  Google Scholar 

  3. Hersey A, Chambers J, Bellis L, Bento AP, Gaulton A, Overington JP (2015) Chemical databases: curation or integration by user-defined equivalence? Drug Discov Today Technol 14:17–24

    Article  PubMed  PubMed Central  Google Scholar 

  4. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA (2015) PubChem substance and compound databases. Nucleic Acids Res 44(D1):D1202–D1213

    Article  PubMed  PubMed Central  Google Scholar 

  5. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107

    Article  PubMed  PubMed Central  Google Scholar 

  6. Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2006) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res 35(Suppl_1):D198–D201

    PubMed  PubMed Central  Google Scholar 

  7. Degtyarenko K, De Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2007) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Suppl_1):D344–D350

    Article  PubMed  PubMed Central  Google Scholar 

  8. Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124

    Article  CAS  Google Scholar 

  10. Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345

    Article  CAS  Google Scholar 

  11. Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17(13–14):685–701

    Article  CAS  PubMed  Google Scholar 

  12. Kramer C, Kalliokoski T, Gedeck P, Vulpetti A (2012) The experimental uncertainty of heterogeneous public K i data. J Med Chem 55(11):5165–5173

    Article  CAS  PubMed  Google Scholar 

  13. Tiikkainen P, Bellis L, Light Y, Franke L (2013) Estimating error rates in bioactivity databases. J Chem Inf Model 53(10):2499–2505

    Article  CAS  PubMed  Google Scholar 

  14. Kalliokoski T, Kramer C, Vulpetti A, Gedeck P (2013) Comparability of mixed IC50 data–a statistical analysis. PLoS One 8(4):e61007

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Papadatos G, Gaulton A, Hersey A, Overington JP (2015) Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 29(9):885–896

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Mansouri K, Grulke C, Richard A, Judson R, Williams A (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27(11):911–937

    Article  CAS  Google Scholar 

  17. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explorations Newsletter 11(1):26–31

    Article  Google Scholar 

  18. Toropova A, Toropov A, Benfenati E, Gini G (2011) QSAR modelling toxicity toward rats of inorganic substances by means of CORAL. Open Chem 9(1):75–85

    CAS  Google Scholar 

  19. Toropova A, Toropov A, Benfenati E, Gini G (2011) Co-evolutions of correlations for QSAR of toxicity of organometallic and inorganic substances: an unexpected good prediction based on a model that seems untrustworthy. Chemom Intell Lab Syst 105(2):215–219

    Article  CAS  Google Scholar 

  20. Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Oprisiu I, Varlamova E, Muratov E, Artemenko A, Marcou G, Polishchuk P, Kuz’min V, Varnek A (2012) QSPR approach to predict nonadditive properties of mixtures. Application to bubble point temperatures of binary mixtures of liquids. Mol Inform 31(6–7):491–502

    Article  CAS  PubMed  Google Scholar 

  22. Csizmadia F (2000) JChem: Java applets and modules supporting chemical database handling from web browsers. J Chem Inf Comput Sci 40(2):323–324

    Article  CAS  PubMed  Google Scholar 

  23. Gadaleta D, Lombardo A, Toma C, Benfenati E (2018) A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications. J Chem 10(1):60

    Article  Google Scholar 

  24. Fourches D, Sassano MF, Roth BL, Tropsha A (2013) HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data. Bioinformatics 30(4):588–589

    Article  PubMed  PubMed Central  Google Scholar 

  25. Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25(6):533–554

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR (2014) Matched molecular pair analysis: significance and the impact of experimental uncertainty. J Med Chem 57(9):3786–3802

    Article  CAS  PubMed  Google Scholar 

  27. Fourches D, Muratov E, Tropsha A (2016) Trust, but verify II: a practical guide to chemogenomics data curation. J Chem Inf Model 56(7):1243–1252

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Kim MT, Wang W, Sedykh A, Zhu H (2016) Curating and preparing high-throughput screening data for quantitative structure-activity relationship modeling. In: High-throughput screening assays in toxicology. Springer, Humana Press, New York, NY, pp 161–172

    Google Scholar 

  29. Kausar S, Falcao AO (2018) An automated framework for QSAR model building. J Chem 10(1):1

    Article  Google Scholar 

  30. Ambure P, Bhat J, Puzyn T, Roy K (2019) Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: an in silico approach. J Biomol Struct Dyn 37:1282–1306

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by UID/QUI/50006/2019 with funding from FCT/MCTES through national funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Natália Dias Soeiro Cordeiro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Science+Business Media, LLC, part of Springer Nature

About this protocol

Check for updates. Verify currency and authenticity via CrossMark

Cite this protocol

Ambure, P., Cordeiro, M.N.D.S. (2020). Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets. In: Roy, K. (eds) Ecotoxicological QSARs. Methods in Pharmacology and Toxicology. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0150-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-0716-0150-1_5

  • Published:

  • Publisher Name: Humana, New York, NY

  • Print ISBN: 978-1-0716-0149-5

  • Online ISBN: 978-1-0716-0150-1

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics