Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets

  • Pravin Ambure
  • M. Natália Dias Soeiro CordeiroEmail author
Part of the Methods in Pharmacology and Toxicology book series (MIPT)


A huge amount of chemical and biological data that is available in several online databases can now be easily retrieved and studied by many researchers (including QSAR modelers) to extract meaningful information. Everyone is naturally aware, however, of the errors in chemical structures and biological data that are possibly present in the retrieved data from these online databases. Implications of those might be severe, particularly for QSAR modelers since developing models using such erroneous data will certainly lead to false or non-predictive models. Proper curation of the retrieved chemical and biological data is therefore crucial and mandatory prior to any QSAR modeling. For large datasets, manual data curation becomes highly impossible, nevertheless. This chapter reviews and discusses the several data curation tools normally applied for such endeavors, paying special attention to those that can be used to semiautomate the curation process, like resorting to a workflow by employing the freely available KNIME software.

Key words

Data curation Online databases Structural errors Duplicate analysis Activity cliffs Curation tools QSAR 



This work was supported by UID/QUI/50006/2019 with funding from FCT/MCTES through national funds.


  1. 1.
    Williams AJ, Ekins S (2011) A quality alert and call for improved curation of public chemistry databases. ‎Drug Discov Today 16:747–750PubMedCrossRefGoogle Scholar
  2. 2.
    Waldman M, Fraczkiewicz R, Clark RD (2015) Tales from the war on error: the art and science of curating QSAR data. J Comput Aided Mol Des 29(9):897–910PubMedCrossRefGoogle Scholar
  3. 3.
    Hersey A, Chambers J, Bellis L, Bento AP, Gaulton A, Overington JP (2015) Chemical databases: curation or integration by user-defined equivalence? Drug Discov Today Technol 14:17–24PubMedPubMedCentralCrossRefGoogle Scholar
  4. 4.
    Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA (2015) PubChem substance and compound databases. Nucleic Acids Res 44(D1):D1202–D1213PubMedPubMedCentralCrossRefGoogle Scholar
  5. 5.
    Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107PubMedPubMedCentralCrossRefGoogle Scholar
  6. 6.
    Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK (2006) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res 35(Suppl_1):D198–D201PubMedPubMedCentralGoogle Scholar
  7. 7.
    Degtyarenko K, De Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M (2007) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36(Suppl_1):D344–D350PubMedPubMedCentralCrossRefGoogle Scholar
  8. 8.
    Irwin JJ, Shoichet BK (2005) ZINC− a free database of commercially available compounds for virtual screening. J Chem Inf Model 45(1):177–182PubMedPubMedCentralCrossRefGoogle Scholar
  9. 9.
    Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124CrossRefGoogle Scholar
  10. 10.
    Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345CrossRefGoogle Scholar
  11. 11.
    Williams AJ, Ekins S, Tkachenko V (2012) Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. Drug Discov Today 17(13–14):685–701PubMedCrossRefGoogle Scholar
  12. 12.
    Kramer C, Kalliokoski T, Gedeck P, Vulpetti A (2012) The experimental uncertainty of heterogeneous public K i data. J Med Chem 55(11):5165–5173PubMedCrossRefGoogle Scholar
  13. 13.
    Tiikkainen P, Bellis L, Light Y, Franke L (2013) Estimating error rates in bioactivity databases. J Chem Inf Model 53(10):2499–2505PubMedCrossRefGoogle Scholar
  14. 14.
    Kalliokoski T, Kramer C, Vulpetti A, Gedeck P (2013) Comparability of mixed IC50 data–a statistical analysis. PLoS One 8(4):e61007PubMedPubMedCentralCrossRefGoogle Scholar
  15. 15.
    Papadatos G, Gaulton A, Hersey A, Overington JP (2015) Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 29(9):885–896PubMedPubMedCentralCrossRefGoogle Scholar
  16. 16.
    Mansouri K, Grulke C, Richard A, Judson R, Williams A (2016) An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR QSAR Environ Res 27(11):911–937CrossRefGoogle Scholar
  17. 17.
    Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD Explorations Newsletter 11(1):26–31CrossRefGoogle Scholar
  18. 18.
    Toropova A, Toropov A, Benfenati E, Gini G (2011) QSAR modelling toxicity toward rats of inorganic substances by means of CORAL. Open Chem 9(1):75–85Google Scholar
  19. 19.
    Toropova A, Toropov A, Benfenati E, Gini G (2011) Co-evolutions of correlations for QSAR of toxicity of organometallic and inorganic substances: an unexpected good prediction based on a model that seems untrustworthy. Chemom Intell Lab Syst 105(2):215–219CrossRefGoogle Scholar
  20. 20.
    Fourches D, Muratov E, Tropsha A (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50(7):1189–1204PubMedPubMedCentralCrossRefGoogle Scholar
  21. 21.
    Oprisiu I, Varlamova E, Muratov E, Artemenko A, Marcou G, Polishchuk P, Kuz’min V, Varnek A (2012) QSPR approach to predict nonadditive properties of mixtures. Application to bubble point temperatures of binary mixtures of liquids. Mol Inform 31(6–7):491–502PubMedCrossRefGoogle Scholar
  22. 22.
    Csizmadia F (2000) JChem: Java applets and modules supporting chemical database handling from web browsers. J Chem Inf Comput Sci 40(2):323–324PubMedCrossRefGoogle Scholar
  23. 23.
    Gadaleta D, Lombardo A, Toma C, Benfenati E (2018) A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications. J Chem 10(1):60CrossRefGoogle Scholar
  24. 24.
    Fourches D, Sassano MF, Roth BL, Tropsha A (2013) HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data. Bioinformatics 30(4):588–589PubMedPubMedCentralCrossRefGoogle Scholar
  25. 25.
    Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25(6):533–554PubMedPubMedCentralCrossRefGoogle Scholar
  26. 26.
    Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR (2014) Matched molecular pair analysis: significance and the impact of experimental uncertainty. J Med Chem 57(9):3786–3802PubMedCrossRefGoogle Scholar
  27. 27.
    Fourches D, Muratov E, Tropsha A (2016) Trust, but verify II: a practical guide to chemogenomics data curation. J Chem Inf Model 56(7):1243–1252PubMedPubMedCentralCrossRefGoogle Scholar
  28. 28.
    Kim MT, Wang W, Sedykh A, Zhu H (2016) Curating and preparing high-throughput screening data for quantitative structure-activity relationship modeling. In: High-throughput screening assays in toxicology. Springer, Humana Press, New York, NY, pp 161–172Google Scholar
  29. 29.
    Kausar S, Falcao AO (2018) An automated framework for QSAR model building. J Chem 10(1):1CrossRefGoogle Scholar
  30. 30.
    Ambure P, Bhat J, Puzyn T, Roy K (2019) Identifying natural compounds as multi-target-directed ligands against Alzheimer’s disease: an in silico approach. J Biomol Struct Dyn 37:1282–1306PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Authors and Affiliations

  • Pravin Ambure
    • 1
  • M. Natália Dias Soeiro Cordeiro
    • 1
    Email author
  1. 1.LAQV@REQUIMTE, Department of Chemistry and BiochemistryUniversity of PortoPortoPortugal

Personalised recommendations