A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge

  • Raymond Lui
  • Davy Guan
  • Slade MatthewsEmail author


Effective representation of a molecule is required to develop useful quantitative structure–property relationships (QSPR) for accurate prediction of chemical properties. The octanol–water partition coefficient logP, a measure of lipophilicity, is an important property for pharmacological and toxicological endpoints used in the pharmaceutical and regulatory spheres. We compare physicochemical descriptors, structural keys, and circular fingerprints in their ability to effectively represent a chemical space and characterise molecular features to correlate with lipophilicity. Exploratory landscape continuity analyses revealed that whole-molecule physicochemical descriptors could map together compounds that were similar in both molecular features and logP, indicating higher potential for use in logP QSPRs compared to the substructural approach of structural keys and circular fingerprints. Indeed, logP QSPR models parameterised by physicochemical descriptors consistently performed with the lowest error. Our best performing model was a stochastic gradient descent-optimised multilinear regression with 1438 descriptors, returning an internal benchmark RMSE of 1.03 log units. This corroborates the well-established notion that lipophilicity is an additive, whole-molecule property. We externally tested the model by participating in the 2019 SAMPL6 logP Prediction Challenge and blindly predicting for 11 protein kinase inhibitor fragment-like molecules. Our model returned an RMSE of 0.49 log units, placing eighth overall and third in the empirical methods category (submission ID ‘hdpuj’). Permutation feature importance analyses revealed that physicochemical descriptors could characterise predictive molecular features highly relevant to the kinase inhibitor fragment-like molecules.


QSPR logP Physicochemical properties Machine learning SAMPL6 



We thank the National Institutes of Health (Grant No. R01-GM124270) for their support in funding the SAMPL6 Challenges and associated experimental work.


  1. 1.
    Fujita T, Iwasa J, Hansch C (1964) A new substituent constant, π, derived from partition coefficients. J Am Chem Soc 86(23):5175–5180CrossRefGoogle Scholar
  2. 2.
    Iwasa J, Fujita T, Hansch C (1965) Substituent constants for aliphatic functions obtained from partition coefficients. J Med Chem 8(2):150–153CrossRefGoogle Scholar
  3. 3.
    Wang R, Fu Y, Lai L (1997) A new atom-additive method for calculating partition coefficients. J Chem Inf Comput Sci 37(3):615–621CrossRefGoogle Scholar
  4. 4.
    Moriguchi I et al (1992) Simple method of calculating octanol/water partition coefficient. Chem Pharm Bull 40(1):127–130CrossRefGoogle Scholar
  5. 5.
    Lo Y-C et al (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23(8):1538–1546CrossRefGoogle Scholar
  6. 6.
    Mitchell JBO (2014) Machine learning methods in chemoinformatics. WIREs Comput Mol Sci 4(5):468–481CrossRefGoogle Scholar
  7. 7.
    Polanski J, Gasteiger J (2017) Computer representation of chemical compounds. In: Leszczynski J et al (eds) Handbook of computational chemistry. Springer International Publishing, Cham, pp 1997–2039CrossRefGoogle Scholar
  8. 8.
    Hall LH, Mohney B, Kier LB (1991) The electrotopological state: an atom index for QSAR. Quant Struct Act Relat 10(1):43–51CrossRefGoogle Scholar
  9. 9.
    Kier LB, Hall LH (1990) An electrotopological-state index for atoms in molecules. Pharm Res 7(8):801–807CrossRefGoogle Scholar
  10. 10.
    Hall LH, Kier LB (1995) Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J Chem Inf Comput Sci 35(6):1039–1045CrossRefGoogle Scholar
  11. 11.
    Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50(5):742–754CrossRefGoogle Scholar
  12. 12.
    Wang J-B et al (2015) In silico evaluation of logD7,4 and comparison with other prediction methods. J Chemom 29(7):389–398CrossRefGoogle Scholar
  13. 13.
    Wang R, Gao Y, Lai L (2000) Calculating partition coefficient by atom-additive method. Perspect Drug Discov Des 19(1):47–66CrossRefGoogle Scholar
  14. 14.
    Chen H-F (2009) In silico log P prediction for a large data set with support vector machines, radial basis neural networks and multiple linear regression. Chem Biol Drug Des 74(2):142–147CrossRefGoogle Scholar
  15. 15.
    Lowe EW et al (2011) Comparative analysis of machine learning techniques for the prediction of logP. In: 2011 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, ParisGoogle Scholar
  16. 16.
    Zang Q et al (2017) In silico prediction of physicochemical properties of environmental chemicals using molecular fingerprints and machine learning. J Chem Inf Model. 57(1):36–49CrossRefGoogle Scholar
  17. 17.
    Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474CrossRefGoogle Scholar
  18. 18.
    Todeschini, R, V Consonni (2009) Molecular descriptors for chemoinformatics: volume I: alphabetical listing/volume II: appendices, references, vol 41. Wiley, WeinheimCrossRefGoogle Scholar
  19. 19.
    Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830Google Scholar
  20. 20.
    Peltason L (2007) J Bajorath, SAR index: quantifying the nature of structure–activity relationships. J Med Chem 50(23):5571–5578CrossRefGoogle Scholar
  21. 21.
    Guha R, Van Drie JH (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J Chem Inf Model 48(3):646–658CrossRefGoogle Scholar
  22. 22.
    Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302CrossRefGoogle Scholar
  23. 23.
    Bajusz D (2015) A Rácz, K Héberger, Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform 7(1):20CrossRefGoogle Scholar
  24. 24.
    Cheng T et al (2007) Computation of octanol−water partition coefficients by guiding an additive model with knowledge. J Chem Inf Model 47(6):2140–2148CrossRefGoogle Scholar
  25. 25.
    Mansouri K et al (2018) OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 10(1):10CrossRefGoogle Scholar
  26. 26.
    Martel S et al (2013) Large, chemically diverse dataset of logP measurements for benchmarking studies. Eur J Pharm Sci 48(1–2):21–29CrossRefGoogle Scholar
  27. 27.
    Daina A (2014) O Michielin, V Zoete, iLOGP: a simple, robust, and efficient description of n-octanol/water partition coefficient for drug design using the GB/SA approach. J Chem Inf Model 54(12):3284–3301CrossRefGoogle Scholar
  28. 28.
    Fraaije JGEM et al (2016) Coarse-grained models for automated fragmentation and parametrization of molecular databases. J Chem Inf Model 56(12):2361–2377CrossRefGoogle Scholar
  29. 29.
    Gedeck P (2017) S Skolnik, S Rodde, Developing collaborative QSAR models without sharing structures. J Chem Inf Model 57(8):1847–1858CrossRefGoogle Scholar
  30. 30.
    Plante J (2018) S Werner, JPlogP: an improved logP predictor trained using predicted data. J Cheminform 10(1):61CrossRefGoogle Scholar
  31. 31.
    Işık M et al (2019) Octanol-water partition coefficient measurements for the SAMPL6 Blind Prediction Challenge. J Comput Aided Mol Des. CrossRefPubMedGoogle Scholar
  32. 32.
    Peltason L (2010) P Iyer, J Bajorath, Rationalizing three-dimensional activity landscapes and the influence of molecular representations on landscape topology and the formation of activity cliffs. J Chem Inf Model 50(6):1021–1033CrossRefGoogle Scholar
  33. 33.
    Mannhold R, van de Waterbeemd H (2001) Substructure and whole molecule approaches for calculating log P J Comput Aided Mol Des 15(4), 337–354.CrossRefGoogle Scholar
  34. 34.
    Zakharov AV et al (2019) Novel consensus architecture to improve performance of large-scale multitask deep learning QSAR models. J Chem Inf Model 59(11):4613–4624CrossRefGoogle Scholar
  35. 35.
    Moriwaki H et al (2018) Mordred: a molecular descriptor calculator. J Cheminform 10(1):4CrossRefGoogle Scholar
  36. 36.
    Cherkasov A et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010CrossRefGoogle Scholar
  37. 37.
    Wu Z et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530CrossRefGoogle Scholar
  38. 38.
    Tiño P et al (2004) Nonlinear prediction of quantitative structure−activity relationships. J Chem Inf Comput Sci 44(5):1647–1653CrossRefGoogle Scholar
  39. 39.
    Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer, Cham, pp 151–160CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Pharmacoinformatics Laboratory, Discipline of Pharmacology, School of Medical Sciences, Faculty of Medicine and HealthThe University of SydneySydneyAustralia

Personalised recommendations