A comparison of molecular representations for lipophilicity quantitative structure–property relationships with results from the SAMPL6 logP Prediction Challenge
- 18 Downloads
Effective representation of a molecule is required to develop useful quantitative structure–property relationships (QSPR) for accurate prediction of chemical properties. The octanol–water partition coefficient logP, a measure of lipophilicity, is an important property for pharmacological and toxicological endpoints used in the pharmaceutical and regulatory spheres. We compare physicochemical descriptors, structural keys, and circular fingerprints in their ability to effectively represent a chemical space and characterise molecular features to correlate with lipophilicity. Exploratory landscape continuity analyses revealed that whole-molecule physicochemical descriptors could map together compounds that were similar in both molecular features and logP, indicating higher potential for use in logP QSPRs compared to the substructural approach of structural keys and circular fingerprints. Indeed, logP QSPR models parameterised by physicochemical descriptors consistently performed with the lowest error. Our best performing model was a stochastic gradient descent-optimised multilinear regression with 1438 descriptors, returning an internal benchmark RMSE of 1.03 log units. This corroborates the well-established notion that lipophilicity is an additive, whole-molecule property. We externally tested the model by participating in the 2019 SAMPL6 logP Prediction Challenge and blindly predicting for 11 protein kinase inhibitor fragment-like molecules. Our model returned an RMSE of 0.49 log units, placing eighth overall and third in the empirical methods category (submission ID ‘hdpuj’). Permutation feature importance analyses revealed that physicochemical descriptors could characterise predictive molecular features highly relevant to the kinase inhibitor fragment-like molecules.
KeywordsQSPR logP Physicochemical properties Machine learning SAMPL6
We thank the National Institutes of Health (Grant No. R01-GM124270) for their support in funding the SAMPL6 Challenges and associated experimental work.
- 15.Lowe EW et al (2011) Comparative analysis of machine learning techniques for the prediction of logP. In: 2011 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, ParisGoogle Scholar
- 19.Pedregosa F et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830Google Scholar