Finding the molecular scaffold of nuclear receptor inhibitors through high-throughput screening based on proteochemometric modelling
KeywordsProteochemometric modelling Nuclear receptor Molecular scaffold Cheminformatics
As a ligand dependent transcription factors, nuclear receptors (NR) can be activated by important molecules such as steroidal hormones, endogenous hormones, glucocorticoids and thyroid hormones [1, 2]. After activation, NR can regulate the expression of specific genes and then participate in several essential physiological processes such as development, homeostasis and metabolism of the organism [1, 2]. Since NR can affect the expression of enormous genes which associated with various diseases such as diabetes and hepatic adipose infiltration, it can be considered as an appropriate therapeutic target for new drug discovery. Till now, 48 nuclear receptors have been discovered in humans , 23 of them are certified as drug target by U.S. Food and Drug Administration (FDA). Meanwhile, over 13% FDA approved drugs were aimed at those nuclear receptors . In that case, discover novel drugs as nuclear receptor inhibitors have acquired a particular significance for NR-related metabolic diseases treatment. In drug design, scaffold is the fixed part of a molecule which is the essential part for biological activity of molecule. Therefore, scaffold based strategies were widely used for drug discovery [5, 6, 7]. It can be noticed that finding a new scaffold often lead to the discovery of a new inhibitor classes which may have the potential to become future drugs [8, 9, 10]. In that case, finding novel bioactive scaffolds is an essential process in the area of drug design.
In order to discover the molecular scaffold of a class of molecules such as NR-inhibitors, massive structure of molecules with bioactivity need to be screened and clustered to finding the consensus structure domain. Traditionally, this screening evolving titration experiments is a time-consuming, expensive and labor-intensive process, which could be assisted by computer-aided drug design (CADD) . In recent decades, different methods including virtual screening [12, 13], molecular docking [14, 15], de-novo drug design [16, 17, 18], pharmacophore modeling [19, 20, 21] and molecular dynamics [22, 23] were introduced to find bioactive molecules for further drug design. In the early 1960 s, quantitative structure activity relationship (QSAR) approach was established to discover the relationship between ligand and target . In general, conventional QSAR based approaches consider structure information and bio-active value to efficiently predict the relationship between ligand and target. However, its prediction ability is limited to single target and enable to map multiple ligand-target relationship . Also, the prediction ability of conventional QSARs were limited since only ligand information were used for model construction [25, 26, 27]. To avoid the shortages of QSAR, an approach relying on the description of both ligand and target to quantitatively analyze their relations was invented and termed as Proteochemometric (PCM) modeling in 2001 . The main advantage of PCM modeling is to integrate information on both ligand and target to make the model applicable for multiple target screening, including GPCRs [29, 30, 31], proteases [32, 33, 34], kinases [35, 36], reverse transcriptase [37, 38]. However, according to author’s knowledge, PCM for NR-inhibitor prediction was hardly reported.
In this article, two major steps including PCM modelling and scaffold finding were processed to guide the design of NR-inhibitors. Initially, based on a total number of 11 nuclear receptors and 9633 molecular compounds with EC50 values were derived from ONRLDB , a series of PCM modelling were generated to predict the inhibition ability for NR-inhibitors. After rigorous validation through both internal and external validation dataset, our PCM model was proved to have the potential ability for high-throughput NR-inhibitor screening. It should be noted that NR-targets validated in external dataset were not involved in our training set. That means for those NR proteins without enough bio-active data to establish a traditional QSAR models, our model may also have the ability to provide NR-inhibitor screening. Further, after molecular clustering based on our PCM model, novel bioactive scaffolds for NR-inhibitors can be discovered. The potential bioactive scaffolds for different NR targets were proposed for future drug discovery of NR-inhibitors.
Results and discussion
Construction of proteochemometric modeling
10-fold cross-validation results of different machine learning methods
Further, the contribution of chemical descriptor was also analyzed. After statistic analysis, it can be found that lipo-hydro partition coefficient (MolLogP in RDKit) contains the major contribution among all ligand descriptors, which means it might be the key element for molecular with potential inhibition abilities (Additional file 2: Fig. S1). It can also found that, for both active compound and inactive compound, the distribution of MolLogP follows Normal distribution with significant difference, which were calculate through T test (P value < 0.0001). Result showed that, lipo-hydro partition coefficient is important for the activity of NR inhibitor, active compounds normally contain MolLogP around 5.775, while the MolLogP of inactive compounds were around 5.380. Importance and P value of top 10 chemical structure descriptors can be found in Additional file 3: Table S2.
Evaluation of proteochemometric modeling
In this study, PCM modeling was systemically evaluated through both internal and external validations. By setting different cutoffs of bio-active data, results of different PCM models can be found in Fig. 1b, detailed information of model performance on all four protein descriptors can be found in Additional file 4: Table S3. Generally, all PCM models can gives outstanding performance in internal validation by achieving an AUC value above 0.870 on different cutoffs. For external validation, all PCM model can also achieves a satisfied performance with AUC value over 0.746. Above results indicate the excellent ability of our PCM model for NR-related inhibitors prediction. Also, with the increasing of cutoffs, the performance of PCM models increased synchronously. This probably caused by the fact that the unbalance between positive and negative data according to different cutoff. For example, when set EC50 ≤ 1 as positive data and EC50 > 1 as negative data, the ratio (positive/negative) of training set, testing set and external validation set were all close to 1 (Additional file 5: Table S4). After the cutoff rising to 10, those ratios were quickly increased to 12.14, 12.95 and 22.76 respectively (Additional file 5: Table S4). Several reports also pointed out that the 1 μM cutoff may be more reasonable because it contains less noise . In that case, the cutoff of EC50 value was set as 1 for further analysis.
Finding the molecular scaffolds for NR inhibitors
Generally, the prediction of our PCM model matched perfectly well with the experimental values. For three peroxisome proliferator-activated receptor (PPAR) protein targets, the top 10 clusters of each target including NR1C1 (Fig. 2a), NR1C2 (Fig. 2c) and NR1C3 (Fig. 2d) were detected and marked in each sub-graphs. For PPAR protein targets, both unique and overlapped scaffolds can be detected. For example, target NR1C1 contains 7 bioactive scaffolds (marked as S1 to S7), 2 inactive scaffolds (marked as S8 and S9) and 1 mixed scaffold (marked as S10) contains both active and inactive compounds. Among above, scaffold S1 and S6 were active in both NR1C1 and NR1C3 (Fig. 2d), while S8 and S9 were both inactive scaffold. On the other hand, different pattern can be found in target NR1C2 (Fig. 2c). In NR1C2, 7 new scaffold clusters marked as S11 to S18 were detected. Besides that, as a major inactive scaffold for NR1C1 and NR1C3, S8 was determined as active scaffold in NR1C2. Also, as an active scaffold in NR1C1 and mixed scaffold in NR1C3, scaffold S2 was defined as inactive scaffold for NR1C2. The results of two targets beside PPAR targets were quite different, totally new scaffolds were discovered and illustrated in Fig. 2e, f. All above illustrated that, even from the same protein family, the inhibitor scaffolds of different NR protein targets were still distinguishable.
Also, it should be noticed that, the bioactivity of different compounds rely on multiple factors such as side-chain composition, functional group, substituent and chirality. For instance, scaffold S10 N-benzylbenzamide contains different compounds including compound 1–3 (Fig. 2b). The molecular structure of three compounds is extremely similar except for the chirality. The stereogenic center of compound 1 (Benzenepropanoic acid, α-ethyl-4-methoxy-3-[[[[4-(trifluoromethyl)phenyl]methyl]amino]carbonyl]-, (αS)-) and compound 2 (Benzenepropanoic acid, α-ethyl-4-methoxy-3-[[[[4-(trifluoromethyl)phenyl]methyl]amino]carbonyl]-, (αR)-) are absolutely configured as S and R, respectively. Compound 3 was defined as mixture of stereoisomers which may combine with both S and R chirality.
Computer-aided drug design (CADD) can assist and shorten the process of new drug discovery. To achieve that, one essential issue is to per-estimate the activity of different compound against different target proteins. By introducing PCM model into CADD, relationship between multiple compounds and targets can be determined. Based on high-throughput screening of compounds, bioactive molecules can be clustered and essential molecular scaffolds can be detected to guide the future development of therapeutic drugs.
In order to process high-throughput screening of bioactive inhibitors for targets from NR families, 7267 bio-active data of 11 nuclear receptors were collected to establish an in silico model. Through both internal and external validation, our PCM models were proved to be sensitive for NR-inhibitor prediction which might be benefit from our descriptors. For target descriptors, generalized sequence similarity descriptors contain information from 30 background targets from NR families. Models based on those descriptors can achieve a better prediction performance on both internal and external validation set, which means those descriptors can be extended to multiple targets from NR families. For chemical descriptors, since lipo-hydro partition coefficient contains the major contribution for classification and parameter MolLogP is distinguishable for active and inactive compounds, this may provide a clue for future therapeutic NR-inhibitors discoveries.
Another essential issue for PCM model construction is to choose the suitable machine learning method. In this study, five different machine learning methods including both regression and classification approaches were tested to establish PCM modeling. Results showed that the performance of RF and DT classifier are significantly higher than other methods, which means above algorithms might be more applicable in the case of NR-inhibitors prediction.
After high-throughput screening of NR-inhibitors, bioactive molecules could be clustered according to structure similarity and molecular scaffold enriched in each clustered can be detected and might assist the process of drug design. In this article, the appropriate models selected after evaluations were used to molecular clustering for five major NR targets. Results showed that our PCM model can successfully predict those potential NR-inhibitors which agree well with the experimental EC50 values. For each NR target, our algorithms can able to predict those potential therapeutic inhibitors and discover the molecular scaffolds for future drug development. Currently, this method was established on NR proteins and it can be extended to other protein targets after the accumulating of experimental data.
Protein target descriptor
Here, both sequence similarity descriptors and structure similarity descriptors were used to characterize those five nuclear receptors. Firstly, a 30 protein targets from NR families can be derived from Protein Data Bank (PDB)  as background. For 11 protein targets in our dataset, the sequence and structure similarity compared with those 30 background protein target structures can be calculated by pairwise alignment respectively. Sequence alignment was calculated by smith-waterman alignment , while structure alignment was calculated by using jFATCAT . Therefore, two types of generalized target descriptor including sequence similarity descriptor (T1) and structure similarity descriptor (T2) can be obtained for each protein targets. For comparison, specific descriptors based on 5 protein target from our training set instead of 30 background protein target were also established, recorded as T3 (specific sequence similarity descriptor based on 5 protein target) and T4 (specific structure similarity descriptor based on 5 protein target). Two generalized target descriptors can be found in Additional file 10: Table S8-1, 2 and two specific target descriptors were also listed in Additional file 11: Table S9-1, 2.
Chemical structure descriptors were calculated by using RDKit (release version 2016). RDkit provides different chemical structure descriptors, which contains both chemical and physical properties such as Molecular Weight, Hydrogen Bond Donor Count, Hydrogen Bond Acceptor Count, Rotatable Bond Count and LogP etc. In addition, RDKit contains massive types of chemical descriptors derived from other tools and literatures, such as MOE-type descriptors for partial charges, MR contributions, LogP contributions, EState indices and surface area contributions integrated from molecular operating environment (MOE). In general, 187 descriptors were used to characterize the structure features of inhibitor (Additional file 12: Table S10).
In this study, 4 Proteochemometric models were created from training set based on different combinations of descriptors (T1-L, T2-L, T3-L, T4-L). All models were implemented in scikit-learn (Version 0.18.1) by using Random Forest (RF) with default parameters. For classification, different thresholds of EC50 were selected to distinguish positive and negative data. Here, three different thresholds (EC50 < 1 μm, EC50 < 5 μm and EC50 < 10 μm) were used for classification respectively.
Molecular scaffold searching
For each protein target, the similarity of corresponding molecules were analyzed based on Rubberbanding Forcefield approach in DataWarrior  (release version 4.5.2). Initially, all molecules were translated into a series of descriptors to encode various aspects of chemical structures including both 2-D and 3-D structure information. After that, calculate the entire similarity matrix between all molecules and locate most similar neighbors to be considered for every molecules. Then, stepwise relocate all molecules to ensure similar molecules were located close to each other. Finally, molecules with structure similarity over 0.95 will be clustered together . For each cluster, the major Bemis-Murcko scaffold  (covering over 80% of the molecules in this cluster) was defined as the representative scaffold. Note that for several clusters, no major scaffold can be detected, in that case, the maximum common substructures for each two scaffolds can be calculated through RDKit and the major substructure was defined as the representative scaffold. After that, the Bemis-Murcko scaffold for each cluster can be derived and analyzed.
TYQ and DFW developed the algorithm. TYQ and JXQ wrote the manuscript. DFW constructed the PCM model. ZWC supervised the whole project and modified the manuscript. All authors read and approved the final manuscript.
This work was supported in part by grants from the National Key R&D Program (2017YFC0908400, SQ2017YFC170310), the Fundamental Research Funds for the Central Universities (1350219165), the National Postdoctoral Program for Innovative Talents (BX201600033) and the China Postdoctoral Science Foundation Funded Project (2017M611451).
The authors declare no competing financial interests.
Availability of data and materials
All raw data used in this study are contained in the supplementary files.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 12.Geromichalos GD, Alifieris CE, Geromichalou EG, Trafalis DT (2016) Overview on the current status of virtual high-throughput screening and combinatorial chemistry approaches in multi-target anticancer drug discovery; Part I. J Buon 21(4):764–779Google Scholar
- 15.Ragno R, Mai A, Sbardella G, Artico M, Massa S, Musiu C, Mura M, Marturana F, Cadeddu A, La Colla P (2004) Computer-aided design, synthesis, and anti-HIV-1 activity in vitro of 2-alkylamino-6-[1-(2,6-difluorophenyl)alkyl]-3,4-dihydro-5-alkylpyrimidin-4(3H)-ones as novel potent non-nucleoside reverse transcriptase inhibitors, also active against the Y181C variant. J Med Chem 47(4):928–934CrossRefGoogle Scholar
- 34.Prusis P, Junaid M, Petrovska R, Yahorava S, Yahorau A, Katzenmeier G, Lapins M, Wikberg JES (2013) Design and evaluation of substrate-based octapeptide and non substrate-based tetrapeptide inhibitors of dengue virus NS2B-NS3 proteases. Biochem Biophys Res Commun 434(4):767–772CrossRefGoogle Scholar
- 38.van Westen GJP, Wegner JK, Geluykens P, Kwanten L, Vereycken I, Peeters A, IJzerman AP, van Vlijmen HWT, Bender A, (2011) Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PLoS ONE 6(11):e27518. https://doi.org/10.1371/journal.pone.0027518 CrossRefGoogle Scholar
- 39.Nanduri R, Bhutani I, Somavarapu AK, Mahajan S, Parkesh R, Gupta P (2015) ONRLDB-manually curated database of experimentally validated ligands for orphan nuclear receptors: insights into new drug discovery. Database-OxfordGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.