Introduction

Quantitative Structure-Activity Relationship (QSAR) modelling is an important field of research in current medicinal chemistry. QSAR models relate the structure of chemical compounds to their biological activities:

$${\text{activity = }}f ( {\text{structure)}} .$$

The aim of building such models is to explain and/or to predict the activity of a group of compounds and thus to facilitate and direct search for new active substances.

In QSAR, the structure of a chemical compound is represented mathematically by molecular descriptors. These can be based on physicochemical properties measured experimentally (e.g. partition coefficient LogP), quantities calculated by quantum chemistry methods (e.g. HOMO/LUMO energies) (Karelson et al. 1996) or be derived from other theoretical bases (e.g. chemical graph theory, (Balaban 1985; Helguera et al. 2008) theory of quantitative chirality (Ostrowski et al. 2012; Jamróz et al. 2012 etc.). The number of currently available descriptors is enormous (Dearden 2016). There are several applications designed specifically at their calculation [for example DRAGON by Talete Srl that computes ca. 5000 descriptors (Talete Srl 2010)] and such a functionality is present in probably all drug design and discovery suites like Accelrys Discovery Studio (Accelrys Software Inc. 2009), Schrödinger Suite (2017), molecular operating environment (Chemical Computing Group ULC 2017) to mention only a few.

In a typical situation a researcher has at his or her disposal a scarce number of compounds with determined activity (like 20 to several dozen) and an alluring plenitude of molecular descriptors (hundreds or thousands) to be used for constructing QSAR equations. This makes the danger of overfitting data a very likely one.

The common statistical parameters, like coefficient of determination, standard deviation, significance etc. are not able to discern ‘good’ models from overfitted ones (Rücker et al. 2007). This cannot be also done by any kind of internal validation procedures, like leave-one-out, leave-many-out etc. An ultimate test of validity and utility of a given QSAR model is always the external validation on an independent, large enough, properly designed set of new derivatives (Gramatica 2007). This is, however, rarely possible due to the lack of resources and/or time. In such circumstances, perhaps the only affordable way to see if studied QSAR models work better than the pure chance is to simulate the ‘predictive power’ of the pure chance. Two tests could be of help here: y-scrambling and pseudo-descriptors test (Clark et al. 2001; Rücker et al. 2007).

The y-scrambling (y-randomization, response randomization) is a form of a permutation test, where the values of the response variable (y) are randomly ascribed (scrambled) to different compounds, while the descriptors values (x’s) are left intact. Scrambled data are then used for training QSAR models. In the pseudo-descriptors test, the descriptors (x’s) are replaced by random numbers (pseudo-descriptors) that are also subsequently used to train QSAR equations.

Both tests are run over several to several dozen times, and from each run best coefficient of determination r 2, leave-one-out cross-validation correlation coefficient q 2 and perhaps other adequate statistical parameters are collected. The mean highest r 2 (mhr2) and q 2 (mhq2) along with their standard deviations (SD) are calculated. This allows to assess the ‘predictive power’ of the pure chance, and the truly good models should have their r 2 and q 2 significantly better than this.

Unfortunately, these simple tests are very often not included into QSAR studies. One of the reasons, apart from their time-consuming character, might be in a difficulty in obtaining random data for simulations. Not every researcher is enough computer proficient to generate them on his own, and not everyone has access to good statistical software that could accomplish this without much trouble. The software, in majority, if not in all cases, is also not suited to working with common formats of chemical table files like SDF (Dalby et al. 1992) that are usually accepted by QSAR modelling software. The need for manual operations on numerous, large spreadsheets of numbers and chemical files can be an actual obstacle, and discouraged researchers omit these insightful tests.

In order to facilitate data preparation for the tests, a simple and free software tool SCRAMBLE’N’GAMBLE is proposed. It is a stand-alone Java application with both graphic user interface as well as a command-line manageability. SCRAMBLE’N’GAMBLE reads in comma-separated files (csv) and chemical table files by MDL (sdf) containing descriptors and activity data. It can perform y-scrambling as well as generate pseudo-descriptors given number of times and output the results into a csv file, but also directly into a sdf file immediately usable in most QSAR programs. SCRAMBLE’N’GAMBLE is available free of charge at: http://www.drugdesign.pl/scramble-n-gamble/.

In order to demonstrate the importance of simulating random chance performance along with building QSAR models, let us expose the following Cases: I. classical QSAR modelling (descriptors based on 2D structures) of steroids’ affinity for the sex-hormone-binding globulin, II. classical QSAR modelling (descriptors based on 2D and 3D structures) of steroids’ affinity for the corticosteroid-binding globulin, III. Fujita-Ban QSAR modelling of the effective dose of some fentanyls in the mouse hot plate test and IV. a classification model for discerning glucocorticoid receptor binders and non-binders.

Experimental

Molecules, activity data, descriptor calculation and modelling procedure

In all Cases, a general workflow was as follows. First, molecules with activity data for a given molecular target were collected and divided into a training set and a test set. Second, molecular descriptors were calculated. Constant and near-constant descriptors were deleted from the pool, and further reduction was done by checking intercorrelations between descriptors. In pairs where the coefficient of correlation was larger than 0.90, one of the descriptors was randomly excluded. Third, QSAR models were trained. Fourth, random data for y-scrambling and pseudo-descriptors test were generated using SCRAMBLE’N’GAMBLE and the tests were performed by training QSAR models in the same way as the ones based on true data were trained. Fifth, the performance of the latter was checked on test sets.

Details of the workflow for singular Cases are given in Table 1.

Table 1 Details of the workflow for singular cases

Evaluation of regression models

For regression models in Cases I and II, standard statistical metrics were applied. These are:

r 2 coefficient of determination in the training set,

q 2 cross-validated coefficient of determination in the training set (internal validation, leave-one-out procedure)

R 2 coefficient of determination in the test set (external validation).

The r 2 and q 2 values were compared to mean highest r 2 (mhr2) and q 2 (mhq2) from y-scrambling and pseudo-descriptors tests in order to check whether the models perform better than chance models.

Additionally, \({}^{c}R_{p}^{2}\) and \(r_{{m \left( {\text{test}} \right)}}^{2}\) parameters were applied calculated as proposed by the Roy group (Pratim Roy et al. 2009; Mitra et al. 2010). Both these metrics should be greater than 0.5 for an acceptable model. The fulfilment of this criterion with regard to \(r_{{m \left( {\text{test}} \right)}}^{2}\) parameter ensures that a model predicts the exact values of the response data. High \({}^{c}R_{p}^{2}\) values allow to consider a model to be robust and not just the outcome of a chance correlation.

For the models in Case II, additional parameters were checked. First, the internal cross-validation was performed also in leave-three-out procedure, giving \(q_{(L3O)}^{2}\)—a cross-validated coefficient of determination in the training set (leave-three-out). Furthermore, another type of randomization experiment was performed (Wold et al. 1998). Here y-scrambled data (25 runs) were used to refit the Case II models. The obtained r 2 and q 2 values were then plotted against the correlation coefficients of original y and permuted y data. The resulting intercepts (\(R_{\text{int}}^{2}\) and \(Q_{\text{int}}^{2}\)) are expected to be below 0.4 and 0.05 respectively for valid models.

Evaluation of decision trees

For all decision trees (Case IV) the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) was collected. The following metrics were used for assessment of the decision trees: accuracy (ACC), precision (PREC), sensitivity (SENS), specificity (SPEC), fall-out (FALL) and F1-score (F1). They are given by the expressions:

$${\text{ACC = }}\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$$
(1)
$${\text{PREC = }}\frac{\text{TP}}{\text{TP + FP}}$$
(2)
$${\text{SENS = }}\frac{\text{TP}}{\text{TP + FN}}$$
(3)
$${\text{SPEC = }}\frac{\text{TN}}{\text{TN + FP}}$$
(4)
$${\rm{FALL = 1}} - {\rm{SPEC}}$$
(5)
$${\text{F1 = }}\frac{{ 2 {\text{TP}}}}{{ 2 {\text{TP + FP + FN}}}}{ = 2} \times \frac{{{\text{PREC}} \times {\text{SENS}}}}{\text{PREC + SENS}}.$$
(6)

Results and discussion

Software description

SCRAMBLE’N’GAMBLE is a fast and user-friendly software for generation of random data for the purposes of QSAR model validation. The program can read and output both comma-separated files (csv) as well as chemical table files by MDL (sdf) containing molecular descriptors and activity data. Upon selecting which fields should be scrambled or replaced with random data (pseudo-descriptors), the user is able to obtain a required number of randomized data sets in csv or sdf files. The latter are most often accepted by QSAR modelling software. SCRAMBLE’N’GAMBLE may be run in a graphic user interface mode (Fig. 1), but it is also manageable in the command-line mode.

Fig. 1
figure 1

General view of SCRAMBLE’N’GAMBLE interface

The generation of random (or to be said more precisely: pseudo-random) numbers is achieved using Mersenne Twister 19937 generator (Matsumoto and Nishimura 1998) implemented in UncommonMaths Java library (Dyer 2006). The generator has been shown to generate high quality random numbers and pass many statistical tests for randomness. It is possible to select a distribution from which random numbers will be generated: uniform, normal, binomial, Poisson or exponential. The user may also want to keep original distributions of variables, and in such case the program will perform x-scrambling. SCRAMBLE’N’GAMBLE is available free of charge at: http://www.drugdesign.pl/scramble-n-gamble/.

The examples and importance of performing random data tests in QSAR validation are provided by considering four close-to-real-world modelling situations.

Case I

Sex-hormone-binding globulin (SHBG) is a transport glycoprotein produced in all vertebrates except for birds. SHBG binds preferentially sex hormones (androgens and oestrogens) in the bloodstream and in this way it has impact on the concentration of their free, supposedly biologically active, fractions. Its role in various endocrine disorders is well described (Anderson 1974; Cunningham et al. 1983; Key et al. 2002; Hammond 2011; Caldwell and Jirikowski 2014). Environmental toxicology points also to the importance of SHBG in the endocrine disruption in men and animals caused by exogenous substances (Wilson et al. 2007; Saxena et al. 2014; Hong et al. 2015).

In QSAR studies, the Cramer data set of 21 steroids (Fig. 2) binding to SHBG became a benchmark set for validating novel QSAR methodologies or descriptors (Cramer et al. 1988; Coats 1998). Therefore, it is a good point for illustrating the danger of chance correlations.

Fig. 2
figure 2figure 2

Steroid molecules used in Cases I and II. The figures under names are binding affinities to corticosteroid-binding globulin (upper figures, Case II) and sex-hormone-binding globulin (lower figures, Case I)

In our study, we trained QSAR models of up to 3 independent variables, using 89 2D molecular descriptors. The top 10 models are presented in Table 2. Their statistical parameters are not the best ones, but they could be perceived as acceptable by some QSAR modellers (r 2 = 0.762–0.811, q 2 = 0.613–0.706). On the other hand, the equations are physically uninterpretable as almost all descriptors (except for P_VSA_s_4 and NsssCH) cannot be translated (at least without great effort) into the language of atoms, functional groups or other chemical structures. Still, many authors ‘interpret’ similar models just by providing brief descriptions of how the descriptors are calculated and conclude that the equation(s) could serve for screening chemical libraries in search of new active compounds.

Table 2 Top 10 QSAR models obtained by the GFA procedure (Case I)

The moderately optimistic r 2 and q 2 become not optimistic at all if one looks at the outcomes of the models trained on y-scrambled activity data or those trained on pseudo-descriptors (Table 3). It turns out that none of the obtained ‘real’ models is better than the 99th percentile (+2.3 SD) of the models found in the y-randomization or pseudo-descriptors tests (mhr2 + 2.3 SD of models trained on pseudo-descriptors is as high as 0.825). Further, external validation on several ligands (6–12, depending on the applicability domain of a given model, (Tables 2 and SI-2) extracted from the extended steroid set (Cherkasov et al. 2008) yields very poor results, with the coefficient of determination in the test set (R 2) not higher than 0.270.

Table 3 Predictive power of the chance models (Case I)

The models in Table 2 are thus: internally quite good but uninterpretable and not better than random models. As such, they could be expected to have poor predictive power, what is then shown in external validation (Tables 2 and SI-2).

Case II

In the second of the studied cases, we used the same Cramer steroid data set (Cramer et al. 1988; Coats 1998), but this time the target property was binding affinity for the corticosteroid-binding globulin (CBG). CBG is another steroid transporting protein, but contrary to SHBG, it binds preferentially corticosteroids and progestogens, while androgens or oestrogens have only moderate affinity for it (Rosner 1990). The protein is implicated in the inflammatory response by modulating the corticosteroid concentration at the site of inflammation (Klieber et al. 2007). On the other hand, under physiological conditions it buffers blood cortisol levels. In CBG-deficient individuals observed are symptoms of extreme tiredness, hypotension or chronic muscle pain (Marathe and Torpy 2012; Torpy et al. 2013). Some research has been also made on the role of CBG in glucose metabolism (Fernández-Real et al. 1999), obesity (Ousova et al. 2004) or sperm motility (Teves et al. 2010). Recently, an interesting proposition was put forward to use engineered CBGs as drug delivery agents (Chan et al. 2014).

In the study, we divided the CBG set into training and test subsets (in proportion 21:10). The GFA procedure was used to find equations of up to 3 variables, using 49 descriptors derived from 2D and 3D molecular structures. The top 10 models are presented in Table 4.

Table 4 Top 10 QSAR models obtained by the GFA procedure (Case II)

The presented models have good statistical parameters (r 2 = 0.863–0.892, \(q_{(LOO)}^{2}\) = 0.694–0.823, \(q_{(L3O)}^{2}\) = 0.600–0.826). A look at the performance of the chance models allows to conclude that in this modelling situation (21 data points and 49 molecular descriptors) the probability of chance correlations is lower than in the Case I (Table 5) . All obtained QSAR models are significantly better than y-scrambled or pseudo-descriptor models. Their additional advantage is clear physical meaning of the variables used (except for two topological descriptors). External validation on several ligands (7–8, depending on the applicability domain of a given model, Tables 4 and SI-3) yields both poor and good results. Three models have external R 2 much lower than 0.5, but on the other hand in the case of the best two (model 3 and 7) the value is 0.732 and 0.691, which is a decent outcome. Model 3 fulfils also the widely accepted criteria for QSAR model predictive power (Golbraikh and Tropsha 2002): q 2 > 0.5, R 2 > 0.6, \(\frac{{R^{2} - R_{0}^{2} }}{{R^{2} }} < 0.1\) and 0.85 ≤ k ≤ 1.15, where \(R_{0}^{2}\) denotes external coefficient of determination forced through the origin, and k is a slope of the regression line through the origin. Here, the value of \(r_{{m \left( {\text{test}} \right)}}^{2}\) parameter is 0.630 and this further supports the predictive power of the model with regard to exact affinity values of the test compounds. Note also that the model has good \(R_{\text{int}}^{2}\) and \(Q_{\text{int}}^{2}\) metrics (their values provided in Table SI-4 in Electronic Supporting Material).

Table 5 Predictive power of the chance models (Case II)

Experimental structures of the corticosteroid-binding globulin co-crystallized with cortisol or progesterone (Fig. 3. PDB accession codes: 2V95, 4BB2) allow to interpret the models in structural terms (Klieber et al. 2007; Gardill et al. 2012). The interaction of corticosteroids or progesterone with CBG depends mainly on hydrogen bonds formed by polar functions at C and D steroidal rings (IUPAC steroid nomenclature). Although in our models, no charge descriptors for C- and D-rings atoms are present, this is accounted for by shape descriptors like Shadow_Zlength or srcm2. The presence or absence of pharmacophoric polar elements (C17 chain with a keto group, C11 hydroxyl group etc.) affects the size of the molecule or non-superposability on its mirror image and thus these important features are indirectly included into equations. On the other hand, q3 descriptor depicts electrostatics of the A ring. If we plot q3 and K aff, there appear three clusters (Fig. 4). The lowest q3 values characterize molecules with a hydroxyl group attached to C3 atom. The middle three are those with C3-keto group but with the charge modified due to a C2-substituent or saturation of the C4–C5 double bond (dihydrotestosterone). The third cluster contains molecules with C3-keto group. There exists some rough correlation between q3 and K aff (r 2 = 0.690) showing that the C3-keto group (with its geometry and electrostatics) is preferred over C3-hydroxyl, perhaps due to a formation of more favourable hydrogen bonds network with water and surrounding amino acids of the binding site. The clustering achieved by q3 is refined by the shape descriptors (bearing also indirectly information on the most important pharmacophoric elements) or the topological JX descriptor (the role of which is not easily interpretable on its own) and thus good QSAR models are obtained.

Fig. 3
figure 3

Progesterone in the binding site of CBG (PDB accession code: 4BB2)

Fig. 4
figure 4

Plot of q3 and K aff

Concluding, the models obtained in Case II are not only internally good, but also significantly better than chance correlations in this modelling situation. Further, they are well-interpretable. As such, they may be expected to possess some predictive power, what is shown by external validation.

Case III

Case III represents a different modelling situation than the previous two, since it was attempted to build Fujita-Ban models (Fujita and Ban 1971). This type of QSAR analysis uses variables that are discrete indicators (taking 0 or 1 values) of presence or absence of particular structural elements in a molecule. Fujita-Ban models have a clear physical sense, but on the other hand they contain multiple parameters. The ratio of the number of equation variables to the number of data points is usually larger than in ‘typical’ QSARs with variables of a continuous character.

In this Case, we considered a group of 36 active (training set) and 10 inactive (test set) fentanyl derivatives (3-methyl-1,4-disubstituted piperidines) (Lalinde et al. 1990) (Table SI-1). Fentanyls or more basically 4-anilidopiperidines are one of the most important groups of analgesics. Since the discovery of fentanyl in the late 1950s (Janssen et al. 1963), numerous derivatives with varying activity have been synthesized and described (Vardanyan and Hruby 2014). Four of them are present in medicinal practice and these are fentanyl, alfentanil, sufentanil and remifentanil. They are used for pain management in terminally ill cancer patients and anaesthesia. Fentanyls act at the µ-opioid receptor (MOR), belonging to the family A of G-protein coupled receptors (GPCR). Unfortunately, this class of analgesics is not free of typical unwanted side effects of opioids (Chaney 1995) nor of their potential for abuse (Skulska et al. 2005; Algren et al. 2013; Mounteney et al. 2016).

The dependent variable for QSAR model building was the effective dose ED50 in mouse hot plate test (analgesic activity test). Multiple Linear Regression correlated indicator variables with the activity to give an equation the terms of which are presented in Table 6. The plot of experimental vs predicted activities is given in Fig. 5. The equation has a moderate r 2 of 0.718 and large errors of terms coefficients, rendering a few of the terms insignificant. On the other hand the predictive power of chance models in this particular modelling situation is rather low, and even such moderately good QSAR model is better than the best predictions trained on random data (Table 7). Large errors may be attributed to inaccuracies of the experimental data (in vivo testing), but still the model is able to predict inactivity of six of 10 compounds not used in model training. In the case of the remaining four, it predicts low or very low activity (Table SI-1).

Table 6 Fujita-Ban QSAR model of fentanyls activity (Case III)
Fig. 5
figure 5

Plot of predicted and experimental activities for the QSAR model in Case III

Table 7 Predictive power of the chance models (Case III)

As to the model interpretability, it must be said that statistical insignificance of the terms causes any interpretations to be only rough in their nature, even though all terms are physically well-defined. Nevertheless, the coefficients of L-descriptors (Table 6) seem to fit the Structure-Activity Relationship knowledge on fentanyl derivatives, with the following order of L-substitution preference: thienylethyl (as in sufentanil) > phenylethyl (as in fentanyl) > tetrazolylethyl (as in alfentanil) (Volpe et al. 2011). Regarding the R-part of the molecules, it is clearly visible that R-methoxymethyl is more favourable for analgesic activity than its branched (R–CH(CH3)OCH3 ) or rigidified (R-furoyl) counterparts. The freedom of rotation and lack of steric hindrance may allow more facile formation of hydrogen bonds. Unfortunately, the role of 3-Me stereochemistry is not well rendered in the model by the statistically insignificant coefficient. In general, however, it is well-known that 3-cis substituents are more active (Vuckovic et al. 2009). No clear conclusions may be drawn about X substituents, again due to the insignificance of the coefficients.

The model presented in Case III is most probably not a random one, but still it is rather inaccurate. As mentioned, large coefficient errors are attributable to the inaccuracies of in vivo data. Thus, even though the model is not random and partially interpretable, it may be of only partial utility.

Case IV

In the last Case, the objective was to create a classification model able to discern glucocorticoid receptor (GR) binders and non-binders. GR is a nuclear receptor-binding corticosteroid and acts as a transcription factor to up- or downregulate the expression of certain genes (Luisi et al. 1991; Yudt and Cidlowski 2002). It is involved in maintaining homeostasis by affecting inflammatory responses, cellular proliferation and differentiation in target tissues (Funder 1997). GR ligands include classical steroidal glucocorticoids which are used for tackling diseases involving inflammation (van der Velden 1998; Barnes 1998), for immunosuppression (Coutinho and Chapman 2011) or for cancer treatment (Coleman 1992; Vaidya et al. 2010). Current medicinal chemistry focuses on development of selective glucocorticoid receptor modulators (based on scaffolds different from the steroidal), which would be void of typical side effects of steroidal glucocorticoids (De Bosscher 2010).

For the modelling purposes, we decided to mimic a most common real-world situation (as for example in virtual screening experiments), where the number of receptor binders is much smaller than that of non-binders. Therefore, we decided to keep the original proportion of actives vs decoys occurring in the DUD-E data set (Mysinger et al. 2012) that is 1:36. The machine learning algorithm obtained the classification model presented in Fig. 6. It is a simple decision tree with a maximal node depth being three. The model has good statistical parameters of internal predictions (Table 8). Models trained on random data have significantly lower accuracies, precisions and specificities and significantly higher fall-out rates, but on the other hand they are comparably sensitive. F1-score, a measure considering both precision and sensitivity, is however much better for the model trained on true data. The quality of the decision tree may also be assessed by comparison to no-model predictions: ‘all binders’, ‘all non-binders’ or ‘coin-toss’. The analysis of their parameters (Table 8) gives optimistic results, with precision and F1-score again much better in the case of the model trained on true data.

Fig. 6
figure 6

Scheme of the classification tree (Case IV). Eig02_EA(ri) eigenvalue n. 2 from edge adjacency matrix weighted by resonance integral, GATS7e Geary autocorrelation of lag 7 weighted by Sanderson electronegativity, nRCONR2 number of tertiary amides (aliphatic), NssssC number of atoms of type ssssC (>C<), where < or > are two single bonds, qnmax maximum negative charge

Table 8 Statistical parameters of the decision tree (Case IV) and comparison with different random models and no-model predictions

Regarding the interpretability of the model, it must be concluded that even though some of the descriptors used in the model are physically well understandable, the tree does not allow to provide explicit statements about what structural features are important for GR binding. The model is thence uninterpretable. Still, when applied for classification of the test set containing 1850 molecules (50 binders and 1800 decoys), it performs correctly for about 93% of cases. The precision (0.26) and F1-score (0.40) are here similar to the ones for the internal predictivity.

Thus, the classification tree in the Case IV is both internally good as well as better than random. The model is not easily interpretable, however physical interpretability is not what is usually expected of classification models. The most important here is good predictivity, what is shown in external validation.

Conclusions

Since the danger of overfitting QSAR models, when working on large descriptor pools, is very high, it is desirable to perform tests showing the performance of models built on random data. In this study we introduce a simple software tool SCRAMBLE’N’GAMBLE that is aimed at facilitating data preparation for y-scrambling and pseudo-descriptors tests. As shown in the Cases studied in the paper, these tests may be applied to all sorts of QSAR techniques, including both classical linear equations, Fujita-Ban models or classification trees. Their results indicate what the quality of a studied model is like in comparison to chance models obtained from random data. While the non-randomness is not the ultimate hallmark of QSAR models’ possible utility, it is a good practice to consider it along with internal statistical parameters and interpretability of the model. On the other hand, if a model performs no better than chance, it is very probable that it will not be of any use in predicting activities of novel compounds. SCRAMBLE’N’GAMBLE (available for free at: http://www.drugdesign.pl/scramble-n-gamble/) is hoped to help QSAR researchers to perform y-scrambling and pseudo-descriptors testing.