Keywords

1 Introduction

In drug discovery, new molecules undergo clinical trials with human subjects only after passing numerous checks for safety and potency in biological test systems. Often desired is a drug suitable for oral administration, i.e., a molecule that can cross cellular membranes separating the gastrointestinal system from blood vessels. Cell assays with MDCK I cells are used to assess membrane penetration. After absorption, blood vessels distribute the molecule across the organism and bring it to its site of action. Blood contains many proteins that bind a substantial fraction of the compound. This is measured as plasma protein binding (PPB). On its way, the molecule passes through the liver that contains enzymes able to metabolize many types of chemical substances, thus reducing the active drug’s concentration (clearance). In drug discovery, a suspension of human liver microsomes is used to assess intrinsic clearance. An important measure to optimize for a bioactive molecule is its plasma exposure after oral administration that is often expressed as “area under the curve” (AUC), i.e., the concentration of the active molecule in blood plasma integrated over time. Bioavailability depends on multiple properties of the molecule including cell layer permeability and clearance in the liver.

Medicinal chemists need quantitative models that would allow to prioritize the most promising molecules for biological testing. Quantitative structure-activity relationships (QSAR) is a central technology in drug discovery and have been investigated in multiple publications [1, 2]. But the chronological occurrence of information in the project workflows of drug discovery has rarely been investigated. A chemical series in drug discovery starts usually with one or a few molecules, often, with modest activity on the target protein. The starting compounds are modified by medicinal chemists to improve their properties. In a maturing drug discovery project, compounds become “smarter”, because they contain more information from previous measurements. Here, we show how to set up quantitative machine learning models in an industrial drug discovery context in order to predict biological properties.

2 Methods

2.1 Data Sets (Table 1)

Each data set contained molecules from the same chemical series that shared an identical ‘backbone’ chemical substructure. Our first dataset contained measurements of half-maximal effective concentration (EC50) on the target protein for 1400 molecules. Much fewer data points were available for the other four datasets originating from active drug discovery programs. PPB is an important measure to assess the concentration of the free (unbound) molecule in the blood.

Table 1. Data sets used for machine learning models.

Dataset 2 contained 129 molecules. The permeability MDCK I (dataset 3) contained data from 89 molecules that were tested in a cell permeability assay. High permeability values are desired when the compound should pass the intestinal membranes in humans. The intrinsic clearance (dataset 4) contained data for 179 molecules. To determine the stability of the molecules they were measured in a human liver microsomes (HLM) assay. Dataset 5 contained area under the concentration vs time curve (AUC) values from rats for 182 compounds representing the overall bioavailability of the compound. After the compound was administered to rats, blood samples were taken and the concentration of the test compound in the blood was measured.

To create quantitative computer models for molecules it is necessary to encode the molecules as vectors x. We decided to use the Skeleton Spheres descriptor [3], where one row in the matrix X represents one molecule. For each molecule in a data set, there is a single response value yi. In every data set, the molecules were highly similar in descriptor space as they were in chemical space.

2.2 Machine Learning Techniques

Five modeling techniques were applied to construct regression models for the data sets: KNN regression, PLSR, PLSR with power transformation, random forest regression, and support vector (SVM) regression. All parameters for these machine learning models were optimized by an exhaustive search. The median model was used as a baseline model. Any successful machine learning model should be significantly better than the baseline model. Almost as simple was the k next neighbor model for regression. Partial least square regression (PLSR) is a multivariate linear regression technique [4] that requires the number of factors as the only input parameter. PLSR with power transformation includes a Box Cox transformation. It is often used to model biological data, which are notorious to be not normally distributed [5]. For random forests, we used the implementation from Li [6]. The Java program library libsvm was used for the support vector machine regression [7].

2.3 Successive Regression

To assess the predictive power of a machine learning tool in a drug discovery project it is necessary to consider the point in time a compound was made. Therefore, we ordered all molecules in a dataset according the point in time it was synthesized. A two-step process was implemented to ensure an unbiased estimation for the predictive power of a model. The first step was the selection of one meta-parameter set for every machine learning technique. The algorithm started with the first 20% of the molecule descriptors X0,0.2, y0,0.2 together with the measured response values to determine the meta parameters of the machine learning models via an exhaustive search. An eleven-fold Monte Carlo cross validation was employed to split all data into the training and validation datasets [8]. A left out of 10% was chosen as the size of the validation dataset. With this setup, the average error for all meta-parameter sets was calculated. For each machine learning technique t, the meta-parameter set Mmin,t was chosen that showed the minimum average error. This meta-parameter set was used to construct a model from all data in X0,0.2, y0,0.2. In the second step, an independent test set was compiled from the next 10% of data, X0.3, y0.3. The average prediction error of \( \widehat{{y_{0.3} }} \) gave an unbiased estimator for the model, because the machine learning algorithm Mmin,t,0.2 had not seen these data before prediction. Subsequently, step one was repeated, this time with the data set X0.3, y0.3. So, the former test data were added to X0,0.2, y0,0.2. The meta parameter for the machine learning algorithms Mmin,t,0.3 were now determined with X0,0.3, y0,0.3. So, the prediction was done for y0.4. This process was repeated eight times, up to a model size with X0,0.9, y0,0.9 and a prediction for y1.0. With this method, we assessed how the predictive power and depends on the time point when the data were obtained in a drug discovery project. The 10% test set, next in time, was an unbiased estimator of the model’s quality.

3 Results and Conclusions

For all five data sets, increasing portions of the chronologically sorted biological data were used as training data to build models that predicted the next 10% of the data. For the largest data set (Fig. 1) already all ‘first-step models’ had more predictive power than the median model. As the project developed in time, it could be observed that the variance in the EC50 values declined. This was indicated by the smaller median error.

Fig. 1.
figure 1

Results for the successive prediction of EC50 values for dataset 1. The x-axis specifies the fraction of the full dataset used as training data. The y-axis indicates the average error in nmol/l for the prediction of EC50.

For the PPB dataset, the first training data set with X0,0.2, and y0,0.2 contained only 26 molecules. All ‘first-step models’ predicted the test data better than the median model. From X0,0.5, and y0,0.5 all predictions were superior compared to the median model.

The successive prediction of intrinsic clearance was more successful than the prediction of the permeability MDCK I dataset. The machine learning models outperformed the median model (data not shown). On the AUC (bioavailability) dataset 5, no machine learning technique performed better that the median model (data not shown).

Summary. All vv inner similarity in the SkeletonSpheres descriptor domain. No time-dependent learning curve was observed for the five biological datasets. The introduction of newly designed compounds increased the model error in several cases, even though the added compounds shared larger parts of their molecular structure with the training dataset compounds.

Conclusions. The uncertainty in prediction is correlated with the underlying biological complexity of the modeled parameter. Meaningful models for most of the PPB test datasets were created by all techniques. The permeability MDCK I (dataset 3) can be partially explained by diffusion that is relatively easy to model. However, MDCK I cell membranes contain active transporter proteins. Whether a molecule will be substrate for a transporter is hard to predict. The intrinsic clearance (data set 4) depends on the activity of approximately 20 enzymes, which makes the modeling more challenging than activity prediction for a single target enzyme. The bioavailability measured as AUC (dataset 5), is the result of multiple processes in animals, including cell layer permeability and intrinsic clearance. Consequently, the bioavailability AUC model had the highest uncertainty, followed by the target protein bioactivity data, and the intrinsic clearance model, while the most reliable models were created for the PPB and the permeability dataset. Naturally, all predictions include the uncertainty of the measurements of the response data. We presented time-series prediction results for four important measurements in drug discovery: PPB, permeability MDCK I, intrinsic clearance for human liver microsomes, and oral bioavailability as AUC in rats. The chronology-based prediction gives a good estimation of how well results of biological tests can be modeled in drug discovery projects.