A constructive approach for discovering new drug leads: Using a kernel methodology for the inverseQSAR problem
 7.8k Downloads
 16 Citations
Abstract
Background
The inverseQSAR problem seeks to find a new molecular descriptor from which one can recover the structure of a molecule that possess a desired activity or property. Surprisingly, there are very few papers providing solutions to this problem. It is a difficult problem because the molecular descriptors involved with the inverseQSAR algorithm must adequately address the forward QSAR problem for a given biological activity if the subsequent recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such a descriptor. The difficulty of recovering the molecule from its descriptor is the major limitation of most inverseQSAR methods.
Results
In this paper, we describe the reversibility of our previously reported descriptor, the vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our inverseQSAR approach can be described using five steps: (1) generate the VSMMD for the compounds in the training set; (2) map the VSMMD in the input space to the kernel feature space using an appropriate kernel function; (3) design or generate a new point in the kernel feature space using a kernel feature space algorithm; (4) map the feature space point back to the input space of descriptors using a preimage approximation algorithm; (5) build the molecular structure template using our VSMMD molecule recovery algorithm.
Conclusion
The empirical results reported in this paper show that our strategy of using kernel methodology for an inverseQuantitative StructureActivity Relationship is sufficiently powerful to find a meaningful solution for practical problems.
Keywords
Feature Space Input Space Molecular Descriptor Reproduce Kernel Hilbert Space Outgoing EdgeBackground
The structural conformation and physicochemical properties of both the ligand and its receptor site determine the level of binding affinity that is observed in such an interaction. If the structural properties of the receptor site are known (for example, there is crystallographic data) then techniques involving approximations of potential functions can be applied to estimate or at least compare binding affinities of various ligands [1]. When this information is sparse or not available, as is the case for many membrane proteins, it becomes necessary to estimate affinities using only the properties of the ligand. This ligandbased prediction strategy is often used in applications such as virtual screening of molecular databases in a drug discovery procedure.
In a more general setting we strive to establish the quantitative dependency between the molecular properties of a ligand and its binding affinity. To restate this goal using current terminology: we want to analyze the Quantitative StructureActivity Relationship (QSAR) of the ligand with respect to this type of receptor. A common approach for a QSAR analysis is the use of a machine learning strategy that processes sample data to learn a function that will predict binding affinities. The input of such a function is a descriptor: a vector of molecular properties [2] that characterize the ligand. These vector entries may be physicochemical properties, for example, molecular weight, surface area, orbital energies, or they may describe topological indices that encode features such as the properties of individual atoms and bonds. Topological indices can be rapidly computed and have been validated by a variety of experiments investigating the correlation of structure and biological activities.
The inverseQSAR problem seeks to find a new molecular descriptor from which one can recover the structure of a molecule that possess a desired activity or property. Surprisingly, there are relatively few papers providing solutions to this problem [3]. Lewis [4] has investigated automated strategies for working with fragment based QSARdriven transforms that are applied to known molecules with the objective of providing new and promising drug leads. Brown et al. [5] have studied the inverse QSPR problem using multiobjective optimization. They used an interesting partial least squares (PLS) related algorithm in the design of novel chemical entities (NCEs) that satisfy a given property range or objective. The physical properties considered in their paper were mean molecular polarizability and aqueous solubility. A problem with the opposite emphasis has also been investigated: Masek et al. [6] considered sharing chemical information that is encoded in such a way as to prevent recovery of the original molecule (a strategy used in the facilitation of prediction software while protecting intellectual property rights).
In general, the inverseQSAR problem is difficult because the molecular descriptors involved with the inverseQSAR algorithm must adequately address the forward QSAR problem for a given biological activity if the subsequent recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such a descriptor. The difficulty of recovering the molecule from its descriptor is the major limitation of most inverseQSAR methods.
Most of the proposed techniques are stochastic in nature [7, 8, 9], however, a limited number of deterministic approaches have been developed including the approach of Kier and Hall [10, 11, 12, 13] based on a count of paths, and an approach based on signature descriptors (see Faulon et al. [14, 15, 16, 17]).
The key to an effective method lies in the use of a descriptor that facilitates the reconstruction of the corresponding molecular structure. Ideally, such a descriptor should be informative, have good correlative abilities in QSAR applications, and most importantly, be computationally efficient. A computationally efficient descriptor should have a low degeneracy, that is, it should lead to a limited number of solutions when a molecular recovery algorithm is applied.
Currently, kernel methods are popular tools in QSAR modeling and are used to predict attributes such as activity towards a therapeutic target, ADMET properties (absorption, distribution, metabolism, excretion, and toxic effects), and adverse drug reactions. Various kernel methods based on different molecular representations have been proposed for QSAR modelling [18]. They include the SMILES string kernel [19], graph kernels [20, 21, 22] and a pharmacophore kernel [23]. However, none of these kernel methods have been used for the inverseQSAR problem.
In this paper, we investigate the reversibility of our previously reported descriptor, the vector space model molecular descriptor (VSMMD) [24]. VSMMD is based on a vector space model that is suitable for kernel studies in QSAR modeling. Our approach to the inverseQSAR problem consists of first deriving a new image point in the kernel feature space and then finding the corresponding preimage descriptor in the input space. Then, we use a recovery algorithm to generate a chemical structure template to be used in the specification of new drug candidates. Template formats will vary with respect to their specificity. Depending on the nature of the recovery process, a template may specify a unique molecule or a family of molecules. In the latter case, molecules meeting the specification could be obtained by means of high throughput screening. In the "Methods" section, we provide a detailed description of our inverseQSAR approach using our VSMMD approach. In the "Results" section, we present the experimental results of our descriptors in the vector space setting.
Methods
Vector Space Model Molecular Descriptor (VSMMD)
The vector space model molecular descriptor (VSMMD) [24] can be categorized as belonging to the constitutional descriptors that provide feature counts related to the two dimensional structure of a molecule. Functionally, our descriptor bears a resemblance to various other descriptors such as reduced graphs (see Gillet [25]) and circular fragments (Glenn et al. [26]). One may also see our fragment oriented approach as somewhat reminiscent of the path count strategy of Kier and Hall [10, 11, 12, 13]. In the setting of inverseQSAR, we used VSMMD because we were familiar with its capabilities and we had software applications to do the generation of the descriptors. The primary significance of the current study is not the formulation of a new and competitive descriptor but rather the development of strategies to facilitate the reverse engineering workflow that one needs to go from feature space point back to the specification of a new ligand.
The VSMMD strategy is based on the extraction of molecular fragments that are comprised of small sets of bonded atoms. The atom count for a fragment is at least two and at most c, where c is some prespecified value such as 2, 3, or 4.
To illustrate the processing of a molecule we describe the steps that are taken in the processing of a molecule (atom count for a fragment limited to 2). Figure 3 shows one of the pyrrole compounds that is a member of the data set used in reference [27].
 1.
The atoms and bonds are labelled as prescribed by Figure 2.
 2.
The molecular descriptor is created by extracting from the molecule a complete set of small fragments.
 3.
Frequency counts are evaluated for all the fragments so that a multiset or bag of fragments can be generated
When these steps are completed, the multiset counts are placed into a vector that has a position for each of the different possible fragments recorded in the dictionary. Note that all molecules end up with descriptors of the same length. If fragment size is limited to 2 atoms this vector would have a dimensionality of 7*7*3 = 147.
VSM and VSMMD compared
The motivation for the VSMMD descriptor comes from the "bagofwords" approach [28] that is based on the vector space model (VSM) used in information retrieval. Roughly speaking, the atom fragments extracted in the VSMMD process corresponds to the document words and phrases extracted in the VSM. The molecule is then analogous to a text document and much of the analysis used in the bagofwords strategy can be brought over to the VSMMD setting.
In practice, we have found that the success of VSMMD is greatly enhanced by utilizing fragments that contain more than two atoms, for example, c = 3 or c = 4. This adoption of a higher level of structure corresponds to the incorporation of phrase structures in the VSM. From a molecular perspective, the larger fragment will incorporate information related to rotation around a single bond. As the value of c increases there is a point of diminished returns due to the combinatorial explosion of fragment possibilities. To help reduce this "curse of dimensionality", we can remove the vector entries that have been observed to have frequency counts equal to zero across all molecules under consideration.
General format of the fragment dictionary (A_{T} = Atom Type, B_{T} = Bond Type).
Type ID  General Fragment Type  Atom Count 

1)  A_{T} B_{T} A_{T}  2 
2)  A_{T} B_{T} A_{T} B_{T} A_{T}  3 
3)  A_{T} B_{T} A_{T} B_{T} A_{T} B_{T} A_{T}  4 
4)  A_{T} B_{T} A_{T} B_{T} A_{T}  4 
B_{T}  
A_{T} 
In the most general case, a molecular descriptor is represented by a bag of fragments, each fragment corresponding to an entry in the dictionary.
Analysis based on VSMMD
where q(f_{l}, d_{ i }) is the frequency of the fragment f_{l} in the descriptor d_{ i }.
The use of the linear kernel ɸ_{ L }(·) in this last equation is deliberate since we want to view this mapping as the type of kernel function that is used in the bagofwords strategy described by ShaweTaylor and Cristianini [28]. Via this mapping, each molecular descriptor is taken over to an mdimensional vector, where m is the size of the dictionary. Although m could be very large, the typical vector generated in this way is usually quite sparse (just as vectors in the VSM are sparse).
gave the best results. It provides a rich hypothesis space and is often used in machine learning studies. One parameter, σ, had to be evaluated through crossvalidation.
With the Gaussian vector space kernel, we can apply a kernelbased method to generate a predictor of any one of several biological activities, for example: ADMET properties or affinity of ligands used as therapeutic agents. In our previous studies [24], we gave empirical evidence to demonstrate that VSMMD can capture important chemical information allowing it to perform very well in both forwardQSAR studies and the determination of binding mode information.
The Reproducing Kernel Hilbert Space (RKHS)
Kernelbased learning algorithms work by embedding the data into a Hilbert space, often called the feature space followed by a search for linear relations within this Hilbert space. Kernels are functions that can be used to formulate similarity comparisons. They provide a general framework to represent data, subject to certain mathematical conditions. Data are not represented individually by kernels. Instead, data are represented through a set of pairwise comparisons.
By using a kernel function, the embedding in the Hilbert space is actually performed implicitly, that is by specifying the inner product between each pair of points rather than by giving their coordinates explicitly. This approach has several advantages, the most important being the fact that often the inner product in the embedding space can be computed much more easily using a kernel rather than using the coordinates of the points themselves.
In machine learning, using kernels is a strategy for converting a linear classifier algorithm into a nonlinear one by using a nonlinear function to map the original observations into a higherdimensional feature space; this makes a linear classification in the new feature space equivalent to a nonlinear classification in the original input space.
For more details about the RKHS, readers can refer to [28].
Designing Molecules Using a Feature Space Algorithm
Suppose we have a set of n molecular descriptors S, designated as S ={d_{1}, d_{2},⋯,d_{ n }} where each d_{ i }is in the input space X. Let us assume we are using a Gaussian vector space kernel as defined in equation (5). Under this kernel, any point d_{ i }∈ X, is implicitly mapped to an image ɸ(d_{ i }) in the feature space FS. With this kernel mapping, we can define the set ɸ(S) = {ɸ(d_{1}), ɸ(d_{2}),⋯,ɸ(d_{ n })} ⊂ FS.
In this subsection, we will evaluate various properties of the data set ɸ(S). We provide a set of elementary algorithms to do various calculations such as distance between two descriptor image points in the feature space.
The feature space centroid derived from highly active compounds
The inverse in the input space is called the preimage. We will discuss pertinent details in the preimage problem subsection to follow.
There are several studies that use a feature space centroid to generate new data. Kwok and his colleagues used a feature space centroid to generate a new data point for handwritten digit recognition [29] and speech processing [30]. In both applications, the preimage has been shown to be robust and meaningful. In the next subsection we describe another strategy for the derivation of a new feature space point
Minimum enclosing and maximum excluding hyperspheres
In the last subsection, we derived a new descriptor image point using the centroid of feature space images derived from the highly active compounds. In this subsection, we use the highly active compounds to derive two hyperspheres with the same center. The center of the hyperspheres is then mapped back from the feature space to the input space to generate the descriptor of a new candidate molecule.
where β is the Lagrange multiplier associated with the constraint Δr^{2} ≥ 0, and C is some constant to be determined with a validation data set. The center a of the hyperspheres is then mapped back from the feature space to the input space to generate the descriptor of a new candidate molecule.
The Preimage Problem
However, the problem of finding ɸ^{1}(·) is typically an illposed problem. A problem is said to be illposed if the solution is not unique, does not exist, or is not a continuous function of the data [32].
Using equation (25), an inversion problem turns out to be an optimization problem. There are several algorithms that attempt to solve this optimization problem. Schölkopf et al. [33] proposed an iterative fixed point algorithm strategy. Kwok and Tsang [28] proposed another method that exploits the correspondence between distances in the input space and the feature space. Alternatively, a standard gradient optimization method can be used to find an approximation of the preimage [33]. Note that all these methods are only guaranteed to find a local optimum.
where I is a p × p identity matrix, and 1 is a p dimensional vector with each component equal to 1.
The Need for a Nonlinear Kernel
The nonlinear implicit mapping provided by the kernel operation allows us to generate an inner product in the feature space by computing a kernel function that has arguments taken from the input space. More significantly, when a nonlinear kernel is used, linear operations in the feature space correspond to nonlinear operations in the input space. This is important because the nonlinear mapping will involve various cross products of components within a vector of the input space. As a consequence, linear structures within the feature space correspond to nonlinear or "warped" structures in the input space.
To illustrate this, we ran a small experiment with the COX2 training set (described below in the Results section): As described earlier, the new feature space point a, generated by extracting the center of the enclosing hypersphere, was mapped back to the input space to get its preimage Open image in new window . We then formed the set of points S_{ fe }in the input space (taken from the training set) that produced the ten closest neighbours of a under the kernel mapping. This set S_{ fe }was compared with the set S_{ in }containing the 10 closest neighbours of Open image in new window in the input space (these neighbours derived using the Euclidean metric). Because of the warping effect, preimages of close neighbours in the feature space are not necessarily the closest neighbours of the preimage Open image in new window in the input space, in fact, the intersection of S_{ in }and S_{ fe }is only 3 descriptors. More significant: the average affinity of molecules in S_{ in }is 8.03 while the average affinity of molecules in S_{ fe }is 8.73. This provides empirical evidence that the nonlinear mapping provided by the kernel function helps us to select input space descriptors that are more significant when considering their corresponding affinities.
Recovering the Molecule
In order to solve the recovery problem for chemical structures, we have to investigate a way to derive a graph representing the 2D structure of a molecule that has Open image in new window as its descriptor. There are several related studies that attempt to find such a graph, see for example Bakir et al. [34], who worked with a stochastic search algorithm. However, this recovery problem is not well studied from a computational viewpoint. Tatsuya and Fukagawa [35, 36] proposed a dynamic programming algorithm for inferring a chemical structure from a descriptor. However, the algorithms are not practical for a large data set. Previous studies focused on creating a real chemical structure, which defined too many constraints on the problem due to the complexity of chemical structure. In our study, we do not attempt to recover a real chemical structure; instead, we generate a chemical structure template with physiochemical properties only. This simplifies the problem and makes it practical for real data sets.
Reversible VSMMD
In the next subsection, we define the notion of a structure template and show how it can be derived.
Forming the De Bruijn graph
If we consider a molecule to be comprised of molecular fragments then it is clear that there is a hierarchical organization of these fragments. A linear fragment with an atom count of three can be seen as containing two smaller fragments each with an atom count of two and of course the two fragments overlap in the central atom. If we restrict a fragment to have an atom count of two, then it will contain two elementary fragments, namely two atoms, each labelled with their atom types.
Informally: The purpose of a De Bruijn graph is to provide a data structure that shows how small fragments combine to build larger fragments. Since we wish to handle ring structures using the simplification of a "super atom", we will abuse these concepts slightly and consider the fragments under discussion to be fragments within a template as described in the previous subsection.
Suppose we are dealing with fragments that have an atom count designated as f_{ a }. The De Bruijn graph D is constructed in the following way: We provide a vertex for each fragment that has an atom count equal to f_{ a } 1. In our simplified case, f_{ a }= 2 and each vertex will represent an atom labelled with a physicochemical property. We then add a bidirectional edge from vertex a to vertex b if the fragments associated with these vertices are within a larger fragment with atom count equal to f_{ a }. Each edge is weighted with a value representing the number of times that this larger type of fragment occurs in the template.
From the VSMMD, we know the exact number of vertices that should appear in the chemical structure template. With this information, we can derive a chemical structure template from the De Bruijn graph by finding an Euler circuit of M.
All possible Euler circuits
An Euler circuit is a circuit on the graph such that each edge is traversed exactly once. Each traversal of an edge corresponds to the consumption of one instance of a component in the VSMMD. The problem of finding an Eulerian circuit of a graph is well known and there exists a linear time algorithm for its derivation [37]. The following is the pseudocode of the Euler circuit algorithm.
EULER(q)
1 Path ←none
2 For each unmarked edge e leaving q do
3 Mark(e)
4 Path A ← EULER(opposite vertex(e))  Path
5 Return Path
In order to generate all possible chemical structures associated with the VSMMD, we have to find all Euler circuits. Different chemical structure templates correspond to the different possible orderings when traversing outgoing edges of each vertex. This produces a factorial explosion with respect to the number of outgoing edges of each vertex. Thus, finding all Euler circuits is not feasible.
Probabilistic Euler paths
To overcome this, we have developed an algorithm that generates Eulerian circuits by doing a guided walk of the graph. During the walk we choose an outgoing edge in a probabilistic fashion. The choice is dependent on statistics that are gathered from the descriptors in the training set. To accomplish this, we have built a statistical model that is used to estimate the probability of an Euler circuit.
To estimate the conditional probabilities P (e_{ i }e_{1}, e_{2},⋯,e_{i1}), we need training data consisting of a large number of Euler circuits each corresponding to some particular molecular template. One can obtain these conditional probability distributions from the training data by keeping statistics on the dependency between the next edge to traverse and the history of the previously traversed edges. Seen as probabilities of traversal, we of course use normalized values so that the probabilities of all possible "nextedges" sum to 1.0.
where c(e_{it},⋯,e_{i1}, e_{ i }) is the number of times the edge sequence e_{it},⋯,e_{i1}, e_{ i }is seen in the training data.
A threshold h can also be used as a cutoff to limit the number of edges that the algorithm should examine in an effort to sidestep the factorial explosion that can occur without this limitation. With this understanding, we can compute the overall probability of each Euler circuit using equation (34).
There are several related research papers that attempt to retrieve the order of elements that are part of a larger structure using Eulerian circuits. Cortes et al. [38] retrieve the order of words in documents using an Eulerian circuit approach. Pevzner et al. [39] assembly DNA fragments using an Eulerian circuit.
As mentioned in [29], in general, there was no exact preimage in the input space; The preimage returned by the algorithm was an approximation and so it was compromised by approximation errors. Because of these approximation errors, the following problems may exist:

The preimage vector may consist of noninteger components.

The preimage vector may not form a fully connected De Bruijn Graph.
Our solution to overcome the first problem is to round the components to obtain integer counts. To deal with the case where the graph is not connected, the allpossible Euler circuits algorithm is called at each vertex whose outgoing edges are not all marked. The resulting path is the concatenation of the paths returned by different calls to the allpossible Euler algorithm. More precisely, a bidirectional edge with the largest conditional probability based on previously traversed edges in the path, (using the same Markov model that we set up in the previous subsection), is added to connect two Euler paths together.
Results
Data
In our previous work [24], eight different data sets were used to test the ability of the VSMMD to predict biological activities. All these data sets contain real valued QSAR inhibitor data. The eight QSAR data sets are from Sutherland et al. [27]. We chose one data set from these eight data sets to demonstrate the effectiveness of the recovery algorithm when applied to our VSMMD.
The same inverseQSAR procedure was applied to the remaining seven data sets; the closest matching molecule in the test set for the generated chemical template is shown in the last subsection.
In our experiments, the data were separated into the same training and testing sets as specified by Sutherland et al. [27].
Implementation Details
To identify the physicochemical properties of each atom, we implemented our descriptor generation program with help from the chemical development kit (CDK) [42, 43] programmed in Java. As illustrated in Figure 3, descriptors were calculated and the kernel matrix K was generated in a few seconds for each complete data set. For the preimage algorithm and the feature space algorithm, we used MATLAB to perform the required calculations. For the recovery phase, we implemented the Probabilistic Euler Paths algorithm in Java.
Verification of the Inverse Mapping – Test Result
InverseQSAR Test Results
Recall that our inverseQSAR approach contains five steps. The first two steps are to perform QSAR analysis. In the first step, we generated the VSMMD for the compounds in the training set. Then, in the second step, we implicitly mapped the VSMMD to the kernel feature space using an appropriate kernel function for classification. The results of the forward QSAR can be found in our previous paper [24].
The third step was to design or to generate a new point in the kernel feature space using a kernel feature space algorithm. To demonstrate our approach, we formed a new point in the feature space by using the ten highest active compounds in the training set. The center of the minimum enclosing and maximum excluding hypersphere was obtained using equation (19). Figure 17 shows these ten compounds.
Notes on Test Results
In order to investigate whether the preimage VSMMD is reasonable, we performed a more detailed analysis of the COX2 data set. In the preimage VSMMD as shown in Figure 20, the cyclopentene ring can be found in one of the descriptors. From medicinal chemistry studies, we know that cyclopentene derivatives are one of the first series of diarylsubstituted cycles that have been well known as COX2 inhibitors [44, 45]. This empirical evidence demonstrates that our generated preimage VSMMD is able to capture important properties of the ten most active molecules.
When we performed a high throughput screening on the test set using the generated chemical structure template, the following molecule was identified as an exact match to the template.
Discussion
In kernelbased learning, the usual assumption is that the data pairs Open image in new window , in the training set, come from a source that provides these samples in an independent identically distributed (i.i.d.) fashion according to an unknown probability distribution P(x, y). Furthermore, the test examples are assumed to come from the same distribution [46]. In an ideal situation, the collection of molecular descriptors in the training set follow a probability distribution that is only determined by the interactions between ligands and the binding site. In practise this does not happen. The selection of members in the training set may involve a significant amount of bias due to human involvement in its creation:

Selection of members of the training set may be restricted by rules that exclude molecules that are not "drug like".

Since the training set involves molecules that have been assessed for binding affinity, they had to be synthesized and may be part of a suite of molecules for which the synthesis was not overly complicated.

Furthermore, the molecular descriptors in the training set may show various types of repetition, (for example, the repeated occurrence of some type of scaffold). This may or may not be intended.
As a consequence of these issues, the learning algorithm will produce a predictor that is taking into account both a biological process and the human activity intrinsic to the formation of the training set. More significantly, there is the demand that future test molecules come from the same probability distribution. Statistical learning theory will guarantee certain generalization bounds, but only if these demands are met. In effect, the theory tells us that if test samples come from a source, such as a virtual screening library that is not characterized by the same rules of formation as the training set – then all bets are off.
In the constructive approach that has been described in this paper, it is clear that we are also limited by the information that is intrinsic to a training set. But beyond this, the strategy significantly differs from virtual screening. Instead of trying to find a new molecule in a database that should exhibit the same P(x,y) characteristics, we side step this requirement (which may be difficult to guarantee) and we build a new drug candidate using only the information that is strictly contained in the training set itself.
Conclusion
While molecular fragments have been used in research studies for dealing with quantitative structureactivity relationship problems, we have further evolved this strategy to include a reverse engineering mechanism.
 1.
the use of a kernel feature space algorithm to design or modify descriptor image points in a feature space
 2.
the deployment of a preimage algorithm to map the descriptor image points in the feature space back to the input space of the descriptors, and
 3.
the design of a probabilistic strategy to convert new descriptors into meaningful chemical graph templates
As reported in earlier papers, our modeling has produced very effective algorithms to predict drugbinding affinities and to predict multiple binding modes [24]. We have now extended our modeling approach to the development of algorithms that derive new descriptors and then facilitate the reverse engineering of such a descriptor. This is a very desirable capability for a molecular descriptor [3].
The most important aspect of our research is the presentation of strategies that actually generate the structure of a new drug candidate. This is substantially different from methodologies that depend on database screening to get new drug candidates. While our approach can support such an endeavour, it is not our primary goal. In fact, we are quite concerned that database screening, done using a predictor derived from a statistical learning algorithm, is subject to procedural demands that may be difficult to maintain. We are referring to statistical learning theory that guarantees the success of a predictor, but only when the test sample is drawn from a data source that has the same probability distribution as that characterizing the training set.
In the applications of statistical learning to database screening, the predictor may be applied to test molecules that have very little relationship to the training data. In these cases, the predictor is optimistically treated as if it actually incorporates an algorithm that has some firm and direct relationship to the biological context of the problem. As mentioned by Good et al. [47], the conclusion of a QSAR analysis can be profoundly altered by how the test set was derived. Traditionally, this concern was usually addressed through the design of complicated and time consuming validation experiments [48] to ensure that the predictor will not generate a misleading conclusion. In our approach, we have avoided such concerns. While the training set is still used to generate a new image point in the feature space, the reverse engineering just described allows us to develop a template for a new drug candidate that is independent of issues related to probability distribution constraints placed on test set molecules.
Notes
Supplementary material
References
 1.Sharp KA: Potential functions for virtual screening and ligand binding calculations: Some theoretical considerations. Virtual Screening in Drug Discovery. Edited by: Alvarez J, Shoichet B. 2005, New York: Taylor & Francis, 229248.CrossRefGoogle Scholar
 2.Todeschini R, Consonni V: Handbook of molecular descriptors. 2000, Weinheim: WileyVCHCrossRefGoogle Scholar
 3.Faulon JL, Brown W, Martin S: Reverse engineering chemical structures from molecular descriptors: how many solutions?. J ComputAided Mol Des. 2005, 19: 637650. 10.1007/s1082200590071.CrossRefGoogle Scholar
 4.Lewis RA: A general method for exploiting QSAR models in lead optimization. J Med Chem. 2005, 48: 16381648. 10.1021/jm049228d.CrossRefGoogle Scholar
 5.Brown N, McKay B, Gasteiger J: A novel workflow for the inverse QSPR problem using multiobjective optimization. J ComputAided Mol Des. 2006, 20: 333341. 10.1007/s1082200690631.CrossRefGoogle Scholar
 6.Masek BB, Shen L, Smith KM, Pearlman RS: Sharing chemical information without sharing chemical structure. J Chem Inf Model. 2008, 48: 256261. 10.1021/ci600383v.CrossRefGoogle Scholar
 7.Sheridan RP, Kearsley SK: Using a genetic algorithm to suggest combinatorial libraries. J Chem Inf Comput Sci. 1995, 35: 310320.CrossRefGoogle Scholar
 8.Venkatasubramanian V, Chan K, Caruthers JM: Evolutionary design of molecules with desired properties using the genetic algorithm. J Chem Inf Comput Sci. 1995, 35: 188195.CrossRefGoogle Scholar
 9.Kvasnicka V, Pospichal J: Simulated annealing construction of molecular graphs with required properties. J Chem Inf Comput Sci. 1996, 36: 516526.CrossRefGoogle Scholar
 10.Hall LH, Kier LB, Frazer JW: Design of molecules from quantitative structureactivity relationship models .2. Derivation and proof of informationtransfer relating equations. J Chem Inf Comput Sci. 1993, 33: 148152.CrossRefGoogle Scholar
 11.Hall LH, Dailey RS, Kier LB: Design of molecules from quantitative structureactivity relationship models .3. Role of higherorder path counts: path 3. J Chem Inf Comput Sci. 1993, 33: 598603.CrossRefGoogle Scholar
 12.Kier LB, Hall LH, Frazer JW: Design of molecules from quantitative structureactivity relationship models .1. Informationtransfer between path and vertex degree counts. J Chem Inf Comput Sci. 1993, 33: 143147.CrossRefGoogle Scholar
 13.Skvortsova MI, Baskin II, Slovokho tova OL, Palyulin VA, Zefirov NS: Inverse problem in QSAR, QSPR studies for the case of topological indexes characterizing molecular shape (Kier indexes). J Chem Inf Comput Sci. 1993, 33: 630634.CrossRefGoogle Scholar
 14.Churchwell CJ, Rintoul MD, Martin S, Visco DP, Kotu A, Larson RS, Sillerud LO, Brown DC, Faulon JL: The signature molecular descriptor – 3. Inverse quantitative structureactivity relationship of ICAM1 inhibitory peptides. J Mol Graphics Modell. 2004, 22: 263273. 10.1016/j.jmgm.2003.10.002.CrossRefGoogle Scholar
 15.Faulon JL, Visco DP, Pophale RS: The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. J Chem Inf Comput Sci. 2003, 43: 707720.CrossRefGoogle Scholar
 16.Faulon JL, Churchwell CJ, Visco DP: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci. 2003, 43: 721734.CrossRefGoogle Scholar
 17.Faulon JL, Collins MJ, Carr RD: The signature molecular descriptor. 4. Canonizing molecules using extended valence sequences. J Chem Inf Comput Sci. 2004, 44: 427436.CrossRefGoogle Scholar
 18.Azencott CA, Ksikes A, Swamidass SJ, Chen JH, Ralaivola L, Baldi P: One to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. J Chem Inf Model. 2007, 47: 965974. 10.1021/ci600397p.CrossRefGoogle Scholar
 19.Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels for small molecules and the prediction of mutagenicity, toxicity and anticancer activity. Bioinformatics. 2005, 21 (supplement 1): i359i368. 10.1093/bioinformatics/bti1055.CrossRefGoogle Scholar
 20.M Fröhlich H, Wegner JK, Sieker F, Zell A: Optimal assignment kernels for attributed molecular graphs. Bonn, Germany. 225232.Google Scholar
 21.Mahe P, Ueda N, Akutsu T, Perret JL, Vert JP: Graph kernels for molecular structureactivity relationship analysis with support vector machines. J Chem Inf Model. 2005, 45: 939951. 10.1021/ci050039t.CrossRefGoogle Scholar
 22.Ralaivola L, Swamidass SJ, Saigo H, Baldi P: Graph kernels for chemical informatics. Neural Net. 2005, 18: 10931110. 10.1016/j.neunet.2005.07.009.CrossRefGoogle Scholar
 23.Mahe P, Ralaivola L, Stoven V, Vert JP: The pharmacophore kernel for virtual screening with support vector machines. J Chem Inf Model. 2006, 46: 20032014. 10.1021/ci060138m.CrossRefGoogle Scholar
 24.Burkowski FJ, Wong WWL: Predicting multiple binding modes in QSAR studies using a vector Space model molecular descriptor in reproducing kernel Hilbert space. International Journal of Computational Biology and Drug Design. 2009Google Scholar
 25.Gillet VJ, Willett P, Bradshaw J: Similarity searching using reduced graphs. J Chem Inf Comput Sci. 2003, 43: 338345.CrossRefGoogle Scholar
 26.Glenn RC, Bender A, Arnby CH, Carlsson L, Boyer S, Smith J: Circular fingerprints: Flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs. 2006, 9: 199204.Google Scholar
 27.Sutherland JJ, O'Brien LA, Weaver DF: A comparison of methods for modeling quantitative structureactivity relationships. J Med Chem. 2004, 47: 55415554. 10.1021/jm0497141.CrossRefGoogle Scholar
 28.ShaweTaylor J, Cristianini N: Kernel methods for pattern analysis. 2004, Cambridge, UK: Cambridge University PressCrossRefGoogle Scholar
 29.Kwok JTY, Tsang IWH: The preimage problem in kernel methods. IEEE Trans Neural Net. 2004, 15: 15171525. 10.1109/TNN.2004.837781.CrossRefGoogle Scholar
 30.Mak B, Hsiao R, Ho S, Kwok JT: Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting. IEEE Trans Speech Audio Proc. 2006, 14: 12671280. 10.1109/TSA.2005.860836.CrossRefGoogle Scholar
 31.Liu Y, Zheng YF: Minimum enclosing and maximum excluding machine for pattern description and discrimination. 18th International Conference on Pattern Recognition. 129132.Google Scholar
 32.Mika S, Schölkopf B, Smola JA, Müller KR, Scholz M, Rätsch G: Kernel PCA and denoising in feature spaces. Proceedings of the. 1998, 536542. conference on Advances in neural information processing system IIGoogle Scholar
 33.Schölkopf B, Knirsch P, Smola AJ, Burges CJC: Fast approximation of support vector kernel expansions, and an interpretation of clustering as approximation in feature spaces. DAGMSymposium. 125132.Google Scholar
 34.Bakir GH, Zien A, Tsuda K: Learning to find graph preimages. Lecture Notes in Computer Science. 2004, 3175: 253261.CrossRefGoogle Scholar
 35.Tatsuya A, Fukagawa D: Inferring a graph from path frequency. Lecture Notes in Computer Science. 2005, 3537: 371382.CrossRefGoogle Scholar
 36.Tatsuya A, Fukagawa D: Inferring a chemical structure from a feature vector based on frequency of labelled paths and small fragments. APBC. 2007, 165174.Google Scholar
 37.Robin JW: Introduction to Graph Theory. 1996, Addison WesleyGoogle Scholar
 38.Cortes C, Mohri M, Weston J: A general regression technique for learning transductions. Proceedings of the 22nd international conference on Machine learning; Bonn, Germany. 153160.Google Scholar
 39.Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001, 98: 97489753. 10.1073/pnas.171285098.CrossRefGoogle Scholar
 40.Huang HC, Chamberlain TS, Seibert K, Koboldt CM, Isakson PC: Diaryl indenes and benzofurans – novel classes of potent and selective cyclooxygenase2 inhibitors. Bioorg Med Chem Lett. 1995, 5: 23772380. 10.1016/0960894X(95)00414O.CrossRefGoogle Scholar
 41.Chavatte P, Yous S, Marot C, Baurin N, Lesieur D: Threedimensional quantitative structureactivity relationships of cyclooxygenase2 (COX2) inhibitors: A comparative molecular field analysis. J Med Chem. 2001, 44: 32233230. 10.1021/jm0101343.CrossRefGoogle Scholar
 42.Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The chemistry development kit (CDK): an opensource Java library for chemo and bioinformatics. J Chem Inf Comput Sci. 2003, 43: 493500.CrossRefGoogle Scholar
 43.Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: Recent developments of the chemistry development kit (CDK) – an opensource java library for chemo and bioinformatics. Curr Pharm Des. 2006, 12: 21112120. 10.2174/138161206777585274.CrossRefGoogle Scholar
 44.Leval X, Delarge J, Somers F, Tullio P, Henrotin Y, Pirotte B, Dogne J: Recent advances in inducible cyclooxygenase (COX2) inhibition. Curr Med Chem. 2000, 7: 10411062.CrossRefGoogle Scholar
 45.Reitz DB, Li JJ, Norton MB, Reinhard EJ, Collins JT, Anderson GD, Gregory SA, Koboldt CM, Perkins WE: Selective cyclooxygenase inhibitors: Novel 1,2diarylcyclopentenes are potent and orally active COX2 inhibitors. J Med Chem. 1994, 37: 38783881. 10.1021/jm00049a005.CrossRefGoogle Scholar
 46.Müller KR, Mika S, Räts ch G, Tsuda K, Schölkopf B: An introduction to kernelbased learning algorithms. IEEE Trans Neural Net. 2001, 12: 181201. 10.1109/72.914517.CrossRefGoogle Scholar
 47.Good AC, Hermsmeier MA: Measuring CAMD technique performance. 2. How "druglike" are drugs? Implications of Random test set selection exemplified using druglikeness classification models. J Chem Inf Model. 2007, 47: 110114. 10.1021/ci6003493.CrossRefGoogle Scholar
 48.Good AC, Hermsmeier MA, Hindle SA: Measuring CAMD technique performance: a virtual screening case study in the design of validation experiments. J ComputAided Mol Des. 2005, 18: 529536. 10.1007/s1082200440671.CrossRefGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.