Molecular denovo design through deep reinforcement learning
 2k Downloads
 6 Citations
Abstract
Keywords
De novo design Recurrent neural networks Reinforcement learningAbbreviations
 DMPK
drug metabolism and pharmacokinetics
 DRD2
dopamine receptor D2
 QSAR
quantitive structure activity relationship
 RNN
recurrent neural network
 RL
reinforcement Learning
 Log
natural logarithm
 BPTT
backpropagation through time
 A
sequence of tokens constituting a SMILES
 Prior
an RNN trained on SMILES from ChEMBL used as a starting point for the Agent
 Agent
an RNN derived from a Prior, trained using reinforcement learning
 J
Jaccard index
 ECFP6
Extended Molecular Fingerprints with diameter 6
 SVM
support vector machine
 FCFP4
Extended Molecular Fingerprints with diameter 4 and feature invariants
Background
Drug discovery is often described using the metaphor of finding a needle in a haystack. In this case, the haystack comprises on the order of \(10^{60}{}10^{100}\) synthetically feasible molecules [1], out of which we need to find a compound which satisfies the plethora of criteria such as bioactivity, drug metabolism and pharmacokinetic (DMPK) profile, synthetic accessibility, etc. The fraction of this space that we can synthesize and test at all—let alone efficiently—is negligible. By using algorithms to virtually design and assess molecules, de novo design offers ways to reduce the chemical space into something more manageable for the search of the needle.
Early de novo design algorithms [1] used structure based approaches to grow ligands to sterically and electronically fit the binding pocket of the target of interest [2, 3]. A limitation of these methods is that the molecules created often possess poor DMPK properties and can be synthetically intractable. In contrast, the ligand based approach is to generate a large virtual library of chemical structures, and search this chemical space using a scoring function that typically takes into account several properties such as DMPK profiles, synthetic accessibility, bioactivity, and query structure similarity [4, 5]. One way to create such a virtual library is to use known chemical reactions alongside a set of available chemical building blocks, resulting in a large number of synthetically accessible structures [6]; another possibility is to use transformational rules based on the expertise of medicinal chemists to design analogues to a query structure. For example, Besnard et al. [7] applied a transformation rule approach to the design of novel dopamine receptor type 2 (DRD2) receptor active compounds with specific polypharmacological profiles and appropriate DMPK profiles for a central nervous system indication. Although using either transformation or reaction rules can reliably and effectively generate novel structures, they are limited by the inherent rigidness and scope of the predefined rules and reactions.
A third approach, known as inverse Quantitative Structure Activity Relationship (inverse QSAR), tackles the problem from a different angle: rather than first generating a virtual library and then using a QSAR model to score and search this library, inverse QSAR aims to map a favourable region in terms of predicted activity to the corresponding molecular structures [8, 9, 10]. This is not a trivial problem: first the solutions of molecular descriptors corresponding to the region need to be resolved using the QSAR model, and these then need be mapped back to the corresponding molecular structures. The fact that the molecular descriptors chosen need to be suitable both for building a forward predictive QSAR model as well as for translation back to molecular structure is one of the major obstacles for this type of approach.
The Recurrent Neural Network (RNN) is commonly used as a generative model for data of sequential nature, and have been used successfully for tasks such as natural language processing [11] and music generation [12]. Recently, there has been an increasing interest in using this type of generative model for the de novo design of molecules [13, 14, 15]. By using a datadriven method that attempts to learn the underlying probability distribution over a large set of chemical structures, the search over the chemical space can be reduced to only molecules seen as reasonable, without introducing the rigidity of rule based approaches. Segler et al. [13] demonstrated that an RNN trained on the canonicalized SMILES representation of molecules can learn both the syntax of the language as well as the distribution in chemical space. They also show how further training of the model using a focused set of actives towards a biological target can produce a finetuned model which generates a high fraction of predicted actives.
In two recent studies, reinforcement learning (RL) [16] was used to fine tune pretrained RNNs. Yu et al. [15] use an adversarial network to estimate the expected return for stateaction pairs sampled from the RNN, and by increasing the likelihood of highly rated pairs improves the generative network for tasks such as poem generation. Jaques et al. [17] use Deep Qlearning to improve a pretrained generative RNN by introducing two ways to score the sequences generated: one is a measure of how well the sequences adhere to music theory, and one is the likelihood of sequences according to the initial pretrained RNN. Using this concept of prior likelihood they reduce the risk of forgetting what was initially learnt by the RNN, compared to a reward based only on the adherence to music theory. The authors demonstrate significant improvements over both the initial RNN as well as an RL only approach. They later extend this method to several other tasks including the generation of chemical structures, and optimize toward molecular properties such as cLogP [18] and QED druglikeness [19]. However, they report that the method is dependent on a reward function incorporating handwritten rules to penalize undesirable types of sequences, and even then can lead to exploitation of the reward resulting in unrealistically simple molecules that are more likely to satisfy the optimization requirements than more complex structures [17].
In this study we propose a policy based RL approach to tune RNNs for episodic tasks [16], in this case the task of generating molecules with given desirable properties. Through learning an augmented episodic likelihood which is a composite of prior likelihood [17] and a user defined scoring function, the method aims to finetune an RNN pretrained on the ChEMBL database [20] towards generating desirable compounds. Compared to maximum likelihood estimation finetuning [13], this method can make use of negative as well as continuous scores, and may reduce the risk of catastrophic forgetting [21]. The method is applied to several different tasks of molecular de novo design, and an investigation was carried out to illustrate how the method affects the behaviour of the generative model on a mechanistic level.
Methods
Recurrent neural networks
Learning the data
BPTT is dealing with gradients that through the chainrule contains terms which are multiplied by themselves many times, and this can lead to a phenomenon known as exploding and vanishing gradients. If these terms are not unity, the gradients quickly become either very large or very small. In order to combat this issue, Hochreiter et al. introduced the LongShortTerm Memory cell [23], which through a more controlled flow of information can decide what information to keep and what to discard. The Gated Recurrent Unit is a simplified implementation of the LongShortTerm Memory architecture that achieves much of the same effect at a reduced computational cost [24].
Generating new samples
Tokenizing and onehot encoding SMILES
There are many different ways to represent a single molecule using SMILES. Algorithms that always represent a certain molecule with the same SMILES are referred to as canonicalization algorithms [26]. However, different implementations of the algorithms can still produce different SMILES.
Reinforcement learning
Consider an Agent, that given a certain state \(s\in {\mathbb {S}}\) has to choose which action \(a\in {\mathbb {A}}(s)\) to take, where \({\mathbb {S}}\) is the set of possible states and \({\mathbb {A}}(s)\) is the set of possible actions for that state. The policy \(\pi (a \mid s)\) of an Agent maps a state to the probability of each action taken therein. Many problems in reinforcement learning are framed as Markov decision processes, which means that the current state contains all information necessary to guide our choice of action, and that nothing more is gained by also knowing the history of past states. For most real problems, this is an approximation rather than a truth; however, we can generalize this concept to that of a partially observable Markov decision process, in which the Agent can interact with an incomplete representation of the environment. Let \(r(a \mid s)\) be the reward which acts as a measurement of how good it was to take an action at a certain state, and the longterm return \(G(a_t, S_t = \sum _{t}^T{r_t})\) as the cumulative rewards starting from t collected up to time T. Since molecular desirability in general is only sensible for a completed SMILES, we will refer only to the return of a complete sequence.
What reinforcement learning concerns, given a set of actions taken from some states and the rewards thus received, is how to improve the Agent policy in such a way as to increase the expected return \({\mathbb {E}}[G]\). A task which has a clear endpoint at step T is referred to as an episodic task [16], where T corresponds to the length of the episode. Generating a SMILES is an example of an episodic task, which ends once the EOS token is sampled.
The states and actions used to train the agent can be generated both by the agent itself or by some other means. If they are generated by the agent itself the learning is referred to as onpolicy, and if they are generated by some other means the learning is referred to as offpolicy [16].

Policy based methods can learn explicitly an optimal stochastic policy [16], which is our goal.

The method used starts with a prior sequence model. The goal is to finetune this model according to some specified scoring function. Since the prior model already constitutes a policy, learning a finetuned policy might require only small changes to the prior model.

The episodes in this case are short and fast to sample, reducing the impact of the variance in the estimate of the gradients.
The prior network
Maximum likelihood estimation was employed to train the initial RNN composed of 3 layers with 1024 Gated Recurrent Units (forget bias 5) in each layer. The RNN was trained on the RDKit [27] canonical SMILES of 1.5 million structures from ChEMBL [20] where the molecules were restrained to containing between 10 and 50 heavy atoms and elements \(\in \{H, B, C, N, O, F, Si, P, S, Cl, Br, I\}\). The model was trained with stochastic gradient descent for 50,000 steps using a batch size of 128, utilizing the Adam optimizer [28] (\(\beta _1 = 0.9\), \(\beta _2 = 0.999\), and \(\epsilon = 10^{8}\)) with an initial learning rate of 0.001 and a 0.02 learning rate decay every 100 steps. Gradients were clipped to \([3, 3]\). Tensorflow [29] was used to implement the Prior as well as the RL Agent.
The agent network
In all the tasks investigated below, the scoring function is fixed during the training of the Agent. If instead the scoring function used is defined by a discriminator network whose task is to distinguish sequences generated by the Agent from ‘real’ SMILES (e.g. a set of actives), the method could be described as a type of Generative Adversarial Network [31], where the Agent and the discriminator would be jointly trained in a game where they both strive to beat the other. This is the approach taken by Yu et al. [15] to finetune a pretrained sequence model for poem generation. Guimaraes et al. demonstrates how such a method can be combined with a fixed scoring function for molecular de novo design [32].
The DRD2 activity model
In one of our studies the objective of the Agent is to generate molecules that are predicted to be active against a biological target. The dopamine type 2 receptor DRD2 was chosen as the target, and corresponding bioactivity data was extracted from ExCAPEDB [33]. In this dataset there are 7218 actives (pIC50 > 5) and 343204 inactives (pIC50 < 5). A subset of 100,000 inactive compounds was randomly selected. In order to decrease the nearest neighbour similarity between the training and testing structures [34, 35, 36], the actives were grouped in clusters based on their molecular similarity. The Jaccard [37] index, for binary vectors also known as the Tanimoto similarity, based on the RDKit implementation of binary Extended Connectivity Molecular Fingerprints with a diameter of 6 (ECFP6 [38]) was used as a similarity measure and the actives were clustered using the Butina clustering algorithm [39] in RDKit with a clustering cutoff of 0.4. In this algorithm, centroid molecules will be selected, and everything with a similarity higher than 0.4 to these centroids will be assigned to the same cluster. The centroids are chosen such as to maximize the number of molecules that are assigned to any cluster. The clusters were sorted by size and iteratively assigned to the test, validation, and training sets (assigned 4 clusters each iteration) to give a distribution of \(\frac{1}{6}\), \(\frac{1}{6}\), and \(\frac{4}{6}\) of the clusters respectively. The inactive compounds, of which less than 0.5% were found to belong to any of the clusters formed by the actives, were split randomly into the three sets using the same ratios.
A support vector machine (SVM) classifier with a Gaussian kernel was built in Scikitlearn [40] on the training set as a predictive model for DRD2 activity. The optimal C and Gamma values utilized in the final model were obtained from a grid search for the highest ROCAUC performance on the validation set.
Results and discussion
Structure generation by the Prior

Once the aromatic “n” has been sampled, the model has come to expect a ring opening (i.e. a number), since aromatic moieties by definition are cyclic.

Once an aromatic ring has been opened, the aromatic atoms “c”, “n”, “o”, and “s” become probable, until 5 or 6 steps later when the model thinks it is time to close the ring.

The model has learnt the RDKit canonicalized SMILES format of increasing ring numbers, and expects the first ring to be numbered “1”. Ring numbers can be reused, as in the two first rings in this example. Only once “1” has been sampled does it expect a ring to be numbered “2” and so on.
Learning to avoid sulphur

Can the models achieve the task of generating valid SMILES corresponding to structures that do not contain sulphur?

Will the models exploit the reward function by converging on naïve solutions such as ‘C’ if not imposed handwritten rules?

Are the distributions of physical chemical properties for the generated structures similar to those of sulphur free structures generated by the Prior?
Comparison of model performance and properties for nonsulphur containing structures generated by the two models
Model  Prior  Agent  Action basis  REINFORCE  REINFORCE + Prior 

Fraction of valid SMILES  0.94 ± 0.01  0.95 ± 0.01  0.95 ± 0.01  0.98 ± 0.00  0.98 ± 0.00 
Fraction without sulphur  0.66 ± 0.01  0.98 ± 0.00  0.92 ± 0.02  0.98 ± 0.00  0.92 ± 0.01 
Average molecular weight  371 ± 1.70  367 ± 3.30  372 ± 0.94  585 ± 27.4  232 ± 5.25 
Average cLogP  3.36 ± 0.04  3.37 ± 0.09  3.39 ± 0.02  11.3 ± 0.85  3.05 ± 0.02 
Average NumRotBonds  5.39 ± 0.04  5.41 ± 0.07  6.08 ± 0.04  30.0 ± 2.17  2.8 ± 0.11 
Average NumAromRings  2.26 ± 0.02  2.26 ± 0.02  2.09 ± 0.02  0.57 ± 0.04  2.11 ± 0.02 
Randomly selected SMILES generated by the different models
Model  Sampled SMILES 

Prior  CCOC(=O)C1=C(C)OC(N)=C(C#N)C1c1ccccc1C(F)(F)F 
COC(=O)CC(C)=NNc1ccc(N(C)C)cc1[N+](=O)[O]  
Cc1ccccc1CNS(=O)(=O)c1ccc2c(c1)C(=O)C(=O)N2  
Agent  CC(C)(C)NC(=O)c1ccc(OCc2ccccc2C(F)(F)F)nc1c1ccccc1 
CC(=O)NCC1OC(=O)N2c3ccc(c4cccnc4)cc3OCC12  
OCCCNCc1cccc(c2cccc(c3nc4ccccc4[nH]3)c2OCCOc2ncc(Cl)cc2Br)c1  
Action level  CCN1CC(C)(C)OC(=O)c2cc(c3ccc(Cl)cc3)ccc21 
CCC(CC)C(=O)Nc1ccc2cnn(c3ccc(C(C)=O)cc3)c2c1  
CCCCN1C(=O)c2ccccc2NC1c1ccc(OC)cc1  
REINFORCE  CC1CCCCC12NC(=O)N(CC(=O)Nc1ccccc1C(=O)O)C2=O 
CCCCCCCCCCCCCCCCCCCCCCCCCCCCNC(=O)OCCCCCC  
CCCCCCCCCCCCCCCCCCCCCC1CCC(O)C1(CCC)CCCCCCCCCCCCCCC  
REINFORCE + Prior  Nc1ccccc1C(=O)Oc1ccccc1 
O=c1cccccc1Oc1ccccc1  
Nc1ccc(c2ccccc2O)cc1 
Similarity guided structure generation
Celecoxib itself as well as many other similar structures appear in the ChEMBL training set used to build the Prior. An interesting question is whether the Agent could succeed in generating Celecoxib when these structures are not part of the chemical space covered by the Prior. To investigate this, all structures with a similarity to Celecoxib higher than 0.5 (corresponding to 1804 molecules) were removed from the training set and a new reduced Prior was trained. The prior likelihood of Celecoxib for the canonical and reduced Priors was compared, as well as the ability of the models to generate structures similar to Celecoxib. As expected, the prior probability of Celecoxib decreased when similar compounds were removed from the training set from \(\log _e P = 12.7\) to \(\log _e P = 19.2\), representing a reduction in likelihood of a factor of 700. An Agent was then trained using the same hyperparameters as before, but on the reduced Prior. After 400 steps, the Agent again managed to find Celecoxib, albeit requiring more time to do so. After 1000 steps, Celecoxib was the most commonly generated structure (about a third of the generated structures), followed by demethylated Celecoxib (also a third) whose SMILES is more likely according to the Prior with \(\log _e P = 15.2\) but has a lower similarity (\(J = 0.87\)), resulting in an augmented likelihood equal to that of Celecoxib.
These experiments demonstrate that the Agent can be optimized using fingerprint based Jaccard similarity as the objective, but making copies of the query structure is hardly useful. A more useful example is that of generating structures that are moderately to the query structure. The Agent was therefore trained for 3000 steps, starting from both the canonical as well as the reduced Prior, using \(k = 0.7\) and \(\sigma = 12\). The Agents based on the canonical Prior quickly converge to their targets, while the Agents based on the reduced Prior converged more slowly. For the Agent based on the reduced Prior where \(k=1\), the fact that Celecoxib and demethylated Celecoxib are given similar augmented likelihoods means that the average similarity converges to around 0.9 rather than 1.0. For the Agent based on the reduced Prior where \(k=0.7\), the lower prior likelihood of compounds similar to Celecoxib translates to a lower augmented likelihood, which lowers the average similarity of the structures generated by the Agent.
We have investigated how various factors affect the learning behaviour of the Agent. In real drug discovery applications, we might be more interested in finding structures with modest similarity to our query molecules rather than very close analogues. For example, one of the structures sampled after 200 steps shown in Fig. 7 displays a type of scaffold hopping where the sulphur functional group on one of the outer aromatic rings has been fused to the central pyrazole. The similarity to Celecoxib of this structure is 0.4, which may be a more interesting solution for scaffoldhopping purposes. One can choose hyperparameters and similarity criterion tailored to the desired output. Other types of similarity measures such as pharmacophoric fingerprints [42], Tversky substructure similarity [43], or presence/absence of certain pharmacophores could also be explored.
Target activity guided structure generation
The third example, perhaps the one most interesting and relevant for drug discovery, is to optimize the Agent towards generating structures with predicted biological activity. This can be seen as a form of inverse QSAR, where the Agent is used to implicitly map high predicted probability of activity to molecular structure. DRD2 was chosen as the biological target. The clustering split of the DRD2 activity dataset as described in “The DRD2 activity model” section resulted in 1405, 1287, and 4526 actives in the test, validation, and training sets respectively. The average similarity to the nearest neighbour in the training set for the test set compounds was 0.53. For a random split of actives in sets of the same sizes this similarity was 0.69, indicating that the clustering had significantly decreased trainingtest set similarity which mimics the hit finding practice in drug discovery to identify diverse hits to the training set. Most of the DRD2 actives are also included in the ChEMBL dataset used to train the Prior. To explore the effect of not having the known actives included in the Prior, a reduced Prior was trained on a reduced subset of the ChEMBL training set where all DRD2 actives had been removed.
Performance of the DRD2 activity model
Set  Training  Validation  Test 

Accuracy  1.00  0.98  0.98 
ROCAUC  1.00  0.99  1.00 
Precision  1.00  0.96  0.97 
Recall  1.00  0.73  0.82 
Comparison of properties for structures generated by the canonical Prior, the reduced Prior, and corresponding Agents
Model  Prior  Agent  Prior^{a}  Agent^{a} 

Fraction valid SMILES  0.94  0.99  0.94  0.99 
Fraction predicted actives  0.03  0.97  0.02  0.96 
Fraction similar to train active  0.02  0.79  0.02  0.75 
Fraction similar to test active  0.01  0.46  0.01  0.38 
Fraction of test actives recovered (\(\times 10^{3}\))  13.5  126  2.85  72.6 
Probability of generating a test set active (\(\times 10^{3}\))  0.17  40.2  0.05  15.0 
In some cases, the model recovered exact matches from the training and test sets (c.f. Segler et al. [13]). The fraction of recovered test actives recovered by the canonical and reduced Prior were 1.3 and 0.3% respectively. The Agent derived from the canonical Prior managed to recover 13% test actives; the Agent derived from the reduced Prior recovered 7%. For the Agent derived from the reduced Prior, where the DRD2 actives were excluded from the Prior training set, this means that the model has learnt to generate “novel” structures that have been seen neither by the DRD2 activity model nor the Prior, and are experimentally confirmed actives. We can formalize this observation by calculating the probability of a given generated sequence belonging to the set of test actives. For the canonical and reduced Priors, this probability was \(0.17\times 10^{3}\) and \(0.05\times 10^{3}\) respectively. Removing the actives from the Prior thus resulted in a threefold reduction in the probability of generating a structure from the set of test actives. For the Agents, the probabilities rose to \(15.0\times 10^{3}\) and \(40.2\times 10^{3}\) respectively, corresponding to an enrichment of a factor of 250 over the Prior models. Again the consequence of removing the actives from the Prior was a threefold reduction in the probability of generating a test set active: the difference between the two Priors is directly mirrored by their corresponding Agents. Apart from generating a higher fraction of structures that are predicted to be active, both Agents also generate a significantly higher fraction of valid SMILES (Table 4). Sequences that are not valid SMILES receive a score of \(1\), which means that the scoring function naturally encourages valid SMILES.
Removing the known actives from the training set of the Prior resulted in an Agent which shows a decrease in all metrics measuring the overlap between the known actives and the structures generated, compared to the Agent derived from the canonical Prior. Interestingly, the fraction of predicted actives did not change significantly. This indicates that the Agent derived from the reduced Prior has managed to find a similar chemical space to that of the canonical Agent, with structures that are equally likely to be predicted as active, but are less similar to the known actives. Whether or not these compounds are active will be dependent on the accuracy of the target activity model. Ideally, any predictive model to be used in conjunction with the generative model should cover a broad chemical space within its domain of applicability, since it initially has to assess representative structures of the dataset used to build the Prior [13].
Conclusion
To summarize, we believe that an RNN operating on the SMILES representation of molecules is a promising method for molecular de novo design. It is a datadriven generative model that does not rely on predefined building blocks and rules, which makes it clearly differentiated from traditional methods. In this study we extend upon previous work [13, 14, 15, 17] by introducing a reinforcement learning method which can be used to tune the RNN to generate structures with certain desirable properties through augmented episodic likelihood.
The model was tested on the task of generating sulphur free molecules as a proof of principle, and the method using augmented episodic likelihood was compared with traditional policy gradient methods. The results indicate that the Agent can find solutions reflecting the underlying probability distribution of the Prior, representing a significant improvement over both traditional REINFORCE [30] algorithms as well as previously reported methods [17]. To evaluate if the model could be used to generate analogues to a query structure, the Agent was trained to generate structures similar to the drug Celecoxib. Even when all analogues of Celecoxib were removed from the Prior, the Agent could still locate the intended region of chemical space which was not part of the Prior. Further more, when trained towards generating predicted actives against the dopamine receptor type 2 (DRD2), the Agent generates structures of which more than 95% are predicted to be active, and could recover test set actives even in the case where they were not included in either the activity model nor the Prior. Our results indicate that the method could be a useful tool for drug discovery.
It is clear that the qualities of the Prior are reflected in the corresponding Agents it produces. An exhaustive study which explores how parameters such as training set size, model size, regularization [44, 45], and training time would influence the quality and variety of structures generated by the Prior would be interesting. Another interesting avenue for future research might be that of token embeddings [46]. The method was found to be robust with respect to the hyperparameters \(\sigma\) and the learning rate.
All of the aforementioned examples used single parameter based scoring functions. In a typical drug discovery project, multiple parameters such as target activity, DMPK profile, synthetic accessibility etc. all need to be taken into account simultaneously. Applying this type of multiparametric scoring functions to the model is an area requiring further research.
Notes
Authors' contributions
MO contributed concept and implementation. All authors codesigned experiments. All authors contributed to the interpretation of results. MO wrote the manuscript. HC, TB, and OE reviewed and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
The authors thank Thierry Kogej and Christian Tyrchan for general assistance and discussion, and Dominik Peters for his expertise in LATEX.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The source code and data supporting the conclusions of this article is available at https://github.com/MarcusOlivecrona/REINVENT, doi: 10.5281/zenodo.572576.

Project name: REINVENT

Project home page: https://github.com/MarcusOlivecrona/REINVENT

Archived version: http://doi.org/10.5281/zenodo.572576

Operating system: Platform independent

Programming language: Python

Other requirements: Python2.7, Tensorflow, RDKit, Scikitlearn

License: MIT.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
MO, HC, and OE are employed by AstraZeneca. TB has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie SklodowskaCurie grant agreement No 676434, “Big Data in Chemistry” (“BIGCHEM”, http://bigchem.eu). The article reflects only the authors’ view and neither the European Commission nor the Research Executive Agency (REA) are responsible for any use that may be made of the information it contains.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
 1.Schneider G, Fechner U (2005) Computerbased de novo design of druglike molecules. Nat Rev Drug Discov 4(8):649–663. doi: 10.1038/nrd1799 CrossRefGoogle Scholar
 2.Böhm HJ (1992) The computer program ludi: a new method for the de novo design of enzyme inhibitors. J Comput Aided Mol Des 6(1):61–78. doi: 10.1007/BF00124387 CrossRefGoogle Scholar
 3.Gillet VJ, Newell W, Mata P, Myatt G, Sike S, Zsoldos Z, Johnson AP (1994) Sprout: recent developments in the de novo design of molecules. J Chem Inf Comput Sci 34(1):207–217. doi: 10.1021/ci00017a027 CrossRefGoogle Scholar
 4.Ruddigkeit L, Blum LC, Reymond JL (2013) Visualization and virtual screening of the chemical universe database gdb17. J Chem Inf Model 53(1):56–65. doi: 10.1021/ci300535x CrossRefGoogle Scholar
 5.Hartenfeller M, Zettl H, Walter M, Rupp M, Reisen F, Proschak E, Weggen S, Stark H, Schneider G (2012) Dogs: reactiondriven de novo design of bioactive compounds. PLOS Comput Biol 8:1–12. doi: 10.1371/journal.pcbi.1002380 CrossRefGoogle Scholar
 6.Schneider G, Geppert T, Hartenfeller M, Reisen F, Klenner A, Reutlinger M, Hähnke V, Hiss JA, Zettl H, Keppner S, Spänkuch B, Schneider P (2011) Reactiondriven de novo design, synthesis and testing of potential type II kinase inhibitors. Future Med Chem 3(4):415–424. doi: 10.4155/fmc.11.8 CrossRefGoogle Scholar
 7.Besnard J, Ruda GF, Setola V, Abecassis K, Rodriguiz RM, Huang XP, Norval S, Sassano MF, Shin AI, Webster LA, Simeons FRC, Stojanovski L, Prat A, Seidah NG, Constam DB, Bickerton GR, Read KD, Wetsel WC, Gilbert IH, Roth BL, Hopkins AL (2012) Automated design of ligands to polypharmacological profiles. Nature 492(7428):215–220. doi: 10.1038/nature11691 CrossRefGoogle Scholar
 8.Miyao T, Kaneko H, Funatsu K (2016) Inverse qspr/qsar analysis for chemical structure generation (from y to x). J Chem Inf Model 56(2):286–299. doi: 10.1021/acs.jcim.5b00628 CrossRefGoogle Scholar
 9.Churchwell CJ, Rintoul MD, Martin S Jr, Visco DP, Kotu A, Larson RS, Sillerud LO, Brown DC, Faulon JL (2004) The signature molecular descriptor: 3. Inversequantitative structureactivity relationship of icam1 inhibitory peptides. J Mol Graph Model 22(4):263–273. doi: 10.1016/j.jmgm.2003.10.002 CrossRefGoogle Scholar
 10.Wong WW, Burkowski FJ (2009) A constructive approach for discovering new drug leads: using a kernel methodology for the inverseqsar problem. J Cheminform 1:44. doi: 10.1186/1758294614 CrossRefGoogle Scholar
 11.Mikolov T, Karafiát M, Burget L, Cernock‘y J, Khudanpur S (2010) Recurrent neural network based language model. In: Kobayashi T, Hirose K, Nakamura S (eds) 11th annual conference of the international speech communication association (INTERSPEECH 2010), Makuhari, Chiba, Japan. ISCA, 26–30 Sept 2010Google Scholar
 12.Eck D, Schmidhuber J (2002) A first look at music composition using lstm recurrent neural networks. Technical report, Istituto Dalle Molle Di Studi Sull Intelligenza ArtificialeGoogle Scholar
 13.Segler MHS, Kogej T, Tyrchan C, Waller MP (2017) Generating focussed molecule libraries for drug discovery with recurrent neural networks. arXiv:1701.01329
 14.GómezBombarelli R, Duvenaud DK, HernáandezLobato JM, AguileraIparraguirre J, Hirzel TD, Adams RP, AspuruGuzik A (2016) Automatic chemical design using a datadriven continuous representation of molecules. CoRR. arXiv:1610.02415
 15.Yu L, Zhang W, Wang J, Yu Y (2016) Seqgan: sequence generative adversarial nets with policy gradient. CoRR. arXiv:1609.05473
 16.Sutton R, Barton A (1998) Reinforcement learning: an introduction, 1st edn. MIT Press, CambridgeGoogle Scholar
 17.Jaques N, Gu S, Turner RE, Eck D (2016) Tuning recurrent neural networks with reinforcement learning. CoRR. arXiv:1611.02796
 18.Leo A, Hansch C, Elkins D (1971) Partition coefficients and their uses. Chem Rev 71(6):525–616. doi: 10.1021/cr60274a001 CrossRefGoogle Scholar
 19.Bickerton GR, Paolini GV, Besnard J, Muresan S, Hopkins AL (2012) Quantifying the chemical beauty of drugs. Nat Chem 4:90–98. doi: 10.1038/nchem.1243 CrossRefGoogle Scholar
 20.Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, AlLazikani B, Overington JP (2012) Chembl: a largescale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107. doi: 10.1093/nar/gkr777 Version 22CrossRefGoogle Scholar
 21.Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y (2013) An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv:1312.6211
 22.Goodfellow I, Bengio Y, Courville A (2016) Deep learning, 1st edn. MIT Press, CambridgeGoogle Scholar
 23.Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780. doi: 10.1162/neco.1997.9.8.1735 CrossRefGoogle Scholar
 24.Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
 25.SMILES. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html. Accessed 7 Apr 2017
 26.Weininger D, Weininger A, Weininger JL (1989) Smiles. 2. Algorithm for generation of unique smiles notation. J Chem Inf Comput Sci 29(2):97–101. doi: 10.1021/ci00062a008 CrossRefGoogle Scholar
 27.RDKit: open source cheminformatics. Version: 2016093. http://www.rdkit.org/
 28.Kingma DP, Ba J: Adam (2014) A method for stochastic optimization. CoRR. arXiv:1412.6980
 29.Tensorflow. Version: 1.0.1. http://www.tensorflow.org
 30.Williams RJ (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach Learn 8(3):229–256Google Scholar
 31.Goodfellow I, PougetAbadie J, Mirza M, Xu B, WardeFarley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27 (NIPS 2014), Montreal, Quebec, Canada. NIPS foundation, 8–13 Dec 2014Google Scholar
 32.Lima Guimaraes G, SanchezLengeling B, Cunha Farias PL, AspuruGuzik A (2017) Objectivereinforced generative adversarial networks (ORGAN) for sequence generation models. arXiv:1705.10843
 33.Sun J, Jeliazkova N, Chupakin V, GolibDzib JF, Engkvist O, Carlsson L, Wegner J, Ceulemans H, Georgiev I, Jeliazkov V, Kochev N, Ashby TJ, Chen H (2017) Excapedb: an integrated large scale dataset facilitating big data analysis in chemogenomics. J Cheminform 9(1):17. doi: 10.1186/s1332101702035 CrossRefGoogle Scholar
 34.Sheridan RP (2013) Timesplit crossvalidation as a method for estimating the goodness of prospective prediction. J Chem Inf Model 53(4):783–790CrossRefGoogle Scholar
 35.Unterthiner T, Mayr A, Steijaert M, Wegner JK, Ceulemans H, Hochreiter S (2014) Deep learning as an opportunity in virtual screening. In: Deep learning and representation learning workshop. NIPS, pp 1058–1066Google Scholar
 36.Mayr A, Klambauer G, Unterthiner T, Hochreiter S (2016) Deeptox: toxicity prediction using deep learning. Front Environ Sci 3:80CrossRefGoogle Scholar
 37.Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat 37:547–579Google Scholar
 38.Rogers D, Hahn M (2010) Extendedconnectivity fingerprints. J Chem Inf Model 50(5):742–754. doi: 10.1021/ci100050t CrossRefGoogle Scholar
 39.Butina D (1999) Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: a fast and automated way to cluster small and large data sets. J Chem Inf Comput Sci 39(4):747–750. doi: 10.1021/ci9803381 CrossRefGoogle Scholar
 40.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikitlearn: machine learning in Python. J Mach Learn Res 12:2825–2830 Version 0.17Google Scholar
 41.Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489CrossRefGoogle Scholar
 42.Reutlinger M, Koch CP, Reker D, Todoroff N, Schneider P, Rodrigues T, Schneider G (2013) Chemically advanced template search (CATS) for scaffoldhopping and prospective target prediction for ’orphan’ molecules. Mol Inform 32(2):133–138CrossRefGoogle Scholar
 43.Senger S (2009) Using Tversky similarity searches for core hopping: finding the needles in the haystack. J Chem Inf Model 49(6):1514–1524CrossRefGoogle Scholar
 44.Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. CoRR. arXiv:1409.2329
 45.Wan L, Zeiler M, Zhang S, LeCun Y, Fergus R (2013) Regularization of neural networks using dropconnect. In: Proceedings of the 30th international conference on international conference on machine learning, Vol 28. ICML’13, pp 1058–1066Google Scholar
 46.Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.