Scaling Out Transformer Models for Retrosynthesis on Supercomputers

Mollinga, Joris; Codreanu, Valeriu

doi:10.1007/978-3-030-80119-9_4

Joris Mollinga¹⁰ &
Valeriu Codreanu¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 283))

2462 Accesses
1 Altmetric

Abstract

Retrosynthesis is the task of building a molecule from smaller precursor molecules. As shown in previous work, good results can be achieved on this task with the help of deep learning techniques, for example with the help of Transformer networks. Here the retrosynthesis task is treated as a machine translation problem where the Transformer network predicts the precursor molecules given a string representation of the target molecule. Previous research has focused on performing the training procedure on a single machine but in this article we investigate the effect of scaling the training of the Transformer networks for the retrosynthesis task on supercomputers. We investigate the issues that arise when scaling Transformers to multiple machines such as learning rate scheduling and choice of optimizer, and present strategies that improve results compared to previous research. By training on multiple machines we are able to increase the top-1 accuracy by \(2.5\%\) to \(43.6\%\). In an attempt to improve results further, we experiment with increasing the number of parameters in the Transformer network but find that models are prone to overfitting, which can be attributed to the small dataset used for training the models. On these runs we manage to achieve a scaling efficiency of nearly \(70\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bai, R., Zhang, C., Wang, L., Yao, C., Ge, J., Duan, H.: Transfer learning: making retrosynthetic predictions based on a small chemical reaction dataset scale to a new level. Molecules 25(10), 2357 (2020)
Article Google Scholar
Bjerrum, E.J.: Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Cavdar, D., et al.: Densifying assumed-sparse tensors. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 23–39. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_2
Chapter Google Scholar
Codreanu, V., Podareanu, D., Saletore, V.: Scale out for large minibatch SGD: residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)
Coley, C.W., Rogers, L., Green, W.H., Jensen, K.F.: Computer-assisted retrosynthesis based on molecular similarity. ACS Central Sci. 3(12), 1237–1245 (2017)
Article Google Scholar
Corey, E.J., Long, A.K., Rubenstein, S.D.: Computer-assisted analysis in organic synthesis. Science 228(4698), 408–418 (1985)
Article Google Scholar
Dai, H., Li, C., Coley, C.W., Dai, B., Song, L.: Retrosynthesis prediction with conditional graph logic network. In: Advances in Neural Information Processing Systems, pp. 8872–8882 (2019)
Google Scholar
Goodman, L., Reddy, R.: Effects of branching factor and vocabulary size on performance. Speech understanding systems: summary of results of the five-year research effort at Carnegie-Mellon University, p. 39
Google Scholar
Goyal, P., et al.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
Harlap, A., et al.: Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)
Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)
Google Scholar
Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, pp. 103–112 (2019)
Google Scholar
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
Karpov, P., Godin, G., Tetko, I.V.: A transformer model for retrosynthesis. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 817–830. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_78
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liu, B., et al.: Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Sci. 3(10), 1103–1113 (2017)
Article Google Scholar
Lowe, D.M.: Extraction of chemical structures and reactions from the literature (2012)
Google Scholar
Ott, M., Edunov, S., Grangier, D., Auli, M.: Scaling neural machine translation. arXiv preprint arXiv:1806.00187 (2018)
Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)
Article Google Scholar
Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698), 604–610 (2018)
Article Google Scholar
Segler, M.H.S., Waller, M.P.: Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem.-A Eur. J. 23(25), 5966–5971 (2017)
Article Google Scholar
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)
Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)
Google Scholar
Tetko, I.V., Karpov, P., Van Deursen, R., Godin, G.: Augmented transformer achieves 97% and 85% for top5 prediction of direct and classical retro-synthesis. arXiv preprint arXiv:2003.02804 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2017)
Google Scholar
Weininger, D.: Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)
Google Scholar
You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019)
Zheng, S., Rao, J., Zhang, Z., Xu, J., Yang, Y.: Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60(1), 47–55 (2019)
Article Google Scholar

Download references

Acknowledgment

We thank the anonymous referees for their constructive comments, which helped to improve the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 814416. We would also like to acknowledge Intel for providing us the resources to run on the Endeavour Supercomputer.

Author information

Authors and Affiliations

SURF Corporative, Amsterdam, The Netherlands
Joris Mollinga & Valeriu Codreanu

Authors

Joris Mollinga
View author publications
You can also search for this author in PubMed Google Scholar
Valeriu Codreanu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joris Mollinga .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mollinga, J., Codreanu, V. (2022). Scaling Out Transformer Models for Retrosynthesis on Supercomputers. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 283. Springer, Cham. https://doi.org/10.1007/978-3-030-80119-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-80119-9_4
Published: 13 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80118-2
Online ISBN: 978-3-030-80119-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics