Scaling Out Transformer Models for Retrosynthesis on Supercomputers

Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 283)


Retrosynthesis is the task of building a molecule from smaller precursor molecules. As shown in previous work, good results can be achieved on this task with the help of deep learning techniques, for example with the help of Transformer networks. Here the retrosynthesis task is treated as a machine translation problem where the Transformer network predicts the precursor molecules given a string representation of the target molecule. Previous research has focused on performing the training procedure on a single machine but in this article we investigate the effect of scaling the training of the Transformer networks for the retrosynthesis task on supercomputers. We investigate the issues that arise when scaling Transformers to multiple machines such as learning rate scheduling and choice of optimizer, and present strategies that improve results compared to previous research. By training on multiple machines we are able to increase the top-1 accuracy by \(2.5\%\) to \(43.6\%\). In an attempt to improve results further, we experiment with increasing the number of parameters in the Transformer network but find that models are prone to overfitting, which can be attributed to the small dataset used for training the models. On these runs we manage to achieve a scaling efficiency of nearly \(70\%\).


Computer aided retrosynthesis Deep learning Transformer High performance computing Deep learning on supercomputers 



We thank the anonymous referees for their constructive comments, which helped to improve the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 814416. We would also like to acknowledge Intel for providing us the resources to run on the Endeavour Supercomputer.


  1. 1.
    Bai, R., Zhang, C., Wang, L., Yao, C., Ge, J., Duan, H.: Transfer learning: making retrosynthetic predictions based on a small chemical reaction dataset scale to a new level. Molecules 25(10), 2357 (2020)CrossRefGoogle Scholar
  2. 2.
    Bjerrum, E.J.: Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017)
  3. 3.
    Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
  4. 4.
    Cavdar, D., et al.: Densifying assumed-sparse tensors. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 23–39. Springer, Cham (2019). Scholar
  5. 5.
    Codreanu, V., Podareanu, D., Saletore, V.: Scale out for large minibatch SGD: residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)
  6. 6.
    Coley, C.W., Rogers, L., Green, W.H., Jensen, K.F.: Computer-assisted retrosynthesis based on molecular similarity. ACS Central Sci. 3(12), 1237–1245 (2017)CrossRefGoogle Scholar
  7. 7.
    Corey, E.J., Long, A.K., Rubenstein, S.D.: Computer-assisted analysis in organic synthesis. Science 228(4698), 408–418 (1985)CrossRefGoogle Scholar
  8. 8.
    Dai, H., Li, C., Coley, C.W., Dai, B., Song, L.: Retrosynthesis prediction with conditional graph logic network. In: Advances in Neural Information Processing Systems, pp. 8872–8882 (2019)Google Scholar
  9. 9.
    Goodman, L., Reddy, R.: Effects of branching factor and vocabulary size on performance. Speech understanding systems: summary of results of the five-year research effort at Carnegie-Mellon University, p. 39Google Scholar
  10. 10.
    Goyal, P., et al.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
  11. 11.
    Harlap, A., et al.: Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)
  12. 12.
    Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)Google Scholar
  13. 13.
    Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, pp. 103–112 (2019)Google Scholar
  14. 14.
    Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)
  15. 15.
    Karpov, P., Godin, G., Tetko, I.V.: A transformer model for retrosynthesis. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 817–830. Springer, Cham (2019). Scholar
  16. 16.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  17. 17.
    Liu, B., et al.: Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Sci. 3(10), 1103–1113 (2017)CrossRefGoogle Scholar
  18. 18.
    Lowe, D.M.: Extraction of chemical structures and reactions from the literature (2012)Google Scholar
  19. 19.
    Ott, M., Edunov, S., Grangier, D., Auli, M.: Scaling neural machine translation. arXiv preprint arXiv:1806.00187 (2018)
  20. 20.
    Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)CrossRefGoogle Scholar
  21. 21.
    Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698), 604–610 (2018)CrossRefGoogle Scholar
  22. 22.
    Segler, M.H.S., Waller, M.P.: Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem.-A Eur. J. 23(25), 5966–5971 (2017)CrossRefGoogle Scholar
  23. 23.
    Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)
  24. 24.
    Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)Google Scholar
  25. 25.
    Tetko, I.V., Karpov, P., Van Deursen, R., Godin, G.: Augmented transformer achieves 97% and 85% for top5 prediction of direct and classical retro-synthesis. arXiv preprint arXiv:2003.02804 (2020)
  26. 26.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  27. 27.
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2017)Google Scholar
  28. 28.
    Weininger, D.: Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)Google Scholar
  29. 29.
    You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019)
  30. 30.
    Zheng, S., Rao, J., Zhang, Z., Xu, J., Yang, Y.: Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60(1), 47–55 (2019)CrossRefGoogle Scholar

Copyright information

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022

Authors and Affiliations

  1. 1.SURF CorporativeAmsterdamThe Netherlands

Personalised recommendations