Skip to main content

Scaling Out Transformer Models for Retrosynthesis on Supercomputers

  • Conference paper
  • First Online:
Intelligent Computing

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 283))

Abstract

Retrosynthesis is the task of building a molecule from smaller precursor molecules. As shown in previous work, good results can be achieved on this task with the help of deep learning techniques, for example with the help of Transformer networks. Here the retrosynthesis task is treated as a machine translation problem where the Transformer network predicts the precursor molecules given a string representation of the target molecule. Previous research has focused on performing the training procedure on a single machine but in this article we investigate the effect of scaling the training of the Transformer networks for the retrosynthesis task on supercomputers. We investigate the issues that arise when scaling Transformers to multiple machines such as learning rate scheduling and choice of optimizer, and present strategies that improve results compared to previous research. By training on multiple machines we are able to increase the top-1 accuracy by \(2.5\%\) to \(43.6\%\). In an attempt to improve results further, we experiment with increasing the number of parameters in the Transformer network but find that models are prone to overfitting, which can be attributed to the small dataset used for training the models. On these runs we manage to achieve a scaling efficiency of nearly \(70\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://userinfo.surfsara.nl/.

  2. 2.

    https://www.top500.org/system/176908.

References

  1. Bai, R., Zhang, C., Wang, L., Yao, C., Ge, J., Duan, H.: Transfer learning: making retrosynthetic predictions based on a small chemical reaction dataset scale to a new level. Molecules 25(10), 2357 (2020)

    Article  Google Scholar 

  2. Bjerrum, E.J.: Smiles enumeration as data augmentation for neural network modeling of molecules. arXiv preprint arXiv:1703.07076 (2017)

  3. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  4. Cavdar, D., et al.: Densifying assumed-sparse tensors. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds.) ISC High Performance 2019. LNCS, vol. 11501, pp. 23–39. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20656-7_2

    Chapter  Google Scholar 

  5. Codreanu, V., Podareanu, D., Saletore, V.: Scale out for large minibatch SGD: residual network training on imagenet-1k with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017)

  6. Coley, C.W., Rogers, L., Green, W.H., Jensen, K.F.: Computer-assisted retrosynthesis based on molecular similarity. ACS Central Sci. 3(12), 1237–1245 (2017)

    Article  Google Scholar 

  7. Corey, E.J., Long, A.K., Rubenstein, S.D.: Computer-assisted analysis in organic synthesis. Science 228(4698), 408–418 (1985)

    Article  Google Scholar 

  8. Dai, H., Li, C., Coley, C.W., Dai, B., Song, L.: Retrosynthesis prediction with conditional graph logic network. In: Advances in Neural Information Processing Systems, pp. 8872–8882 (2019)

    Google Scholar 

  9. Goodman, L., Reddy, R.: Effects of branching factor and vocabulary size on performance. Speech understanding systems: summary of results of the five-year research effort at Carnegie-Mellon University, p. 39

    Google Scholar 

  10. Goyal, P., et al.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)

  11. Harlap, A., et al.: Pipedream: fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)

  12. Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: Advances in Neural Information Processing Systems, pp. 1731–1741 (2017)

    Google Scholar 

  13. Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems, pp. 103–112 (2019)

    Google Scholar 

  14. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018)

  15. Karpov, P., Godin, G., Tetko, I.V.: A transformer model for retrosynthesis. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 817–830. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_78

    Chapter  Google Scholar 

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Liu, B., et al.: Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Central Sci. 3(10), 1103–1113 (2017)

    Article  Google Scholar 

  18. Lowe, D.M.: Extraction of chemical structures and reactions from the literature (2012)

    Google Scholar 

  19. Ott, M., Edunov, S., Grangier, D., Auli, M.: Scaling neural machine translation. arXiv preprint arXiv:1806.00187 (2018)

  20. Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)

    Article  Google Scholar 

  21. Segler, M.H., Preuss, M., Waller, M.P.: Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698), 604–610 (2018)

    Article  Google Scholar 

  22. Segler, M.H.S., Waller, M.P.: Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chem.-A Eur. J. 23(25), 5966–5971 (2017)

    Article  Google Scholar 

  23. Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)

  24. Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017)

    Google Scholar 

  25. Tetko, I.V., Karpov, P., Van Deursen, R., Godin, G.: Augmented transformer achieves 97% and 85% for top5 prediction of direct and classical retro-synthesis. arXiv preprint arXiv:2003.02804 (2020)

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  27. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2017)

    Google Scholar 

  28. Weininger, D.: Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)

    Google Scholar 

  29. You, Y., et al.: Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019)

  30. Zheng, S., Rao, J., Zhang, Z., Xu, J., Yang, Y.: Predicting retrosynthetic reactions using self-corrected transformer neural networks. J. Chem. Inf. Model. 60(1), 47–55 (2019)

    Article  Google Scholar 

Download references

Acknowledgment

We thank the anonymous referees for their constructive comments, which helped to improve the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement No. 814416. We would also like to acknowledge Intel for providing us the resources to run on the Endeavour Supercomputer.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joris Mollinga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mollinga, J., Codreanu, V. (2022). Scaling Out Transformer Models for Retrosynthesis on Supercomputers. In: Arai, K. (eds) Intelligent Computing. Lecture Notes in Networks and Systems, vol 283. Springer, Cham. https://doi.org/10.1007/978-3-030-80119-9_4

Download citation

Publish with us

Policies and ethics