Abstract
One of the most popular training algorithms for deep neural networks is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba. Despite its success in many applications there is no satisfactory convergence analysis: only local convergence can be shown for batch mode under some restrictions on the hyperparameters, counterexamples exist for incremental mode. Recent results show that for simple quadratic objective functions limit cycles of period 2 exist in batch mode, but only for atypical hyperparameters, and only for the algorithm without bias correction. We extend the convergence analysis to all choices of the hyperparameters for quadratic functions. This finally answers the question of convergence for Adam in batch mode to the negative. We analyze the stability of these limit cycles and relate our analysis to other results where approximate convergence was shown, but under the additional assumption of bounded gradients which does not apply to quadratic functions. The investigation heavily relies on the use of computer algebra due to the complexity of the equations.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bock, S.: Rotationsermittlung von Bauteilen basierend auf neuronalen Netzen. M.Sc. thesis, Ostbayerische Technische Hochschule Regensburg (2017, unpublished)
Bock, S., Weiß, M.G.: A proof of local convergence for the Adam optimizer. In: International Joint Conference on Neural Networks IJCNN 2019, Budapest (2019)
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of Adam-type algorithms for non-convex optimization. In: Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA (2019)
da Silva, A.B., Gazeau, M.: A general system of differential equations to model first order adaptive algorithms (2018). https://arxiv.org/pdf/1810.13108
Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electron. J. Statist. 12(1), 461–529 (2018). https://doi.org/10.1214/18-EJS1395
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA (2015)
Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. ISOR, vol. 228, 4th edn. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-18842-3
Nesterov, J.E.: Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, vol. APOP 87. Kluwer Acad. Publ., Boston (2004)
Reddi, J.S., Kale, S., Kumar, S.: On the convergence of Adam and beyond. In: 6th International Conference on Learning Representations, Vancouver, BC, Canada (2018)
Rubio, D.M.: Convergence analysis of an adaptive method of gradient descent. M.Sc. thesis, University of Oxford, Oxford (2017)
Acknowledgements
This paper presents results of the project “LeaP – Learning Poses” supported by the Bavarian Ministry of Science and Art under Kap. 15 49 TG 78.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bock, S., Weiß, M. (2019). Non-convergence and Limit Cycles in the Adam Optimizer. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning. ICANN 2019. Lecture Notes in Computer Science(), vol 11728. Springer, Cham. https://doi.org/10.1007/978-3-030-30484-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-30484-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30483-6
Online ISBN: 978-3-030-30484-3
eBook Packages: Computer ScienceComputer Science (R0)