Abstract
As mentioned in Chap. 3, gradient descent (GD) and its variants provide the core optimization methodology in machine learning problems. Given a C 1 or C 2 function \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\) with unconstrained variable \(x \in \mathbb {R}^{n}\), GD uses the following update rule:
where h t are step size, which may be either fixed or vary across iterations. When f is convex, \(h_t < \frac {2}{L}\) is a necessary and sufficient condition to guarantee the (worst-case) convergence of GD, where L is the Lipschitz constant of the gradient of the function f. On the other hand, there is far less understanding of GD for non-convex problems. For general smooth non-convex problems, GD is only known to converge to a stationary point (i.e., a point with zero gradient).
Part of this chapter is in the paper titled “gradient decent converges to minimizers: optimal and adaptive step size rules” by Bin Shi et al. (2018) presently under review for publication in INFORMS, Journal on Optimization.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
For the purpose of this paper, strict saddle points include local maximizers.
- 2.
f n(x) means the application of f on x repetitively for n times.
References
N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199. http://arxiv.org/abs/1611.01146
S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881
Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)
F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)
S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, B. Poczos, A. Singh, Gradient descent can take exponential time to escape saddle points, in Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017), pp. 1067–1077
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842
R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242
R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981
P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)
P. Hartman, The stable manifold of a point of a hyperbolic map of a banach space. J. Differ. Equ. 9(2), 360–379 (1971)
P. Hartman, Ordinary Differential Equations, Classics in Applied Mathematics, vol. 38 (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2002). Corrected reprint of the second (1982) edition 1982
C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732
C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)
J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)
J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257
X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)
M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)
J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)
Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)
R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)
D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74
I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)
C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)
S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
M. Shub, Global Stability of Dynamical Systems (Springer, Berlin, 2013)
J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383
J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Shi, B., Iyengar, S.S. (2020). Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules. In: Mathematical Theories of Machine Learning - Theory and Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-17076-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-17076-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17075-2
Online ISBN: 978-3-030-17076-9
eBook Packages: EngineeringEngineering (R0)