Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules

  • Bin Shi
  • S. S. Iyengar


As mentioned in Chap.  3, gradient descent (GD) and its variants provide the core optimization methodology in machine learning problems. Given a C1 or C2 function \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\) with unconstrained variable \(x \in \mathbb {R}^{n}\), GD uses the following update rule:
$$\displaystyle x_{t+1} = x_{t} - h_t \nabla f\left (x_t\right ), $$
where ht are step size, which may be either fixed or vary across iterations. When f is convex, \(h_t < \frac {2}{L}\) is a necessary and sufficient condition to guarantee the (worst-case) convergence of GD, where L is the Lipschitz constant of the gradient of the function f. On the other hand, there is far less understanding of GD for non-convex problems. For general smooth non-convex problems, GD is only known to converge to a stationary point (i.e., a point with zero gradient).


Stable manifold theorem Lipschitz constant Cubic-regularization method Lebesgue measure Lindelöf’s lemma 


  1. [AAB+17]
    N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199.
  2. [BNS16]
    S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881Google Scholar
  3. [CD16]
    Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)Google Scholar
  4. [CDHS16]
    Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)Google Scholar
  5. [CRS14]
    F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)MathSciNetzbMATHGoogle Scholar
  6. [DJL+17]
    S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, B. Poczos, A. Singh, Gradient descent can take exponential time to escape saddle points, in Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017), pp. 1067–1077Google Scholar
  7. [GHJY15]
    R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842Google Scholar
  8. [GJZ17]
    R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242Google Scholar
  9. [GLM16]
    R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981Google Scholar
  10. [GM74]
    P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  11. [Har71]
    P. Hartman, The stable manifold of a point of a hyperbolic map of a banach space. J. Differ. Equ. 9(2), 360–379 (1971)MathSciNetzbMATHCrossRefGoogle Scholar
  12. [Har82]
    P. Hartman, Ordinary Differential Equations, Classics in Applied Mathematics, vol. 38 (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2002). Corrected reprint of the second (1982) edition 1982Google Scholar
  13. [JGN+17]
    C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732Google Scholar
  14. [JNJ17]
    C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)Google Scholar
  15. [LPP+17]
    J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)Google Scholar
  16. [LSJR16]
    J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257Google Scholar
  17. [LWL+16]
    X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)Google Scholar
  18. [LY17]
    M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)Google Scholar
  19. [MS79]
    J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetzbMATHCrossRefGoogle Scholar
  20. [Nes13]
    Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)zbMATHGoogle Scholar
  21. [NP06]
    Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  22. [OW17]
    M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)Google Scholar
  23. [Pem90]
    R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)MathSciNetzbMATHCrossRefGoogle Scholar
  24. [PKCS17]
    D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74Google Scholar
  25. [PP16]
    I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)Google Scholar
  26. [RW17]
    C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)Google Scholar
  27. [RZS+17]
    S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)Google Scholar
  28. [Shu13]
    M. Shub, Global Stability of Dynamical Systems (Springer, Berlin, 2013)Google Scholar
  29. [SQW16]
    J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383Google Scholar
  30. [SQW17]
    J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Bin Shi
    • 1
  • S. S. Iyengar
    • 2
  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Florida International UniversityMiamiUSA

Personalised recommendations