Skip to main content

Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules

  • Chapter
  • First Online:
  • 2745 Accesses

Abstract

As mentioned in Chap. 3, gradient descent (GD) and its variants provide the core optimization methodology in machine learning problems. Given a C 1 or C 2 function \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\) with unconstrained variable \(x \in \mathbb {R}^{n}\), GD uses the following update rule:

$$\displaystyle x_{t+1} = x_{t} - h_t \nabla f\left (x_t\right ), $$

where h t are step size, which may be either fixed or vary across iterations. When f is convex, \(h_t < \frac {2}{L}\) is a necessary and sufficient condition to guarantee the (worst-case) convergence of GD, where L is the Lipschitz constant of the gradient of the function f. On the other hand, there is far less understanding of GD for non-convex problems. For general smooth non-convex problems, GD is only known to converge to a stationary point (i.e., a point with zero gradient).

Part of this chapter is in the paper titled “gradient decent converges to minimizers: optimal and adaptive step size rules” by Bin Shi et al. (2018) presently under review for publication in INFORMS, Journal on Optimization.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    For the purpose of this paper, strict saddle points include local maximizers.

  2. 2.

    f n(x) means the application of f on x repetitively for n times.

References

  1. N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima faster than gradient descent, in STOC (2017), pp. 1195–1199. http://arxiv.org/abs/1611.01146

  2. S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems (2016), pp. 3873–3881

    Google Scholar 

  3. Y. Carmon, J.C. Duchi, Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)

    Google Scholar 

  4. Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756 (2016)

    Google Scholar 

  5. F.E. Curtis, D.P. Robinson, M. Samadi, A trust region algorithm with a worst-case iteration complexity of O(𝜖 −3∕2) for nonconvex optimization. Math. Program. 162(1–2), 1–32 (2014)

    MathSciNet  MATH  Google Scholar 

  6. S.S. Du, C. Jin, J.D. Lee, M.I. Jordan, B. Poczos, A. Singh, Gradient descent can take exponential time to escape saddle points, in Proceedings of Advances in Neural Information Processing Systems (NIPS) (2017), pp. 1067–1077

    Google Scholar 

  7. R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points—online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory (2015), pp. 797–842

    Google Scholar 

  8. R. Ge, C. Jin, Y. Zheng, No spurious local minima in nonconvex low rank problems: a unified geometric analysis, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1233–1242

    Google Scholar 

  9. R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems (2016), pp. 2973–2981

    Google Scholar 

  10. P.E. Gill, W. Murray, Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  11. P. Hartman, The stable manifold of a point of a hyperbolic map of a banach space. J. Differ. Equ. 9(2), 360–379 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  12. P. Hartman, Ordinary Differential Equations, Classics in Applied Mathematics, vol. 38 (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 2002). Corrected reprint of the second (1982) edition 1982

    Google Scholar 

  13. C. Jin, R. Ge, P. Netrapalli, S.M. Kakade, M.I. Jordan, How to escape saddle points efficiently, in Proceedings of the 34th International Conference on Machine Learning (2017), pp. 1724–1732

    Google Scholar 

  14. C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)

    Google Scholar 

  15. J.D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M.I. Jordan, B. Recht, First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406 (2017)

    Google Scholar 

  16. J.D. Lee, M. Simchowitz, M.I. Jordan, B. Recht, Gradient descent only converges to minimizers, in Conference on Learning Theory (2016), pp. 1246–1257

    Google Scholar 

  17. X. Li, Z. Wang, J. Lu, R. Arora, J. Haupt, H. Liu, T. Zhao, Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint arXiv:1612.09296 (2016)

    Google Scholar 

  18. M. Liu, T. Yang, On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)

    Google Scholar 

  19. J.J. Moré, D.C. Sorensen, On the use of directions of negative curvature in a modified newton method. Math. Program. 16(1), 1–20 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  20. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, vol. 87 (Springer, Berlin, 2013)

    MATH  Google Scholar 

  21. Y. Nesterov, B.T. Polyak, Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  22. M. O’Neill, S.J. Wright, Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)

    Google Scholar 

  23. R. Pemantle, Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18(2), 698–712 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  24. D. Park, A. Kyrillidis, C. Carmanis, S. Sanghavi, Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (2017), pp. 65–74

    Google Scholar 

  25. I. Panageas, G. Piliouras, Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)

    Google Scholar 

  26. C.W. Royer, S.J. Wright, Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)

    Google Scholar 

  27. S.J. Reddi, M. Zaheer, S. Sra, B. Poczos, F. Bach, R. Salakhutdinov, A.J. Smola, A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

    Google Scholar 

  28. M. Shub, Global Stability of Dynamical Systems (Springer, Berlin, 2013)

    Google Scholar 

  29. J. Sun, Q. Qu, J. Wright, A geometric analysis of phase retrieval, in 2016 IEEE International Symposium on Information Theory (ISIT) (IEEE, Piscataway, 2016), pp. 2379–2383

    Google Scholar 

  30. J. Sun, Q. Qu, J. Wright, Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Shi, B., Iyengar, S.S. (2020). Gradient Descent Converges to Minimizers: Optimal and Adaptive Step-Size Rules. In: Mathematical Theories of Machine Learning - Theory and Applications. Springer, Cham. https://doi.org/10.1007/978-3-030-17076-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-17076-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-17075-2

  • Online ISBN: 978-3-030-17076-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics