Advertisement

Mathematical Programming

, Volume 176, Issue 1–2, pp 311–337 | Cite as

First-order methods almost always avoid strict saddle points

  • Jason D. LeeEmail author
  • Ioannis Panageas
  • Georgios Piliouras
  • Max Simchowitz
  • Michael I. Jordan
  • Benjamin Recht
Full Length Paper Series B
  • 545 Downloads

Abstract

We establish that first-order methods avoid strict saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including (manifold) gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Theorem allow for a global stability analysis. Thus, neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid strict saddle points.

Keywords

Gradient descent Smooth optimization Saddle points Local minimum Dynamical systems 

Mathematics Subject Classification

90C26 

Notes

Supplementary material

10107_2019_1374_MOESM1_ESM.pdf (649 kb)
Supplementary material 1 (pdf 649 KB)
10107_2019_1374_MOESM2_ESM.pdf (253 kb)
Supplementary material 2 (pdf 253 KB)

References

  1. 1.
    Absil, P.A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)zbMATHGoogle Scholar
  3. 3.
    Absil, P.A., Mahony, R., Trumpf, J.: An extrinsic look at the Riemannian Hessian. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 361–368. Springer, Berlin (2013)CrossRefGoogle Scholar
  4. 4.
    Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Adler, R.J., Taylor, J.E.: Random Fields and Geometry. Springer, Berlin (2009)zbMATHGoogle Scholar
  6. 6.
    Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, efficient, and neural algorithms for sparse coding. In: Proceedings of The 28th Conference on Learning Theory, pp. 113–149 (2015)Google Scholar
  7. 7.
    Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137(1–2), 91–129 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Auffinger, A., Arous, G.B., Černỳ, J.: Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, New Delhi (2017)CrossRefzbMATHGoogle Scholar
  11. 11.
    Belkin, M., Rademacher, L., Voss, J.: Basis learning as an algorithmic primitive. In: Conference on Learning Theory, pp. 446–487 (2016)Google Scholar
  12. 12.
    Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)Google Scholar
  13. 13.
    Bolte, J., Daniilidis, A., Ley, O., Mazet, L., et al.: Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc 362(6), 3319–3363 (2010)CrossRefzbMATHGoogle Scholar
  14. 14.
    Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization or nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with Gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)
  17. 17.
    Cai, T.T., Li, X., Ma, Z., et al.: Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. Ann. Stat. 44(5), 2221–2251 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: Artificial Intelligence and Statistics, pp. 192–204 (2015)Google Scholar
  20. 20.
    Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods, vol. 1. SIAM, New Delhi (2000)CrossRefzbMATHGoogle Scholar
  21. 21.
    Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)Google Scholar
  22. 22.
    Du, S.S., Jin, C., Lee, J.D., Jordan, M.I., Poczos, B., Singh, A.: Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412 (2017)
  23. 23.
    Du, S.S., Lee, J.D., Tian, Y.: When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 (2017)
  24. 24.
    Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)Google Scholar
  25. 25.
    Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708 (2017)
  26. 26.
    Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)Google Scholar
  27. 27.
    Gill, P.E., Murray, W.: Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887 (2017)
  29. 29.
    Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Trans. Inf. Theory 56(6), 2980–2998 (2009)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Kleinberg, R., Piliouras, G., Tardos, E.: Multiplicative updates outperform generic no-regret learning in congestion games. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 533–542. ACM (2009)Google Scholar
  31. 31.
    Lange, K.: Optimization, vol. 95. Springer (2013)Google Scholar
  32. 32.
    Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)Google Scholar
  33. 33.
    Lewis, A.S., Malick, J.: Alternating projections on manifolds. Math. Oper. Res. 33(1), 216–234 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Liu, M., Yang, T.: On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)
  35. 35.
    Losert, V., Akin, E.: Dynamics of games and genes: discrete versus continuous time. J. Math. Biol. 17, 241–251 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Mikusinski, P., Taylor, M.: An Introduction to Multivariable Analysis From Vector to Manifold. Springer, Berlin (2012)Google Scholar
  37. 37.
    Moré, J.J., Sorensen, D.C.: On the use of directions of negative curvature in a modified Newton method. Math. Program. 16(1), 1–20 (1979)MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Math. Program. 39(2), 117–129 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Nesterov, Y.: Introductory Lectures on Convex Optimization, vol. 87. Springer, Berlin (2004)CrossRefzbMATHGoogle Scholar
  40. 40.
    Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    ONeill, M., Wright, S.: Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)
  42. 42.
    Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. In: Innovations of Theoretical Computer Science (ITCS) (2017)Google Scholar
  43. 43.
    Pascanu, R., Dauphin, Y.N., Ganguli, S., Bengio, Y.: On the saddle point problem for non-convex optimization. arXiv:1405.4604 (2014)
  44. 44.
    Pemantle, R.: Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18, 698–712 (1990)MathSciNetCrossRefzbMATHGoogle Scholar
  45. 45.
    Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
  46. 46.
    Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. arXiv preprint arXiv:1107.2848 (2011)
  47. 47.
    Royer, C.W., Wright, S.J.: Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)
  48. 48.
    Shub, M.: Global Stability of Dynamical Systems. Springer, Berlin (1987)CrossRefzbMATHGoogle Scholar
  49. 49.
    Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926 (2017)
  50. 50.
    Sun, J., Qu, Q., Wright, J.: When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096 (2015)
  51. 51.
    Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. In: 2016 IEEE International Symposium on Information Theory (ISIT), pp. 2379–2383. IEEE (2016)Google Scholar
  52. 52.
    Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  53. 53.
    Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Tran. Inf. Theory 63(2), 885–914 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  54. 54.
    Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in Neural Information Processing Systems, pp. 1260–1268 (2014)Google Scholar
  55. 55.
    Zhao, T., Wang, Z., Liu, H.: Nonconvex low rank matrix factorization via inexact first order oracle. Adv. Neural Inf. Process. Syst. 559–567 (2015)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature and Mathematical Optimization Society 2019

Authors and Affiliations

  1. 1.Data Sciences and OperationsUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Department of Information SystemsSingapore University of TechnologyTampinesSingapore
  3. 3.Engineering Systems and Design PillarSingapore University of Technology and DesignTampinesSingapore
  4. 4.Department of Electrical Engineering and Computer ScienceUC BerkeleyBerkeleyUSA

Personalised recommendations