First-order methods almost always avoid strict saddle points

Lee, Jason D.; Panageas, Ioannis; Piliouras, Georgios; Simchowitz, Max; Jordan, Michael I.; Recht, Benjamin

doi:10.1007/s10107-019-01374-3

First-order methods almost always avoid strict saddle points

Full Length Paper
Series B
Published: 18 February 2019

Volume 176, pages 311–337, (2019)
Cite this article

Mathematical Programming Submit manuscript

Jason D. Lee ORCID: orcid.org/0000-0003-0064-7800¹,
Ioannis Panageas²,
Georgios Piliouras³,
Max Simchowitz⁴,
Michael I. Jordan⁴ &
…
Benjamin Recht⁴

4189 Accesses
78 Citations
Explore all metrics

Abstract

We establish that first-order methods avoid strict saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including (manifold) gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Theorem allow for a global stability analysis. Thus, neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid strict saddle points.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points

Article 19 March 2021

The Mirror Descent Algorithm

Almost Fixed Points

Notes

This line of work assumes that f is a random function with a specific distribution.
For the purposes of this paper, strict saddle points include local maximizers.
The determinant is invariant under similarity transformations, so is independent of the choice of basis.

References

Absil, P.A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)
Article MathSciNet MATH Google Scholar
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
MATH Google Scholar
Absil, P.A., Mahony, R., Trumpf, J.: An extrinsic look at the Riemannian Hessian. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information, pp. 361–368. Springer, Berlin (2013)
Chapter Google Scholar
Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
Article MathSciNet MATH Google Scholar
Adler, R.J., Taylor, J.E.: Random Fields and Geometry. Springer, Berlin (2009)
MATH Google Scholar
Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, efficient, and neural algorithms for sparse coding. In: Proceedings of The 28th Conference on Learning Theory, pp. 113–149 (2015)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Article MathSciNet MATH Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Article MathSciNet MATH Google Scholar
Auffinger, A., Arous, G.B., Černỳ, J.: Random matrices and complexity of spin glasses. Commun. Pure Appl. Math. 66(2), 165–201 (2013)
Article MathSciNet MATH Google Scholar
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, New Delhi (2017)
Book MATH Google Scholar
Belkin, M., Rademacher, L., Voss, J.: Basis learning as an algorithmic primitive. In: Conference on Learning Theory, pp. 446–487 (2016)
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Bolte, J., Daniilidis, A., Ley, O., Mazet, L., et al.: Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc 362(6), 3319–3363 (2010)
Article MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization or nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet MATH Google Scholar
Bolte, J., Sabach, S., Teboulle, M., Vaisbourd, Y.: First order methods beyond convexity and lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)
Article MathSciNet MATH Google Scholar
Brutzkus, A., Globerson, A.: Globally optimal gradient descent for a convnet with Gaussian inputs. arXiv preprint arXiv:1702.07966 (2017)
Cai, T.T., Li, X., Ma, Z., et al.: Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. Ann. Stat. 44(5), 2221–2251 (2016)
Article MathSciNet MATH Google Scholar
Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)
Article MathSciNet MATH Google Scholar
Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: Artificial Intelligence and Statistics, pp. 192–204 (2015)
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods, vol. 1. SIAM, New Delhi (2000)
Book MATH Google Scholar
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, pp. 2933–2941 (2014)
Du, S.S., Jin, C., Lee, J.D., Jordan, M.I., Poczos, B., Singh, A.: Gradient descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412 (2017)
Du, S.S., Lee, J.D., Tian, Y.: When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129 (2017)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Jin, C., Zheng, Y.: No spurious local minima in nonconvex low rank problems: a unified geometric analysis. arXiv preprint arXiv:1704.00708 (2017)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Gill, P.E., Murray, W.: Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 7(1), 311–350 (1974)
Article MathSciNet MATH Google Scholar
Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887 (2017)
Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from a few entries. IEEE Trans. Inf. Theory 56(6), 2980–2998 (2009)
Article MathSciNet MATH Google Scholar
Kleinberg, R., Piliouras, G., Tardos, E.: Multiplicative updates outperform generic no-regret learning in congestion games. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 533–542. ACM (2009)
Lange, K.: Optimization, vol. 95. Springer (2013)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Conference on Learning Theory, pp. 1246–1257 (2016)
Lewis, A.S., Malick, J.: Alternating projections on manifolds. Math. Oper. Res. 33(1), 216–234 (2008)
Article MathSciNet MATH Google Scholar
Liu, M., Yang, T.: On noisy negative curvature descent: competing with gradient descent for faster non-convex optimization. arXiv preprint arXiv:1709.08571 (2017)
Losert, V., Akin, E.: Dynamics of games and genes: discrete versus continuous time. J. Math. Biol. 17, 241–251 (1983)
Article MathSciNet MATH Google Scholar
Mikusinski, P., Taylor, M.: An Introduction to Multivariable Analysis From Vector to Manifold. Springer, Berlin (2012)
Google Scholar
Moré, J.J., Sorensen, D.C.: On the use of directions of negative curvature in a modified Newton method. Math. Program. 16(1), 1–20 (1979)
Article MathSciNet MATH Google Scholar
Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Math. Program. 39(2), 117–129 (1987)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization, vol. 87. Springer, Berlin (2004)
Book MATH Google Scholar
Nesterov, Y., Polyak, B.T.: Cubic regularization of Newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
ONeill, M., Wright, S.: Behavior of accelerated gradient methods near critical points of nonconvex problems. arXiv preprint arXiv:1706.07993 (2017)
Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: non-isolated critical points and invariant regions. In: Innovations of Theoretical Computer Science (ITCS) (2017)
Pascanu, R., Dauphin, Y.N., Ganguli, S., Bengio, Y.: On the saddle point problem for non-convex optimization. arXiv:1405.4604 (2014)
Pemantle, R.: Nonconvergence to unstable points in urn models and stochastic approximations. Ann. Probab. 18, 698–712 (1990)
Article MathSciNet MATH Google Scholar
Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. arXiv preprint arXiv:1107.2848 (2011)
Royer, C.W., Wright, S.J.: Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131 (2017)
Shub, M.: Global Stability of Dynamical Systems. Springer, Berlin (1987)
Book MATH Google Scholar
Soltanolkotabi, M., Javanmard, A., Lee, J.D.: Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926 (2017)
Sun, J., Qu, Q., Wright, J.: When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096 (2015)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. In: 2016 IEEE International Symposium on Information Theory (ISIT), pp. 2379–2383. IEEE (2016)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory 63(2), 853–884 (2017)
Article MathSciNet MATH Google Scholar
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Tran. Inf. Theory 63(2), 885–914 (2017)
Article MathSciNet MATH Google Scholar
Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in Neural Information Processing Systems, pp. 1260–1268 (2014)
Zhao, T., Wang, Z., Liu, H.: Nonconvex low rank matrix factorization via inexact first order oracle. Adv. Neural Inf. Process. Syst. 559–567 (2015)

Download references

Author information

Authors and Affiliations

Data Sciences and Operations, University of Southern California, Los Angeles, CA, USA
Jason D. Lee
Department of Information Systems, Singapore University of Technology, Tampines, Singapore
Ioannis Panageas
Engineering Systems and Design Pillar, Singapore University of Technology and Design, Tampines, Singapore
Georgios Piliouras
Department of Electrical Engineering and Computer Science, UC Berkeley, Berkeley, USA
Max Simchowitz, Michael I. Jordan & Benjamin Recht

Authors

Jason D. Lee
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Panageas
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Piliouras
View author publications
You can also search for this author in PubMed Google Scholar
Max Simchowitz
View author publications
You can also search for this author in PubMed Google Scholar
Michael I. Jordan
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Recht
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jason D. Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper extends upon the special case of gradient descent dynamics developed in the conference proceedings of the authors [32, 42].

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 649 KB)

Supplementary material 2 (pdf 253 KB)

Appendices

Proof of Claim 4

Proof

We assume that $\alpha |e_j^{T}Hz_j| < \delta \left\| z_j \right\| _2$ for all $j \in \{1,\ldots ,n\}$, for some $\delta $ to be chosen later. For the base case $j=2$, it holds that $\left\| y_t-z_2 \right\| _2 = \left\| z_1-z_2 \right\| _2 = \alpha |e_1^{T}Hz_1|< \delta \left\| z_1 \right\| _2 < 2\delta \left\| y_t \right\| _2$ and $\left\| z_2 \right\| _2 < (1+2\delta )\left\| y_t \right\| _2$. Suppose for $j \ge 2$ that $\left\| y_t-z_{j} \right\| _2 < 2(j-1)\delta \left\| y_t \right\| _2$ and thus $\left\| z_j \right\| _2 < [1+2(j-1)\delta ] \left\| y_t \right\| _2.$ Using induction and triangle inequality we get

$$\begin{aligned} \left\| y_t - z_{j+1} \right\| _2&\le \left\| y_t - z_{j} \right\| _2+ \left\| z_{j} -z_{j+1} \right\| _2\\&= 2(j-1)\delta \left\| y_t \right\| _2 + \alpha |e_j^{T}Hz_j| \\&< 2(j-1)\delta \left\| y_t \right\| _2 + \delta \left\| z_j \right\| _2 \\&< 2(j-1)\delta \left\| y_t \right\| _2+ \delta [1+2(j-1)\delta ] \left\| y_t \right\| _2\\&\le 2j\delta \left\| y_t \right\| _2, \end{aligned}$$

where we assume $\delta < \frac{1}{2n}$ so that $2(j-1)\delta <1$ for all $j \in [n]$. Using the above calculation,

$$\begin{aligned} \alpha |e_i^{T}Hy_t|&< \alpha |e_i^{T}Hz_i|+\alpha |e_i^{T}H(y_t - z_i)| \\&< \delta \left\| z_i \right\| _2+ \alpha \left\| He_i \right\| _2\left\| y_t - z_i \right\| _2\\&< \delta \big (1 +2(i-1)\delta \big ) \left\| y_t \right\| _2 + \alpha \left\| He_i \right\| _2 \big (2(i-1)\delta \big ) \left\| y_t \right\| _2\\&\le \delta \big ( 1+2n \delta + 2n \alpha L\big )\left\| y_t \right\| . \end{aligned}$$

Thus $\alpha \left\| Hy_t \right\| _2 < \sqrt{n}\delta \big ( 1+2n \delta + 2n \alpha L\big ) \left\| y_t \right\| _2$, and

$$\begin{aligned} \sigma _{\min ^+} (H) \left\| y_t \right\| _2 \le \left\| Hy_t \right\| _2 < \frac{\sqrt{n}}{\alpha }\delta \big ( 1+2n \delta + 2n \alpha L\big ) \left\| y_t \right\| _2, \end{aligned}$$

where $\sigma _{\min ^+}$ is the smallest non-zero singular value of H. Thus by choosing $\delta $ small enough such that

$$\begin{aligned} \sigma _{min^+} (H) \ge \frac{\sqrt{n}}{\alpha }\delta \big ( 1+2n\delta + 2n\alpha L\big ), \end{aligned}$$

we have obtained a contradiction. $\square $

Proof of Proposition 7

Proof

Let $H= \nabla ^2 f(x^*)$, $J =\mathrm {D}g(x^*) = \prod _{i=1}^b (I- \alpha P_{S_{b-i+1}} H)$, and $y_0$ be an eigenvector of the Hessian at $x^*$.

The proof technique is very similar to that of the proof of Proposition 5. We shall prove that $\left\| J^t y_0 \right\| _2 \ge c(1+\eta )^t $. Hence by Gelfand’s theorem J must have at least one eigenvalue with magnitude greater than one.

We fix some arbitrary iteration t and let $y_t = J^t y_0$. We will first show that there exists an $\epsilon >0$,

$$\begin{aligned} y_{t+1}^{T} Hy_{t+1} \le (1+\epsilon ) y_{t}^{T}Hy_{t}, \end{aligned}$$

(13)

for all $t\in {\mathbb {N}}$. Let $z_1 = y_t$ and $z_{i+1} = (I- \alpha P_{S_i} H)z_i = z_i - \alpha \sum _{j \in S_i}(e_j^{T}Hz_i)e_j $, so that $y_{t+1} = Jy_t = z_{b+1}$. We get that

$$\begin{aligned}&z_{i+1}^{T}Hz_{i+1} = \left( z_i^{T} - 2 \alpha \sum _{j \in S_i}(e_j^{T}Hz_i)e_j^{T}\right) H \left( z_i - \alpha \sum _{j \in S_i}(e_j^{T}Hz_i)e_j\right) \\&= z_i^{T}Hz_i -2\alpha \sum _{j \in S_i}(e_j^{T} H z_i)^2 + \alpha ^2 \left( \sum _{j \in S_i}(e_j^{T}Hz_i)e_j\right) ^{T} H \left( \sum _{j \in S_i}(e_j^{T}Hz_i)e_j \right) \\&< z_i^{T}Hz_i - 2\alpha \sum _{j \in S_i}(e_j ^{T} H z_i)^2 + \alpha ^2 L_{b} \left\| \sum _{j \in S_i}(e_j^{T}Hz_i)e_j \right\| _2^2 \quad { \text { (using } \left\| H_{S_i} \right\| _2 \le L_b)}\\&= z_i^{T}Hz_i - \alpha (2 -\alpha L_b )\left\| \sum _{j \in S_i}(e_j^{T}Hz_i)e_j \right\| _2^2 \\&\le z_i^{T}Hz_i - \alpha \left\| \sum _{j \in S_i}(e_j^{T}Hz_i)e_j \right\| _2^2 \quad { \text {(using } \alpha L_b <1)}. \end{aligned}$$

Thus $z_{i}^T H z_i $ is a decreasing (non-increasing) sequence.

We shall prove that there exists an $i \in [b]$ so that $z_{i+1}^{T}Hz_{i+1} \le (1+\delta ) z_{i}^{T}Hz_{i}$ for some global constant $\delta $ to be chosen later.

Claim

Let $y_t$ be in the range of H. There exists an $i \in [b]$ so that $\alpha \sum _{j \in S_i} \left| e_j^{T}Hz_i\right| \ge \delta \left\| z_i \right\| _2$ for some $\delta >0$.

To finish the proof of the lemma, suppose that Claim B applies. Then by Cauchy–Schwarz, there exists an index i such that

$$\begin{aligned} z_{i+1}^{T}Hz_{i+1}< & {} z_i^{T}Hz_i - \alpha \sum _{j \in S_i}(z_i^{T}He_j)^2 \\< & {} z_i^{T}Hz_i - \frac{\alpha }{n} \left( \sum _{j \in S_i} \left| z_i^{T}He_j\right| \right) ^2 < z_i ^{T}Hz_i - \frac{\delta ^2}{n\alpha } \left\| z_i \right\| ^2_2. \end{aligned}$$

However, $w^{T}Hw \ge \lambda _{\min }(H)\left\| w \right\| ^2_2 \ge - L \left\| w \right\| ^2_2$, hence we get that

$$\begin{aligned} z_{i+1}^{T}Hz_{i+1} < \left( 1+ \frac{\delta ^2}{\alpha L n}\right) z_i ^{T}Hz_i . \end{aligned}$$

(14)

By choosing $\epsilon =\frac{\delta ^2}{\alpha L n}$ we showed that $y_{t+1}^{T}Hy_{t+1} \le (1+\epsilon )y_t^{T}Hy_t$ as long as $y_t$ is in the range of H.

Assume that $y_t = y_{{\mathcal {N}}} + y_{{\mathcal {R}}}$. It is easy to see $y_t^{T}Hy_t = y_{{\mathcal {R}}}^{T}Hy_{{\mathcal {R}}}$ and also $y_{t+1} = Jy_t = y_{{\mathcal {N}}} + Jy_{{\mathcal {R}}}$, hence $y_{t+1}^{T}Hy_{t+1} =(Jy_{{\mathcal {R}}})^{T} H (Jy_{{\mathcal {R}}})$. Therefore from Inequality (14) proved above, if the starting vector is $y_{{\mathcal {R}}}$, which Claim B applies too, then $(Jy_{{\mathcal {R}}})^{T}HJy_{{\mathcal {R}}} \le (1+\epsilon )y_{{\mathcal {R}}}^{T}H y_{{\mathcal {R}}} = (1+\epsilon )y_t^{T}Hy_t$.

To sum up, we showed that $y_t^{T}Hy_t \le (1+\epsilon )^t y_0^{T} H y_0$ and since $y_0$ is an eigenvector of H (of norm one) with corresponding negative eigenvalue $\lambda $, it follows that $y_t^{T}Hy_t \le \lambda (1+\epsilon )^t.$ Finally using $y_t^{T}Hy_t \ge \lambda _{\min }(H)\left\| y_t \right\| ^2_2$, we get $\left\| y_t \right\| _2 \ge (1+\epsilon )^{t/2} \frac{\lambda }{\lambda _{\min }(H)}$. Observe that $\frac{\lambda }{\lambda _{\min }(H)}>0$ is a positive constant, $(1+\epsilon )^{t/2} \ge (1+\epsilon /4)^t$ (since $\epsilon \le 1/2$) and the proof follows (the parameters as claimed in the beginning will be $c = \frac{\lambda }{\lambda _{\min }(H)}$ and $\eta = \epsilon /4$). $\square $

Proof of Claim B

We assume that $\alpha \sum _{j \in S_i}\left| e_j^{T}Hz_i \right| < \delta \left\| z_i \right\| _2$ for all $i \in [b]$. For base case $i=2$, it holds that $\left\| y_t-z_2 \right\| _2 = \left\| z_1-z_2 \right\| _2 = \alpha |\sum _{j \in S_1}e_j^{T}Hz_1|< \alpha \sum _{j \in S_1} |e_j^{T}Hz_1|< \delta \left\| z_1 \right\| _2 < 2\delta \left\| y_t \right\| _2$ and $\left\| z_2 \right\| _2 < (1+2\delta )\left\| y_t \right\| _2$. Suppose for $i \ge 2$ that $\left\| y_t-z_{i} \right\| _2 < 2(i-1)\delta \left\| y_t \right\| _2$ and thus $\left\| z_i \right\| _2 < [1+2(i-1)\delta ] \left\| y_t \right\| _2.$ Using induction and triangle inequality we obtain

$$\begin{aligned} \left\| y_t - z_{i+1} \right\| _2&\le \left\| y_t - z_{i} \right\| _2+ \left\| z_{i} -z_{i+1} \right\| _2\\&= 2(i-1)\delta \left\| y_t \right\| _2 + \alpha \left| \sum _{j \in S_i}e_j^{T}Hz_i\right| \\&\le 2(i-1)\delta \left\| y_t \right\| _2 + \alpha \sum _{j \in S_i}\left| e_j^{T}Hz_i\right| \\&< 2(i-1)\delta \left\| y_t \right\| _2 + \delta \left\| z_i \right\| _2 \\&< 2(i-1)\delta \left\| y_t \right\| _2+ \delta [1+2(i-1)\delta ] \left\| y_t \right\| _2 \\&\le 2i\delta \left\| y_t \right\| _2, \end{aligned}$$

where we assume $\delta < \frac{1}{2b}$ so that $2(i-1)\delta <1$ for all $i \in [b]$. Using the above,

$$\begin{aligned} \alpha \sum _{j \in S_i} \left| e_j^{T}Hy_t\right|&< \alpha \sum _{j \in S_i}\left| e_j^{T}Hz_i \right| +\alpha \sum _{j \in S_i}\left| e_j^{T}H(y_t - z_i)\right| \\&< \delta \left\| z_i \right\| _2+ \alpha \left( \sum _{j \in S_i}\left\| He_j \right\| _2 \right) \left\| y_t - z_i \right\| _2 \\&< \delta [1+2(i-1)\delta ]\left\| y_t \right\| _2 + \alpha [ 2(i-1)\delta ] \left\| y_t \right\| _2 \left( \sum _{j \in S_i}\left\| He_j \right\| _2 \right) . \end{aligned}$$

Since $\left\| He_i \right\| _2 < \sigma _{\max }(H) \le L$, we get that $\alpha \sum _{j \in S_i} \left\| He_j \right\| _2 < |S_i| \le n$ and we conclude

$$\begin{aligned} \alpha \sum _{j \in S_i}\left| e_j^{T}Hy_t\right| < 2n^2 \delta \left\| y_t \right\| _2. \end{aligned}$$

(15)

Finally, using Inequality (15) it follows that $\alpha \left\| Hy_t \right\| _2 < 2n^2 \delta \sqrt{n} \left\| y_t \right\| _2$. Let $w \in \text {Im}(H)$ be a vector that is orthogonal to $\text {null}(H)$ (since H is symmetric). Then it holds that $\left\| Hw \right\| _2 \ge \sigma _{\min ^+}(H) \left\| w \right\| _2$ where $\sigma _{\min ^+}(H)$ denotes the smallest positive singular value of H (greater than zero). Assume that $y_t \in \text {Im}(H)$ and we get $\left\| Hy_t \right\| _2 < \frac{2n^2 \delta \sqrt{n}}{\alpha } \left\| y_t \right\| _2$. However, $\left\| Hy_t \right\| _2 \ge \sigma _{\min ^+}(H) \left\| y_t \right\| _2$ thus by choosing $\frac{2n^2 \sqrt{n}\delta }{\alpha } < \sigma _{\min ^+}(H)$ we reach a contradiction. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J.D., Panageas, I., Piliouras, G. et al. First-order methods almost always avoid strict saddle points. Math. Program. 176, 311–337 (2019). https://doi.org/10.1007/s10107-019-01374-3

Download citation

Received: 19 October 2017
Accepted: 05 February 2019
Published: 18 February 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s10107-019-01374-3

Keywords

Mathematics Subject Classification

90C26

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

First-order methods almost always avoid strict saddle points

Abstract

Access this article

Similar content being viewed by others

Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points

The Mirror Descent Algorithm

Almost Fixed Points

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 649 KB)

Supplementary material 2 (pdf 253 KB)

Appendices

Proof of Claim 4

Proof

Proof of Proposition 7

Proof

Claim

Proof of Claim B

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

First-order methods almost always avoid strict saddle points

Abstract

Access this article

Similar content being viewed by others

Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points

The Mirror Descent Algorithm

Almost Fixed Points

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 649 KB)

Supplementary material 2 (pdf 253 KB)

Appendices

Proof of Claim 4

Proof

Proof of Proposition 7

Proof

Claim

Proof of Claim B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation