Nearly Optimal First-Order Methods for Convex Optimization under Gradient Norm Measure: an Adaptive Regularization Approach

Abstract

In the development of first-order methods for smooth (resp., composite) convex optimization problems, where smooth functions with Lipschitz continuous gradients are minimized, the gradient (resp., gradient mapping) norm becomes a fundamental optimality measure. Under this measure, a fixed iteration algorithm with the optimal iteration complexity for the smooth case is known, while determining this number of iteration to obtain a desired accuracy requires the prior knowledge of the distance from the initial point to the optimal solution set. In this paper, we report an adaptive regularization approach, which attains the nearly optimal iteration complexity without knowing the distance to the optimal solution set. To obtain further faster convergence adaptively, we secondly apply this approach to construct a first-order method that is adaptive to the Hölderian error bound condition (or equivalently, the Łojasiewicz gradient property), which covers moderately wide classes of applications. The proposed method attains nearly optimal iteration complexity with respect to the gradient mapping norm.

This is a preview of subscription content, access via your institution.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Notes

  1. 1.

    Suppose that \(\varphi \) is an L-smooth convex function satisfying (4) for some exponent \(\rho \in [1,2[\). For \(\rho =1\), Lemma 2.2 implies \(\left\| \nabla {\varphi }(x)\right\| \ge \kappa \) for \(x \not \in \mathop {\text {Argmin}}\varphi \). If \(\rho \in ]1,2[\), on the other hand, Lemmas 2.1 and 2.2 imply \({\frac{1}{2L}\left\| \nabla {\varphi }(x)\right\| ^2 \le \varphi (x)-\varphi ^* \le \kappa ^{-\frac{1}{\rho -1}}\left\| \nabla {\varphi }(x)\right\| ^{\frac{\rho }{\rho -1}}}\) for all x, which yields \(\left\| \nabla {\varphi }(x)\right\| \ge \text {const.}\) for all \(x \not \in \text {Argmin }\varphi \). This contradicts to the continuity of \(\nabla {\varphi }\) at points in \(\text {Argmin }\varphi \).

  2. 2.

    By the strong convexity of \(\left\langle a,x\right\rangle +h(x)\) and \(\left\langle b,x\right\rangle +h(x)\), we have

    $$\begin{aligned} \frac{\mu }{2}\left\| x_a^*-x_b^*\right\| ^2 \le [\left\langle a,x_b^*\right\rangle +h(x_b^*)] - [\left\langle a,x_a^*\right\rangle +h(x_a^*)] ~~\text {and}~~ \frac{\mu }{2}\left\| x_a^*-x_b^*\right\| ^2 \le [\left\langle b,x_a^*\right\rangle +h(x_a^*)]-[\left\langle b,x_b^*\right\rangle +h(x_b^*)], \end{aligned}$$

    respectively. Adding them implies \( \mu \left\| x_a^*-x_b^*\right\| ^2 \le \left\langle a-b,x_b^*-x_a^*\right\rangle \le \left\| a-b\right\| \left\| x_b^*-x_a^*\right\| . \)

  3. 3.

    In fact, since the derivative \(\log (1+x)\) of the function \(h(x):=(1+x)\log (1+x)-x\) is increasing and vanishes at \(x=0\), we have \(\min _{x>-1}h(x)=h(0)=0\).

  4. 4.

    This assumption on \(L_f/L_{\min }\) means that we know a good estimate \(L_{\min }\) close to \(L_f\) in advance.

  5. 5.

    Although (29) is asserted in the case \({x^{(t)}_+} \not \in X^*\), it trivially holds if \({x^{(t)}_+} \in X^*\) unless \(\rho = 1\).

References

  1. 1.

    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27, 372–376 (1983)

    MATH  Google Scholar 

  3. 3.

    Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140, 125–161 (2013)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Renegar, J., Grimmer, B.: A simple nearly-optimal restart scheme for speeding-up first order methods. ArXiv preprint arXiv:1803.00151v1 (2018)

  6. 6.

    Roulet, V., d’Aspremont, A.: Sharpness, restart and acceleration. SIAM J. Optim. 30, 262–289 (2020)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Lu, Z.: Smooth optimization approach for sparse covariance selection. SIAM J. Optim. 19, 1807–1827 (2009)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Nesterov, Y., Gasnikov, A., Guminov, S., Dvurechensky, P.: Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optim. Methods Softw. (2020). https://doi.org/10.1080/10556788.2020.1731747

  11. 11.

    Kim, D., Fessler, J.A.: Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. (2020). https://doi.org/10.1007/s10957-020-01770-2

  12. 12.

    Nesterov, Y.: How to make the gradients small. Optima 88, 10–11 (2012)

    Google Scholar 

  13. 13.

    Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28, 1920–1950 (2018)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. In: Les Équations aux Dérivées Partielles, pp. 87–89. Éditions du Centre National de la Recherche Scientifique, Paris (1963)

  15. 15.

    Łojasiewicz, S.: Ensembles semi-analytiques. preprint, Institut des Hautes Études Scientifiques (1965)

  16. 16.

    Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137, 91–129 (2013)

    MathSciNet  Article  Google Scholar 

  17. 17.

    Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)

    Article  Google Scholar 

  19. 19.

    Liu, M., Yang, T.: Adaptive accelerated gradient converging methods under Hölderian error bound condition. Adv. Neural Inf. Process. Syst. 30, 1 (2017)

    Google Scholar 

  20. 20.

    Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA J. Numer. Anal. 39, 2069–2095 (2019)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60, 633–674 (2015)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1, 123–231 (2014)

    Google Scholar 

  23. 23.

    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Norwell (2004)

    Google Scholar 

  24. 24.

    Rockafellar, R.T.: Convex Analysis. Princeton University Press, New Jersey (1970)

    Google Scholar 

  25. 25.

    Łojasiewicz, S.: Sur le problème de la division. Studia Math. 18, 87–136 (1959)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Li, G.: Global error bounds for piecewise convex polynomials. Math. Program. 137, 37–64 (2013)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Nemirovsky, A.: Information-based complexity of linear operator equations. J. Complexity 8, 153–175 (1992)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Nemirovsky, A., Nesterov, Y.: Optimal methods for smooth convex minimization. USSR Comput. Math. Math. Phys. 25, 21–30 (1985)

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

The authors are thankful to an anonymous referee for valuable comments that improved this paper. This work was partially supported by the Grant-in-Aid for Young Scientists (B) (17K12645) and the Grant-in-Aid for Scientific Research (C) (18K11178) from Japan Society for the Promotion of Science.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Masaru Ito.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Lionel Thibault.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ito, M., Fukuda, M. Nearly Optimal First-Order Methods for Convex Optimization under Gradient Norm Measure: an Adaptive Regularization Approach. J Optim Theory Appl 188, 770–804 (2021). https://doi.org/10.1007/s10957-020-01806-7

Download citation

Keywords

  • Smooth/composite convex optimization
  • Accelerated proximal gradient methods
  • Hölderian error bound
  • Adaptive methods

Mathematics Subject Classification

  • 90C25
  • 68Q25
  • 49M37