Efficient first-order methods for convex minimization: a constructive approach

Abstract

We describe a novel constructive technique for devising efficient first-order methods for a wide range of large-scale convex minimization settings, including smooth, non-smooth, and strongly convex minimization. The technique builds upon a certain variant of the conjugate gradient method to construct a family of methods such that (a) all methods in the family share the same worst-case guarantee as the base conjugate gradient method, and (b) the family includes a fixed-step first-order method. We demonstrate the effectiveness of the approach by deriving optimal methods for the smooth and non-smooth cases, including new methods that forego knowledge of the problem parameters at the cost of a one-dimensional line search per iteration, and a universal method for the union of these classes that requires a three-dimensional search per iteration. In the strongly convex case, we show how numerical tools can be used to perform the construction, and show that the resulting method offers an improved worst-case bound compared to Nesterov’s celebrated fast gradient method.

This is a preview of subscription content, access via your institution.

Fig. 1

References

  1. 1.

    Arjevani, Y., Shalev-Shwartz, S., Shamir, O.: On lower and upper bounds in smooth and strongly convex optimization. J. Mach. Learn. Res. 17(126), 1–51 (2016)

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Beck, A.: Quadratic matrix programming. SIAM J. Optim. 17(4), 1224–1238 (2007)

    MathSciNet  MATH  Google Scholar 

  3. 3.

    Beck, A., Drori, Y., Teboulle, M.: A new semidefinite programming relaxation scheme for a class of quadratic matrix problems. Oper. Res. Lett. 40(4), 298–302 (2012)

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Bubeck, S., Lee, Y.T., Singh, M.: A geometric alternative to Nesterov’s accelerated gradient descent (2015). arXiv preprint arXiv:1506.08187

  6. 6.

    De Klerk, E., Glineur, F., Taylor, A.B.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)

    MathSciNet  MATH  Google Scholar 

  7. 7.

    Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654 (2014)

  8. 8.

    Devolder, O., Glineur, F., Nesterov, Y.: Intermediate gradient methods for smooth convex problems with inexact oracle. Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), Technical report (2013)

  9. 9.

    Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Diehl, M., Ferreau, H.J., Haverbeke, N.: Efficient numerical methods for nonlinear MPC and moving horizon estimation. Nonlinear Model Predict. Control 384, 391–417 (2009)

    MATH  Google Scholar 

  11. 11.

    Drori, Y.: Contributions to the complexity analysis of optimization algorithms. Ph.D. thesis, Tel-Aviv University (2014)

  12. 12.

    Drori, Y.: The exact information-based complexity of smooth convex minimization. J. Complex. 39, 1–16 (2017)

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)

    MathSciNet  MATH  Google Scholar 

  14. 14.

    Drori, Y., Teboulle, M.: An optimal variant of Kelley’s cutting-plane method. Math. Program. 160(1–2), 321–351 (2016)

    MathSciNet  MATH  Google Scholar 

  15. 15.

    Drusvyatskiy, D., Fazel, M., Roy, S.: An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)

    MathSciNet  MATH  Google Scholar 

  16. 16.

    Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM J. Optim. 28(3), 2654–2689 (2018)

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming. version 2.0 beta. http://cvxr.com/cvx (2013)

  18. 18.

    Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Stand. 49(6), 409–436 (1952)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Hu, B., Lessard, L.: Dissipativity theory for Nesterov’s accelerated method. In: International Conference on Machine Learning (ICML), pp. 1549–1557 (2017)

  20. 20.

    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems (NIPS), pp. 315–323 (2013)

  21. 21.

    Karimi, S., Vavasis, S.A.: A unified convergence bound for conjugate gradient and accelerated gradient. (2016). arXiv preprint arXiv:1605.00320

  22. 22.

    Kim, D., Fessler, J.A.: Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Kim, D., Fessler, J.A.: On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 172(1), 187–205 (2017)

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2663–2671 (2012)

  25. 25.

    Lemaréchal, C., Sagastizábal, C.: Variable metric bundle methods: from conceptual to implementable forms. Math. Program. 76(3), 393–410 (1997)

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    MathSciNet  MATH  Google Scholar 

  27. 27.

    Löfberg, J.: YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the CACSD Conference (2004)

  28. 28.

    Mosek, A.: The MOSEK Optimization Software, vol. 54 (2010). http://www.mosek.com

  29. 29.

    Narkiss, G., Zibulevsky, M.: Sequential subspace optimization method for large-scale unconstrained problems. In: Technion-IIT, Department of Electrical Engineering (2005)

  30. 30.

    Nemirovski, A.: Orth-method for smooth convex optimization. Izvestia AN SSSR 2, 937–947 (1982). (in Russian)

    Google Scholar 

  31. 31.

    Nemirovski, A.: Information-based complexity of linear operator equations. J. Complex. 8(2), 153–175 (1992)

    MathSciNet  Google Scholar 

  32. 32.

    Nemirovski, A.: Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Nemirovski, A., Yudin, D.: Information-based complexity of mathematical programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika 1 (1983) (in Russian)

  34. 34.

    Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Willey-Interscience, New York (1983)

    Google Scholar 

  35. 35.

    Nesterov, Y.: A method of solving a convex programming problem with convergence rate O(\(1/k^2\))). Soviet Mathematics Doklady 27, 372–376 (1983)

    MATH  Google Scholar 

  36. 36.

    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, London (2004)

    MATH  Google Scholar 

  37. 37.

    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    MathSciNet  MATH  Google Scholar 

  39. 39.

    Nesterov, Y., Shikhman, V.: Quasi-monotone subgradient methods for nonsmooth convex minimization. J. Optim. Theory Appl. 165(3), 917–940 (2015)

    MathSciNet  Article  Google Scholar 

  40. 40.

    Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Google Scholar 

  41. 41.

    Polyak, B.T.: Introduction to Optimization. Optimization Software, New York (1987)

    MATH  Google Scholar 

  42. 42.

    Ruszczyński, A.P.: Nonlinear Optimization, vol. 13. Princeton University Press, Princeton (2006)

    MATH  Google Scholar 

  43. 43.

    Ryu, E.K., Taylor, A.B., Bergeling, C., Giselsson, P.: Operator splitting performance estimation: tight contraction factors and optimal parameter selection (2018). arXiv preprint arXiv:1812.00146

  44. 44.

    Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems (NIPS), pp. 1458–1466 (2011)

  45. 45.

    Scieur, D., Roulet, V., Bach, F., d’Aspremont, A.: Integration methods and optimization algorithms. In: Advances in Neural Information Processing Systems (NIPS), pp. 1109–1118 (2017)

  46. 46.

    Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems (NIPS), pp. 2510–2518 (2014)

  47. 47.

    Taylor, A.: Convex interpolation and performance estimation of first-order methods for convex optimization. Ph.D. thesis, Université catholique de Louvain (2017)

  48. 48.

    Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)

    MathSciNet  MATH  Google Scholar 

  49. 49.

    Taylor, A.B., Hendrickx, J.M., Glineur, F.: Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In: IEEE 56th Annual Conference on Decision and Control (CDC), pp. 1278–1283 (2017)

  50. 50.

    Taylor, A.B., Hendrickx, J.M., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)

    MathSciNet  MATH  Google Scholar 

  51. 51.

    Taylor, A.B., Hendrickx, J.M., Glineur, F.: Exact worst-case convergence rates of the proximal gradient method for composite convex minimization. J. Optim. Theory Appl. 178(2), 455–476 (2018)

    MathSciNet  MATH  Google Scholar 

  52. 52.

    Van Scoy, B., Freeman, R.A., Lynch, K.M.: The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2(1), 49–54 (2018)

    Google Scholar 

  53. 53.

    Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of momentum methods in optimization. (2016). arXiv preprint arXiv:1611.02635

  54. 54.

    Wright, S.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    MathSciNet  MATH  Google Scholar 

  55. 55.

    Wright, S., Nocedal, J.: Numerical optimization. Science 35, 67–68 (1999)

    MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Adrien B. Taylor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The Adrien B. Taylor was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement 724063).

Appendices

Appendix A: Proof of Lemma 1

We start the proof of Lemma 1 with the following a technical lemma.

Lemma 5

Let \({\mathcal {F}}\) be a class of contraction-preserving c.c.p. functions (see Definition 3), and let \(S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}\) be an \({\mathcal {F}}\)-interpolable set satisfying

$$\begin{aligned}&{\left\langle g_i, g_j\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N,\end{aligned}$$
(23)
$$\begin{aligned}&{\left\langle g_i, x_j-x_0\right\rangle }=0,\quad \text {for all } 1\le j\le i=1,\ldots ,N, \end{aligned}$$
(24)

then there exists \(\{{\hat{x}}_i\}_{i\in I^*_N}\subset \mathbb {R}^d\) such that the set \({\hat{S}}=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}\) is \({\mathcal {F}}\)-interpolable, and

$$\begin{aligned}&{\left||{\hat{x}}_0 - {\hat{x}}_*\right||}\le {\left||x_0-x_*\right||}, \end{aligned}$$
(25)
$$\begin{aligned}&{\hat{x}}_i \in {\hat{x}}_0 + \mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad {i=0,\ldots , N}. \end{aligned}$$
(26)

Proof

By the orthogonal decomposition theorem there exists \(\{h_{i,j}\}_{0\le j<i\le N} \subset \mathbb {R}\) and \(\{v_i\}_{0\le i\le N} \subset \mathbb {R}^d\) with \({\left\langle g_k, v_i\right\rangle }=0\) for all \(0\le k<i \le N\) such that

$$\begin{aligned} x_i&=x_0-\sum _{j=0}^{i-1} h_{i,j}g_j +v_i, \quad { i=0,\ldots , N}, \end{aligned}$$

furthermore, there exist \(r_*\in \mathbb {R}^d\) satisfying \({\left\langle r_*, v_j\right\rangle }=0\) for all \(0\le j \le N\) and some \(\{\nu _{j}\}_{0\le j\le N}\subset \mathbb {R}\), such that

$$\begin{aligned} x_*=x_0 + \sum _{j=0}^N \nu _{j}v_j + r_*. \end{aligned}$$

By (23) and (24) it then follows that for all \(k\ge i\)

$$\begin{aligned} {\left\langle g_k, v_i\right\rangle } = {\left\langle g_k, x_i-x_0+\sum _{j=0}^{i-1} h_{i,j} g_j\right\rangle } = 0, \end{aligned}$$

hence, together with the definition of \(v_i\), we get

$$\begin{aligned} {\left\langle g_k, v_i\right\rangle }=0, \quad {i,k=0,\ldots ,N}. \end{aligned}$$
(27)

Let us now choose \(\{{\hat{x}}_i\}_{i\in I^*_N}\) as follows:

$$\begin{aligned}&{\hat{x}}_0:=x_0,\\&{\hat{x}}_i:=x_0-\sum _{j=0}^{i-1} h_{i,j} g_j, \quad { i =0,\ldots , N}, \\&{\hat{x}}_* := x_0+r_*. \end{aligned}$$

It follows immediately from this definition that (26) holds, it thus remains to show that \({\hat{S}}\) is \({\mathcal {F}}\)-interpolable and that (25) holds.

In order to establish that \({\hat{S}}\) is \({\mathcal {F}}\)-interpolable, from Definition 3 it is enough to show that the conditions in (4) are satisfied. This is indeed the case, as \({\left\langle g_j, {\hat{x}}_i - {\hat{x}}_0\right\rangle }={\left\langle g_j, x_i-x_0\right\rangle }\) follows directly from definition of \(\{{\hat{x}}_i\}\) and (27), whereas \({\left||{\hat{x}}_i - {\hat{x}}_j\right||}\le {\left||x_i-x_j\right||}\) in the case \(i,j\ne *\) follows from

$$\begin{aligned} {\left||x_i-x_j\right||}^2&={\left||x_0-\sum _{k=0}^{i-1} h_{i,k} g_k+v_i-x_0+\sum _{k=0}^{j-1} h_{j,k} g_k-v_j\right||}^2\\&={\left||{\hat{x}}_i - {\hat{x}}_j\right||}^2+{\left||v_i-v_j\right||}^2\\&\ge {\left||{\hat{x}}_i - {\hat{x}}_j\right||}^2, \quad {i,j=0,\ldots , N}, \end{aligned}$$

and in the case \(j=*\), follows from

$$\begin{aligned} {\left||x_i-x_*\right||}^2&={\left||x_0-\sum _{k=0}^{i-1} h_{i,k} g_k+v_i-x_0 -\sum _{j=0}^N \nu _{j}v_j - r_*\right||}^2\\&={\left||{\hat{x}}_i - {\hat{x}}_*\right||}^2+{\left||v_i-\sum _{j=0}^N \nu _{j}v_j\right||}^2\\&\ge {\left||{\hat{x}}_i - {\hat{x}}_*\right||}^2, \quad {i=0,\ldots , N}, \end{aligned}$$

where for the second equality we used \({\left\langle v_i, r_*\right\rangle }=0\). The last inequality also establishes (25), which completes the proof. \(\square \)

Proof of Lemma 1

By the first-order necessary and sufficient optimality conditions (see e.g., [42, Theorem 3.5]), the definitions of \(x_i\) and \(f'(x_i)\) in (5) and (6) can be equivalently defined as a solution to the problem of finding \(x_i\in \mathbb {R}^d\) and \(f'(x_i)\in \partial f(x_i)\) (\(0\le i\le N\)), that satisfy:

$$\begin{aligned}&{\left\langle f'(x_i), f'(x_j)\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \\&x_i\in x_0+\mathrm {span}\{f'(x_0),\ldots ,f'(x_{i-1})\}, \quad \text {for all } i=1,\ldots ,N, \end{aligned}$$

hence the problem (PEP) can be equivalently expressed as follows:

$$\begin{aligned} \sup _{ f, \left\{ x_i\right\} _{i \in I^*_N}, \{f'(x_i)\}_{i\in I^*_N}}&f(x_N)-f_*\nonumber \\ \text {subject to: }&f\in {\mathcal {F}}(\mathbb {R}^d),\ x_* \text { is a minimizer of } f, \nonumber \\&f'(x_i) \in \partial f(x_i), \quad \text {for all } i\in I^*_N, \nonumber \\&{\left||x_0-x_*\right||}\le R_x, \nonumber \\&{\left\langle f'(x_i), f'(x_j)\right\rangle }=0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \nonumber \\&x_i\in x_0+\mathrm {span}\{f'(x_0),\ldots ,f'(x_{i-1})\}, \quad \text {for all } i=1,\ldots ,N. \end{aligned}$$
(28)

Now, since all constraints in (28) depend only on the first-order information of f at \(\{x_i\}_{i\in I^*_N}\), by taking advantage of Definition 2 we can denote \(f_i:=f(x_i)\) and \(g_i:=f'(x_i)\) and treat these and as optimization variables, thereby reaching the following equivalent formulation

$$\begin{aligned} \sup _{\{(x_i,g_i,f_i)\}_{i\in I^*_N}}&\ f_N-f_* \nonumber \\ \text { subject to: }&\{(x_i,g_i,f_i)\}_{i\in I^*_N} \text { is }{\mathcal {F}}(\mathbb {R}^d)\text {-interpolable}, \nonumber \\&{\left||x_0-x_*\right||}\le R_x, \nonumber \\&g_*=0, \nonumber \\&{\left\langle g_i, g_j\right\rangle }= 0, \ \text {for all } 0\le j<i=1,\ldots N,\nonumber \\&x_i\in x_0+\mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad \text {for all } i=1,\ldots ,N. \end{aligned}$$
(29)

Since (PEP-GFOM) is a relaxation of (29), we get

$$\begin{aligned} f(x_N) - f_*\le {{\,\mathrm{val}\,}}\mathrm{(PEP)} \le {{\,\mathrm{val}\,}}\mathrm{(PEP-GFOM)}, \end{aligned}$$

which establishes the bound (13).

In order to establish the second part of the claim, let \(\varepsilon >0\). We will proceed to show that there exists some valid input for GFOM \((f, x_0)\), such that \(f(\mathrm {GFOM}_N(f, x_0)) - f_*\ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \).

Indeed, by the definition of (PEP-GFOM), there exists a set \(S=\{(x_i,g_i,f_i)\}_{i\in I^*_N}\) that satisfies the constraints in (PEP-GFOM) and reaches an objective value \(f_N-f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \). Since S satisfies the requirements of Lemma 5 [as these requirements are constraints in (PEP-GFOM)], there exists a set of vectors \(\{{\hat{x}}_i\}_{i\in I^*_N}\) for which

$$\begin{aligned}&{\left||{\hat{x}}_0- {\hat{x}}_*\right||}\le R_x, \\&{\hat{x}}_i\in {\hat{x}}_0 + \mathrm {span}\{g_0,\ldots ,g_{i-1}\},\quad i=0,\dots ,N, \end{aligned}$$

hold, and in addition, \({\hat{S}}:=\{({\hat{x}}_i,g_i,f_i)\}_{i\in I^*_N}\) is \({\mathcal {F}}(\mathbb {R}^d)\)-interpolable. By definition of an \({\mathcal {F}}(\mathbb {R}^d)\)-interpolable set, it follows that there exists a function \({\hat{f}}\in {\mathcal {F}}(\mathbb {R}^d)\) such that \({\hat{f}}({\hat{x}}_i) = f_i\), \(g_i \in \partial {\hat{f}}({\hat{x}}_i)\), hence satisfying

$$\begin{aligned}&{\left\langle {\hat{f}}'({\hat{x}}_i), {\hat{f}}'({\hat{x}}_j)\right\rangle } = 0, \quad \text {for all } 0\le j<i=1,\ldots ,N, \\&{\hat{x}}_i\in {\hat{x}}_0+\mathrm {span}\{{\hat{f}}'(x_0),\ldots , {\hat{f}}'({\hat{x}}_{i-1})\}, \quad \text {for all } i=1,\ldots ,N. \end{aligned}$$

Furthermore, since \(g_*=0\) we have that \({\hat{x}}_*\) is an optimal solution of \({\hat{f}}\).

We conclude that the sequence \({\hat{x}}_0, \dots , {\hat{x}}_N\) forms a valid execution of GFOM on the input \(({\hat{f}}, {\hat{x}}_0)\), that the requirement \({\left||{\hat{x}}_0 - {\hat{x}}_*\right||}\le R_x\) is satisfied, and that the output of the method, \({\hat{x}}_N\), attains the absolute inaccuracy value of \({\hat{f}}({\hat{x}}_N) -{\hat{f}}({\hat{x}}_*) = f_N - f_* \ge {{\,\mathrm{val}\,}}(PEP-GFOM)-\varepsilon \). \(\square \)

Appendix B: Proof of Theorem 3

Lemma 6

Suppose there exists a pair \((f,x_0)\) such that \(f\in {\mathcal {F}}\), \({\left||x_0-x_*\right||}\le R_x\) and \(\mathrm {GFOM}_{2N+1}(f, x_0)\) is not optimal for f, then (sdp-PEP-GFOM) satisfies Slater’s condition. In particular, no duality gap occurs between the primal-dual pair (sdp-PEP-GFOM), (dual-PEP-GFOM), and the dual optimal value is attained.

Proof

Let \((f,x_0)\) be a pair satisfying the premise of the lemma and denote by \(\{x_i\}_{i\ge 0}\) the sequence generated according to GFOM and by \(\{f'(x_i)\}_{i\ge 0}\) the subgradients chosen at each iteration of the method, respectively. By the assumption that the optimal value is not obtained after \(2N+1\) iterations, we have \(f(x_{2N+1})>f_*\).

We show that the set \(\{({\tilde{x}}_i,{\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) with

$$\begin{aligned}&{\tilde{x}}_i:=x_{2i}, \quad i=0,\ldots ,N, \\&{\tilde{x}}_*:=x_*, \\&{\tilde{g}}_i:=f'(x_{2i}), \quad i=0,\ldots ,N, \\&{\tilde{g}}_*:=0, \\&{\tilde{f}}_i:=f(x_{2i}), \quad i=0,\ldots ,N, \\&{\tilde{f}}_*:=f(x_*), \end{aligned}$$

corresponds to a Slater point for (sdp-PEP-GFOM).

In order to proceed, we consider the Gram matrix \({\tilde{G}}\) and the vector \({\tilde{F}}\) constructed from the set \(\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) as in Sect. 3.2. We then continue in two steps:

  1. (i)

    we show that \(({\tilde{G}}, {\tilde{F}})\) is feasible for (sdp-PEP-GFOM),

  2. (ii)

    we show that \({\tilde{G}}\succ 0\).

The proofs follow.

  1. (i)

    First, we note that the set \(\{({\tilde{x}}_i, {\tilde{g}}_i, {\tilde{f}}_i)\}_{i\in I^*_N}\) satisfies the interpolation conditions for \({\mathcal {F}}\), as it was obtained by taking the values and gradients of a function in \({\mathcal {F}}\). Furthermore, since \({\tilde{x}}_0 = x_0\) and \({\tilde{x}}_*=x_*\) we also get that the initial condition \({\left||{\tilde{x}}_0-{\tilde{x}}_*\right||}\le R_x\) is respected, and since \(\{x_i\}\) correspond to the iterates of GFOM, we also have by Lemma 5 that

    $$\begin{aligned}&{\left\langle {\tilde{g}}_i, {\tilde{g}}_j\right\rangle }= 0, \quad \text {for all } 0\le j<i=1,\ldots N, \\&{\left\langle {\tilde{g}}_i, {\tilde{x}}_j-{\tilde{x}}_0\right\rangle }= 0, \quad \text {for all } 1\le j \le i=1,\ldots N. \end{aligned}$$

    It then follows from the construction of \({\tilde{G}}\) and \({\tilde{F}}\) and by (10) that \({\tilde{G}}\) and \({\tilde{F}}\) satisfies the constrains of (sdp-PEP-GFOM).

  2. (ii)

    In order to establish that \({\tilde{G}}\succ 0\) it suffices to show that the vectors

    $$\begin{aligned} \{{\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots ,{\tilde{x}}_N- {\tilde{x}}_0 ; {\tilde{x}}_*- {\tilde{x}}_0 \} \end{aligned}$$

    are linearly independent. Indeed, this follows from Lemma 5, since these vectors are all non-zero, and since \({\tilde{x}}_*\) does not fall in the linear space spanned by \({\tilde{g}}_0,\ldots , {\tilde{g}}_N ; {\tilde{x}}_1- {\tilde{x}}_0,\ldots , {\tilde{x}}_N- {\tilde{x}}_0\) (as otherwise \(x_{2N+1}\) would be an optimal solution).

We conclude that \(({\tilde{G}}, {\tilde{F}})\) forms a Slater point for (sdp-PEP-GFOM).\(\square \)

Proof of Theorem 3

The bound follows directly from

$$\begin{aligned} f(\mathrm {GFOM}_{N}(f, x_0)) - f_*\le {{\,\mathrm{val}\,}}\mathrm{(PEP-GFOM)} \le {{\,\mathrm{val}\,}}\mathrm{(sdp-PEP-GFOM)}, \end{aligned}$$

established by Lemmas 1 and 2. The tightness claim follows from the tightness claims of Lemmas 1, 2 and 6. \(\square \)

Appendix C: Proof of Theorem 4

We begin the proof of Theorem 4 by recalling a well-known lemma on constraint aggregation, showing that it is possible to aggregate the constraints of a minimization problem while keeping the optimal value of the resulting program bounded from below.

Lemma 7

Consider the problem

figuren

where \(f:\mathbb {R}^d\rightarrow \mathbb {R}\), \(h:\mathbb {R}^d\rightarrow \mathbb {R}^n\), \(g:\mathbb {R}^d\rightarrow \mathbb {R}^m\) are some (not necessarily convex) functions, and suppose \(({\tilde{\alpha }}, {\tilde{\beta }})\in \mathbb {R}^{n}\times \mathbb {R}_+^{m}\) is a feasible point for the Lagrangian dual of (P) that attains the value \({\tilde{\omega }}\). Let \(k\in {\mathbb {N}}\), and let \(M\in \mathbb {R}^{n \times k}\) be a linear map such that \({\tilde{\alpha }} \in \mathrm {range}(M)\), then

figureo

is bounded from below by \({\tilde{\omega }}\).

Proof

Let

$$\begin{aligned} L(x, \alpha , \beta ) = f(x)+\alpha ^\top h(x) + \beta ^\top g(x) \end{aligned}$$

be the Lagrangian for the problem (P), then by the assumption on \(({\tilde{\alpha }}, {\tilde{\beta }})\) we have \( \min _x L(x, {\tilde{\alpha }}, {\tilde{\beta }}) = {\tilde{\omega }}. \) Now, let \(u\in \mathbb {R}^k\) be some vector such that \(Mu = {\tilde{\alpha }}\), then for every x in the domain of (P\('\))

$$\begin{aligned}&{\tilde{\alpha }}^\top h(x) = u^\top M^\top h(x) = 0, \\&{\tilde{\beta }}^\top g(x)\le 0, \end{aligned}$$

where that last inequality follows from nonnegativity of \({\tilde{\beta }}\). We get

$$\begin{aligned} f(x) \ge f(x) + {\tilde{\alpha }}^\top h(x) + {\tilde{\beta }}^\top g(x) = L(x, {\tilde{\alpha }}, {\tilde{\beta }}) \ge {\tilde{\omega }}, \quad \forall x: M^\top h(x)=0, g(x)\le 0, \end{aligned}$$

and thus the desired result \(w'\ge {\tilde{\omega }}\) holds. \(\square \)

Before proceeding with the proof of the main results, let us first formulate a performance estimation problem for the class of methods described by (14).

Lemma 8

Let \( R_x\ge 0\) and let \(\{\beta _{i,j}\}_{1\le i\le N, 0\le j\le i-1}\), \(\{\gamma _{i,j}\}_{1\le i\le N, 1\le j\le i}\) be some given sets of real numbers, then for any pair \((f, x_0)\) such that \(f\in {\mathcal {F}}(\mathbb {R}^d)\) and \({\left||x_0-x_*\right||}\le R_x\) (where \(x_*\in {{\,\mathrm{argmin}\,}}_x f(x)\)). Then for any sequence \(\{x_i\}_{1\le i\le N}\) that satisfies

$$\begin{aligned} {\left\langle f'(x_i), \sum _{j=0}^{i-1}\beta _{i,j} f'(x_j) + \sum _{j=1}^{i} \gamma _{i,j}(x_j-x_0)\right\rangle }=0, \quad i=1,\ldots ,N \end{aligned}$$
(30)

for some \(f'(x_i)\in \partial f(x_i)\), the following bound holds:

$$\begin{aligned}&f(x_N)-f_*\le \sup _{ F\in {\mathbb {R}}^{N+1}, G\in {\mathbb {R}}^{2N+2\times 2N+2}} F^\top \mathbf {f}_N - F^\top \mathbf {f}_* \\&\quad \begin{array}{lrl} \text {subject to: } &{}{{{\,\mathrm{Tr}\,}}\left( A^{\mathrm {ic}}_kG\right) }+(a^{\mathrm {ic}}_k)^\top F+b^{\mathrm {ic}}_k\le 0, &{} \quad \text {for all } k\in K_N,\\ &{}{\left\langle \mathbf {g}_i, \sum \limits _{j=0}^{i-1} \beta _{i,j}\mathbf {g}_j + \sum \limits _{j=1}^{i} \gamma _{i,j}(\mathbf {x}_j-\mathbf {x}_0)\right\rangle }_G = 0, &{}\quad \text {for all } i=1,\ldots N,\\ &{}{\left||\mathbf {x}_0-\mathbf {x}_*\right||}_G^2-R_x^2\le 0, &{}\\ &{} G\succeq 0. \end{array} \end{aligned}$$

We omit the proof since it follows the exact same lines as for (sdp-PEP-GFOM) (c.f. the derivations in [13, 50]).

Proof of Theorem 4

The key observation underlying the proof is that by taking the PEP for GFOM (sdp-PEP-GFOM) and aggregating the constraints that define its iterates, we can reach a PEP for the class of methods (14). Furthermore, by Lemma 7, this aggregation can be done in a way that maintains the optimal value of the program, thereby reaching a specific method in this class whose corresponding PEP attains an optimal value that is at least as good as that of the PEP for GFOM.

We perform the aggregation of the constraints as follows: for all \(i=1,\dots ,N\) we aggregate the constraints which correspond to \(\{\beta _{i,j}\}_{0\le j<i}\), \(\{\gamma _{i,j}\}_{1\le j\le i}\) (weighted by \(\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}\), \(\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}\), respectively) into a single constraint, reaching

figurep

By Lemma 7 and the choice of weights \(\{{\tilde{\beta }}_{i,j}\}_{0\le j<i}\), \(\{{\tilde{\gamma }}_{i,j}\}_{1\le j\le i}\) it follows that

$$\begin{aligned} w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x) \le {\tilde{\omega }}. \end{aligned}$$

Finally, by Lemma 8, we conclude that \(w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x)\) forms an upper bound on the performance of the method (14), i.e., for any valid pair \((f, x_0)\) and any \(\{x_i\}_{i\ge 0}\) that satisfies (14) we have

$$\begin{aligned} f(x_N)-f_*\le w'(N, {\mathcal {F}}({\mathbb {R}}^d),R_x)\le {\tilde{\omega }}. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Drori, Y., Taylor, A.B. Efficient first-order methods for convex minimization: a constructive approach. Math. Program. 184, 183–220 (2020). https://doi.org/10.1007/s10107-019-01410-2

Download citation

Mathematics Subject Classification

  • 90C60
  • 90C25
  • 90C22
  • 68Q25