# Introduction

• Zhouchen Lin
• Huan Li
• Cong Fang
Chapter

## Abstract

This chapter gives several examples of optimization problems in machine learning and briefly overviews the representative works on accelerated first-order algorithms. It also gives a brief introduction to the content of the monograph.

## Keywords

First-order accelerated algorithms Machine learning Classification/regression Low-rank learning
Optimization is a supporting technology in many numerical computation related research fields, such as machine learning, signal processing, industrial design, and operation research. In particular, P. Domingos, an AAAI Fellow and a Professor of University of Washington, proposed a celebrated formula :
\displaystyle \begin{aligned}\mbox{machine learning = representation + optimization + evaluation,}\end{aligned}
showing the importance of optimization in machine learning.

## 1.1 Examples of Optimization Problems in Machine Learning

Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning.

Many classification/regression problems can be formulated as
\displaystyle \begin{aligned} \min\limits_{\mathbf{w}\in\mathbb{R}^n} \frac{1}{m}\sum_{i=1}^m l(p({\mathbf{x}}_i;\mathbf{w}),y_i) + \lambda R(\mathbf{w}), \end{aligned}
(1.1)
where w consists of the parameters of a classification/regression system, p(x;w) represents the prediction function of the learning model, l is the loss function to punish the inconformity between the system prediction and the truth value, (xi, yi) is the i-th data sample with xi being the datum/feature vector and yi the label for classification or the corresponding value for regression, R is a regularizer that enforces some special property in w, and λ ≥ 0 is a trade-off parameter. Typical examples of l(p, y) include the squared loss $$l(p,y)=\frac {1}{2}(p-y)^2$$, the logistic loss $$l(p,y)=\log (1+\exp (-py))$$, and the hinge loss $$l(p,y)=\max \{0,1-py\}$$. Examples of p(x;w) include p(x;w) = wTx − b for linear classification/regression and p(x;W) = ϕ(Wnϕ(Wn−1ϕ(W1x)⋯ )) for forward propagation widely used in deep neural networks, where W is a collection of the weight matrices Wk, k = 1, ⋯ , n, and ϕ is an activation function. Representative examples of R(w) include the 2 regularizer $$R(\mathbf {w})=\frac {1}{2}\|\mathbf {w}\|{ }^2$$ and the 1 regularizer R(w) = ∥w1.

The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and 2 regularizer give the support vector machine (SVM) problem ; logistic loss, linear regression function, and 2 regularizer give the regularized logistic regression problem ; square loss, forward propagation function, and R(W) = 0 give the multi-layer perceptron ; and square loss, linear regression function, and 1 regularizer give the LASSO problem .

There are also many problems investigated by the machine learning community that are not of the form of (1.1). For example, the matrix completion problem, which has wide applications in signal and data processing, can be written as:
\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{X}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{X}\|{}_*,\\ &\displaystyle s.t.&\displaystyle {\mathbf{X}}_{ij}={\mathbf{D}}_{ij},\forall (i,j)\in\Omega, \end{array} \end{aligned}
where Ω is the locations of observed entries. The low-rank representation (LRR) problem , which is powerful in clustering data into subspaces, is cast as:
\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{Z}\in\mathbb{R}^{n\times n},\mathbf{E}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{Z}\|{}_*+\lambda\|\mathbf{E}\|{}_1,\\ &\displaystyle s.t.&\displaystyle \mathbf{D}=\mathbf{D}\mathbf{Z}+\mathbf{E}. \end{array} \end{aligned}
To reduce the computational cost as well as the storage space, people observe that a low-rank matrix can be factorized as a product of two much smaller matrices, i.e., X = UVT. Take the matrix completion problem as an example, it can be reformulated as follows, which is a nonconvex problem,
\displaystyle \begin{aligned} \begin{array}{rcl} \min_{\mathbf{U}\in\mathbb{R}^{m\times r},\mathbf{V}\in\mathbb{R}^{n\times r}} \frac{1}{2}\sum_{(i,j)\in\Omega}\left\|{\mathbf{U}}_i{\mathbf{V}}_j^T-{\mathbf{D}}_{ij}\right\|{}_F^2+\frac{\lambda}{2}\left(\|\mathbf{U}\|{}_F^2+\|\mathbf{V}\|{}_F^2\right). \end{array} \end{aligned}

For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 .

## 1.2 First-Order Algorithm

In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, first-order optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns f(xk) and ∇f(xk) when queried with xk, here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition ), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12, 13, 14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the accelerated first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.

## 1.3 Sketch of Representative Works on Accelerated Algorithms

In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method . Consider a problem with an L-smooth (Definition ) and μ-strongly convex (Definition ) objective, and let ε be the error to the optimal solution. The heavy-ball method reduces the complexity $$O\left (\frac {L}{\mu }\log \frac {1}{\varepsilon }\right )$$ of the usual gradient descent to $$O\left (\sqrt {\frac {L}{\mu }}\log \frac {1}{\varepsilon }\right )$$. In 1983, Nesterov proposed his accelerated gradient descent (AGD) for L-smooth objective functions, where the complexity is reduced to $$O\left (\frac {1}{\sqrt {\varepsilon }}\right )$$ as compared with that of usual gradient descent: $$O\left (\frac {1}{\varepsilon }\right )$$. Nesterov further proposed another accelerated algorithm for L-smooth objective functions in 1988 , smoothing techniques for nonsmooth functions with acceleration tricks in 2005 , and an accelerated algorithm for composite functions in 2007  (whose formal publication is ). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle , which was formally published in 2009 and is an extension of  and simpler than ,1 somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques  and Bubeck proposed a near optimal method for highly smooth convex optimization .

Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations  and Wibisono et al. further extended it to higher order AGD . Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD . Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique . On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach . Bubeck proposed a geometric descent method inspired from the ellipsoid method  and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function .

For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition ) using AGD directly, which leads to the accelerated dual ascent  and accelerated augmented Lagrange multiplier method , both with the optimal convergence rate in the dual space. Lu  and Li  further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable  and Xu proposed an accelerated linearized augmented Lagrangian method . ADMM and the primal-dual method, as the most commonly used methods for constrained optimization, were also accelerated in  and  for generally convex (Definition ) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72].

Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in , which minimizes a composite objective with a smooth (Definition ) nonconvex part and a nonsmooth convex (Definition ) part. Inspired by , Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part . Both works in [31, 43] studied the convergence to the first-order critical point (Definition ). Carmon et al. further gave an $$O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )$$ complexity analysis . For many famous machine learning problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition ). The first accelerated method to find the second-order critical point appeared in , which alternates between two subroutines: negative curvature descent and Almost Convex AGD, and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method . Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method, using accelerated methods for fast approximate matrix inversion . The complexities established in [1, 18, 38] are all $$O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )$$.

As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha . Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst , that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in . Stochastic nonconvex optimization is also an important topic and some excellent works include [3, 4, 5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions .

The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD)  and asynchronous accelerated coordinate descent (AACD) . Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM , distributed dual coordinate ascent  and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node . Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017  and a distributed dual ascent with a matching upper bound is given in . Motivated by the lower bound, Li et al. further analyzed the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor .

In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap. ), constrained convex problems (Chap. ), and (unconstrained) nonconvex problems (Chap. ), as well as stochastic algorithms for centralized optimization (Chap. ) and distributed optimization (Chap. ). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect. ), escaping saddle points (Sect. ), and decentralized topology (Sect. ) are highly non-trivial. So uninterested readers may skip them.

## Footnotes

1. 1.

In each iteration  uses only information from two last iterations and makes one call on proximal mapping, while  uses entire history of previous iterations and makes two calls on proximal mapping.

## References

1. 1.
N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1195–1200Google Scholar
2. 2.
Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206Google Scholar
3. 3.
Z. Allen-Zhu, Natasha2: faster non-convex optimization than SGD, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2675–2686Google Scholar
4. 4.
Z. Allen-Zhu, E. Hazan, Variance reduction for faster non-convex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 699–707Google Scholar
5. 5.
Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726Google Scholar
6. 6.
Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradientand mirror descent, in Proceedings of the 8th Innovations in Theoretical Computer Science, Berkeley, (2017)Google Scholar
7. 7.
A. Beck, First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia, 2017)
8. 8.
A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
9. 9.
A. Beck, M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42(1), 1–6 (2014)
10. 10.
J. Berkson, Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(227), 357–365 (1944)Google Scholar
11. 11.
S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 3873–3881Google Scholar
12. 12.
L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
13. 13.
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
14. 14.
S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
15. 15.
S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187Google Scholar
16. 16.
S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in Proceedings of the 32th Conference on Learning Theory, Phoenix, (2019), pp. 492–507Google Scholar
17. 17.
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 654–663Google Scholar
18. 18.
Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)
19. 19.
A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011)
20. 20.
Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)
21. 21.
C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
22. 22.
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654Google Scholar
23. 23.
P.M. Domingos, A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)
24. 24.
D. Drusvyatskiy, M. Fazel, S. Roy, An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)
25. 25.
C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747Google Scholar
26. 26.
C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699Google Scholar
27. 27.
O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
28. 28.
C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization models for machine learning: a survey (2019). Preprint. arXiv:1901.05331Google Scholar
29. 29.
R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory, Paris, (2015), pp. 797–842Google Scholar
30. 30.
R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 2973–2981Google Scholar
31. 31.
S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
32. 32.
R. Hannah, F. Feng, W. Yin, A2BCD: an asynchronous accelerated block coordinate descent algorithm with optimal complexity, in Proceedings of the 7th International Conference on Learning Representations, New Orleans, (2019)Google Scholar
33. 33.
S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Pearson Prentice Hall, Upper Saddle River, 1999)Google Scholar
34. 34.
E. Hazan, Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)
35. 35.
E. Hazan, Optimization for machine learning. Technical report, Princeton University (2019)Google Scholar
36. 36.
B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization. Optim. (2010). Preprint. http://www.optimization-online.org/DB_FILE/2010/10/2760.pdf
37. 37.
P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017)
38. 38.
C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Proceedings of the 31th Conference On Learning Theory, Stockholm, (2018), pp. 1042–1085Google Scholar
39. 39.
R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323Google Scholar
40. 40.
D. Kim, J.A. Fessler, Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)
41. 41.
G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1–2), 167–215 (2018)
42. 42.
L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
43. 43.
H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 379–387Google Scholar
44. 44.
H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020). http://jmlr.org/papers/v21/18-425.html
45. 45.
H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802Google Scholar
46. 46.
H. Li, C. Fang, W. Yin, Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods (2018). Preprint. arXiv:1810.01053Google Scholar
47. 47.
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Advances in Neural Information Processing Systems, Long Beach, vol. 30 (2017), pp. 5330–5340Google Scholar
48. 48.
Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067Google Scholar
49. 49.
H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392Google Scholar
50. 50.
G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in Proceedings of the 27th International Conference on Machine Learning, Haifa, vol. 1 (2010), pp. 663–670Google Scholar
51. 51.
J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim. 26(4), 2430–2467 (2016)
52. 52.
J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791Google Scholar
53. 53.
Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody 24(3), 509–517 (1988)
54. 54.
Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
55. 55.
Y. Nesterov, Gradient methods for minimizing composite objective function. Technical Report Discussion Paper #2007/76, CORE (2007)Google Scholar
56. 56.
Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
57. 57.
Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
58. 58.
Y. Nesterov, Lectures on Convex Optimization (Springer, New York, 2018)
59. 59.
Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci. 8(1), 644–681 (2015)
60. 60.
N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
61. 61.
B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
62. 62.
S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 314–323Google Scholar
63. 63.
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
64. 64.
K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036Google Scholar
65. 65.
S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72Google Scholar
66. 66.
S. Sra, S. Nowozin, S.J. Wright (eds.), Optimization for Machine Learning (MIT Press, Cambridge, MA, 2012)Google Scholar
67. 67.
W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 2510–2518Google Scholar
68. 68.
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267–288 (1996)
69. 69.
N. Tripuraneni, M. Stern, C. Jin, J. Regier, M.I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2899–2908Google Scholar
70. 70.
P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008)Google Scholar
71. 71.
A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), 7351–7358 (2016)
72. 72.
Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)
73. 73.
Y. Xu, J. Rong, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 5530–5540Google Scholar
74. 74.
Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(1), 2939–2980 (2017)
75. 75.
S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res. 18(115), 1–52 (2017)