# Introduction

- 534 Downloads

## Abstract

This chapter gives several examples of optimization problems in machine learning and briefly overviews the representative works on accelerated first-order algorithms. It also gives a brief introduction to the content of the monograph.

## Keywords

First-order accelerated algorithms Machine learning Classification/regression Low-rank learning## 1.1 Examples of Optimization Problems in Machine Learning

Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning.

**w**consists of the parameters of a classification/regression system,

*p*(

**x**;

**w**) represents the prediction function of the learning model,

*l*is the loss function to punish the inconformity between the system prediction and the truth value, (

**x**

_{i},

*y*

_{i}) is the

*i*-th data sample with

**x**

_{i}being the datum/feature vector and

*y*

_{i}the label for classification or the corresponding value for regression,

*R*is a regularizer that enforces some special property in

**w**, and

*λ*≥ 0 is a trade-off parameter. Typical examples of

*l*(

*p*,

*y*) include the squared loss \(l(p,y)=\frac {1}{2}(p-y)^2\), the logistic loss \(l(p,y)=\log (1+\exp (-py))\), and the hinge loss \(l(p,y)=\max \{0,1-py\}\). Examples of

*p*(

**x**;

**w**) include

*p*(

**x**;

**w**) =

**w**

^{T}

**x**−

*b*for linear classification/regression and

*p*(

**x**;

**W**) =

*ϕ*(

**W**

_{n}

*ϕ*(

**W**

_{n−1}⋯

*ϕ*(

**W**

_{1}

**x**)⋯ )) for forward propagation widely used in deep neural networks, where

**W**is a collection of the weight matrices

**W**

_{k},

*k*= 1, ⋯ ,

*n*, and

*ϕ*is an activation function. Representative examples of

*R*(

**w**) include the

*ℓ*

_{2}regularizer \(R(\mathbf {w})=\frac {1}{2}\|\mathbf {w}\|{ }^2\) and the

*ℓ*

_{1}regularizer

*R*(

**w**) = ∥

**w**∥

_{1}.

The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and *ℓ*_{2} regularizer give the support vector machine (SVM) problem [21]; logistic loss, linear regression function, and *ℓ*_{2} regularizer give the regularized logistic regression problem [10]; square loss, forward propagation function, and *R*(**W**) = 0 give the multi-layer perceptron [33]; and square loss, linear regression function, and *ℓ*_{1} regularizer give the LASSO problem [68].

**X**=

**UV**

^{T}. Take the matrix completion problem as an example, it can be reformulated as follows, which is a nonconvex problem,

For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 [28].

## 1.2 First-Order Algorithm

In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, *first-order* optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns *f*(**x**_{k}) and ∇*f*(**x**_{k}) when queried with **x**_{k}, here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition ), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12, 13, 14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the *accelerated* first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.

## 1.3 Sketch of Representative Works on Accelerated Algorithms

In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method [61]. Consider a problem with an *L*-smooth (Definition ) and *μ*-strongly convex (Definition ) objective, and let *ε* be the error to the optimal solution. The heavy-ball method reduces the complexity \(O\left (\frac {L}{\mu }\log \frac {1}{\varepsilon }\right )\) of the usual gradient descent to \(O\left (\sqrt {\frac {L}{\mu }}\log \frac {1}{\varepsilon }\right )\). In 1983, Nesterov proposed his accelerated gradient descent (AGD) for *L*-smooth objective functions, where the complexity is reduced to \(O\left (\frac {1}{\sqrt {\varepsilon }}\right )\) as compared with that of usual gradient descent: \(O\left (\frac {1}{\varepsilon }\right )\). Nesterov further proposed another accelerated algorithm for *L*-smooth objective functions in 1988 [53], smoothing techniques for nonsmooth functions with acceleration tricks in 2005 [54], and an accelerated algorithm for composite functions in 2007 [55] (whose formal publication is [57]). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle [8], which was formally published in 2009 and is an extension of [53] and simpler than [55],^{1} somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques [70] and Bubeck proposed a near optimal method for highly smooth convex optimization [16].

Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations [67] and Wibisono et al. further extended it to higher order AGD [71]. Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD [42]. Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique [6]. On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach [40]. Bubeck proposed a geometric descent method inspired from the ellipsoid method [15] and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function [24].

For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition ) using AGD directly, which leads to the accelerated dual ascent [9] and accelerated augmented Lagrange multiplier method [36], both with the optimal convergence rate in the dual space. Lu [51] and Li [44] further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable [45] and Xu proposed an accelerated linearized augmented Lagrangian method [72]. ADMM and the primal-dual method, as the most commonly used methods for constrained optimization, were also accelerated in [59] and [20] for generally convex (Definition ) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72].

Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in [31], which minimizes a composite objective with a smooth (Definition ) nonconvex part and a nonsmooth convex (Definition ) part. Inspired by [31], Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part [43]. Both works in [31, 43] studied the convergence to the first-order critical point (Definition ). Carmon et al. further gave an \(O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )\) complexity analysis [17]. For many famous machine learning problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition ). The first accelerated method to find the second-order critical point appeared in [18], which alternates between two subroutines: negative curvature descent and Almost Convex AGD, and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method [38]. Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method, using accelerated methods for fast approximate matrix inversion [1]. The complexities established in [1, 18, 38] are all \(O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )\).

As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha [2]. Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst [49], that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in [65]. Stochastic nonconvex optimization is also an important topic and some excellent works include [3, 4, 5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions [26].

The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD) [25] and asynchronous accelerated coordinate descent (AACD) [32]. Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM [13], distributed dual coordinate ascent [75] and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node [47]. Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017 [64] and a distributed dual ascent with a matching upper bound is given in [64]. Motivated by the lower bound, Li et al. further analyzed the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor [46].

## 1.4 About the Book

In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap. 2), constrained convex problems (Chap. 3), and (unconstrained) nonconvex problems (Chap. 4), as well as stochastic algorithms for centralized optimization (Chap. 5) and distributed optimization (Chap. 6). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect. 4.2), escaping saddle points (Sect. 4.3), and decentralized topology (Sect. 6.2.2) are highly non-trivial. So uninterested readers may skip them.

## Footnotes

## References

- 1.N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in
*Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing*, Montreal, (2017), pp. 1195–1200Google Scholar - 2.Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in
*Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing*, Montreal, (2017), pp. 1200–1206Google Scholar - 3.Z. Allen-Zhu, Natasha2: faster non-convex optimization than SGD, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 31 (2018), pp. 2675–2686Google Scholar - 4.Z. Allen-Zhu, E. Hazan, Variance reduction for faster non-convex optimization, in
*Proceedings of the 33th International Conference on Machine Learning*, New York, (2016), pp. 699–707Google Scholar - 5.Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 31 (2018), pp. 3716–3726Google Scholar - 6.Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradientand mirror descent, in
*Proceedings of the 8th Innovations in Theoretical Computer Science*, Berkeley, (2017)Google Scholar - 7.A. Beck,
*First-Order Methods in Optimization*, vol. 25 (SIAM, Philadelphia, 2017)zbMATHCrossRefGoogle Scholar - 8.A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci.
**2**(1), 183–202 (2009)MathSciNetzbMATHCrossRefGoogle Scholar - 9.A. Beck, M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett.
**42**(1), 1–6 (2014)MathSciNetzbMATHCrossRefGoogle Scholar - 10.J. Berkson, Application of the logistic function to bio-assay. J. Am. Stat. Assoc.
**39**(227), 357–365 (1944)Google Scholar - 11.S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in
*Advances in Neural Information Processing Systems*, Barcelona, vol. 29 (2016), pp. 3873–3881Google Scholar - 12.L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev.
**60**(2), 223–311 (2018)MathSciNetzbMATHCrossRefGoogle Scholar - 13.S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.
**3**(1), 1–122 (2011)zbMATHCrossRefGoogle Scholar - 14.S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn.
**8**(3–4), 231–357 (2015)zbMATHCrossRefGoogle Scholar - 15.S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187Google Scholar
- 16.S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in
*Proceedings of the 32th Conference on Learning Theory*, Phoenix, (2019), pp. 492–507Google Scholar - 17.Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions, in
*Proceedings of the 34th International Conference on Machine Learning*, Sydney, (2017), pp. 654–663Google Scholar - 18.Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim.
**28**(2), 1751–1772 (2018)MathSciNetzbMATHCrossRefGoogle Scholar - 19.A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis.
**40**(1), 120–145 (2011)MathSciNetzbMATHCrossRefGoogle Scholar - 20.Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim.
**24**(4), 1779–1814 (2014)MathSciNetzbMATHCrossRefGoogle Scholar - 21.C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn.
**20**(3), 273–297 (1995)zbMATHGoogle Scholar - 22.A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 27 (2014), pp. 1646–1654Google Scholar - 23.P.M. Domingos, A few useful things to know about machine learning. Commun. ACM
**55**(10), 78–87 (2012)CrossRefGoogle Scholar - 24.D. Drusvyatskiy, M. Fazel, S. Roy, An optimal first order method based on optimal quadratic averaging. SIAM J. Optim.
**28**(1), 251–271 (2018)MathSciNetzbMATHCrossRefGoogle Scholar - 25.C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747Google Scholar
- 26.C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 31 (2018), pp. 689–699Google Scholar - 27.O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim.
**25**(4), 1997–2023 (2015)MathSciNetzbMATHCrossRefGoogle Scholar - 28.C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization models for machine learning: a survey (2019). Preprint. arXiv:1901.05331Google Scholar
- 29.R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in
*Proceedings of the 28th Conference on Learning Theory*, Paris, (2015), pp. 797–842Google Scholar - 30.R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in
*Advances in Neural Information Processing Systems*, Barcelona, vol. 29 (2016), pp. 2973–2981Google Scholar - 31.S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program.
**156**(1–2), 59–99 (2016)MathSciNetzbMATHCrossRefGoogle Scholar - 32.R. Hannah, F. Feng, W. Yin, A2BCD: an asynchronous accelerated block coordinate descent algorithm with optimal complexity, in
*Proceedings of the 7th International Conference on Learning Representations*, New Orleans, (2019)Google Scholar - 33.S. Haykin,
*Neural Networks: A Comprehensive Foundation*, 2nd edn. (Pearson Prentice Hall, Upper Saddle River, 1999)Google Scholar - 34.E. Hazan, Introduction to online convex optimization. Found. Trends Optim.
**2**(3–4), 157–325 (2016)CrossRefGoogle Scholar - 35.E. Hazan, Optimization for machine learning. Technical report, Princeton University (2019)Google Scholar
- 36.B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization. Optim. (2010). Preprint. http://www.optimization-online.org/DB_FILE/2010/10/2760.pdf
- 37.P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn.
**10**(3–4), 142–336 (2017)zbMATHCrossRefGoogle Scholar - 38.C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in
*Proceedings of the 31th Conference On Learning Theory*, Stockholm, (2018), pp. 1042–1085Google Scholar - 39.R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in
*Advances in Neural Information Processing Systems*, Lake Tahoe, vol. 26 (2013), pp. 315–323Google Scholar - 40.D. Kim, J.A. Fessler, Optimized first-order methods for smooth convex minimization. Math. Program.
**159**(1–2), 81–107 (2016)MathSciNetzbMATHCrossRefGoogle Scholar - 41.G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program.
**171**(1–2), 167–215 (2018)MathSciNetzbMATHCrossRefGoogle Scholar - 42.L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim.
**26**(1), 57–95 (2016)MathSciNetzbMATHCrossRefGoogle Scholar - 43.H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 28 (2015), pp. 379–387Google Scholar - 44.H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020). http://jmlr.org/papers/v21/18-425.html
- 45.H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802Google Scholar
- 46.H. Li, C. Fang, W. Yin, Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods (2018). Preprint. arXiv:1810.01053Google Scholar
- 47.X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in
*Advances in Neural Information Processing Systems*, Long Beach, vol. 30 (2017), pp. 5330–5340Google Scholar - 48.Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 27 (2014), pp. 3059–3067Google Scholar - 49.H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 28 (2015), pp. 3384–3392Google Scholar - 50.G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in
*Proceedings of the 27th International Conference on Machine Learning*, Haifa, vol. 1 (2010), pp. 663–670Google Scholar - 51.J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim.
**26**(4), 2430–2467 (2016)MathSciNetzbMATHCrossRefGoogle Scholar - 52.J. Mairal, Optimization with first-order surrogate functions, in
*Proceedings of the 30th International Conference on Machine Learning*, Atlanta, (2013), pp. 783–791Google Scholar - 53.Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody
**24**(3), 509–517 (1988)zbMATHGoogle Scholar - 54.Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program.
**103**(1), 127–152 (2005)MathSciNetzbMATHCrossRefGoogle Scholar - 55.Y. Nesterov, Gradient methods for minimizing composite objective function. Technical Report Discussion Paper #2007/76, CORE (2007)Google Scholar
- 56.Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim.
**22**(2), 341–362 (2012)MathSciNetzbMATHCrossRefGoogle Scholar - 57.Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program.
**140**(1), 125–161 (2013)MathSciNetzbMATHCrossRefGoogle Scholar - 58.Y. Nesterov,
*Lectures on Convex Optimization*(Springer, New York, 2018)zbMATHCrossRefGoogle Scholar - 59.Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci.
**8**(1), 644–681 (2015)MathSciNetzbMATHCrossRefGoogle Scholar - 60.N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim.
**1**(3), 127–239 (2014)CrossRefGoogle Scholar - 61.B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys.
**4**(5), 1–17 (1964)CrossRefGoogle Scholar - 62.S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in
*Proceedings of the 33th International Conference on Machine Learning*, New York, (2016), pp. 314–323Google Scholar - 63.M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program.
**162**(1–2), 83–112 (2017)MathSciNetzbMATHCrossRefGoogle Scholar - 64.K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in
*Proceedings of the 34th International Conference on Machine Learning*, Sydney, (2017), pp. 3027–3036Google Scholar - 65.S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in
*Proceedings of the 31th International Conference on Machine Learning*, Beijing, (2014), pp. 64–72Google Scholar - 66.S. Sra, S. Nowozin, S.J. Wright (eds.),
*Optimization for Machine Learning*(MIT Press, Cambridge, MA, 2012)Google Scholar - 67.W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 27 (2014), pp. 2510–2518Google Scholar - 68.R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol.
**58**(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar - 69.N. Tripuraneni, M. Stern, C. Jin, J. Regier, M.I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 31 (2018), pp. 2899–2908Google Scholar - 70.P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008)Google Scholar
- 71.A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci.
**113**(47), 7351–7358 (2016)MathSciNetzbMATHCrossRefGoogle Scholar - 72.Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim.
**27**(3), 1459–1484 (2017)MathSciNetzbMATHCrossRefGoogle Scholar - 73.Y. Xu, J. Rong, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in
*Advances in Neural Information Processing Systems*, Montreal, vol. 31 (2018), pp. 5530–5540Google Scholar - 74.Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res.
**18**(1), 2939–2980 (2017)MathSciNetzbMATHGoogle Scholar - 75.S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res.
**18**(115), 1–52 (2017)MathSciNetzbMATHGoogle Scholar