• Zhouchen Lin
  • Huan Li
  • Cong Fang


This chapter gives several examples of optimization problems in machine learning and briefly overviews the representative works on accelerated first-order algorithms. It also gives a brief introduction to the content of the monograph.


First-order accelerated algorithms Machine learning Classification/regression Low-rank learning 
Optimization is a supporting technology in many numerical computation related research fields, such as machine learning, signal processing, industrial design, and operation research. In particular, P. Domingos, an AAAI Fellow and a Professor of University of Washington, proposed a celebrated formula [23]:
$$\displaystyle \begin{aligned}\mbox{machine learning = representation + optimization + evaluation,}\end{aligned}$$
showing the importance of optimization in machine learning.

1.1 Examples of Optimization Problems in Machine Learning

Optimization problems arise throughout machine learning. We provide two representative examples here. The first one is classification/regression and the second one is low-rank learning.

Many classification/regression problems can be formulated as
$$\displaystyle \begin{aligned} \min\limits_{\mathbf{w}\in\mathbb{R}^n} \frac{1}{m}\sum_{i=1}^m l(p({\mathbf{x}}_i;\mathbf{w}),y_i) + \lambda R(\mathbf{w}), \end{aligned} $$
where w consists of the parameters of a classification/regression system, p(x;w) represents the prediction function of the learning model, l is the loss function to punish the inconformity between the system prediction and the truth value, (xi, yi) is the i-th data sample with xi being the datum/feature vector and yi the label for classification or the corresponding value for regression, R is a regularizer that enforces some special property in w, and λ ≥ 0 is a trade-off parameter. Typical examples of l(p, y) include the squared loss \(l(p,y)=\frac {1}{2}(p-y)^2\), the logistic loss \(l(p,y)=\log (1+\exp (-py))\), and the hinge loss \(l(p,y)=\max \{0,1-py\}\). Examples of p(x;w) include p(x;w) = wTx − b for linear classification/regression and p(x;W) = ϕ(Wnϕ(Wn−1ϕ(W1x)⋯ )) for forward propagation widely used in deep neural networks, where W is a collection of the weight matrices Wk, k = 1, ⋯ , n, and ϕ is an activation function. Representative examples of R(w) include the 2 regularizer \(R(\mathbf {w})=\frac {1}{2}\|\mathbf {w}\|{ }^2\) and the 1 regularizer R(w) = ∥w1.

The combinations of different loss functions, prediction functions, and regularizers lead to different machine learning models. For example, hinge loss, linear classification function, and 2 regularizer give the support vector machine (SVM) problem [21]; logistic loss, linear regression function, and 2 regularizer give the regularized logistic regression problem [10]; square loss, forward propagation function, and R(W) = 0 give the multi-layer perceptron [33]; and square loss, linear regression function, and 1 regularizer give the LASSO problem [68].

There are also many problems investigated by the machine learning community that are not of the form of (1.1). For example, the matrix completion problem, which has wide applications in signal and data processing, can be written as:
$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{X}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{X}\|{}_*,\\ &\displaystyle s.t.&\displaystyle {\mathbf{X}}_{ij}={\mathbf{D}}_{ij},\forall (i,j)\in\Omega, \end{array} \end{aligned} $$
where Ω is the locations of observed entries. The low-rank representation (LRR) problem [50], which is powerful in clustering data into subspaces, is cast as:
$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle \min\limits_{\mathbf{Z}\in\mathbb{R}^{n\times n},\mathbf{E}\in\mathbb{R}^{m\times n}} &\displaystyle \|\mathbf{Z}\|{}_*+\lambda\|\mathbf{E}\|{}_1,\\ &\displaystyle s.t.&\displaystyle \mathbf{D}=\mathbf{D}\mathbf{Z}+\mathbf{E}. \end{array} \end{aligned} $$
To reduce the computational cost as well as the storage space, people observe that a low-rank matrix can be factorized as a product of two much smaller matrices, i.e., X = UVT. Take the matrix completion problem as an example, it can be reformulated as follows, which is a nonconvex problem,
$$\displaystyle \begin{aligned} \begin{array}{rcl} \min_{\mathbf{U}\in\mathbb{R}^{m\times r},\mathbf{V}\in\mathbb{R}^{n\times r}} \frac{1}{2}\sum_{(i,j)\in\Omega}\left\|{\mathbf{U}}_i{\mathbf{V}}_j^T-{\mathbf{D}}_{ij}\right\|{}_F^2+\frac{\lambda}{2}\left(\|\mathbf{U}\|{}_F^2+\|\mathbf{V}\|{}_F^2\right). \end{array} \end{aligned} $$

For more examples of optimization problems in machine learning, one may refer to the survey paper written by Gambella, Ghaddar, and Naoum-Sawaya in 2019 [28].

1.2 First-Order Algorithm

In most machine learning models, a moderate numerical precision of parameters already suffices. Moreover, an iteration needs to be finished in reasonable amount of time. Thus, first-order optimization methods are the mainstream algorithms used in the machine learning community. While “first-order” has its rigorous definition in the complexity theory of optimization, which is based on an oracle that only returns f(xk) and ∇f(xk) when queried with xk, here we adopt a much more general sense that higher order derivatives of the objective function are not used (thus allows the closed form solution of a subproblem and the use of proximal mapping (Definition ), etc.). However, we do not want to write a book on all first-order algorithms that are commonly used or actively investigated in the machine learning community, which is clearly out of our capability due to the huge amount of literatures. Some excellent reference books, preprints, or surveys include [7, 12, 13, 14, 34, 35, 37, 58, 60, 66]. Rather, we focus on the accelerated first-order methods only, where “accelerated” means that the convergence rate is improved without making much stronger assumptions and the techniques used are essentially exquisite interpolation and extrapolation.

1.3 Sketch of Representative Works on Accelerated Algorithms

In the above sense of acceleration, the first accelerated optimization algorithm may be Polyak’s heavy-ball method [61]. Consider a problem with an L-smooth (Definition ) and μ-strongly convex (Definition ) objective, and let ε be the error to the optimal solution. The heavy-ball method reduces the complexity \(O\left (\frac {L}{\mu }\log \frac {1}{\varepsilon }\right )\) of the usual gradient descent to \(O\left (\sqrt {\frac {L}{\mu }}\log \frac {1}{\varepsilon }\right )\). In 1983, Nesterov proposed his accelerated gradient descent (AGD) for L-smooth objective functions, where the complexity is reduced to \(O\left (\frac {1}{\sqrt {\varepsilon }}\right )\) as compared with that of usual gradient descent: \(O\left (\frac {1}{\varepsilon }\right )\). Nesterov further proposed another accelerated algorithm for L-smooth objective functions in 1988 [53], smoothing techniques for nonsmooth functions with acceleration tricks in 2005 [54], and an accelerated algorithm for composite functions in 2007 [55] (whose formal publication is [57]). Nesterov’s seminal work did not catch much attention in the machine learning community, possibly because the objective functions in machine learning models are often nonsmooth, e.g., due to the adoption of sparse and low-rank regularizers which are not differentiable. The accelerated proximal gradient (APG) for composite functions by Beck and Teboulle [8], which was formally published in 2009 and is an extension of [53] and simpler than [55],1 somehow gained great interest in the machine learning community as it fits well for the sparse and low-rank models which were hot topics at that time. Tseng further provided a unified analysis of existing acceleration techniques [70] and Bubeck proposed a near optimal method for highly smooth convex optimization [16].

Nesterov’s AGD is not quite intuitive. There have been some efforts on interpreting his AGD algorithm. Su et al. gave an interpretation from the viewpoint of differential equations [67] and Wibisono et al. further extended it to higher order AGD [71]. Fazlyab et al. proposed a Linear Matrix Inequality (LMI) using the Integral Quadratic Constraints (IQCs) from robust control theory to interpret AGD [42]. Allen-Zhu and Orecchia connected AGD to mirror descent via the linear coupling technique [6]. On the other hand, some researchers work on designing other interpretable accelerated algorithms. Kim and Fessler designed an optimized first-order algorithm whose complexity is only one half of that of Nesterov’s accelerated gradient method via the Performance Estimation Problem approach [40]. Bubeck proposed a geometric descent method inspired from the ellipsoid method [15] and Drusvyatskiy et al. showed that the same iterate sequence is generated via computing an optimal average of quadratic lower-models of the function [24].

For linearly constrained convex problems, different from the unconstrained case, both the errors in the objective function value and the constraint should be taken care of. Ideally, both errors should reduce at the same rate. A straightforward way to extend Nesterov’s acceleration technique to constrained optimization is to solve its dual problem (Definition ) using AGD directly, which leads to the accelerated dual ascent [9] and accelerated augmented Lagrange multiplier method [36], both with the optimal convergence rate in the dual space. Lu [51] and Li [44] further analyzed the complexity in the primal space for the accelerated dual ascent and its variant. One disadvantage of the dual based method is the need to solve a subproblem at each iteration. Linearization is an effective approach to overcome this shortcoming. Specifically, Li et al. proposed an accelerated linearized penalty method that increases the penalty along with the update of variable [45] and Xu proposed an accelerated linearized augmented Lagrangian method [72]. ADMM and the primal-dual method, as the most commonly used methods for constrained optimization, were also accelerated in [59] and [20] for generally convex (Definition ) and smooth objectives, respectively. When the strong convexity is assumed, ADMM and the primal-dual method can have faster convergence rates even if no acceleration techniques are used [19, 72].

Nesterov’s AGD has also been extended to nonconvex problems. The first analysis of AGD for nonconvex optimization appeared in [31], which minimizes a composite objective with a smooth (Definition ) nonconvex part and a nonsmooth convex (Definition ) part. Inspired by [31], Li and Lin proposed AGD variants for minimizing the composition of a smooth nonconvex part and a nonsmooth nonconvex part [43]. Both works in [31, 43] studied the convergence to the first-order critical point (Definition ). Carmon et al. further gave an \(O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )\) complexity analysis [17]. For many famous machine learning problems, e.g., matrix sensing and matrix completion, there is no spurious local minimum [11, 30] and the only task is to escape strict saddle points (Definition ). The first accelerated method to find the second-order critical point appeared in [18], which alternates between two subroutines: negative curvature descent and Almost Convex AGD, and can be seen as a combination of accelerated gradient descent and the Lanczos method. Jin et al. further proposed a single-loop accelerated method [38]. Agarwal et al. proposed a careful implementation of the Nesterov–Polyak method, using accelerated methods for fast approximate matrix inversion [1]. The complexities established in [1, 18, 38] are all \(O\left (\frac {1}{\varepsilon ^{7/4}}\log \frac {1}{\varepsilon }\right )\).

As for stochastic algorithms, compared with the deterministic algorithms, the main challenge is that the noise of gradient will not reach zero through updates and this makes the famous stochastic gradient descent (SGD) converge only with a sublinear rate even for strongly convex and smooth problems. Variance reduction (VR) is an efficient technique to reduce the negative effect of noise [22, 39, 52, 63]. With the VR and the momentum technique, Allen-Zhu proposed the first truly accelerated stochastic algorithm, named Katyusha [2]. Katyusha is an algorithm working in the primal space. Another way to accelerate the stochastic algorithms is to solve the problem in the dual space so that we can use the techniques like stochastic coordinate descent (SCD) [27, 48, 56] and stochastic primal-dual method [41, 74]. On the other hand, in 2015 Lin et al. proposed a generic framework, called Catalyst [49], that minimizes a convex objective function via an accelerated proximal point method and gains acceleration, whose idea previously appeared in [65]. Stochastic nonconvex optimization is also an important topic and some excellent works include [3, 4, 5, 29, 62, 69, 73]. Particularly, Fang et al. proposed a Stochastic Path-Integrated Differential Estimator (SPIDER) technique and attained the near optimal convergence rate under certain conditions [26].

The acceleration techniques are also applicable to parallel optimization. Parallel algorithms can be implemented in two fashions: asynchronous updates and synchronous updates. For asynchronous update, none of the machines need to wait for the others to finish computing. Representative works include asynchronous accelerated gradient descent (AAGD) [25] and asynchronous accelerated coordinate descent (AACD) [32]. Based on different topologies, synchronous algorithms include centralized and decentralized distributed methods. Typical works for the former organization include the distributed ADMM [13], distributed dual coordinate ascent [75] and their extensions. One bottleneck of centralized topology lies in high communication cost at the central node [47]. Although decentralized algorithms have been widely studied by the control community, the lower bound has not been established until 2017 [64] and a distributed dual ascent with a matching upper bound is given in [64]. Motivated by the lower bound, Li et al. further analyzed the distributed accelerated gradient descent with both optimal communication and computation complexities up to a log factor [46].

1.4 About the Book

In the previous section, we have briefly introduced the representative works on accelerated first-order algorithms. However, due to limited time we do not give details of all of them in the subsequent chapters. Rather, we only introduce results and proofs of part of them, based on our personal flavor and familiarity. The algorithms are organized by their nature: deterministic algorithms for unconstrained convex problems (Chap.  2), constrained convex problems (Chap.  3), and (unconstrained) nonconvex problems (Chap.  4), as well as stochastic algorithms for centralized optimization (Chap.  5) and distributed optimization (Chap.  6). To make our book self-contained, for each introduced algorithm we give the details of its proof. This book serves as a reference to part of the recent advances in optimization. It is appropriate for graduate students and researchers who are interested in machine learning and optimization. Nonetheless, the proofs for achieving critical points (Sect.  4.2), escaping saddle points (Sect.  4.3), and decentralized topology (Sect.  6.2.2) are highly non-trivial. So uninterested readers may skip them.


  1. 1.

    In each iteration [8] uses only information from two last iterations and makes one call on proximal mapping, while [55] uses entire history of previous iterations and makes two calls on proximal mapping.


  1. 1.
    N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, T. Ma, Finding approximate local minima for nonconvex optimization in linear time, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1195–1200Google Scholar
  2. 2.
    Z. Allen-Zhu, Katyusha: the first truly accelerated stochastic gradient method, in Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, (2017), pp. 1200–1206Google Scholar
  3. 3.
    Z. Allen-Zhu, Natasha2: faster non-convex optimization than SGD, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2675–2686Google Scholar
  4. 4.
    Z. Allen-Zhu, E. Hazan, Variance reduction for faster non-convex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 699–707Google Scholar
  5. 5.
    Z. Allen-Zhu, Y. Li, Neon2: finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 3716–3726Google Scholar
  6. 6.
    Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradientand mirror descent, in Proceedings of the 8th Innovations in Theoretical Computer Science, Berkeley, (2017)Google Scholar
  7. 7.
    A. Beck, First-Order Methods in Optimization, vol. 25 (SIAM, Philadelphia, 2017)zbMATHCrossRefGoogle Scholar
  8. 8.
    A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    A. Beck, M. Teboulle, A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42(1), 1–6 (2014)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    J. Berkson, Application of the logistic function to bio-assay. J. Am. Stat. Assoc. 39(227), 357–365 (1944)Google Scholar
  11. 11.
    S. Bhojanapalli, B. Neyshabur, N. Srebro, Global optimality of local search for low rank matrix recovery, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 3873–3881Google Scholar
  12. 12.
    L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)zbMATHCrossRefGoogle Scholar
  14. 14.
    S. Bubeck, Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)zbMATHCrossRefGoogle Scholar
  15. 15.
    S. Bubeck, Y.T. Lee, M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent (2015). Preprint. arXiv:1506.08187Google Scholar
  16. 16.
    S. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, A. Sidford, Near-optimal method for highly smooth convex optimization, in Proceedings of the 32th Conference on Learning Theory, Phoenix, (2019), pp. 492–507Google Scholar
  17. 17.
    Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 654–663Google Scholar
  18. 18.
    Y. Carmon, J.C. Duchi, O. Hinder, A. Sidford, Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  19. 19.
    A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40(1), 120–145 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  20. 20.
    Y. Chen, G. Lan, Y. Ouyang, Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 24(4), 1779–1814 (2014)MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  22. 22.
    A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 1646–1654Google Scholar
  23. 23.
    P.M. Domingos, A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRefGoogle Scholar
  24. 24.
    D. Drusvyatskiy, M. Fazel, S. Roy, An optimal first order method based on optimal quadratic averaging. SIAM J. Optim. 28(1), 251–271 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    C. Fang, Y. Huang, Z. Lin, Accelerating asynchronous algorithms for convex optimization by momentum compensation (2018). Preprint. arXiv:1802.09747Google Scholar
  26. 26.
    C. Fang, C.J. Li, Z. Lin, T. Zhang, SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 689–699Google Scholar
  27. 27.
    O. Fercoq, P. Richtárik, Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  28. 28.
    C. Gambella, B. Ghaddar, J. Naoum-Sawaya, Optimization models for machine learning: a survey (2019). Preprint. arXiv:1901.05331Google Scholar
  29. 29.
    R. Ge, F. Huang, C. Jin, Y. Yuan, Escaping from saddle points – online stochastic gradient for tensor decomposition, in Proceedings of the 28th Conference on Learning Theory, Paris, (2015), pp. 797–842Google Scholar
  30. 30.
    R. Ge, J.D. Lee, T. Ma, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems, Barcelona, vol. 29 (2016), pp. 2973–2981Google Scholar
  31. 31.
    S. Ghadimi, G. Lan, Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  32. 32.
    R. Hannah, F. Feng, W. Yin, A2BCD: an asynchronous accelerated block coordinate descent algorithm with optimal complexity, in Proceedings of the 7th International Conference on Learning Representations, New Orleans, (2019)Google Scholar
  33. 33.
    S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd edn. (Pearson Prentice Hall, Upper Saddle River, 1999)Google Scholar
  34. 34.
    E. Hazan, Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016)CrossRefGoogle Scholar
  35. 35.
    E. Hazan, Optimization for machine learning. Technical report, Princeton University (2019)Google Scholar
  36. 36.
    B. He, X. Yuan, On the acceleration of augmented Lagrangian method for linearly constrained optimization. Optim. (2010). Preprint.
  37. 37.
    P. Jain, P. Kar, Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017)zbMATHCrossRefGoogle Scholar
  38. 38.
    C. Jin, P. Netrapalli, M.I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, in Proceedings of the 31th Conference On Learning Theory, Stockholm, (2018), pp. 1042–1085Google Scholar
  39. 39.
    R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, Lake Tahoe, vol. 26 (2013), pp. 315–323Google Scholar
  40. 40.
    D. Kim, J.A. Fessler, Optimized first-order methods for smooth convex minimization. Math. Program. 159(1–2), 81–107 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  41. 41.
    G. Lan, Y. Zhou, An optimal randomized incremental gradient method. Math. Program. 171(1–2), 167–215 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  42. 42.
    L. Lessard, B. Recht, A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  43. 43.
    H. Li, Z. Lin, Accelerated proximal gradient methods for nonconvex programming, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 379–387Google Scholar
  44. 44.
    H. Li, Z. Lin, On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. J. Mach. Learn. Res. (2020).
  45. 45.
    H. Li, C. Fang, Z. Lin, Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization (2017). Preprint. arXiv:1711.10802Google Scholar
  46. 46.
    H. Li, C. Fang, W. Yin, Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods (2018). Preprint. arXiv:1810.01053Google Scholar
  47. 47.
    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, J. Liu, Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, in Advances in Neural Information Processing Systems, Long Beach, vol. 30 (2017), pp. 5330–5340Google Scholar
  48. 48.
    Q. Lin, Z. Lu, L. Xiao, An accelerated proximal coordinate gradient method, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 3059–3067Google Scholar
  49. 49.
    H. Lin, J. Mairal, Z. Harchaoui, A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 28 (2015), pp. 3384–3392Google Scholar
  50. 50.
    G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation, in Proceedings of the 27th International Conference on Machine Learning, Haifa, vol. 1 (2010), pp. 663–670Google Scholar
  51. 51.
    J. Lu, M. Johansson, Convergence analysis of approximate primal solutions in dual first-order methods. SIAM J. Optim. 26(4), 2430–2467 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  52. 52.
    J. Mairal, Optimization with first-order surrogate functions, in Proceedings of the 30th International Conference on Machine Learning, Atlanta, (2013), pp. 783–791Google Scholar
  53. 53.
    Y. Nesterov, On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika I Mateaticheskie Metody 24(3), 509–517 (1988)zbMATHGoogle Scholar
  54. 54.
    Y. Nesterov, Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)MathSciNetzbMATHCrossRefGoogle Scholar
  55. 55.
    Y. Nesterov, Gradient methods for minimizing composite objective function. Technical Report Discussion Paper #2007/76, CORE (2007)Google Scholar
  56. 56.
    Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)MathSciNetzbMATHCrossRefGoogle Scholar
  57. 57.
    Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)MathSciNetzbMATHCrossRefGoogle Scholar
  58. 58.
    Y. Nesterov, Lectures on Convex Optimization (Springer, New York, 2018)zbMATHCrossRefGoogle Scholar
  59. 59.
    Y. Ouyang, Y. Chen, G. Lan, E. Pasiliao Jr., An accelerated linearized alternating direction method of multipliers. SIAM J. Imag. Sci. 8(1), 644–681 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  60. 60.
    N. Parikh, S. Boyd, Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)CrossRefGoogle Scholar
  61. 61.
    B.T. Polyak, Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)CrossRefGoogle Scholar
  62. 62.
    S.J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in Proceedings of the 33th International Conference on Machine Learning, New York, (2016), pp. 314–323Google Scholar
  63. 63.
    M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  64. 64.
    K. Seaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning, Sydney, (2017), pp. 3027–3036Google Scholar
  65. 65.
    S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, in Proceedings of the 31th International Conference on Machine Learning, Beijing, (2014), pp. 64–72Google Scholar
  66. 66.
    S. Sra, S. Nowozin, S.J. Wright (eds.), Optimization for Machine Learning (MIT Press, Cambridge, MA, 2012)Google Scholar
  67. 67.
    W. Su, S. Boyd, E. Candès, A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights, in Advances in Neural Information Processing Systems, Montreal, vol. 27 (2014), pp. 2510–2518Google Scholar
  68. 68.
    R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 58(1), 267–288 (1996)MathSciNetzbMATHGoogle Scholar
  69. 69.
    N. Tripuraneni, M. Stern, C. Jin, J. Regier, M.I. Jordan, Stochastic cubic regularization for fast nonconvex optimization, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 2899–2908Google Scholar
  70. 70.
    P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle (2008)Google Scholar
  71. 71.
    A. Wibisono, A.C. Wilson, M.I. Jordan, A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), 7351–7358 (2016)MathSciNetzbMATHCrossRefGoogle Scholar
  72. 72.
    Y. Xu, Accelerated first-order primal-dual proximal methods for linearly constrained composite convex programming. SIAM J. Optim. 27(3), 1459–1484 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  73. 73.
    Y. Xu, J. Rong, T. Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, in Advances in Neural Information Processing Systems, Montreal, vol. 31 (2018), pp. 5530–5540Google Scholar
  74. 74.
    Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(1), 2939–2980 (2017)MathSciNetzbMATHGoogle Scholar
  75. 75.
    S. Zheng, J. Wang, F. Xia, W. Xu, T. Zhang, A general distributed dual coordinate optimization framework for regularized loss minimization. J. Mach. Learn. Res. 18(115), 1–52 (2017)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Key Lab. of Machine Perception School of EECSPeking UniversityBeijingChina
  2. 2.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina
  3. 3.School of Engineering and Applied SciencePrinceton UniversityPrincetonUSA

Personalised recommendations