Abstract
We consider distributed convex optimization problems originating from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communication-efficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the results for ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning where the condition number of the problem grows with square root of the sample size, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Agarwal, J.C. Duchi, Distributed delayed stochastic optimization, in Advances in Neural Information Processing Systems (NIPS) 24 (2011), pp. 873–881
Y. Arjevani, O. Shamir, Communication complexity of distributed convex learning and optimization, in Advances in Neural Information Processing Systems (NIPS) 28 (2015), pp. 1756–1764
M. Avriel, Nonlinear Programming: Analysis and Methods (Prentice-Hall, Upper Saddle River, 1976)
F. Bach, Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
R. Bekkerman, M. Bilenko, J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches (Cambridge University Press, Cambridge, 2011)
D.P. Bertsekas, J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Upper Saddle River, 1989)
R.H. Bird, G.M. Chin, W. Neveitt, J. Nocedal, On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–955 (2011)
J.A. Blackard, D.J. Dean, C.W. Anderson, Covertype data set, in UCI Machine Learning Repository, ed. by K. Bache, M. Lichman (School of Information and Computer Sciences, University of California, Irvine, 2013). http://archive.ics.uci.edu/ml
O. Bousquet, A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)
S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004)
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)
C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems (NIPS) 27 (2014), pp. 1646–1654
O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)
R.S. Dembo, S.C. Eisenstat, T. Steihaug, Inexact Newton methods. SIAM J. Numer. Anal. 19(2), 400–408 (1982)
W. Deng, W. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)
J.E. Dennis, J.J. Moreó, A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
J.C. Duchi, A. Agarwal, M.J. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)
G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The John Hopkins University Press, Baltimore, 1996)
R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, Cambridge, 1985)
S.S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution of large scale linear svms. J. Mach. Learn. Res. 6, 341–361 (2005)
K. Lang, Newsweeder: learning to filter netnews, in Proceedings of the Twelfth International Conference on Machine Learning (ICML) (1995), pp. 331–339
N. Le Roux, M. Schmidt, F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems (NIPS) 25 (2012), pp. 2672–2680
J.D. Lee, Y. Sun, M. Saunders, Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)
D.D. Lewis, Y. Yang, T. Rose, F. Li, RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Q. Lin, L. Xiao, An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)
C.-Y. Lin, C.-H. Tsai, C.-P. Lee, C.-J. Lin, Large-scale logistic regression and linear support vector machines using Spark, in Proceedings of the IEEE Conference on Big Data, Washington, 2014
Q. Lin, Z. Lu, L. Xiao, An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)
Z. Lu, Randomized block proximal dampled Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2017)
D.G. Luenberger, Introduction to Linear and Nonlinear Programming (Addison-Wesley, New York, 1973)
L. Mackey, M.I. Jordan, R.Y. Chen, B. Farrell, J.A. Tropp et al., Matrix concentration inequalities via the method of exchangeable pairs. Ann. Probab. 42(3), 906–945 (2014)
D. Mahajan, N. Agrawal, S.S. Keerthi, S. Sundararajan, L. Bottou, An efficient distributed learning algorithm based on effective local functional approximation. arXiv:1310.8418
MPI Forum, MPI: a message-passing interface standard, version 3.0 (2012), http://www.mpi-forum.org
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Kluwer, Boston, 2004)
Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. Ser. B 140, 125–161 (2013)
Y. Nesterov, A. Nemirovski, Interior Point Polynomial Time Methods in Convex Programming (SIAM, Philadelphia, 1994)
J. Nocedal, S.J. Wright, Numerical Optimization, 2nd edn. (Springer, New York, 2006)
M. Pilanci, M.J. Wainwright, Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)
S.S. Ram, A. Nedicó, V.V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
B. Recht, C. Re, S. Wright, F. Niu, Hogwild: a lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems (2011), pp. 693–701
K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulieó, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney (2017), pp. 3027–3036
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155(1), 105–145 (2015)
S. Shalev-Shwartz, O. Shamir, N. Srebro, K. Sridharan, Stochastic convex optimization, in Proceedings of the 22nd Annual Conference on Learning Theory (COLT) (2009)
J. Shalf, S. Dosanjh, J. Morrison, Exascale computing technology challenges, in Proceedings of the 9th International Conference on High Performance Computing for Computational Science, VECPAR’10, Berkeley (Springer, Berlin, 2011), pp. 1–25
O. Shamir, N. Srebro, On distributed stochastic optimization and learning, in Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)
O. Shamir, N. Srebro, T. Zhang, Communication efficient distributed optimization using an approximate Newton-type method, in Proceedings of the 31st International Conference on Machine Learning (ICML), JMLR: W&CP, vol. 32 (2014)
A. Shapiro, D. Dentcheva, A. Ruszczynóski, Lectures on Stochastic Programming: Modeling and Theory. MPS-SIAM Series on Optimization (SIAM-MPS, Philadelphia, 2009)
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)
Q. Tran-Dinh, A. Kyrillidis, V. Cevher, Composite self-concordant minimization. J. Mach. Learn. Res. 16, 371–416 (2015)
L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(84), 1–42 (2017)
Y. Zhang, J.C. Duchi, M.J. Wainwright, Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
Y. Zhuang, W.-S. Chin, Y.-C. Juan, C.-J. Lin, Distributed Newton method for regularized logistic regression. Technical Report, Department of Computer Science, National Taiwan University (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1: Proof of Theorem 1
First, we notice that Step 2 of Algorithm 1 is equivalent to
which implies
When inequality (11.44) holds, Nesterov [36, Theorem 4.1.8] has shown that
Here we recall the definitions of the pair of conjugate functions
Using the definition of ω and ω ∗, and with some algebraic operations, we obtain
By the second-order mean-value theorem, we have
for some t satisfying
Using the inequality (11.11), we can upper bound the second derivative ω ′′(t) as
Therefore,
In addition, we have
Combining the two inequalities above, and using the relation t 2∕(1 + t) ≤ 2ω(t) for all t ≥ 0, we obtain
In the last inequality above, we used the fact that for any t ≥ 0 we have
which is the result of convexity of ω(t) and ω(0) = 0. Substituting the above upper bound into inequality (11.45) yields
With inequality (11.46), we are ready to prove the conclusions of Theorem 1. In particular, Part (a) of Theorem 1 holds for any 0 ≤ β ≤ 1∕10.
For part (b), we assume that \(\|{{\widetilde u}_k}\|{ }_2 \leq 1/6\). According to [36, Theorem 4.1.13], when \(\|{{\widetilde u}_k}\|{ }_2 < 1\), it holds that for every k ≥ 0,
Combining this sandwich inequality with inequality (11.46), we have
It is easy to verify that ω ∗(t) − ω(t) ≤ 0.26 ω(t) for all t ≤ 1∕6, and
Applying these two inequalities to inequality (11.48) completes the proof.
It should be clear that other combinations of the value of β and bound on \(\|{{\widetilde u}_k}\|{ }_2\) are also possible. For example, for β = 1∕10 and \(\|{{\widetilde u}_k}\|{ }_2\leq 1/10\), we have \(\omega (\|{{\widetilde u}_{k+1}}\|{ }_2)\leq 0.65\, \omega (\|{{\widetilde u}_k}\|{ }_2)\).
Appendix 2: Proof of Theorem 2 and Corollary 2
First we prove Theorem 2. We start with the inequality (11.45), and upper bound the last two terms on its right-hand side. Since ω′(t) = t∕(1 + t) < 1, we have
In addition, we have
Applying these two bounds to (11.45), we obtain
Next we bound \(\|{{\widetilde u}_k-{\widetilde v}_k}\|{ }_2\) using the approximation tolerance 𝜖 k specified in (11.14),
Combining the above inequality with (11.49), and using \(r_k = L^{-1/2}\|{f'(w_k)}\|{ }_2 \leq \|{{\widetilde u}_k}\|{ }_2\) with the monotonicity of ω, we arrive at
Part (a) of the theorem follows immediately from inequality (11.50).
For part (b), we assume that \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\). Combining (11.47) with (11.50), we have
Let h(t) = ω ∗(t) − ω(t) and consider only t ≥ 0. Notice that h(0) = 0 and \(h'(t)=\frac {2t^2}{1-t^2}<\frac {128}{63}t^2\) for t ≤ 1∕8. Thus, we conclude that \(h(t)\leq \frac {128}{189}t^3\) for t ≤ 1∕8. We also notice that ω(0) = 0 and \(\omega '(t)=\frac {t}{1+t}\geq \frac {8}{9}t\) for t ≤ 1∕8. Thus, we have \(\omega (t)\geq \frac {4}{9}t^2\) for t ≤ 1∕8. Combining these results, we obtain
Applying this inequality to the right-hand side of (11.51) completes the proof.
Next we prove Corollary 2. By part (a) of Theorem 2, if \(\omega (\|{{\widetilde u}_k}\|{ }_2)\geq 1/8\), then each iteration of Algorithm 1 decreases the function value at least by the constant \(\frac {1}{2}\omega (1/8)\). So within at most \(K_1=\left \lceil \frac {2(f(w_0)-f(w_\star ))}{\omega (1/8)}\right \rceil \) iterations, we are guaranteed to have \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\). Part (b) of Theorem 2 implies \(6\,\omega (\|{{\widetilde u}_{k+1}}\|{ }_2) \leq \left ( 6\,\omega (\|{{\widetilde u}_k}\|{ }_2)\right )^{3/2}\) when \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\), and hence
Note that both sides of the above inequality are negative. Therefore, after \(k\geq K_1 + \frac {\log \log (1/(3\epsilon ))}{\log (3/2)}\) iterations (assuming 𝜖 ≤ 1∕(3e)), we have
which implies \(\omega (\|{{\widetilde u}_k}\|{ }_2) \leq \epsilon /2\). Finally using (11.47) and the fact that ω ∗(t) ≤ 2 ω(t) for t ≤ 1∕8, we obtain
This completes the proof of Corollary 2.
Appendix 3: Proof of Lemma 4
It suffices to show that the algorithm terminates at iteration t ≤ T μ − 1, because when the algorithm terminates, it outputs a vector v k which satisfies
Denote by v ∗ = H −1 f′(w k) the solution of the linear system Hv k = f′(w k). By the classical analysis on the preconditioned conjugate gradient method (see, e.g., [3, 32]), Algorithm 2 has the following convergence property:
where κ = 1 + 2μ∕λ is the condition number of P −1 H given in (11.25). For the left-hand side of inequality (11.52), we have
For the right-hand side of inequality (11.52), we have
Combining the above two inequalities with inequality (11.52), we obtain
To guarantee that ∥r (t)∥2 ≤ 𝜖 k, it suffices to have
where in the last inequality we used \(-\log (1-x) \geq x\) for 0 < x < 1. Comparing with the definition of T μ in (11.26), this is the desired result.
Appendix 4: Proof of Lemma 5
First, we prove Inequality (11.31). Recall that w ⋆ and \({\widehat w}_i\) minimizes f(w) and \(f_i(w) + (\rho /2)\|{w}\|{ }_2^2\) respectively. Since both functions are λ-strongly convex, we have
where we also used Assumption 2(a) in the first inequality on both lines. These two inequalities imply \(\|{w_\star }\|{ }_2 \leq \sqrt {2V_0/\lambda }\) and \(\|{{\widehat w}_i}\|{ }_2 \leq \sqrt {2V_0/\lambda }\). Then the inequality (11.31) follows since w 0 is the average over \({\widehat w}_i\) for i = 1, …, m.
Next we prove inequality (11.32). Let z be a random variable in \(\mathcal {Z}\subset \mathbb {R}^p\) with an unknown probability distribution. We define a regularized population risk:
Let S be a set of n i.i.d. samples in \(\mathcal {Z}\) from the same distribution. We define a regularized empirical risk
and its minimizer
The following lemma states that the population risk of \({\widehat w}_S\) is very close to its empirical risk. The proof is based on the notion of stability of regularized ERM [9].
Lemma 7
Suppose Assumption 2 holds and S is a set of n i.i.d. samples in \(\mathcal {Z}\) . Then
Proof
Let S = {z 1, …, z n}. For any k ∈{1, …, n}, we define a modified training set S (k) by replacing z k with another sample \({\widetilde z}_k\), which is drawn from the same distribution and is independent of S. The empirical risk on S (k) is defined as
Let \({\widehat w}_S^{(k)}= \arg \min _w r_S^{(k)}(w)\). Since both r S and \(r_S^{(k)}\) are ρ-strongly convex, we have
Summing the above two inequalities, and noticing that
we have
By Assumption 2(b) and the facts \(\|{{\widehat w}_S}\|{ }_2\leq \sqrt {2V_0/\lambda }\) and \(\|{{\widehat w}_S^{(k)}}\|{ }_2\leq \sqrt {2V_0/\lambda }\), we have
Combining the above Lipschitz condition with (11.53), we obtain
As a consequence, we have \(\|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{ }_2 \leq \frac {2 G}{\rho n}\), and therefore
In the terminology of learning theory, this means that empirical minimization over the regularized loss r S has uniform stability 2G 2∕(ρn) with respect to the loss function ϕ; see [9].
For any fixed k ∈{1, …, n}, since \({\widetilde z}_k\) is independent of S, we have
where the second equality used the fact that \(\mathbb {E}_S[\phi ({\widehat w}_S,z_j)\) has the same value for all j = 1, …, n, and the third equality used the symmetry between the pairs (S, z k) and \((S^{(k)}, {\widetilde z}_k)\) (also known as the renaming trick; see [9, Lemma 7]). Combining the above equality with (11.54) yields the desired result. □
Next, we consider a distributed system with m machines, where each machine has a local dataset S i of size n, for i = 1, …, m. To simplify notation, we denote the local regularized empirical loss function and its minimizer by r i and \({\widehat w}_i\), respectively. We would like to bound the excessive error when applying \({\widehat w}_i\) to a different dataset S j. Notice that
where \({\widehat w}_R\) is the minimizer of R(w). Since S i and S j are independent, we have
where the inequality is due to Lemma 7. For the second term in (11.55), we have
It remains to bound the third term v 3. We first use the strong convexity of r j to obtain (see, e.g., [36, Theorem 2.1.10])
where \(r^{\prime }_j({\widehat w}_R)\) denotes the gradient of r j at \({\widehat w}_R\). If we index the elements of S j by z 1, …, z n, then
By the optimality condition of \({\widehat w}_R=\arg \min _w R(w)\), we have for any k ∈{1, …, n},
Therefore, according to (11.57), the gradient \(r_j({\widehat w}_R)\) is the average of n independent and zero-mean random vectors. Combining (11.56) and (11.57) with the definition of v 3 in (11.55), we have
In the equality above, we used the fact that \(\phi '({\widehat w}_R,z_k)+(\lambda +\rho ){\widehat w}_R\) are i.i.d. zero-mean random variables, so the variance of their sum equals the sum of their variances. The last inequality above is due to Assumption 2(b) and the fact that \(\|{{\widehat w}_R}\|{ }_2\leq \sqrt {2V_0/(\lambda +\rho )}\leq \sqrt {2V_0/\lambda }\). Combining the upper bounds for v 1, v 2 and v 3, we have
Recall the definition of f as
where z i,k denotes the kth sample at machine i. Let \(r(w) = (1/m)\sum _{j=1}^m r_j(w)\), then
We compare the value \(r({\widehat w}_i)\), for any i ∈{1, …, m}, with the minimum of r(w):
Taking expectation with respect to all the random data sets S 1, …, S m, we obtain
where the last inequality is due to (11.58). Finally, we bound the expected value of \(f({\widehat w}_i)\):
where the first inequality holds because of (11.59), the second inequality is due to (11.60), and the last inequality follows from the assumption that \(\mathbb {E}[\|{w_\star }\|{ }_2]\leq D^2\). Choosing \(\rho = \sqrt {6 G^2/(n D^2)}\) results in
Since \(w_0=(1/m)\sum _{i=1}^m {\widehat w}_i\), we can use the convexity of the function f to conclude that \(\mathbb {E}[f(w_0) - f(w_\star )] \leq \sqrt {6}G D/\sqrt {n}\), which is the desired result.
Appendix 5: Proof of Lemma 6
We consider the regularized empirical loss functions f i defined in (11.30). For any two vectors \(u,w\in \mathbb {R}^d\) satisfying ∥u − w∥2 ≤ ε, Assumption 2(d) implies
Let B(0, r) be the ball in \(\mathbb {R}^d\) with radius r, centered at the origin. Let \(N_\varepsilon ^{\mathrm {cov}}(B(0,r))\) be the covering number of B(0, r) by balls of radius ε, i.e., the minimum number of balls of radius ε required to cover B(0, r). We also define \(N_\varepsilon ^{\mathrm {pac}}(B(0,r))\) as the packing number of B(0, r), i.e., the maximum number of disjoint balls whose centers belong to B(0, r). It is easy to verify that
Therefore, there exist a set of points \(U\subseteq \mathbb {R}^d\) with cardinality at most (1 + 2r∕ε)d, such that for any vector w ∈ B(0, r), we have
We consider an arbitrary point u ∈ U and the associated Hessian matrices for the functions f i defined in (11.30). We have
The components of the above sum are i.i.d. matrices that are upper bounded by LI. By the matrix Hoeffding’s inequality [33, Corollary 4.2], we have
Note that \(\mathbb {E}[f^{\prime \prime }_1(w)] = \mathbb {E}[f^{\prime \prime }(w)]\) for any w ∈ B(0, r). Using the triangular inequality and inequality (11.61), we obtain
Applying the union bound, we have with probability at least
the inequality \(\|{f^{\prime \prime }_i(u) - \mathbb {E}[f^{\prime \prime }_i(u)]}\|{ }_2 \leq t\) holds for every i ∈{1, …, m} and every u ∈ U. Combining this probability bound with inequality (11.62), we have
As the final step, we choose \(\varepsilon = \frac {\sqrt {2}L}{\sqrt {n}M}\) and then choose t to make the right-hand side of inequality (11.63) equal to δ. This yields the desired result.
Appendix 6: Proof of Theorem 5
Suppose Algorithm 3 terminates in K iterations. Let t k be the number of conjugate gradient steps in each call of Algorithm 2, for k = 0, 1, …, K − 1. For any given μ > 0, we define T μ as in (11.26). Let \(\mathcal {A}\) denotes the event that t k ≤ T μ for all k ∈{0, …, K − 1}. Let \({\mathcal {A}_{\mathrm {c}}}\) be the complement of \(\mathcal {A}\), i.e., the event that t k > T μ for some k ∈{0, …, K − 1}. In addition, let the probabilities of the events \(\mathcal {A}\) and \({\mathcal {A}_{\mathrm {c}}}\) be 1 − δ and δ respectively. By the law of total expectation, we have
When the event \(\mathcal {A}\) happens, we have T ≤ 1 + K(T μ + 1) where T μ is given in (11.26); otherwise we have T ≤ 1 + K(T L + 1), where
bounds the number of PCG iterations in Algorithm 2 when the event \({\mathcal {A}_{\mathrm {c}}}\) happens. Since Algorithm 2 always ensures ∥f ′′(w k)v k − f′(w k)∥2 ≤ 𝜖 k, the outer iteration count K shares the same bound in (11.12), which depends on f(w 0) − f(w ⋆). Notice that f(w 0) − f(w ⋆) is a random variable depending on the random generation of the datasets. However, T μ and T L are deterministic constants. So we have
Next we bound \(\mathbb {E}[K|\mathcal {A}]\) and \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\) separately. To bound \(\mathbb {E}[K|\mathcal {A}]\), we use
to obtain
In order to bound \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\), we derive a deterministic bound on f(w 0) − f(w ⋆). By Lemma 5, we have \(\|{w_0}\|{ }_2\leq \sqrt {2V_0/\lambda }\), which together with Assumption 2(b) yields
Combining with the strong convexity of f, we obtain
Therefore by Corollary 1,
where the additional 1 counts compensate for removing one ⌈⋅⌉ operator in (11.12).
Using inequality (11.65), the bound on \(\mathbb {E}[K|\mathcal {A}]\) in (11.66) and the bound on \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\) in (11.67), we obtain
Now we can bound \(\mathbb {E}[K]\) by Corollary 1 and Lemma 5. More specifically,
where \(C_0=1+\left \lceil \log _2(2\omega (1/6)/\epsilon ) \right \rceil \). With the choice of δ in (11.34) and the definition of T L in (11.64), we have
where \(C_2=\log (2L/(\beta \lambda ))\). Putting everything together, we have
Replacing T μ by its expression in (11.26) and applying Corollary 4, we obtain the desired result.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zhang, Y., Xiao, L. (2018). Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss. In: Giselsson, P., Rantzer, A. (eds) Large-Scale and Distributed Optimization. Lecture Notes in Mathematics, vol 2227. Springer, Cham. https://doi.org/10.1007/978-3-319-97478-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-97478-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97477-4
Online ISBN: 978-3-319-97478-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)