Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss

Zhang, Yuchen; Xiao, Lin

doi:10.1007/978-3-319-97478-1_11

Yuchen Zhang¹⁴ &
Lin Xiao¹⁵

Part of the book series: Lecture Notes in Mathematics ((LNM,volume 2227))

2296 Accesses
6 Citations

Abstract

We consider distributed convex optimization problems originating from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communication-efficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the results for ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning where the condition number of the problem grows with square root of the sample size, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Agarwal, J.C. Duchi, Distributed delayed stochastic optimization, in Advances in Neural Information Processing Systems (NIPS) 24 (2011), pp. 873–881
Google Scholar
Y. Arjevani, O. Shamir, Communication complexity of distributed convex learning and optimization, in Advances in Neural Information Processing Systems (NIPS) 28 (2015), pp. 1756–1764
Google Scholar
M. Avriel, Nonlinear Programming: Analysis and Methods (Prentice-Hall, Upper Saddle River, 1976)
MATH Google Scholar
F. Bach, Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
Article MathSciNet Google Scholar
R. Bekkerman, M. Bilenko, J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches (Cambridge University Press, Cambridge, 2011)
Book Google Scholar
D.P. Bertsekas, J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Upper Saddle River, 1989)
MATH Google Scholar
R.H. Bird, G.M. Chin, W. Neveitt, J. Nocedal, On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–955 (2011)
Article MathSciNet Google Scholar
J.A. Blackard, D.J. Dean, C.W. Anderson, Covertype data set, in UCI Machine Learning Repository, ed. by K. Bache, M. Lichman (School of Information and Computer Sciences, University of California, Irvine, 2013). http://archive.ics.uci.edu/ml
O. Bousquet, A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)
MathSciNet MATH Google Scholar
S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004)
Book Google Scholar
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)
Article Google Scholar
C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Article Google Scholar
C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems (NIPS) 27 (2014), pp. 1646–1654
Google Scholar
O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)
MathSciNet MATH Google Scholar
R.S. Dembo, S.C. Eisenstat, T. Steihaug, Inexact Newton methods. SIAM J. Numer. Anal. 19(2), 400–408 (1982)
Article MathSciNet Google Scholar
W. Deng, W. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)
Article MathSciNet Google Scholar
J.E. Dennis, J.J. Moreó, A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)
Article MathSciNet Google Scholar
J.C. Duchi, A. Agarwal, M.J. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)
Article MathSciNet Google Scholar
G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The John Hopkins University Press, Baltimore, 1996)
MATH Google Scholar
R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, Cambridge, 1985)
Book Google Scholar
S.S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution of large scale linear svms. J. Mach. Learn. Res. 6, 341–361 (2005)
MathSciNet MATH Google Scholar
K. Lang, Newsweeder: learning to filter netnews, in Proceedings of the Twelfth International Conference on Machine Learning (ICML) (1995), pp. 331–339
Google Scholar
N. Le Roux, M. Schmidt, F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems (NIPS) 25 (2012), pp. 2672–2680
Google Scholar
J.D. Lee, Y. Sun, M. Saunders, Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)
Article MathSciNet Google Scholar
D.D. Lewis, Y. Yang, T. Rose, F. Li, RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Q. Lin, L. Xiao, An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)
Article MathSciNet Google Scholar
C.-Y. Lin, C.-H. Tsai, C.-P. Lee, C.-J. Lin, Large-scale logistic regression and linear support vector machines using Spark, in Proceedings of the IEEE Conference on Big Data, Washington, 2014
Google Scholar
Q. Lin, Z. Lu, L. Xiao, An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)
Article MathSciNet Google Scholar
Z. Lu, Randomized block proximal dampled Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2017)
Article MathSciNet Google Scholar
D.G. Luenberger, Introduction to Linear and Nonlinear Programming (Addison-Wesley, New York, 1973)
MATH Google Scholar
L. Mackey, M.I. Jordan, R.Y. Chen, B. Farrell, J.A. Tropp et al., Matrix concentration inequalities via the method of exchangeable pairs. Ann. Probab. 42(3), 906–945 (2014)
Article MathSciNet Google Scholar
D. Mahajan, N. Agrawal, S.S. Keerthi, S. Sundararajan, L. Bottou, An efficient distributed learning algorithm based on effective local functional approximation. arXiv:1310.8418
Google Scholar
MPI Forum, MPI: a message-passing interface standard, version 3.0 (2012), http://www.mpi-forum.org
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Kluwer, Boston, 2004)
Book Google Scholar
Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. Ser. B 140, 125–161 (2013)
Article MathSciNet Google Scholar
Y. Nesterov, A. Nemirovski, Interior Point Polynomial Time Methods in Convex Programming (SIAM, Philadelphia, 1994)
Book Google Scholar
J. Nocedal, S.J. Wright, Numerical Optimization, 2nd edn. (Springer, New York, 2006)
MATH Google Scholar
M. Pilanci, M.J. Wainwright, Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)
MathSciNet MATH Google Scholar
S.S. Ram, A. Nedicó, V.V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)
Article MathSciNet Google Scholar
B. Recht, C. Re, S. Wright, F. Niu, Hogwild: a lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems (2011), pp. 693–701
Google Scholar
K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulieó, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney (2017), pp. 3027–3036
Google Scholar
M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
Article MathSciNet Google Scholar
S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155(1), 105–145 (2015)
MathSciNet MATH Google Scholar
S. Shalev-Shwartz, O. Shamir, N. Srebro, K. Sridharan, Stochastic convex optimization, in Proceedings of the 22nd Annual Conference on Learning Theory (COLT) (2009)
Google Scholar
J. Shalf, S. Dosanjh, J. Morrison, Exascale computing technology challenges, in Proceedings of the 9th International Conference on High Performance Computing for Computational Science, VECPAR’10, Berkeley (Springer, Berlin, 2011), pp. 1–25
Google Scholar
O. Shamir, N. Srebro, On distributed stochastic optimization and learning, in Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)
Google Scholar
O. Shamir, N. Srebro, T. Zhang, Communication efficient distributed optimization using an approximate Newton-type method, in Proceedings of the 31st International Conference on Machine Learning (ICML), JMLR: W&CP, vol. 32 (2014)
Google Scholar
A. Shapiro, D. Dentcheva, A. Ruszczynóski, Lectures on Stochastic Programming: Modeling and Theory. MPS-SIAM Series on Optimization (SIAM-MPS, Philadelphia, 2009)
Google Scholar
R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Q. Tran-Dinh, A. Kyrillidis, V. Cevher, Composite self-concordant minimization. J. Mach. Learn. Res. 16, 371–416 (2015)
MathSciNet MATH Google Scholar
L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)
Article MathSciNet Google Scholar
Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(84), 1–42 (2017)
MathSciNet MATH Google Scholar
Y. Zhang, J.C. Duchi, M.J. Wainwright, Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
MathSciNet MATH Google Scholar
Y. Zhuang, W.-S. Chin, Y.-C. Juan, C.-J. Lin, Distributed Newton method for regularized logistic regression. Technical Report, Department of Computer Science, National Taiwan University (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Stanford University, Stanford, CA, USA
Yuchen Zhang
Microsoft Research, Redmond, WA, USA
Lin Xiao

Authors

Yuchen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lin Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Xiao .

Editor information

Editors and Affiliations

Department of Automatic Control, Lund University, Lund, Sweden
Pontus Giselsson
Department of Automatic Control, Lund University, Lund, Sweden
Anders Rantzer

Appendices

Appendix 1: Proof of Theorem 1

First, we notice that Step 2 of Algorithm 1 is equivalent to

$$\displaystyle \begin{aligned} w_{k+1} - w_k = \frac{v_k}{1+\delta_k} = \frac{v_k}{1+\|{{\widetilde v}_k}\|{}_2}, \end{aligned}$$

which implies

$$\displaystyle \begin{aligned} \|{[f^{\prime\prime}(w_k)]^{1/2}(w_{k+1} - w_k)}\|{}_2 = \frac{\|{{\widetilde v}_k}\|{}_2}{1 + \|{{\widetilde v}_k}\|{}_2} < 1. \end{aligned} $$

(11.44)

When inequality (11.44) holds, Nesterov [36, Theorem 4.1.8] has shown that

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) + \langle f'(w_k), w_{k+1} - w_k \rangle + \omega_*\bigl(\|{[f^{\prime\prime}(w_k)]^{1/2}(w_{k+1} - w_k)}\|{}_2\bigr) . \end{aligned} $$

Here we recall the definitions of the pair of conjugate functions

$$\displaystyle \begin{aligned} \omega(t) & = t - \log(1+t), \qquad t\geq 0, \\ \omega_*(t) & = -t - \log(1-t), \qquad 0\leq t < 1. \end{aligned} $$

Using the definition of ω and ω _∗, and with some algebraic operations, we obtain

$$\displaystyle \begin{aligned} f(w_{k+1}) &\leq f(w_k) - \frac{\langle {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} - \frac{\|{{\widetilde v}_k}\|{}_2}{1 + \|{{\widetilde v}_k}\|{}_2} + \log(1 + \|{{\widetilde v}_k}\|{}_2 ) \\ &= f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + \big( \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2)\big) + \frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2}. \end{aligned} $$

(11.45)

By the second-order mean-value theorem, we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) = \omega'(\|{{\widetilde v}_k}\|{}_2)(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2) +\frac{1}{2}\omega^{\prime\prime}(t)\left(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\right)^2 \end{aligned}$$

for some t satisfying

$$\displaystyle \begin{aligned} \min\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} \leq t \leq \max\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} . \end{aligned}$$

Using the inequality (11.11), we can upper bound the second derivative ω ^′′(t) as

$$\displaystyle \begin{aligned} \omega^{\prime\prime}(t) = \frac{1}{(1+t)^2} \leq \frac{1}{1+t} \leq \frac{1}{1+\min\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} } \leq \frac{1}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}. \end{aligned}$$

Therefore,

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) &=\frac{(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2)\|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} +\frac{1}{2}\omega^{\prime\prime}(t)\left(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\right)^2\\ &\leq \frac{\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \|{{\widetilde v}_k}\|{}_2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} + \frac{(1/2)\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}\\ &\leq \frac{\beta(1+\beta)\|{{\widetilde u}_k}\|{}_2^2+(1/2)\beta^2\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} . \end{aligned} $$

In addition, we have

$$\displaystyle \begin{aligned} \frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} \leq \frac{\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} \leq \frac{\beta(1+\beta)\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}. \end{aligned}$$

Combining the two inequalities above, and using the relation t ²∕(1 + t) ≤ 2ω(t) for all t ≥ 0, we obtain

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) +\frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} &\leq \left(2\beta(1+\beta)+(1/2)\beta^2\right) \\ &\quad \frac{\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} \\ &= \left(\frac{2\beta+(5/2)\beta^2}{(1-\beta)^2}\right) \frac{(1-\beta)^2\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} \\ &\leq \left(\frac{2\beta+(5/2)\beta^2}{(1-\beta)^2}\right) 2 \omega\bigl( (1-\beta)\|{{\widetilde u}_k}\|{}_2 \bigr) \\ &\leq \left(\frac{4\beta+5\beta^2}{1-\beta}\right) \omega\bigl( \|{{\widetilde u}_k}\|{}_2 \bigr) . \end{aligned} $$

In the last inequality above, we used the fact that for any t ≥ 0 we have

$$\displaystyle \begin{aligned} \omega((1-\beta)t) \leq (1-\beta)\omega(t), \end{aligned}$$

which is the result of convexity of ω(t) and ω(0) = 0. Substituting the above upper bound into inequality (11.45) yields

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \left(1 - \frac{4\beta+5\beta^2}{1-\beta} \right) \omega(\|{{\widetilde u}_k}\|{}_2). \end{aligned} $$

(11.46)

With inequality (11.46), we are ready to prove the conclusions of Theorem 1. In particular, Part (a) of Theorem 1 holds for any 0 ≤ β ≤ 1∕10.

For part (b), we assume that $\|{{\widetilde u}_k}\|{ }_2 \leq 1/6$. According to [36, Theorem 4.1.13], when $\|{{\widetilde u}_k}\|{ }_2 < 1$, it holds that for every k ≥ 0,

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) \leq f(w_k) - f(w_\star) \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) . \end{aligned} $$

(11.47)

Combining this sandwich inequality with inequality (11.46), we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_{k+1}}\|{}_2) &\leq f(w_{k+1}) - f(w_\star) \\ &\leq f(w_k) - f(w_\star) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{4\beta+5\beta^2}{1-\beta} \omega(\|{{\widetilde u}_k}\|{}_2)\\ &\leq \omega_*(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{4\beta+5\beta^2}{1-\beta}\omega(\|{{\widetilde u}_k}\|{}_2) . {} \end{aligned} $$

(11.48)

It is easy to verify that ω _∗(t) − ω(t) ≤ 0.26 ω(t) for all t ≤ 1∕6, and

$$\displaystyle \begin{aligned} (4\beta+5\beta^2)/(1-\beta) \leq 0.23, \qquad \mbox{if}\quad \beta\leq 1/20. \end{aligned}$$

Applying these two inequalities to inequality (11.48) completes the proof.

It should be clear that other combinations of the value of β and bound on $\|{{\widetilde u}_k}\|{ }_2$ are also possible. For example, for β = 1∕10 and $\|{{\widetilde u}_k}\|{ }_2\leq 1/10$, we have $\omega (\|{{\widetilde u}_{k+1}}\|{ }_2)\leq 0.65\, \omega (\|{{\widetilde u}_k}\|{ }_2)$.

Appendix 2: Proof of Theorem 2 and Corollary 2

First we prove Theorem 2. We start with the inequality (11.45), and upper bound the last two terms on its right-hand side. Since ω′(t) = t∕(1 + t) < 1, we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2)-\omega(\|{{\widetilde v}_k}\|{}_2) \leq \big| \|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\big| \leq \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$

In addition, we have

$$\displaystyle \begin{aligned} \frac{\langle {\widetilde v}_k-{\widetilde u}_k, {\widetilde v}_k\rangle}{1+\|{{\widetilde v}_k}\|{}_2} \leq \frac{\|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \leq \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$

Applying these two bounds to (11.45), we obtain

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + 2\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$

(11.49)

Next we bound $\|{{\widetilde u}_k-{\widetilde v}_k}\|{ }_2$ using the approximation tolerance 𝜖 _k specified in (11.14),

$$\displaystyle \begin{aligned} \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 & = \left\| [f^{\prime\prime}(w_k)]^{-1/2}f'(w_k) - [f^{\prime\prime}(w_k)]^{1/2} v_k \right\|{}_2 \\ & = \left\| [f^{\prime\prime}(w_k)]^{-1/2} \bigl( f^{\prime\prime}(w_k) v_k - f'(w_k) \bigr)\right\|{}_2\\ &\leq \lambda^{-1/2} \left\| f^{\prime\prime}(w_k) v_k - f'(w_k) \right\|{}_2\\ &\leq \lambda^{-1/2} \epsilon_k \\ &= \frac{1}{2} \min\left\{ \frac{\omega(r_k)}{2}, \frac{\omega^{3/2}(r_k)}{10} \right\}.\end{aligned} $$

Combining the above inequality with (11.49), and using $r_k = L^{-1/2}\|{f'(w_k)}\|{ }_2 \leq \|{{\widetilde u}_k}\|{ }_2$ with the monotonicity of ω, we arrive at

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + \min\left\{ \frac{\omega({\widetilde u}_k)}{2}, \frac{\omega^{3/2}({\widetilde u}_k)}{10} \right\}. \end{aligned} $$

(11.50)

Part (a) of the theorem follows immediately from inequality (11.50).

For part (b), we assume that $\|{{\widetilde u}_k}\|{ }_2\leq 1/8$. Combining (11.47) with (11.50), we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_{k+1}}\|{}_2) & \leq f(w_{k+1}) - f(w_\star) \leq f(w_k)-f(w_\star) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{\omega^{3/2}(\|{{\widetilde u}_k}\|{}_2)}{10} \\ & \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{\omega^{3/2}(\|{{\widetilde u}_k}\|{}_2)}{10}. {} \end{aligned} $$

(11.51)

Let h(t) = ω _∗(t) − ω(t) and consider only t ≥ 0. Notice that h(0) = 0 and $h'(t)=\frac {2t^2}{1-t^2}<\frac {128}{63}t^2$ for t ≤ 1∕8. Thus, we conclude that $h(t)\leq \frac {128}{189}t^3$ for t ≤ 1∕8. We also notice that ω(0) = 0 and $\omega '(t)=\frac {t}{1+t}\geq \frac {8}{9}t$ for t ≤ 1∕8. Thus, we have $\omega (t)\geq \frac {4}{9}t^2$ for t ≤ 1∕8. Combining these results, we obtain

$$\displaystyle \begin{aligned} \omega_*(t)-\omega(t) \leq \frac{128}{189}t^3 = \frac{128}{189}(t^2)^{3/2} \leq \frac{128}{189}\left(\frac{9}{4}\omega(t)\right)^{3/2} \leq \left(\sqrt{6}-\frac{1}{10}\right) \omega^{3/2}(t) . \end{aligned}$$

Applying this inequality to the right-hand side of (11.51) completes the proof.

Next we prove Corollary 2. By part (a) of Theorem 2, if $\omega (\|{{\widetilde u}_k}\|{ }_2)\geq 1/8$, then each iteration of Algorithm 1 decreases the function value at least by the constant $\frac {1}{2}\omega (1/8)$. So within at most $K_1=\left \lceil \frac {2(f(w_0)-f(w_\star ))}{\omega (1/8)}\right \rceil $ iterations, we are guaranteed to have $\|{{\widetilde u}_k}\|{ }_2\leq 1/8$. Part (b) of Theorem 2 implies $6\,\omega (\|{{\widetilde u}_{k+1}}\|{ }_2) \leq \left ( 6\,\omega (\|{{\widetilde u}_k}\|{ }_2)\right )^{3/2}$ when $\|{{\widetilde u}_k}\|{ }_2\leq 1/8$, and hence

$$\displaystyle \begin{aligned} \log\bigl(6\,\omega(\|{{\widetilde u}_k}\|{}_2)\bigr) \leq\left(\frac{3}{2}\right)^{k-K_1} \log\left(6\,\omega(1/8)\right), \qquad k \geq K_1. \end{aligned}$$

Note that both sides of the above inequality are negative. Therefore, after $k\geq K_1 + \frac {\log \log (1/(3\epsilon ))}{\log (3/2)}$ iterations (assuming 𝜖 ≤ 1∕(3e)), we have

$$\displaystyle \begin{aligned} \log\bigl(6\,\omega(\|{{\widetilde u}_k}\|{}_2)\bigr) \leq \log(1/(3\epsilon)) \log(6\,\omega(1/8)) \leq -\log(1/(3\epsilon)), \end{aligned}$$

which implies $\omega (\|{{\widetilde u}_k}\|{ }_2) \leq \epsilon /2$. Finally using (11.47) and the fact that ω _∗(t) ≤ 2 ω(t) for t ≤ 1∕8, we obtain

$$\displaystyle \begin{aligned} f(w_k)-f(w_\star) \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) \leq 2\,\omega(\|{{\widetilde u}_k}\|{}_2) \leq \epsilon. \end{aligned}$$

This completes the proof of Corollary 2.

Appendix 3: Proof of Lemma 4

It suffices to show that the algorithm terminates at iteration t ≤ T _μ − 1, because when the algorithm terminates, it outputs a vector v _k which satisfies

$$\displaystyle \begin{aligned} \|{H v_k - f'(w_k)}\|{}_2 = \|{r^{(t+1)}}\|{}_2 \leq \epsilon_k. \end{aligned}$$

Denote by v ^∗ = H ⁻¹ f′(w _k) the solution of the linear system Hv _k = f′(w _k). By the classical analysis on the preconditioned conjugate gradient method (see, e.g., [3, 32]), Algorithm 2 has the following convergence property:

$$\displaystyle \begin{aligned} (v^{(t)} - v^*)^T H (v^{(t)} - v^*) \leq 4 \left( \frac{\sqrt{\kappa} -1 }{\sqrt{\kappa}+1}\right)^{2t} (v^*)^T H v^*, \end{aligned} $$

(11.52)

where κ = 1 + 2μ∕λ is the condition number of P ⁻¹ H given in (11.25). For the left-hand side of inequality (11.52), we have

$$\displaystyle \begin{aligned} (v^{(t)} - v^*)^T H (v^{(t)} - v^*) = (r^{(t)})^T H^{-1} r^{(t)} \geq \frac{ \|{r^{(t)}}\|{}_2^2}{L}. \end{aligned} $$

For the right-hand side of inequality (11.52), we have

$$\displaystyle \begin{aligned} (v^*)^T H v^* & = (f'(w_k))^T H^{-1} f'(w_k) \leq \frac{\|{f'(w_k)}\|{}_2^2}{\lambda} . \end{aligned} $$

Combining the above two inequalities with inequality (11.52), we obtain

$$\displaystyle \begin{aligned} \|{r^{(t)}}\|{}_2 \leq 2 \sqrt{\frac{L}{\lambda}} \left( \frac{\sqrt{\kappa} -1 }{\sqrt{\kappa}+1}\right)^{t} \|{f'(w_k)}\|{}_2 \leq 2 \sqrt{\frac{L}{\lambda}} \left(1 - \sqrt{\frac{\lambda}{\lambda + 2\mu}}\right)^{t}\|{f'(w_k)}\|{}_2 . \end{aligned} $$

To guarantee that ∥r ^(t)∥₂ ≤ 𝜖 _k, it suffices to have

$$\displaystyle \begin{aligned} t ~\geq~ \frac{\log\Big(\frac{2 \sqrt{L/\lambda} \|{f'(w_k)}\|{}_2}{\epsilon_k}\Big)}{- \log\left(1 - \sqrt{\frac{\lambda}{\lambda + 2\mu}}\right)} ~\geq~ \sqrt{1+\frac{2\mu}{\lambda}}\, \log\bigg(\frac{2 \sqrt{L/\lambda} \|{f'(w_k)}\|{}_2}{\epsilon_k}\bigg), \end{aligned} $$

where in the last inequality we used $-\log (1-x) \geq x$ for 0 < x < 1. Comparing with the definition of T _μ in (11.26), this is the desired result.

Appendix 4: Proof of Lemma 5

First, we prove Inequality (11.31). Recall that w _⋆ and ${\widehat w}_i$ minimizes f(w) and $f_i(w) + (\rho /2)\|{w}\|{ }_2^2$ respectively. Since both functions are λ-strongly convex, we have

$$\displaystyle \begin{aligned} \frac{\lambda}{2}\|{w_\star}\|{}_2^2 &\leq f(w_\star) \leq f(0) \leq V_0,\\ \frac{\lambda}{2}\|{{\widehat w}_i}\|{}_2^2 &\leq f_i({\widehat w}_i) + \frac{\rho}{2}\|{{\widehat w}_i}\|{}_2^2 \leq f_i(0) \leq V_0, \end{aligned} $$

where we also used Assumption 2(a) in the first inequality on both lines. These two inequalities imply $\|{w_\star }\|{ }_2 \leq \sqrt {2V_0/\lambda }$ and $\|{{\widehat w}_i}\|{ }_2 \leq \sqrt {2V_0/\lambda }$. Then the inequality (11.31) follows since w ₀ is the average over ${\widehat w}_i$ for i = 1, …, m.

Next we prove inequality (11.32). Let z be a random variable in $\mathcal {Z}\subset \mathbb {R}^p$ with an unknown probability distribution. We define a regularized population risk:

$$\displaystyle \begin{aligned} R(w) = \mathbb{E}_z[\phi(w, z)] + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2. \end{aligned}$$

Let S be a set of n i.i.d. samples in $\mathcal {Z}$ from the same distribution. We define a regularized empirical risk

$$\displaystyle \begin{aligned} r_S(w) = \frac{1}{n}\sum_{z\in S}\phi(w,z) + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2, \end{aligned}$$

and its minimizer

$$\displaystyle \begin{aligned} {\widehat w}_S = \arg\min_w ~r_S(w). \end{aligned}$$

The following lemma states that the population risk of ${\widehat w}_S$ is very close to its empirical risk. The proof is based on the notion of stability of regularized ERM [9].

Lemma 7

Suppose Assumption 2 holds and S is a set of n i.i.d. samples in $\mathcal {Z}$ . Then

$$\displaystyle \begin{aligned} \mathbb{E}_S\bigl[R({\widehat w}_S) - r_S({\widehat w}_S)\bigr] \leq \frac{2G^2}{\rho n}. \end{aligned}$$

Proof

Let S = {z ₁, …, z _n}. For any k ∈{1, …, n}, we define a modified training set S ^(k) by replacing z _k with another sample ${\widetilde z}_k$, which is drawn from the same distribution and is independent of S. The empirical risk on S ^(k) is defined as

$$\displaystyle \begin{aligned} r_S^{(k)}(w) = \frac{1}{n}\sum_{z\in S^{(k)}} \phi(w,z) + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2. \end{aligned}$$

Let ${\widehat w}_S^{(k)}= \arg \min _w r_S^{(k)}(w)$. Since both r _S and $r_S^{(k)}$ are ρ-strongly convex, we have

$$\displaystyle \begin{aligned} r_S({\widehat w}_S^{(k)}) - r_S({\widehat w}_S) &\geq \frac{\rho}{2} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2^2, \\ r_S^{(k)}({\widehat w}_S) - r_S^{(k)}({\widehat w}_S^{(k)}) &\geq \frac{\rho}{2} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2^2. \end{aligned} $$

Summing the above two inequalities, and noticing that

$$\displaystyle \begin{aligned} r_S(w) - r_S^{(k)}(w) = \frac{1}{n}(\phi(w,z_k) - \phi(w,{\widetilde z}_k)), \end{aligned}$$

we have

$$\displaystyle \begin{aligned} \|{{\widehat w}_S^{(k)}- {\widehat w}_S}\|{}_2^2 \leq \frac{1}{\rho n}\left( \phi({\widehat w}_S^{(k)},z_k) - \phi({\widehat w}_S^{(k)},{\widetilde z}_k) - \phi({\widehat w}_S,z_k) + \phi({\widehat w}_S, {\widetilde z}_k)\right) . \end{aligned} $$

(11.53)

By Assumption 2(b) and the facts $\|{{\widehat w}_S}\|{ }_2\leq \sqrt {2V_0/\lambda }$ and $\|{{\widehat w}_S^{(k)}}\|{ }_2\leq \sqrt {2V_0/\lambda }$, we have

$$\displaystyle \begin{aligned} \big|\phi({\widehat w}_S^{(k)},z) - \phi({\widehat w}_S,z)\big| \leq G \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2, \qquad \forall\, z\in \mathcal{Z}. \end{aligned}$$

Combining the above Lipschitz condition with (11.53), we obtain

$$\displaystyle \begin{aligned} \|{{\widehat w}_S^{(k)}- {\widehat w}_S}\|{}_2^2 \leq \frac{2 G}{\rho n} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2. \end{aligned}$$

As a consequence, we have $\|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{ }_2 \leq \frac {2 G}{\rho n}$, and therefore

$$\displaystyle \begin{aligned} \big|\phi({\widehat w}_S^{(k)},z) - \phi({\widehat w}_S,z)\big| \leq \frac{2 G^2}{\rho n}, \qquad \forall\, z\in \mathcal{Z}. \end{aligned} $$

(11.54)

In the terminology of learning theory, this means that empirical minimization over the regularized loss r _S has uniform stability 2G ²∕(ρn) with respect to the loss function ϕ; see [9].

For any fixed k ∈{1, …, n}, since ${\widetilde z}_k$ is independent of S, we have

$$\displaystyle \begin{aligned} \mathbb{E}_S\bigl[R({\widehat w}_S) - r_S({\widehat w}_S)\bigr] &= \mathbb{E}_S \bigg[ \mathbb{E}_{{\widetilde z}_k}[\phi({\widehat w}_S,{\widetilde z}_k)] - \frac{1}{n}\sum_{j=1}^n \phi({\widehat w}_S, z_j) \bigg] \\ &= \mathbb{E}_{S,{\widetilde z}_k}\bigl[\phi({\widehat w}_S,{\widetilde z}_k)-\phi({\widehat w}_S,z_k)\bigr] \\ &= \mathbb{E}_{S,{\widetilde z}_k}\bigl[\phi({\widehat w}_S,{\widetilde z}_k)-\phi({\widehat w}_S^{(k)},{\widetilde z}_k)\bigr], \end{aligned} $$

where the second equality used the fact that $\mathbb {E}_S[\phi ({\widehat w}_S,z_j)$ has the same value for all j = 1, …, n, and the third equality used the symmetry between the pairs (S, z _k) and $(S^{(k)}, {\widetilde z}_k)$ (also known as the renaming trick; see [9, Lemma 7]). Combining the above equality with (11.54) yields the desired result. □

Next, we consider a distributed system with m machines, where each machine has a local dataset S _i of size n, for i = 1, …, m. To simplify notation, we denote the local regularized empirical loss function and its minimizer by r _i and ${\widehat w}_i$, respectively. We would like to bound the excessive error when applying ${\widehat w}_i$ to a different dataset S _j. Notice that

$$\displaystyle \begin{aligned} \mathbb{E}_{S_i,S_j}\bigl[r_j({\widehat w}_i) - r_j({\widehat w}_j)\bigr] &= \underbrace{\mathbb{E}_{S_i,S_j}\bigl[r_j({\widehat w}_i)-r_i({\widehat w}_i)\bigr]}_{v_1} + \underbrace{\mathbb{E}_{S_i,S_j}\bigl[r_i({\widehat w}_i)-r_j({\widehat w}_R)\bigr]}_{v_2} \\ &\quad + \underbrace{\mathbb{E}_{S_j}\bigl[r_j({\widehat w}_R)-r_j({\widehat w}_j)\bigr]}_{v_3}, {} \end{aligned} $$

(11.55)

where ${\widehat w}_R$ is the minimizer of R(w). Since S _i and S _j are independent, we have

$$\displaystyle \begin{aligned} v_1 = \mathbb{E}_{S_i}\big[\mathbb{E}_{S_j}[r_j({\widehat w}_i)] - r_i({\widehat w}_i)\big] = \mathbb{E}_{S_i}\big[R({\widehat w}_i) - r_i({\widehat w}_i)] \leq \frac{2 G^2}{\rho n} , \end{aligned} $$

where the inequality is due to Lemma 7. For the second term in (11.55), we have

$$\displaystyle \begin{aligned} v_2 = \mathbb{E}_{S_i}\bigl[r_i({\widehat w}_i) - \mathbb{E}_{S_j}[r_j({\widehat w}_R)]\bigr] = \mathbb{E}_{S_i}\bigl[r_i({\widehat w}_i) - r_i({\widehat w}_R)\bigr] \leq 0. \end{aligned} $$

It remains to bound the third term v ₃. We first use the strong convexity of r _j to obtain (see, e.g., [36, Theorem 2.1.10])

$$\displaystyle \begin{aligned} r_j({\widehat w}_R) - r_j({\widehat w}_j) \leq \frac{\|{r_j^{\prime}({\widehat w}_R)}\|{}_2^2}{2\rho}, \end{aligned} $$

(11.56)

where $r^{\prime }_j({\widehat w}_R)$ denotes the gradient of r _j at ${\widehat w}_R$. If we index the elements of S _j by z ₁, …, z _n, then

$$\displaystyle \begin{aligned} r_j^{\prime}({\widehat w}_R) = \frac{1}{n} \sum_{k=1}^n \left(\phi'({\widehat w}_R, z_k) + (\lambda+\rho) {\widehat w}_R \right). \end{aligned} $$

(11.57)

By the optimality condition of ${\widehat w}_R=\arg \min _w R(w)$, we have for any k ∈{1, …, n},

$$\displaystyle \begin{aligned} \mathbb{E}_{z_k}\bigl[ \phi'({\widehat w}_R,z_k) + (\lambda+\rho){\widehat w}_R\bigr] = 0. \end{aligned}$$

Therefore, according to (11.57), the gradient $r_j({\widehat w}_R)$ is the average of n independent and zero-mean random vectors. Combining (11.56) and (11.57) with the definition of v ₃ in (11.55), we have

In the equality above, we used the fact that $\phi '({\widehat w}_R,z_k)+(\lambda +\rho ){\widehat w}_R$ are i.i.d. zero-mean random variables, so the variance of their sum equals the sum of their variances. The last inequality above is due to Assumption 2(b) and the fact that $\|{{\widehat w}_R}\|{ }_2\leq \sqrt {2V_0/(\lambda +\rho )}\leq \sqrt {2V_0/\lambda }$. Combining the upper bounds for v ₁, v ₂ and v ₃, we have

$$\displaystyle \begin{aligned} \mathbb{E}_{S_i,S_j} \left[r_j({\widehat w}_i) - r_j({\widehat w}_j)\right] \leq \frac{3 G^2}{\rho n}.\end{aligned} $$

(11.58)

Recall the definition of f as

$$\displaystyle \begin{aligned} f(w) = \frac{1}{mn}\sum_{i=1}^m\sum_{k=1}^n \phi(w, z_{i,k})+\frac{\lambda}{2}\|{w}\|{}_2^2, \end{aligned}$$

where z _i,k denotes the kth sample at machine i. Let $r(w) = (1/m)\sum _{j=1}^m r_j(w)$, then

$$\displaystyle \begin{aligned} r(w) = f(w) + \frac{\rho}{2} \|{w}\|{}_2^2 . \end{aligned} $$

(11.59)

We compare the value $r({\widehat w}_i)$, for any i ∈{1, …, m}, with the minimum of r(w):

$$\displaystyle \begin{aligned} r({\widehat w}_i) -\min_w r(w) & = \frac{1}{m}\sum_{j=1}^m r_j({\widehat w}_i) - \min_w\frac{1}{m}\sum_{j=1}^m r_j(w) \\ & \leq \frac{1}{m}\sum_{j=1}^m r_j({\widehat w}_i) - \frac{1}{m}\sum_{j=1}^m \min_w r_j(w) \\ & = \frac{1}{m}\sum_{j=1}^m \left( r_j({\widehat w}_i) - r_j({\widehat w}_j)\right) . \end{aligned} $$

Taking expectation with respect to all the random data sets S ₁, …, S _m, we obtain

$$\displaystyle \begin{aligned} \mathbb{E}[ r({\widehat w}_i) - \min_w r(w)] \leq \frac{1}{m} \sum_{j=1}^n \mathbb{E}[ r_j({\widehat w}_i)- r_j({\widehat w}_j)] \leq \frac{3 G^2}{\rho n}, \end{aligned} $$

(11.60)

where the last inequality is due to (11.58). Finally, we bound the expected value of $f({\widehat w}_i)$:

$$\displaystyle \begin{aligned} \mathbb{E}[f({\widehat w}_i)] &\leq \mathbb{E}[r({\widehat w}_i)] \leq \mathbb{E}\left[\min_w r(w)\right] + \frac{3G^2}{\rho n} \\ &\leq \mathbb{E}\left[f(w_\star)+\frac{\rho}{2}\|{w_\star}\|{}_2^2\right]+ \frac{3G^2}{\rho n} \\ & \leq \mathbb{E}\left[f(w_\star)\right]+\frac{\rho D^2}{2}+ \frac{3G^2}{\rho n}, \end{aligned} $$

where the first inequality holds because of (11.59), the second inequality is due to (11.60), and the last inequality follows from the assumption that $\mathbb {E}[\|{w_\star }\|{ }_2]\leq D^2$. Choosing $\rho = \sqrt {6 G^2/(n D^2)}$ results in

$$\displaystyle \begin{aligned} \mathbb{E}[f( {\widehat w}_i) - f(w_\star)] \leq \frac{\sqrt{6}G D}{\sqrt{n}}, \qquad i=1,\ldots,m. \end{aligned}$$

Since $w_0=(1/m)\sum _{i=1}^m {\widehat w}_i$, we can use the convexity of the function f to conclude that $\mathbb {E}[f(w_0) - f(w_\star )] \leq \sqrt {6}G D/\sqrt {n}$, which is the desired result.

Appendix 5: Proof of Lemma 6

We consider the regularized empirical loss functions f _i defined in (11.30). For any two vectors $u,w\in \mathbb {R}^d$ satisfying ∥u − w∥₂ ≤ ε, Assumption 2(d) implies

$$\displaystyle \begin{aligned} \|{f^{\prime\prime}_i(u) - f^{\prime\prime}_i(w)}\|{}_2 \leq M \varepsilon. \end{aligned} $$

Let B(0, r) be the ball in $\mathbb {R}^d$ with radius r, centered at the origin. Let $N_\varepsilon ^{\mathrm {cov}}(B(0,r))$ be the covering number of B(0, r) by balls of radius ε, i.e., the minimum number of balls of radius ε required to cover B(0, r). We also define $N_\varepsilon ^{\mathrm {pac}}(B(0,r))$ as the packing number of B(0, r), i.e., the maximum number of disjoint balls whose centers belong to B(0, r). It is easy to verify that

$$\displaystyle \begin{aligned} N_{\varepsilon}^{\mathrm{cov}}(B(0,r)) \leq N_{\varepsilon/2}^{\mathrm{pac}}(B(0,r)) \leq \left( 1 + {2r}/{\varepsilon}\right)^d. \end{aligned} $$

Therefore, there exist a set of points $U\subseteq \mathbb {R}^d$ with cardinality at most (1 + 2r∕ε)^d, such that for any vector w ∈ B(0, r), we have

$$\displaystyle \begin{aligned} \min_{u\in U} \|{f^{\prime\prime}_i(w) - f^{\prime\prime}_i(u)}\|{}_2 \leq M\varepsilon. \end{aligned} $$

(11.61)

We consider an arbitrary point u ∈ U and the associated Hessian matrices for the functions f _i defined in (11.30). We have

$$\displaystyle \begin{aligned} f^{\prime\prime}_i(u)=\frac{1}{n} \sum_{j=1}^n \left(\phi^{\prime\prime}(u, z_{i,j}) + \lambda I\right), \qquad i=1,\ldots,m. \end{aligned}$$

The components of the above sum are i.i.d. matrices that are upper bounded by LI. By the matrix Hoeffding’s inequality [33, Corollary 4.2], we have

$$\displaystyle \begin{aligned} \mathbb{P}\left[\|{f^{\prime\prime}_i(u) - \mathbb{E}[f^{\prime\prime}_i(u)]}\|{}_2 > t\right] \leq d \cdot e^{- \frac{n t^2}{2L^2}} . \end{aligned} $$

Note that $\mathbb {E}[f^{\prime \prime }_1(w)] = \mathbb {E}[f^{\prime \prime }(w)]$ for any w ∈ B(0, r). Using the triangular inequality and inequality (11.61), we obtain

$$\displaystyle \begin{aligned} \|{f^{\prime\prime}_1(w) - f^{\prime\prime}(w)]}\|{}_2 &\leq \|{f^{\prime\prime}_1(w) - \mathbb{E}[f^{\prime\prime}_1(w)]}\|{}_2 + \|{f^{\prime\prime}(w) - \mathbb{E}[f^{\prime\prime}(w)]}\|{}_2 \\ {} & \leq 2\max_{i\in\{1,\ldots,m\}} \|{f^{\prime\prime}_i(w) - \mathbb{E}[f^{\prime\prime}_i(w)]}\|{}_2 \\ &\leq 2\max_{i\in\{1,\ldots,m\}} \Big(\max_{u\in U}\|{f^{\prime\prime}_i(u) - \mathbb{E}[f^{\prime\prime}_i(u)]}\|{}_2 + M \varepsilon \Big).{} \end{aligned} $$

(11.62)

Applying the union bound, we have with probability at least

$$\displaystyle \begin{aligned} 1 - m d ( 1 + {2r}/{\varepsilon})^d\cdot e^{- \frac{n t^2}{2L^2}}, \end{aligned}$$

the inequality $\|{f^{\prime \prime }_i(u) - \mathbb {E}[f^{\prime \prime }_i(u)]}\|{ }_2 \leq t$ holds for every i ∈{1, …, m} and every u ∈ U. Combining this probability bound with inequality (11.62), we have

$$\displaystyle \begin{aligned} \mathbb{P}\Big[\sup_{w\in B(0,r)}\|{f^{\prime\prime}_1(w)-f^{\prime\prime}(w)}\|{}_2 > 2 t+2 M\varepsilon\Big] \leq m d \left( 1 + {2r}/{\varepsilon}\right)^d \cdot e^{- \frac{n t^2}{2L^2}}. \end{aligned} $$

(11.63)

As the final step, we choose $\varepsilon = \frac {\sqrt {2}L}{\sqrt {n}M}$ and then choose t to make the right-hand side of inequality (11.63) equal to δ. This yields the desired result.

Appendix 6: Proof of Theorem 5

Suppose Algorithm 3 terminates in K iterations. Let t _k be the number of conjugate gradient steps in each call of Algorithm 2, for k = 0, 1, …, K − 1. For any given μ > 0, we define T _μ as in (11.26). Let $\mathcal {A}$ denotes the event that t _k ≤ T _μ for all k ∈{0, …, K − 1}. Let ${\mathcal {A}_{\mathrm {c}}}$ be the complement of $\mathcal {A}$, i.e., the event that t _k > T _μ for some k ∈{0, …, K − 1}. In addition, let the probabilities of the events $\mathcal {A}$ and ${\mathcal {A}_{\mathrm {c}}}$ be 1 − δ and δ respectively. By the law of total expectation, we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] = \mathbb{E}[T|\mathcal{A}] \mathbb{P}(\mathcal{A}) + \mathbb{E}[T|{\mathcal{A}_{\mathrm{c}}}] \mathbb{P}({\mathcal{A}_{\mathrm{c}}}) = (1-\delta)\mathbb{E}[T|\mathcal{A}] + \delta\,\mathbb{E}[T|{\mathcal{A}_{\mathrm{c}}}] . \end{aligned} $$

When the event $\mathcal {A}$ happens, we have T ≤ 1 + K(T _μ + 1) where T _μ is given in (11.26); otherwise we have T ≤ 1 + K(T _L + 1), where

$$\displaystyle \begin{aligned} T_L = \sqrt{2+\frac{2L}{\lambda}}\log\left(\frac{2L}{\beta\lambda}\right) \end{aligned} $$

(11.64)

bounds the number of PCG iterations in Algorithm 2 when the event ${\mathcal {A}_{\mathrm {c}}}$ happens. Since Algorithm 2 always ensures ∥f ^′′(w _k)v _k − f′(w _k)∥₂ ≤ 𝜖 _k, the outer iteration count K shares the same bound in (11.12), which depends on f(w ₀) − f(w _⋆). Notice that f(w ₀) − f(w _⋆) is a random variable depending on the random generation of the datasets. However, T _μ and T _L are deterministic constants. So we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] &\leq 1 + (1-\delta) \mathbb{E}[K(T_{\mu}+1)|\mathcal{A}] + \delta\,\mathbb{E}[K(T_{L}+1)|{\mathcal{A}_{\mathrm{c}}}] \\ &= 1 + (1-\delta) (T_{\mu}+1)\mathbb{E}[K|\mathcal{A}] + \delta(T_{L}+1)\mathbb{E}[K|{\mathcal{A}_{\mathrm{c}}}] . {} \end{aligned} $$

(11.65)

Next we bound $\mathbb {E}[K|\mathcal {A}]$ and $\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]$ separately. To bound $\mathbb {E}[K|\mathcal {A}]$, we use

$$\displaystyle \begin{aligned} \mathbb{E}[K] = (1-\delta) \mathbb{E}[K|\mathcal{A}] + \delta\,\mathbb{E}[K|{\mathcal{A}_{\mathrm{c}}}] \geq (1-\delta) \mathbb{E}[K|\mathcal{A}] \end{aligned}$$

to obtain

$$\displaystyle \begin{aligned} \mathbb{E}[K|\mathcal{A}]\leq\mathbb{E}[K]/(1-\delta). \end{aligned} $$

(11.66)

In order to bound $\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]$, we derive a deterministic bound on f(w ₀) − f(w _⋆). By Lemma 5, we have $\|{w_0}\|{ }_2\leq \sqrt {2V_0/\lambda }$, which together with Assumption 2(b) yields

$$\displaystyle \begin{aligned} \|{f'(w)}\|{}_2 \leq G + \lambda\|w\|{}_2 \leq G + \sqrt{2\lambda V_0}. \end{aligned}$$

Combining with the strong convexity of f, we obtain

$$\displaystyle \begin{aligned} f(w_0)-f(w_\star) \leq \frac{1}{2\lambda}\|{f'(w_0)}\|{}_2^2 \leq\frac{1}{2\lambda}\left(G+\sqrt{2\lambda V_0}\right)^2 \leq 2V + \frac{G^2}{\lambda}. \end{aligned} $$

Therefore by Corollary 1,

$$\displaystyle \begin{aligned} K \leq K_{\mathrm{max}} = 1 + \frac{4V_0+2G^2/\lambda}{\omega(1/6)} + \left\lceil \log_2\left(\frac{2\omega(1/6)}{\epsilon}\right)\right\rceil , \end{aligned} $$

(11.67)

where the additional 1 counts compensate for removing one ⌈⋅⌉ operator in (11.12).

Using inequality (11.65), the bound on $\mathbb {E}[K|\mathcal {A}]$ in (11.66) and the bound on $\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]$ in (11.67), we obtain

$$\displaystyle \begin{aligned} \mathbb{E}[T] \leq 1 + (T_{\mu}+1) \mathbb{E}[K] + \delta (T_{L}+1) K_{\mathrm{max}} . \end{aligned}$$

Now we can bound $\mathbb {E}[K]$ by Corollary 1 and Lemma 5. More specifically,

$$\displaystyle \begin{aligned} \mathbb{E}[ K ] \leq \frac{\mathbb{E}[2(f(w_0) - f(w_\star))]}{\omega(1/6)} + \left\lceil \log_2 \Big( \frac{ 2 \omega(1/6)}{\epsilon}\Big) \right\rceil + 1 = C_0 + \frac{2\sqrt{6}}{\omega(1/6)} \cdot \frac{G D}{\sqrt{n}}, \end{aligned}$$

where $C_0=1+\left \lceil \log _2(2\omega (1/6)/\epsilon ) \right \rceil $. With the choice of δ in (11.34) and the definition of T _L in (11.64), we have

where $C_2=\log (2L/(\beta \lambda ))$. Putting everything together, we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] &\leq 1 + \left(C_0+\frac{C_0}{\sqrt{n}}\cdot\frac{GD}{4V_0+2G^2/\lambda}+\frac{2\sqrt{6}+1}{\omega(1/6)}\cdot \frac{GD}{\sqrt{n}} \right) (T_\mu+1)\\ &\leq 1 + \left(C_1 + \frac{6}{\omega(1/6)}\cdot \frac{GD}{\sqrt{n}} \right) (T_\mu+1) . \end{aligned} $$

Replacing T _μ by its expression in (11.26) and applying Corollary 4, we obtain the desired result.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, Y., Xiao, L. (2018). Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss. In: Giselsson, P., Rantzer, A. (eds) Large-Scale and Distributed Optimization. Lecture Notes in Mathematics, vol 2227. Springer, Cham. https://doi.org/10.1007/978-3-319-97478-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-97478-1_11
Published: 12 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97477-4
Online ISBN: 978-3-319-97478-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics