Skip to main content

Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss

  • Chapter
  • First Online:
Large-Scale and Distributed Optimization

Part of the book series: Lecture Notes in Mathematics ((LNM,volume 2227))

Abstract

We consider distributed convex optimization problems originating from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communication-efficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the results for ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning where the condition number of the problem grows with square root of the sample size, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Agarwal, J.C. Duchi, Distributed delayed stochastic optimization, in Advances in Neural Information Processing Systems (NIPS) 24 (2011), pp. 873–881

    Google Scholar 

  2. Y. Arjevani, O. Shamir, Communication complexity of distributed convex learning and optimization, in Advances in Neural Information Processing Systems (NIPS) 28 (2015), pp. 1756–1764

    Google Scholar 

  3. M. Avriel, Nonlinear Programming: Analysis and Methods (Prentice-Hall, Upper Saddle River, 1976)

    MATH  Google Scholar 

  4. F. Bach, Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)

    Article  MathSciNet  Google Scholar 

  5. R. Bekkerman, M. Bilenko, J. Langford, Scaling Up Machine Learning: Parallel and Distributed Approaches (Cambridge University Press, Cambridge, 2011)

    Book  Google Scholar 

  6. D.P. Bertsekas, J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods (Prentice-Hall, Upper Saddle River, 1989)

    MATH  Google Scholar 

  7. R.H. Bird, G.M. Chin, W. Neveitt, J. Nocedal, On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21(3), 977–955 (2011)

    Article  MathSciNet  Google Scholar 

  8. J.A. Blackard, D.J. Dean, C.W. Anderson, Covertype data set, in UCI Machine Learning Repository, ed. by K. Bache, M. Lichman (School of Information and Computer Sciences, University of California, Irvine, 2013). http://archive.ics.uci.edu/ml

  9. O. Bousquet, A. Elisseeff, Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)

    MathSciNet  MATH  Google Scholar 

  10. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004)

    Book  Google Scholar 

  11. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)

    Article  Google Scholar 

  12. C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)

    Article  Google Scholar 

  13. C. Cortes, V. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  14. J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  15. A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems (NIPS) 27 (2014), pp. 1646–1654

    Google Scholar 

  16. O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13(1), 165–202 (2012)

    MathSciNet  MATH  Google Scholar 

  17. R.S. Dembo, S.C. Eisenstat, T. Steihaug, Inexact Newton methods. SIAM J. Numer. Anal. 19(2), 400–408 (1982)

    Article  MathSciNet  Google Scholar 

  18. W. Deng, W. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)

    Article  MathSciNet  Google Scholar 

  19. J.E. Dennis, J.J. Moreó, A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 28(126), 549–560 (1974)

    Article  MathSciNet  Google Scholar 

  20. J.C. Duchi, A. Agarwal, M.J. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans. Autom. Control 57(3), 592–606 (2012)

    Article  MathSciNet  Google Scholar 

  21. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The John Hopkins University Press, Baltimore, 1996)

    MATH  Google Scholar 

  22. R.A. Horn, C.R. Johnson, Matrix Analysis (Cambridge University Press, Cambridge, 1985)

    Book  Google Scholar 

  23. S.S. Keerthi, D. DeCoste, A modified finite Newton method for fast solution of large scale linear svms. J. Mach. Learn. Res. 6, 341–361 (2005)

    MathSciNet  MATH  Google Scholar 

  24. K. Lang, Newsweeder: learning to filter netnews, in Proceedings of the Twelfth International Conference on Machine Learning (ICML) (1995), pp. 331–339

    Google Scholar 

  25. N. Le Roux, M. Schmidt, F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems (NIPS) 25 (2012), pp. 2672–2680

    Google Scholar 

  26. J.D. Lee, Y. Sun, M. Saunders, Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim. 24(3), 1420–1443 (2014)

    Article  MathSciNet  Google Scholar 

  27. D.D. Lewis, Y. Yang, T. Rose, F. Li, RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)

    Google Scholar 

  28. Q. Lin, L. Xiao, An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60(3), 633–674 (2015)

    Article  MathSciNet  Google Scholar 

  29. C.-Y. Lin, C.-H. Tsai, C.-P. Lee, C.-J. Lin, Large-scale logistic regression and linear support vector machines using Spark, in Proceedings of the IEEE Conference on Big Data, Washington, 2014

    Google Scholar 

  30. Q. Lin, Z. Lu, L. Xiao, An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM J. Optim. 25(4), 2244–2273 (2015)

    Article  MathSciNet  Google Scholar 

  31. Z. Lu, Randomized block proximal dampled Newton method for composite self-concordant minimization. SIAM J. Optim. 27(3), 1910–1942 (2017)

    Article  MathSciNet  Google Scholar 

  32. D.G. Luenberger, Introduction to Linear and Nonlinear Programming (Addison-Wesley, New York, 1973)

    MATH  Google Scholar 

  33. L. Mackey, M.I. Jordan, R.Y. Chen, B. Farrell, J.A. Tropp et al., Matrix concentration inequalities via the method of exchangeable pairs. Ann. Probab. 42(3), 906–945 (2014)

    Article  MathSciNet  Google Scholar 

  34. D. Mahajan, N. Agrawal, S.S. Keerthi, S. Sundararajan, L. Bottou, An efficient distributed learning algorithm based on effective local functional approximation. arXiv:1310.8418

    Google Scholar 

  35. MPI Forum, MPI: a message-passing interface standard, version 3.0 (2012), http://www.mpi-forum.org

  36. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Kluwer, Boston, 2004)

    Book  Google Scholar 

  37. Y. Nesterov, Gradient methods for minimizing composite functions. Math. Program. Ser. B 140, 125–161 (2013)

    Article  MathSciNet  Google Scholar 

  38. Y. Nesterov, A. Nemirovski, Interior Point Polynomial Time Methods in Convex Programming (SIAM, Philadelphia, 1994)

    Book  Google Scholar 

  39. J. Nocedal, S.J. Wright, Numerical Optimization, 2nd edn. (Springer, New York, 2006)

    MATH  Google Scholar 

  40. M. Pilanci, M.J. Wainwright, Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(53), 1–38 (2016)

    MathSciNet  MATH  Google Scholar 

  41. S.S. Ram, A. Nedicó, V.V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010)

    Article  MathSciNet  Google Scholar 

  42. B. Recht, C. Re, S. Wright, F. Niu, Hogwild: a lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems (2011), pp. 693–701

    Google Scholar 

  43. K. Scaman, F. Bach, S. Bubeck, Y.T. Lee, L. Massoulieó, Optimal algorithms for smooth and strongly convex distributed optimization in networks, in Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney (2017), pp. 3027–3036

    Google Scholar 

  44. M. Schmidt, N. Le Roux, F. Bach, Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)

    Article  MathSciNet  Google Scholar 

  45. S. Shalev-Shwartz, T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

    MathSciNet  MATH  Google Scholar 

  46. S. Shalev-Shwartz, T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program. 155(1), 105–145 (2015)

    MathSciNet  MATH  Google Scholar 

  47. S. Shalev-Shwartz, O. Shamir, N. Srebro, K. Sridharan, Stochastic convex optimization, in Proceedings of the 22nd Annual Conference on Learning Theory (COLT) (2009)

    Google Scholar 

  48. J. Shalf, S. Dosanjh, J. Morrison, Exascale computing technology challenges, in Proceedings of the 9th International Conference on High Performance Computing for Computational Science, VECPAR’10, Berkeley (Springer, Berlin, 2011), pp. 1–25

    Google Scholar 

  49. O. Shamir, N. Srebro, On distributed stochastic optimization and learning, in Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)

    Google Scholar 

  50. O. Shamir, N. Srebro, T. Zhang, Communication efficient distributed optimization using an approximate Newton-type method, in Proceedings of the 31st International Conference on Machine Learning (ICML), JMLR: W&CP, vol. 32 (2014)

    Google Scholar 

  51. A. Shapiro, D. Dentcheva, A. Ruszczynóski, Lectures on Stochastic Programming: Modeling and Theory. MPS-SIAM Series on Optimization (SIAM-MPS, Philadelphia, 2009)

    Google Scholar 

  52. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  53. Q. Tran-Dinh, A. Kyrillidis, V. Cevher, Composite self-concordant minimization. J. Mach. Learn. Res. 16, 371–416 (2015)

    MathSciNet  MATH  Google Scholar 

  54. L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    Article  MathSciNet  Google Scholar 

  55. Y. Zhang, L. Xiao, Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 18(84), 1–42 (2017)

    MathSciNet  MATH  Google Scholar 

  56. Y. Zhang, J.C. Duchi, M.J. Wainwright, Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)

    MathSciNet  MATH  Google Scholar 

  57. Y. Zhuang, W.-S. Chin, Y.-C. Juan, C.-J. Lin, Distributed Newton method for regularized logistic regression. Technical Report, Department of Computer Science, National Taiwan University (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Xiao .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Proof of Theorem 1

First, we notice that Step 2 of Algorithm 1 is equivalent to

$$\displaystyle \begin{aligned} w_{k+1} - w_k = \frac{v_k}{1+\delta_k} = \frac{v_k}{1+\|{{\widetilde v}_k}\|{}_2}, \end{aligned}$$

which implies

$$\displaystyle \begin{aligned} \|{[f^{\prime\prime}(w_k)]^{1/2}(w_{k+1} - w_k)}\|{}_2 = \frac{\|{{\widetilde v}_k}\|{}_2}{1 + \|{{\widetilde v}_k}\|{}_2} < 1. \end{aligned} $$
(11.44)

When inequality (11.44) holds, Nesterov [36, Theorem 4.1.8] has shown that

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) + \langle f'(w_k), w_{k+1} - w_k \rangle + \omega_*\bigl(\|{[f^{\prime\prime}(w_k)]^{1/2}(w_{k+1} - w_k)}\|{}_2\bigr) . \end{aligned} $$

Here we recall the definitions of the pair of conjugate functions

$$\displaystyle \begin{aligned} \omega(t) & = t - \log(1+t), \qquad t\geq 0, \\ \omega_*(t) & = -t - \log(1-t), \qquad 0\leq t < 1. \end{aligned} $$

Using the definition of ω and ω , and with some algebraic operations, we obtain

$$\displaystyle \begin{aligned} f(w_{k+1}) &\leq f(w_k) - \frac{\langle {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} - \frac{\|{{\widetilde v}_k}\|{}_2}{1 + \|{{\widetilde v}_k}\|{}_2} + \log(1 + \|{{\widetilde v}_k}\|{}_2 ) \\ &= f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + \big( \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2)\big) + \frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2}. \end{aligned} $$
(11.45)

By the second-order mean-value theorem, we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) = \omega'(\|{{\widetilde v}_k}\|{}_2)(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2) +\frac{1}{2}\omega^{\prime\prime}(t)\left(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\right)^2 \end{aligned}$$

for some t satisfying

$$\displaystyle \begin{aligned} \min\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} \leq t \leq \max\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} . \end{aligned}$$

Using the inequality (11.11), we can upper bound the second derivative ω ′′(t) as

$$\displaystyle \begin{aligned} \omega^{\prime\prime}(t) = \frac{1}{(1+t)^2} \leq \frac{1}{1+t} \leq \frac{1}{1+\min\{\|{{\widetilde u}_k}\|{}_2,\|{{\widetilde v}_k}\|{}_2\} } \leq \frac{1}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}. \end{aligned}$$

Therefore,

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) &=\frac{(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2)\|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} +\frac{1}{2}\omega^{\prime\prime}(t)\left(\|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\right)^2\\ &\leq \frac{\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \|{{\widetilde v}_k}\|{}_2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} + \frac{(1/2)\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}\\ &\leq \frac{\beta(1+\beta)\|{{\widetilde u}_k}\|{}_2^2+(1/2)\beta^2\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} . \end{aligned} $$

In addition, we have

$$\displaystyle \begin{aligned} \frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} \leq \frac{\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} \leq \frac{\beta(1+\beta)\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2}. \end{aligned}$$

Combining the two inequalities above, and using the relation t 2∕(1 + t) ≤ 2ω(t) for all t ≥ 0, we obtain

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde v}_k}\|{}_2) +\frac{\langle {\widetilde v}_k - {\widetilde u}_k, {\widetilde v}_k \rangle}{1 + \|{{\widetilde v}_k}\|{}_2} &\leq \left(2\beta(1+\beta)+(1/2)\beta^2\right) \\ &\quad \frac{\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} \\ &= \left(\frac{2\beta+(5/2)\beta^2}{(1-\beta)^2}\right) \frac{(1-\beta)^2\|{{\widetilde u}_k}\|{}_2^2}{1+(1-\beta)\|{{\widetilde u}_k}\|{}_2} \\ &\leq \left(\frac{2\beta+(5/2)\beta^2}{(1-\beta)^2}\right) 2 \omega\bigl( (1-\beta)\|{{\widetilde u}_k}\|{}_2 \bigr) \\ &\leq \left(\frac{4\beta+5\beta^2}{1-\beta}\right) \omega\bigl( \|{{\widetilde u}_k}\|{}_2 \bigr) . \end{aligned} $$

In the last inequality above, we used the fact that for any t ≥ 0 we have

$$\displaystyle \begin{aligned} \omega((1-\beta)t) \leq (1-\beta)\omega(t), \end{aligned}$$

which is the result of convexity of ω(t) and ω(0) = 0. Substituting the above upper bound into inequality (11.45) yields

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \left(1 - \frac{4\beta+5\beta^2}{1-\beta} \right) \omega(\|{{\widetilde u}_k}\|{}_2). \end{aligned} $$
(11.46)

With inequality (11.46), we are ready to prove the conclusions of Theorem 1. In particular, Part (a) of Theorem 1 holds for any 0 ≤ β ≤ 1∕10.

For part (b), we assume that \(\|{{\widetilde u}_k}\|{ }_2 \leq 1/6\). According to [36, Theorem 4.1.13], when \(\|{{\widetilde u}_k}\|{ }_2 < 1\), it holds that for every k ≥ 0,

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2) \leq f(w_k) - f(w_\star) \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) . \end{aligned} $$
(11.47)

Combining this sandwich inequality with inequality (11.46), we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_{k+1}}\|{}_2) &\leq f(w_{k+1}) - f(w_\star) \\ &\leq f(w_k) - f(w_\star) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{4\beta+5\beta^2}{1-\beta} \omega(\|{{\widetilde u}_k}\|{}_2)\\ &\leq \omega_*(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{4\beta+5\beta^2}{1-\beta}\omega(\|{{\widetilde u}_k}\|{}_2) . {} \end{aligned} $$
(11.48)

It is easy to verify that ω (t) − ω(t) ≤ 0.26 ω(t) for all t ≤ 1∕6, and

$$\displaystyle \begin{aligned} (4\beta+5\beta^2)/(1-\beta) \leq 0.23, \qquad \mbox{if}\quad \beta\leq 1/20. \end{aligned}$$

Applying these two inequalities to inequality (11.48) completes the proof.

It should be clear that other combinations of the value of β and bound on \(\|{{\widetilde u}_k}\|{ }_2\) are also possible. For example, for β = 1∕10 and \(\|{{\widetilde u}_k}\|{ }_2\leq 1/10\), we have \(\omega (\|{{\widetilde u}_{k+1}}\|{ }_2)\leq 0.65\, \omega (\|{{\widetilde u}_k}\|{ }_2)\).

Appendix 2: Proof of Theorem 2 and Corollary 2

First we prove Theorem 2. We start with the inequality (11.45), and upper bound the last two terms on its right-hand side. Since ω′(t) = t∕(1 + t) < 1, we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_k}\|{}_2)-\omega(\|{{\widetilde v}_k}\|{}_2) \leq \big| \|{{\widetilde u}_k}\|{}_2-\|{{\widetilde v}_k}\|{}_2\big| \leq \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$

In addition, we have

$$\displaystyle \begin{aligned} \frac{\langle {\widetilde v}_k-{\widetilde u}_k, {\widetilde v}_k\rangle}{1+\|{{\widetilde v}_k}\|{}_2} \leq \frac{\|{{\widetilde v}_k}\|{}_2}{1+\|{{\widetilde v}_k}\|{}_2} \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 \leq \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$

Applying these two bounds to (11.45), we obtain

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + 2\|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 .\end{aligned} $$
(11.49)

Next we bound \(\|{{\widetilde u}_k-{\widetilde v}_k}\|{ }_2\) using the approximation tolerance 𝜖 k specified in (11.14),

$$\displaystyle \begin{aligned} \|{{\widetilde u}_k-{\widetilde v}_k}\|{}_2 & = \left\| [f^{\prime\prime}(w_k)]^{-1/2}f'(w_k) - [f^{\prime\prime}(w_k)]^{1/2} v_k \right\|{}_2 \\ & = \left\| [f^{\prime\prime}(w_k)]^{-1/2} \bigl( f^{\prime\prime}(w_k) v_k - f'(w_k) \bigr)\right\|{}_2\\ &\leq \lambda^{-1/2} \left\| f^{\prime\prime}(w_k) v_k - f'(w_k) \right\|{}_2\\ &\leq \lambda^{-1/2} \epsilon_k \\ &= \frac{1}{2} \min\left\{ \frac{\omega(r_k)}{2}, \frac{\omega^{3/2}(r_k)}{10} \right\}.\end{aligned} $$

Combining the above inequality with (11.49), and using \(r_k = L^{-1/2}\|{f'(w_k)}\|{ }_2 \leq \|{{\widetilde u}_k}\|{ }_2\) with the monotonicity of ω, we arrive at

$$\displaystyle \begin{aligned} f(w_{k+1}) \leq f(w_k) - \omega(\|{{\widetilde u}_k}\|{}_2) + \min\left\{ \frac{\omega({\widetilde u}_k)}{2}, \frac{\omega^{3/2}({\widetilde u}_k)}{10} \right\}. \end{aligned} $$
(11.50)

Part (a) of the theorem follows immediately from inequality (11.50).

For part (b), we assume that \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\). Combining (11.47) with (11.50), we have

$$\displaystyle \begin{aligned} \omega(\|{{\widetilde u}_{k+1}}\|{}_2) & \leq f(w_{k+1}) - f(w_\star) \leq f(w_k)-f(w_\star) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{\omega^{3/2}(\|{{\widetilde u}_k}\|{}_2)}{10} \\ & \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) - \omega(\|{{\widetilde u}_k}\|{}_2) + \frac{\omega^{3/2}(\|{{\widetilde u}_k}\|{}_2)}{10}. {} \end{aligned} $$
(11.51)

Let h(t) = ω (t) − ω(t) and consider only t ≥ 0. Notice that h(0) = 0 and \(h'(t)=\frac {2t^2}{1-t^2}<\frac {128}{63}t^2\) for t ≤ 1∕8. Thus, we conclude that \(h(t)\leq \frac {128}{189}t^3\) for t ≤ 1∕8. We also notice that ω(0) = 0 and \(\omega '(t)=\frac {t}{1+t}\geq \frac {8}{9}t\) for t ≤ 1∕8. Thus, we have \(\omega (t)\geq \frac {4}{9}t^2\) for t ≤ 1∕8. Combining these results, we obtain

$$\displaystyle \begin{aligned} \omega_*(t)-\omega(t) \leq \frac{128}{189}t^3 = \frac{128}{189}(t^2)^{3/2} \leq \frac{128}{189}\left(\frac{9}{4}\omega(t)\right)^{3/2} \leq \left(\sqrt{6}-\frac{1}{10}\right) \omega^{3/2}(t) . \end{aligned}$$

Applying this inequality to the right-hand side of (11.51) completes the proof.

Next we prove Corollary 2. By part (a) of Theorem 2, if \(\omega (\|{{\widetilde u}_k}\|{ }_2)\geq 1/8\), then each iteration of Algorithm 1 decreases the function value at least by the constant \(\frac {1}{2}\omega (1/8)\). So within at most \(K_1=\left \lceil \frac {2(f(w_0)-f(w_\star ))}{\omega (1/8)}\right \rceil \) iterations, we are guaranteed to have \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\). Part (b) of Theorem 2 implies \(6\,\omega (\|{{\widetilde u}_{k+1}}\|{ }_2) \leq \left ( 6\,\omega (\|{{\widetilde u}_k}\|{ }_2)\right )^{3/2}\) when \(\|{{\widetilde u}_k}\|{ }_2\leq 1/8\), and hence

$$\displaystyle \begin{aligned} \log\bigl(6\,\omega(\|{{\widetilde u}_k}\|{}_2)\bigr) \leq\left(\frac{3}{2}\right)^{k-K_1} \log\left(6\,\omega(1/8)\right), \qquad k \geq K_1. \end{aligned}$$

Note that both sides of the above inequality are negative. Therefore, after \(k\geq K_1 + \frac {\log \log (1/(3\epsilon ))}{\log (3/2)}\) iterations (assuming 𝜖 ≤ 1∕(3e)), we have

$$\displaystyle \begin{aligned} \log\bigl(6\,\omega(\|{{\widetilde u}_k}\|{}_2)\bigr) \leq \log(1/(3\epsilon)) \log(6\,\omega(1/8)) \leq -\log(1/(3\epsilon)), \end{aligned}$$

which implies \(\omega (\|{{\widetilde u}_k}\|{ }_2) \leq \epsilon /2\). Finally using (11.47) and the fact that ω (t) ≤ 2 ω(t) for t ≤ 1∕8, we obtain

$$\displaystyle \begin{aligned} f(w_k)-f(w_\star) \leq \omega_*(\|{{\widetilde u}_k}\|{}_2) \leq 2\,\omega(\|{{\widetilde u}_k}\|{}_2) \leq \epsilon. \end{aligned}$$

This completes the proof of Corollary 2.

Appendix 3: Proof of Lemma 4

It suffices to show that the algorithm terminates at iteration t ≤ T μ − 1, because when the algorithm terminates, it outputs a vector v k which satisfies

$$\displaystyle \begin{aligned} \|{H v_k - f'(w_k)}\|{}_2 = \|{r^{(t+1)}}\|{}_2 \leq \epsilon_k. \end{aligned}$$

Denote by v  = H −1 f′(w k) the solution of the linear system Hv k = f′(w k). By the classical analysis on the preconditioned conjugate gradient method (see, e.g., [3, 32]), Algorithm 2 has the following convergence property:

$$\displaystyle \begin{aligned} (v^{(t)} - v^*)^T H (v^{(t)} - v^*) \leq 4 \left( \frac{\sqrt{\kappa} -1 }{\sqrt{\kappa}+1}\right)^{2t} (v^*)^T H v^*, \end{aligned} $$
(11.52)

where κ = 1 + 2μλ is the condition number of P −1 H given in (11.25). For the left-hand side of inequality (11.52), we have

$$\displaystyle \begin{aligned} (v^{(t)} - v^*)^T H (v^{(t)} - v^*) = (r^{(t)})^T H^{-1} r^{(t)} \geq \frac{ \|{r^{(t)}}\|{}_2^2}{L}. \end{aligned} $$

For the right-hand side of inequality (11.52), we have

$$\displaystyle \begin{aligned} (v^*)^T H v^* & = (f'(w_k))^T H^{-1} f'(w_k) \leq \frac{\|{f'(w_k)}\|{}_2^2}{\lambda} . \end{aligned} $$

Combining the above two inequalities with inequality (11.52), we obtain

$$\displaystyle \begin{aligned} \|{r^{(t)}}\|{}_2 \leq 2 \sqrt{\frac{L}{\lambda}} \left( \frac{\sqrt{\kappa} -1 }{\sqrt{\kappa}+1}\right)^{t} \|{f'(w_k)}\|{}_2 \leq 2 \sqrt{\frac{L}{\lambda}} \left(1 - \sqrt{\frac{\lambda}{\lambda + 2\mu}}\right)^{t}\|{f'(w_k)}\|{}_2 . \end{aligned} $$

To guarantee that ∥r (t)2 ≤ 𝜖 k, it suffices to have

$$\displaystyle \begin{aligned} t ~\geq~ \frac{\log\Big(\frac{2 \sqrt{L/\lambda} \|{f'(w_k)}\|{}_2}{\epsilon_k}\Big)}{- \log\left(1 - \sqrt{\frac{\lambda}{\lambda + 2\mu}}\right)} ~\geq~ \sqrt{1+\frac{2\mu}{\lambda}}\, \log\bigg(\frac{2 \sqrt{L/\lambda} \|{f'(w_k)}\|{}_2}{\epsilon_k}\bigg), \end{aligned} $$

where in the last inequality we used \(-\log (1-x) \geq x\) for 0 < x < 1. Comparing with the definition of T μ in (11.26), this is the desired result.

Appendix 4: Proof of Lemma 5

First, we prove Inequality (11.31). Recall that w and \({\widehat w}_i\) minimizes f(w) and \(f_i(w) + (\rho /2)\|{w}\|{ }_2^2\) respectively. Since both functions are λ-strongly convex, we have

$$\displaystyle \begin{aligned} \frac{\lambda}{2}\|{w_\star}\|{}_2^2 &\leq f(w_\star) \leq f(0) \leq V_0,\\ \frac{\lambda}{2}\|{{\widehat w}_i}\|{}_2^2 &\leq f_i({\widehat w}_i) + \frac{\rho}{2}\|{{\widehat w}_i}\|{}_2^2 \leq f_i(0) \leq V_0, \end{aligned} $$

where we also used Assumption 2(a) in the first inequality on both lines. These two inequalities imply \(\|{w_\star }\|{ }_2 \leq \sqrt {2V_0/\lambda }\) and \(\|{{\widehat w}_i}\|{ }_2 \leq \sqrt {2V_0/\lambda }\). Then the inequality (11.31) follows since w 0 is the average over \({\widehat w}_i\) for i = 1, …, m.

Next we prove inequality (11.32). Let z be a random variable in \(\mathcal {Z}\subset \mathbb {R}^p\) with an unknown probability distribution. We define a regularized population risk:

$$\displaystyle \begin{aligned} R(w) = \mathbb{E}_z[\phi(w, z)] + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2. \end{aligned}$$

Let S be a set of n i.i.d. samples in \(\mathcal {Z}\) from the same distribution. We define a regularized empirical risk

$$\displaystyle \begin{aligned} r_S(w) = \frac{1}{n}\sum_{z\in S}\phi(w,z) + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2, \end{aligned}$$

and its minimizer

$$\displaystyle \begin{aligned} {\widehat w}_S = \arg\min_w ~r_S(w). \end{aligned}$$

The following lemma states that the population risk of \({\widehat w}_S\) is very close to its empirical risk. The proof is based on the notion of stability of regularized ERM [9].

Lemma 7

Suppose Assumption 2 holds and S is a set of n i.i.d. samples in \(\mathcal {Z}\) . Then

$$\displaystyle \begin{aligned} \mathbb{E}_S\bigl[R({\widehat w}_S) - r_S({\widehat w}_S)\bigr] \leq \frac{2G^2}{\rho n}. \end{aligned}$$

Proof

Let S = {z 1, …, z n}. For any k ∈{1, …, n}, we define a modified training set S (k) by replacing z k with another sample \({\widetilde z}_k\), which is drawn from the same distribution and is independent of S. The empirical risk on S (k) is defined as

$$\displaystyle \begin{aligned} r_S^{(k)}(w) = \frac{1}{n}\sum_{z\in S^{(k)}} \phi(w,z) + \frac{\lambda+\rho}{2}\|{w}\|{}_2^2. \end{aligned}$$

Let \({\widehat w}_S^{(k)}= \arg \min _w r_S^{(k)}(w)\). Since both r S and \(r_S^{(k)}\) are ρ-strongly convex, we have

$$\displaystyle \begin{aligned} r_S({\widehat w}_S^{(k)}) - r_S({\widehat w}_S) &\geq \frac{\rho}{2} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2^2, \\ r_S^{(k)}({\widehat w}_S) - r_S^{(k)}({\widehat w}_S^{(k)}) &\geq \frac{\rho}{2} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2^2. \end{aligned} $$

Summing the above two inequalities, and noticing that

$$\displaystyle \begin{aligned} r_S(w) - r_S^{(k)}(w) = \frac{1}{n}(\phi(w,z_k) - \phi(w,{\widetilde z}_k)), \end{aligned}$$

we have

$$\displaystyle \begin{aligned} \|{{\widehat w}_S^{(k)}- {\widehat w}_S}\|{}_2^2 \leq \frac{1}{\rho n}\left( \phi({\widehat w}_S^{(k)},z_k) - \phi({\widehat w}_S^{(k)},{\widetilde z}_k) - \phi({\widehat w}_S,z_k) + \phi({\widehat w}_S, {\widetilde z}_k)\right) . \end{aligned} $$
(11.53)

By Assumption 2(b) and the facts \(\|{{\widehat w}_S}\|{ }_2\leq \sqrt {2V_0/\lambda }\) and \(\|{{\widehat w}_S^{(k)}}\|{ }_2\leq \sqrt {2V_0/\lambda }\), we have

$$\displaystyle \begin{aligned} \big|\phi({\widehat w}_S^{(k)},z) - \phi({\widehat w}_S,z)\big| \leq G \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2, \qquad \forall\, z\in \mathcal{Z}. \end{aligned}$$

Combining the above Lipschitz condition with (11.53), we obtain

$$\displaystyle \begin{aligned} \|{{\widehat w}_S^{(k)}- {\widehat w}_S}\|{}_2^2 \leq \frac{2 G}{\rho n} \|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{}_2. \end{aligned}$$

As a consequence, we have \(\|{{\widehat w}_S^{(k)} - {\widehat w}_S}\|{ }_2 \leq \frac {2 G}{\rho n}\), and therefore

$$\displaystyle \begin{aligned} \big|\phi({\widehat w}_S^{(k)},z) - \phi({\widehat w}_S,z)\big| \leq \frac{2 G^2}{\rho n}, \qquad \forall\, z\in \mathcal{Z}. \end{aligned} $$
(11.54)

In the terminology of learning theory, this means that empirical minimization over the regularized loss r S has uniform stability 2G 2∕(ρn) with respect to the loss function ϕ; see [9].

For any fixed k ∈{1, …, n}, since \({\widetilde z}_k\) is independent of S, we have

$$\displaystyle \begin{aligned} \mathbb{E}_S\bigl[R({\widehat w}_S) - r_S({\widehat w}_S)\bigr] &= \mathbb{E}_S \bigg[ \mathbb{E}_{{\widetilde z}_k}[\phi({\widehat w}_S,{\widetilde z}_k)] - \frac{1}{n}\sum_{j=1}^n \phi({\widehat w}_S, z_j) \bigg] \\ &= \mathbb{E}_{S,{\widetilde z}_k}\bigl[\phi({\widehat w}_S,{\widetilde z}_k)-\phi({\widehat w}_S,z_k)\bigr] \\ &= \mathbb{E}_{S,{\widetilde z}_k}\bigl[\phi({\widehat w}_S,{\widetilde z}_k)-\phi({\widehat w}_S^{(k)},{\widetilde z}_k)\bigr], \end{aligned} $$

where the second equality used the fact that \(\mathbb {E}_S[\phi ({\widehat w}_S,z_j)\) has the same value for all j = 1, …, n, and the third equality used the symmetry between the pairs (S, z k) and \((S^{(k)}, {\widetilde z}_k)\) (also known as the renaming trick; see [9, Lemma 7]). Combining the above equality with (11.54) yields the desired result. □

Next, we consider a distributed system with m machines, where each machine has a local dataset S i of size n, for i = 1, …, m. To simplify notation, we denote the local regularized empirical loss function and its minimizer by r i and \({\widehat w}_i\), respectively. We would like to bound the excessive error when applying \({\widehat w}_i\) to a different dataset S j. Notice that

$$\displaystyle \begin{aligned} \mathbb{E}_{S_i,S_j}\bigl[r_j({\widehat w}_i) - r_j({\widehat w}_j)\bigr] &= \underbrace{\mathbb{E}_{S_i,S_j}\bigl[r_j({\widehat w}_i)-r_i({\widehat w}_i)\bigr]}_{v_1} + \underbrace{\mathbb{E}_{S_i,S_j}\bigl[r_i({\widehat w}_i)-r_j({\widehat w}_R)\bigr]}_{v_2} \\ &\quad + \underbrace{\mathbb{E}_{S_j}\bigl[r_j({\widehat w}_R)-r_j({\widehat w}_j)\bigr]}_{v_3}, {} \end{aligned} $$
(11.55)

where \({\widehat w}_R\) is the minimizer of R(w). Since S i and S j are independent, we have

$$\displaystyle \begin{aligned} v_1 = \mathbb{E}_{S_i}\big[\mathbb{E}_{S_j}[r_j({\widehat w}_i)] - r_i({\widehat w}_i)\big] = \mathbb{E}_{S_i}\big[R({\widehat w}_i) - r_i({\widehat w}_i)] \leq \frac{2 G^2}{\rho n} , \end{aligned} $$

where the inequality is due to Lemma 7. For the second term in (11.55), we have

$$\displaystyle \begin{aligned} v_2 = \mathbb{E}_{S_i}\bigl[r_i({\widehat w}_i) - \mathbb{E}_{S_j}[r_j({\widehat w}_R)]\bigr] = \mathbb{E}_{S_i}\bigl[r_i({\widehat w}_i) - r_i({\widehat w}_R)\bigr] \leq 0. \end{aligned} $$

It remains to bound the third term v 3. We first use the strong convexity of r j to obtain (see, e.g., [36, Theorem 2.1.10])

$$\displaystyle \begin{aligned} r_j({\widehat w}_R) - r_j({\widehat w}_j) \leq \frac{\|{r_j^{\prime}({\widehat w}_R)}\|{}_2^2}{2\rho}, \end{aligned} $$
(11.56)

where \(r^{\prime }_j({\widehat w}_R)\) denotes the gradient of r j at \({\widehat w}_R\). If we index the elements of S j by z 1, …, z n, then

$$\displaystyle \begin{aligned} r_j^{\prime}({\widehat w}_R) = \frac{1}{n} \sum_{k=1}^n \left(\phi'({\widehat w}_R, z_k) + (\lambda+\rho) {\widehat w}_R \right). \end{aligned} $$
(11.57)

By the optimality condition of \({\widehat w}_R=\arg \min _w R(w)\), we have for any k ∈{1, …, n},

$$\displaystyle \begin{aligned} \mathbb{E}_{z_k}\bigl[ \phi'({\widehat w}_R,z_k) + (\lambda+\rho){\widehat w}_R\bigr] = 0. \end{aligned}$$

Therefore, according to (11.57), the gradient \(r_j({\widehat w}_R)\) is the average of n independent and zero-mean random vectors. Combining (11.56) and (11.57) with the definition of v 3 in (11.55), we have

In the equality above, we used the fact that \(\phi '({\widehat w}_R,z_k)+(\lambda +\rho ){\widehat w}_R\) are i.i.d. zero-mean random variables, so the variance of their sum equals the sum of their variances. The last inequality above is due to Assumption 2(b) and the fact that \(\|{{\widehat w}_R}\|{ }_2\leq \sqrt {2V_0/(\lambda +\rho )}\leq \sqrt {2V_0/\lambda }\). Combining the upper bounds for v 1, v 2 and v 3, we have

$$\displaystyle \begin{aligned} \mathbb{E}_{S_i,S_j} \left[r_j({\widehat w}_i) - r_j({\widehat w}_j)\right] \leq \frac{3 G^2}{\rho n}.\end{aligned} $$
(11.58)

Recall the definition of f as

$$\displaystyle \begin{aligned} f(w) = \frac{1}{mn}\sum_{i=1}^m\sum_{k=1}^n \phi(w, z_{i,k})+\frac{\lambda}{2}\|{w}\|{}_2^2, \end{aligned}$$

where z i,k denotes the kth sample at machine i. Let \(r(w) = (1/m)\sum _{j=1}^m r_j(w)\), then

$$\displaystyle \begin{aligned} r(w) = f(w) + \frac{\rho}{2} \|{w}\|{}_2^2 . \end{aligned} $$
(11.59)

We compare the value \(r({\widehat w}_i)\), for any i ∈{1, …, m}, with the minimum of r(w):

$$\displaystyle \begin{aligned} r({\widehat w}_i) -\min_w r(w) & = \frac{1}{m}\sum_{j=1}^m r_j({\widehat w}_i) - \min_w\frac{1}{m}\sum_{j=1}^m r_j(w) \\ & \leq \frac{1}{m}\sum_{j=1}^m r_j({\widehat w}_i) - \frac{1}{m}\sum_{j=1}^m \min_w r_j(w) \\ & = \frac{1}{m}\sum_{j=1}^m \left( r_j({\widehat w}_i) - r_j({\widehat w}_j)\right) . \end{aligned} $$

Taking expectation with respect to all the random data sets S 1, …, S m, we obtain

$$\displaystyle \begin{aligned} \mathbb{E}[ r({\widehat w}_i) - \min_w r(w)] \leq \frac{1}{m} \sum_{j=1}^n \mathbb{E}[ r_j({\widehat w}_i)- r_j({\widehat w}_j)] \leq \frac{3 G^2}{\rho n}, \end{aligned} $$
(11.60)

where the last inequality is due to (11.58). Finally, we bound the expected value of \(f({\widehat w}_i)\):

$$\displaystyle \begin{aligned} \mathbb{E}[f({\widehat w}_i)] &\leq \mathbb{E}[r({\widehat w}_i)] \leq \mathbb{E}\left[\min_w r(w)\right] + \frac{3G^2}{\rho n} \\ &\leq \mathbb{E}\left[f(w_\star)+\frac{\rho}{2}\|{w_\star}\|{}_2^2\right]+ \frac{3G^2}{\rho n} \\ & \leq \mathbb{E}\left[f(w_\star)\right]+\frac{\rho D^2}{2}+ \frac{3G^2}{\rho n}, \end{aligned} $$

where the first inequality holds because of (11.59), the second inequality is due to (11.60), and the last inequality follows from the assumption that \(\mathbb {E}[\|{w_\star }\|{ }_2]\leq D^2\). Choosing \(\rho = \sqrt {6 G^2/(n D^2)}\) results in

$$\displaystyle \begin{aligned} \mathbb{E}[f( {\widehat w}_i) - f(w_\star)] \leq \frac{\sqrt{6}G D}{\sqrt{n}}, \qquad i=1,\ldots,m. \end{aligned}$$

Since \(w_0=(1/m)\sum _{i=1}^m {\widehat w}_i\), we can use the convexity of the function f to conclude that \(\mathbb {E}[f(w_0) - f(w_\star )] \leq \sqrt {6}G D/\sqrt {n}\), which is the desired result.

Appendix 5: Proof of Lemma 6

We consider the regularized empirical loss functions f i defined in (11.30). For any two vectors \(u,w\in \mathbb {R}^d\) satisfying ∥u − w2 ≤ ε, Assumption 2(d) implies

$$\displaystyle \begin{aligned} \|{f^{\prime\prime}_i(u) - f^{\prime\prime}_i(w)}\|{}_2 \leq M \varepsilon. \end{aligned} $$

Let B(0, r) be the ball in \(\mathbb {R}^d\) with radius r, centered at the origin. Let \(N_\varepsilon ^{\mathrm {cov}}(B(0,r))\) be the covering number of B(0, r) by balls of radius ε, i.e., the minimum number of balls of radius ε required to cover B(0, r). We also define \(N_\varepsilon ^{\mathrm {pac}}(B(0,r))\) as the packing number of B(0, r), i.e., the maximum number of disjoint balls whose centers belong to B(0, r). It is easy to verify that

$$\displaystyle \begin{aligned} N_{\varepsilon}^{\mathrm{cov}}(B(0,r)) \leq N_{\varepsilon/2}^{\mathrm{pac}}(B(0,r)) \leq \left( 1 + {2r}/{\varepsilon}\right)^d. \end{aligned} $$

Therefore, there exist a set of points \(U\subseteq \mathbb {R}^d\) with cardinality at most (1 + 2rε)d, such that for any vector w ∈ B(0, r), we have

$$\displaystyle \begin{aligned} \min_{u\in U} \|{f^{\prime\prime}_i(w) - f^{\prime\prime}_i(u)}\|{}_2 \leq M\varepsilon. \end{aligned} $$
(11.61)

We consider an arbitrary point u ∈ U and the associated Hessian matrices for the functions f i defined in (11.30). We have

$$\displaystyle \begin{aligned} f^{\prime\prime}_i(u)=\frac{1}{n} \sum_{j=1}^n \left(\phi^{\prime\prime}(u, z_{i,j}) + \lambda I\right), \qquad i=1,\ldots,m. \end{aligned}$$

The components of the above sum are i.i.d. matrices that are upper bounded by LI. By the matrix Hoeffding’s inequality [33, Corollary 4.2], we have

$$\displaystyle \begin{aligned} \mathbb{P}\left[\|{f^{\prime\prime}_i(u) - \mathbb{E}[f^{\prime\prime}_i(u)]}\|{}_2 > t\right] \leq d \cdot e^{- \frac{n t^2}{2L^2}} . \end{aligned} $$

Note that \(\mathbb {E}[f^{\prime \prime }_1(w)] = \mathbb {E}[f^{\prime \prime }(w)]\) for any w ∈ B(0, r). Using the triangular inequality and inequality (11.61), we obtain

$$\displaystyle \begin{aligned} \|{f^{\prime\prime}_1(w) - f^{\prime\prime}(w)]}\|{}_2 &\leq \|{f^{\prime\prime}_1(w) - \mathbb{E}[f^{\prime\prime}_1(w)]}\|{}_2 + \|{f^{\prime\prime}(w) - \mathbb{E}[f^{\prime\prime}(w)]}\|{}_2 \\ {} & \leq 2\max_{i\in\{1,\ldots,m\}} \|{f^{\prime\prime}_i(w) - \mathbb{E}[f^{\prime\prime}_i(w)]}\|{}_2 \\ &\leq 2\max_{i\in\{1,\ldots,m\}} \Big(\max_{u\in U}\|{f^{\prime\prime}_i(u) - \mathbb{E}[f^{\prime\prime}_i(u)]}\|{}_2 + M \varepsilon \Big).{} \end{aligned} $$
(11.62)

Applying the union bound, we have with probability at least

$$\displaystyle \begin{aligned} 1 - m d ( 1 + {2r}/{\varepsilon})^d\cdot e^{- \frac{n t^2}{2L^2}}, \end{aligned}$$

the inequality \(\|{f^{\prime \prime }_i(u) - \mathbb {E}[f^{\prime \prime }_i(u)]}\|{ }_2 \leq t\) holds for every i ∈{1, …, m} and every u ∈ U. Combining this probability bound with inequality (11.62), we have

$$\displaystyle \begin{aligned} \mathbb{P}\Big[\sup_{w\in B(0,r)}\|{f^{\prime\prime}_1(w)-f^{\prime\prime}(w)}\|{}_2 > 2 t+2 M\varepsilon\Big] \leq m d \left( 1 + {2r}/{\varepsilon}\right)^d \cdot e^{- \frac{n t^2}{2L^2}}. \end{aligned} $$
(11.63)

As the final step, we choose \(\varepsilon = \frac {\sqrt {2}L}{\sqrt {n}M}\) and then choose t to make the right-hand side of inequality (11.63) equal to δ. This yields the desired result.

Appendix 6: Proof of Theorem 5

Suppose Algorithm 3 terminates in K iterations. Let t k be the number of conjugate gradient steps in each call of Algorithm 2, for k = 0, 1, …, K − 1. For any given μ > 0, we define T μ as in (11.26). Let \(\mathcal {A}\) denotes the event that t k ≤ T μ for all k ∈{0, …, K − 1}. Let \({\mathcal {A}_{\mathrm {c}}}\) be the complement of \(\mathcal {A}\), i.e., the event that t k > T μ for some k ∈{0, …, K − 1}. In addition, let the probabilities of the events \(\mathcal {A}\) and \({\mathcal {A}_{\mathrm {c}}}\) be 1 − δ and δ respectively. By the law of total expectation, we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] = \mathbb{E}[T|\mathcal{A}] \mathbb{P}(\mathcal{A}) + \mathbb{E}[T|{\mathcal{A}_{\mathrm{c}}}] \mathbb{P}({\mathcal{A}_{\mathrm{c}}}) = (1-\delta)\mathbb{E}[T|\mathcal{A}] + \delta\,\mathbb{E}[T|{\mathcal{A}_{\mathrm{c}}}] . \end{aligned} $$

When the event \(\mathcal {A}\) happens, we have T ≤ 1 + K(T μ + 1) where T μ is given in (11.26); otherwise we have T ≤ 1 + K(T L + 1), where

$$\displaystyle \begin{aligned} T_L = \sqrt{2+\frac{2L}{\lambda}}\log\left(\frac{2L}{\beta\lambda}\right) \end{aligned} $$
(11.64)

bounds the number of PCG iterations in Algorithm 2 when the event \({\mathcal {A}_{\mathrm {c}}}\) happens. Since Algorithm 2 always ensures ∥f ′′(w k)v k − f′(w k)∥2 ≤ 𝜖 k, the outer iteration count K shares the same bound in (11.12), which depends on f(w 0) − f(w ). Notice that f(w 0) − f(w ) is a random variable depending on the random generation of the datasets. However, T μ and T L are deterministic constants. So we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] &\leq 1 + (1-\delta) \mathbb{E}[K(T_{\mu}+1)|\mathcal{A}] + \delta\,\mathbb{E}[K(T_{L}+1)|{\mathcal{A}_{\mathrm{c}}}] \\ &= 1 + (1-\delta) (T_{\mu}+1)\mathbb{E}[K|\mathcal{A}] + \delta(T_{L}+1)\mathbb{E}[K|{\mathcal{A}_{\mathrm{c}}}] . {} \end{aligned} $$
(11.65)

Next we bound \(\mathbb {E}[K|\mathcal {A}]\) and \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\) separately. To bound \(\mathbb {E}[K|\mathcal {A}]\), we use

$$\displaystyle \begin{aligned} \mathbb{E}[K] = (1-\delta) \mathbb{E}[K|\mathcal{A}] + \delta\,\mathbb{E}[K|{\mathcal{A}_{\mathrm{c}}}] \geq (1-\delta) \mathbb{E}[K|\mathcal{A}] \end{aligned}$$

to obtain

$$\displaystyle \begin{aligned} \mathbb{E}[K|\mathcal{A}]\leq\mathbb{E}[K]/(1-\delta). \end{aligned} $$
(11.66)

In order to bound \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\), we derive a deterministic bound on f(w 0) − f(w ). By Lemma 5, we have \(\|{w_0}\|{ }_2\leq \sqrt {2V_0/\lambda }\), which together with Assumption 2(b) yields

$$\displaystyle \begin{aligned} \|{f'(w)}\|{}_2 \leq G + \lambda\|w\|{}_2 \leq G + \sqrt{2\lambda V_0}. \end{aligned}$$

Combining with the strong convexity of f, we obtain

$$\displaystyle \begin{aligned} f(w_0)-f(w_\star) \leq \frac{1}{2\lambda}\|{f'(w_0)}\|{}_2^2 \leq\frac{1}{2\lambda}\left(G+\sqrt{2\lambda V_0}\right)^2 \leq 2V + \frac{G^2}{\lambda}. \end{aligned} $$

Therefore by Corollary 1,

$$\displaystyle \begin{aligned} K \leq K_{\mathrm{max}} = 1 + \frac{4V_0+2G^2/\lambda}{\omega(1/6)} + \left\lceil \log_2\left(\frac{2\omega(1/6)}{\epsilon}\right)\right\rceil , \end{aligned} $$
(11.67)

where the additional 1 counts compensate for removing one ⌈⋅⌉ operator in (11.12).

Using inequality (11.65), the bound on \(\mathbb {E}[K|\mathcal {A}]\) in (11.66) and the bound on \(\mathbb {E}[K|{\mathcal {A}_{\mathrm {c}}}]\) in (11.67), we obtain

$$\displaystyle \begin{aligned} \mathbb{E}[T] \leq 1 + (T_{\mu}+1) \mathbb{E}[K] + \delta (T_{L}+1) K_{\mathrm{max}} . \end{aligned}$$

Now we can bound \(\mathbb {E}[K]\) by Corollary 1 and Lemma 5. More specifically,

$$\displaystyle \begin{aligned} \mathbb{E}[ K ] \leq \frac{\mathbb{E}[2(f(w_0) - f(w_\star))]}{\omega(1/6)} + \left\lceil \log_2 \Big( \frac{ 2 \omega(1/6)}{\epsilon}\Big) \right\rceil + 1 = C_0 + \frac{2\sqrt{6}}{\omega(1/6)} \cdot \frac{G D}{\sqrt{n}}, \end{aligned}$$

where \(C_0=1+\left \lceil \log _2(2\omega (1/6)/\epsilon ) \right \rceil \). With the choice of δ in (11.34) and the definition of T L in (11.64), we have

where \(C_2=\log (2L/(\beta \lambda ))\). Putting everything together, we have

$$\displaystyle \begin{aligned} \mathbb{E}[T] &\leq 1 + \left(C_0+\frac{C_0}{\sqrt{n}}\cdot\frac{GD}{4V_0+2G^2/\lambda}+\frac{2\sqrt{6}+1}{\omega(1/6)}\cdot \frac{GD}{\sqrt{n}} \right) (T_\mu+1)\\ &\leq 1 + \left(C_1 + \frac{6}{\omega(1/6)}\cdot \frac{GD}{\sqrt{n}} \right) (T_\mu+1) . \end{aligned} $$

Replacing T μ by its expression in (11.26) and applying Corollary 4, we obtain the desired result.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zhang, Y., Xiao, L. (2018). Communication-Efficient Distributed Optimization of Self-concordant Empirical Loss. In: Giselsson, P., Rantzer, A. (eds) Large-Scale and Distributed Optimization. Lecture Notes in Mathematics, vol 2227. Springer, Cham. https://doi.org/10.1007/978-3-319-97478-1_11

Download citation

Publish with us

Policies and ethics