Skip to main content
Log in

A framework for parallel second order incremental optimization algorithms for solving partially separable problems

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

We propose Hessian Approximated Multiple Subsets Iteration (HAMSI), which is a generic second order incremental algorithm for solving large-scale partially separable convex and nonconvex optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with the number of processors. We prove the convergence properties of our algorithm when the subset selection step is deterministic. Combined with techniques for effectively utilizing modern parallel computer architectures, we illustrate that a particular implementation of the proposed method based on L-BFGS updates converges more rapidly than a parallel gradient descent when both methods are used to solve large-scale matrix factorization problems. This performance gain comes only at the expense of using memory that scales linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many large scale problems, where first order methods based on variants of gradient descent are applicable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/spartensor/hamsi-mf.

  2. An already used random color is chosen if possible, otherwise, a new color is chosen.

  3. cf. http://www.7-cpu.com/.

References

  1. Berahas, A.S., Nocedal, J., Takáč, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 1055–1063 (2016)

  2. Bertsekas, D.P.: Incremental least squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 1–38, 2011 (2010)

    Google Scholar 

  4. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bozdağ, D., Çatalyürek, Ü.V., Gebremedhin, A.H., Manne, F., Boman, E.G., Özgüner, F.: Distributed-memory parallel algorithms for distance-2 coloring and related problems in derivative computation. SIAM J. Sci. Comput. 32(4), 2418–2446 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  6. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  7. Byrd, R.H., Nocedal, J., Schnabel, R.B.: Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program. 63(1–3), 129–156 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor Factorization. Wiley, New York (2009)

    Book  MATH  Google Scholar 

  9. Daneshmand, A., Facchinei, F., Kungurtsev, V., Scutari, G.: Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization. IEEE Trans. Signal Process. 63(15), 3914–3929 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  10. Facchinei, F., Scutari, G., Sagratella, S.: Parallel selective algorithms for nonconvex big data optimization. IEEE Trans. Signal Process. 63(7), 1874–1889 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gebremedhin, A.H., Manne, F., Pothen, A.: Parallel distance-k coloring algorithms for numerical optimization. In: Euro-Par 2002 Parallel Processing—8th International Conference, pp. 912–921 (2002)

  12. Gebremedhin, A.H., Manne, F., Pothen, A.: What color is your Jacobian? Graph coloring for computing derivatives. SIAM Rev. 47(4), 629–705 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gebremedhin, A.H., Nguyen, D., Patwary, MdMA, Pothen, A.: ColPack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  14. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: ACM SIGKDD (2011)

  15. Gower, R.M., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, New York, USA, 20–22 June 2016. PMLR, pp. 1869–1878

  16. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: A globally convergent incremental Newton method. Math. Program. 151(1), 283–313 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  17. Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (2015)

    Article  Google Scholar 

  18. Kschischang, F.R., Frey, B.J., Loeliger, H.-A.: Factor graphs and the sum–product algorithm. IEEE Trans. Inf. Theory 47(2), 498–519 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  19. Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)

  20. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)

    MathSciNet  MATH  Google Scholar 

  21. Mangasarian, O.L., Solodov, M.V.: Serial and parallel backpropation convergence via nonmonotone perturbed minimization. Optim. Methods Softw. 4, 103–116 (1994)

    Article  Google Scholar 

  22. Mareček, J., Richtárik, P., Takáč, M.: Distributed block coordinate descent for minimizing partially separable functions. In: Al-Baali, M., Grandinetti, L., Purnama, A. (eds.) Numerical Analysis and Optimization, pp. 261–288. Springer, Berlin (2015)

    Chapter  Google Scholar 

  23. Matula, D.W.: A min-max theorem for graphs with application to graph coloring. SIAM Rev. 10, 481–482 (1968)

    Google Scholar 

  24. Mokhtari, A., Eisen, M., Ribeiro, A.: IQN: an incremental quasi-Newton method with local superlinear convergence rate. arXiv preprint arXiv:1702.00709 (2017)

  25. Moritz, P., Nishihara, R., Jordan, M.I.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258 (2016)

  26. Mota, J.F.C., Xavier, J.M.F., Aguiar, P.M.Q., Püschel, M.: D-ADMM: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61(10), 2718–2723 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  27. Pan, X., Lam, M., Tu, S., Papailiopoulos, D., Zhang, S., Jordan, M.I., Ramchandran, K., Ré, C.: Cyclades: conflict-free asynchronous machine learning. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2568–2576. Curran Associates Inc, Red Hook (2016)

    Google Scholar 

  28. Recht, B., Re, C., Wright, S., Feng, N.: HOGWILD: a lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 693–701. Curran Associates Inc., Red Hook (2011)

    Google Scholar 

  29. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  30. Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence\_rate for finite training sets. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, pp. 2663–2671. Curran Associates Inc., Red Hook (2012)

    Google Scholar 

  31. Scherrer, C., Halappanavar, M., Tewari, A., Haglin, D.: Scaling up coordinate descent algorithms for large \(\ell _1\) regularization problems. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1 (2012)

  32. Schraudolph, N.N., Yu, J., Gunter, S.: A stochastic quasi-Newton method for online convex optimization. In: Proceedings of the 11th International Conference Artificial Intelligence and Statistics (AISTATS), pp. 433–440 (2007)

  33. Shamir, O., Srebro, N., Zhang, T.: Communication efficient distributed optimization using an approximate Newton-type method. In: International Conference on Machine Learning (ICML) (2014)

  34. Singh, A.P., Gordon, G.J.: A unified view of matrix factorization models. In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Part II, number 5212, pp. 358–373. Springer, Berlin (2008)

  35. Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In: Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 604–612 (2014)

  36. Solodov, M.V.: Incremental gradient algorithms with stepsizes bounded away from zero. Comput. Optim. Appl. 11(1), 23–35 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  37. Tseng, P.: An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  38. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  39. Yousefian, F., Nedić, A., Shanbhag, U.V.: Stochastic quasi-Newton methods for non-strongly convex problems: convergence and rate analysis. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 4496–4503. IEEE (2016)

  40. Zuckerman, D.: Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput. 3, 103–128 (2007)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) Grant No. 113M492.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ş. İlker Birbil.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Omitted proofs

Appendix A: Omitted proofs

1.1 Proof of Lemma 1

By using Assumption A.2, we have

$$\begin{aligned} \begin{array}{rl} \left\| \nabla f_{S_{[l]}}(x^{(t,l-1)}) - \nabla f_{S_{[l]}}(x^{(t)}) \right\| &{} \le L\Vert x^{(t,l-1)} - x^{(t)} \Vert \\ &{} = L\Vert x^{(t,l-1)} - x^{(t, l-2)} + x^{(t, l-2)} - x^{(t, l-3)} \\ &{}\quad + \dots + x^{(t,1)} - x^{(t)}\Vert \\ &{} \le L \sum \limits _{j=1}^{\ell -1}\Vert x^{(t,j)} - x^{(t, j-1)}\Vert . \end{array} \end{aligned}$$

Note for \(j=1, \dots , \ell -1\) that

$$\begin{aligned} \left\| x^{(t,j)} - x^{(t, j-1)}\right\| = \left\| x^{(t,j-1)} - (H^{(t)} + \beta _tI)^{-1}\nabla f_{S_{[j]}}(x^{(t,j-1)}) - x^{(t, j-1)}\right\| \le M_t C, \end{aligned}$$

where the last inequality holds by Assumption A.4. Therefore, we have

$$\begin{aligned} \left\| \nabla f_{S_{[l]}}(x^{(t,l-1)}) - \nabla f_{S_{[l]}}(x^{(t)}) \right\| \le LM_tC(\ell - 1). \end{aligned}$$

1.2 Proof of Lemma 2

At iteration \(t + 1\), we have

$$\begin{aligned} \begin{array}{rl} x^{(t+1)} &{} = x^{(t)} - \sum \limits _{\ell =1}^{K+1} \left( H^{(t)} + \beta _tI\right) ^{-1}\nabla f_{S_{[\ell ]}}(x^{(t, \ell -1)}) \\ &{} = x^{(t)} - (H^{(t)} + \beta _tI)^{-1}\nabla f(x^{(t)}) \\ &{} \quad +\, (H^{(t)} + \beta _tI)^{-1}\sum \limits _{\ell =1}^{K+1} \left( \nabla f_{S_{[\ell ]}}(x^{(t)}) - \nabla f_{S_{[\ell ]}}(x^{(t,\ell -1)})\right) . \end{array} \end{aligned}$$

This shows that

$$\begin{aligned} x^{(t+1)} - x^{(t)} = \Delta _t - (H^{(t)} + \beta _tI)^{-1}\nabla f(x^{(t)}), \end{aligned}$$
(11)

where

$$\begin{aligned} \Delta _t\equiv (H^{(t)} + \beta _tI)^{-1}\sum _{\ell =1}^{K+1} \left( \nabla f_{S_{[\ell ]}}(x^{(t)}) - \nabla f_{S_{[\ell ]}}(x^{(t,\ell -1)})\right) . \end{aligned}$$

Using now (9) implies

$$\begin{aligned} \begin{array}{rl} \Vert \Delta _t\Vert &{} \le M_t\sum \limits _{\ell =1}^{K+1} \left\| \nabla f_{S_{[\ell ]}}(x^{(t)}) - \nabla f_{S_{[\ell ]}}(x^{(t,\ell -1)})\right\| \\ &{} \le M_t\sum \limits _{\ell =1}^{K+1}LM_tC(\ell -1) = \frac{1}{2}LM^2_tC K(K+1) = BM_t^2. \\ \end{array} \end{aligned}$$
(12)

Then, we obtain

$$\begin{aligned} \begin{array}{rl} \Vert x^{(t+1)} - x^{(t)}\Vert &{} = \Vert \Delta _t - (H^{(t)} + \beta _tI)^{-1}\nabla f(x^{(t)})\Vert \\ &{} \le \Vert \Delta _t\Vert + \Vert (H^{(t)} + \beta _tI)^{-1}\Vert \nabla f(x^{(t)})\Vert \\ &{} \le BM_t^2 + CM_t \le \frac{B + C(M+1)}{M+1}M_t. \end{array} \end{aligned}$$

1.3 Proof of Theorem 3

As f is a twice differentiable function, we have

$$\begin{aligned} f(x^{(t+1)}) - f(x^{(t)}) \le {\nabla f(x^{(t)})}^{T} (x^{(t+1)} - x^{(t)}) + \frac{LK}{2} \Vert x^{(t+1)} - x^{(t)}\Vert ^2. \end{aligned}$$

Using now Lemma 2 along with (11) and (12), we obtain

$$\begin{aligned} \begin{array}{rl} f(x^{(t+1)}) - f(x^{(t)}) &{} \le \nabla f(x^{(t)})^T \Delta _t - \nabla f(x^{(t)})^T (H^{(t)} \\ &{}\quad + \beta _tI)^{-1}\nabla f(x^{(t)}) + \frac{LK}{2}\Vert x^{(t+1)} - x^{(t)}\Vert ^2 \\ &{} \le \Vert \nabla f(x^{(t)})\Vert \Vert |\Delta _t\Vert - U_t\Vert \nabla f(x^{(t)}) \Vert ^2 \\ &{}\quad + \frac{LK}{2}\left( \frac{B + C(M+1)}{M+1}\right) ^2M^2_t \\ &{} \le - U_t\Vert \nabla f(x^{(t)}) \Vert ^2 + \bar{B}M^2_t, \end{array} \end{aligned}$$
(13)

where \(\bar{B} \equiv CB + \frac{LK}{2}\left( \frac{B + C(M+1)}{M+1}\right) ^2\). Due to Assumption A.1, we can write \(\inf _{x \in \mathbb {R}^n} f(x) = f^* > -\infty \). Thus, we obtain

$$\begin{aligned} 0 \le f(x^{(t+1)}) - f^* \le f(x^{(t)}) - f^* + \bar{B}M^2_t. \end{aligned}$$

Relation (10) and Lemma 2.2 in Mangasarian and Solodov [21] together show that the sequence \(\{f(x^{(t)})\}\) converges. By using (13), we further have

$$\begin{aligned} \begin{array}{rl} f(x^{(1)}) - f^* &{} \ge f(x^{(1)}) - f(x^{(t)}) = \sum \limits _{j=1}^{t-1}\left( f(x^{(t)}) - f(x^{(t+1)})\right) \\ &{} \ge \sum \limits _{j=1}^{t-1}U_j \Vert \nabla f(x^{(j)}) \Vert ^2 - \bar{B}\sum \limits _{j=1}^{t-1}M^2_j \\ &{} \ge \underset{1 \le j \le t-1}{\inf } \Vert \nabla f(x^{(j)}) \Vert ^2 \sum \limits _{j=1}^{t-1}U_j - \bar{B}\sum \limits _{j=1}^{t-1}M^2_j. \end{array} \end{aligned}$$

Let now \(t \rightarrow \infty \), then

$$\begin{aligned} f(x^{(1)}) - f^* \ge \inf _{j \ge 1}\Vert \nabla f(x^{(j)}) \Vert ^2 \sum _{j=1}^{\infty }U_j - \bar{B}\sum _{j=1}^{\infty }M^2_j. \end{aligned}$$
(14)

Using again Assumption A.1 and conditon (10), we obtain

$$\begin{aligned} \inf _{t \ge 1}\Vert \nabla f(x^{(t)}) \Vert = 0. \end{aligned}$$
(15)

Now, suppose for contradiction that the sequence \(\{\nabla f(x^{(t)})\}\) does not converge to zero. Then, there exists an increasing sequence of integers such that for some \(\varepsilon > 0\), we have \(\Vert \nabla f(x^{(t_\tau )})\Vert \ge \varepsilon \) for all \(\tau \). On the other hand, the relation (15) implies that there exist some \(j > t_\tau \) such that \(\Vert \nabla f(x^{(j)})\Vert \le \frac{\varepsilon }{2}\). Let \(j_\tau \) be the least integer for each \(\tau \) satisfying these inequalities. Then, we have

$$\begin{aligned} \begin{array}{rl} \frac{\varepsilon }{2} &{} \le \Vert \nabla f(x^{(t_\tau )})\Vert - \Vert \nabla f(x^{(j_\tau )})\Vert \\ &{} \le \Vert \nabla f(x^{(t_\tau )}) - \nabla f(x^{(j_\tau )})\Vert \\ &{} \le LK\Vert x^{(t_\tau )} - x^{(j_\tau )}\Vert \le LK\frac{B+C(M+1)}{M+1}\sum \limits _{k=t_\tau }^{j_\tau -1} M_k, \end{array} \end{aligned}$$

where the last inequality follows from Lemma 2. Since \(0 < M \le U\), there exists \(\zeta \le \frac{M}{U} \le 1\) such that

$$\begin{aligned} M + \beta _k \ge \zeta U + \beta _k \ge \zeta (U + \beta _k) \implies M_k \le \frac{1}{\zeta } U_k. \end{aligned}$$

Thus, we obtain

$$\begin{aligned} 0 < \hat{B} \equiv \frac{\varepsilon (M+1)\zeta }{2LK(B+C(M+1))} \le \sum _{k=t_\tau }^{j_\tau - 1} U_k. \end{aligned}$$

Then, using together with inequality (13), we obtain

$$\begin{aligned} \begin{array}{rl} f(x^{(t_\tau )}) - f(x^{(j_\tau )}) &{} \ge \sum \limits _{k=t_\tau }^{j_\tau -1} U_k\Vert \nabla f(x^{(k)})\Vert ^2 - \bar{B}\sum \limits _{k=t_\tau }^{j_\tau -1}M_k^2 \\ &{} \ge \hat{B} \underset{t_\tau \le k \le j_\tau -1}{\inf } \Vert \nabla f(x^{(k)})\Vert ^2 - \bar{B}\sum \limits _{k=t_\tau }^{\infty }M_k^2. \end{array} \end{aligned}$$

Since the left-hand-side of the inequality converges and the condition (10) holds, we have

$$\begin{aligned} \lim _{\tau \uparrow \infty }~\underset{t_\tau \le k \le j_\tau -1}{\inf } \Vert \nabla f(x^{(k)})\Vert ^2 = 0. \end{aligned}$$
(16)

But our choice of \(t_\tau \) and \(j_\tau \) guarantees \(\Vert \nabla f(x^{(k)})\Vert > \frac{\varepsilon }{2}\) for all \(t_\tau \le k \le j_\tau \), and hence, we arrive at a contradiction with (16). Therefore, \({\nabla f(x^{(t)})}\) converges, and with the continuity of the gradient, we conclude for each accumulation point \(x^*\) of the sequence \(\{x^{(t)}\}\) that \(\nabla f(x^*) = 0\) holds.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaya, K., Öztoprak, F., Birbil, Ş.İ. et al. A framework for parallel second order incremental optimization algorithms for solving partially separable problems. Comput Optim Appl 72, 675–705 (2019). https://doi.org/10.1007/s10589-018-00057-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-018-00057-7

Keywords

Navigation