Abstract
In this paper we study nonconvex and nonsmooth multi-block optimization over Euclidean embedded (smooth) Riemannian submanifolds with coupled linear constraints. Such optimization problems naturally arise from machine learning, statistical learning, compressive sensing, image processing, and tensor PCA, among others. By utilizing the embedding structure, we develop an ADMM-like primal-dual approach based on decoupled solvable subroutines such as linearized proximal mappings, where the duality is with respect to the embedded Euclidean spaces. First, we introduce the optimality conditions for the afore-mentioned optimization models. Then, the notion of \(\epsilon \)-stationary solutions is introduced as a result. The main part of the paper is to show that the proposed algorithms possess an iteration complexity of \(O(1/\epsilon ^2)\) to reach an \(\epsilon \)-stationary solution. For prohibitively large-size tensor or machine learning models, we present a sampling-based stochastic algorithm with the same iteration complexity bound in expectation. In case the subproblems are not analytically solvable, a feasible curvilinear line-search variant of the algorithm based on retraction operators is proposed. Finally, we show specifically how the algorithms can be implemented to solve a variety of practical problems such as the NP-hard maximum bisection problem, the \(\ell _q\) regularized sparse tensor principal component analysis and the community detection problem. Our preliminary numerical results show great potentials of the proposed methods.
Similar content being viewed by others
References
Absil, P.A., Baker, C.G., Gallivan, K.A.: Convergence analysis of Riemannian trust-region methods. Technical report (2006)
Absil, P.A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2009)
Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
Ballani, J., Grasedyck, L., Kluge, M.: Black box approximation of tensors in hierarchical Tucker format. Linear Algebra Appl. 438(2), 639–657 (2013)
Bento, G.C., Ferreira, O.P., Melo, J.G.: Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. https://arxiv.org/pdf/1609.04869.pdf (2016)
Bergmann, R., Persch, J., Steidl, G.: A parallel Douglas–Rachford algorithm for minimizing ROF-like functionals on images with values in symmetric Hadamard manifolds. SIAM J. Imaging Sci. 9(3), 901–937 (2016)
Boumal, N., Absil, P.A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. IMA J. Numer. Anal. 39(1), 1–33 (2018)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11 (2011)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Chen, Y., Li, X., Xu, J.: Convexified modularity maximization for degree-corrected stochastic block models. arXiv preprint arXiv:1512.08425 (2015)
Clarke, F.H.: Nonsmooth analysis and optimization. Proc. Int. Congr. Math. 5, 847–853 (1983)
De Lathauwer, L., De Moor, B., Vandewalle, J.: A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21(4), 1253–1278 (2000)
Dhillon, I.S., Sra, S.: Generalized nonnegative matrix approximations with Bregman divergences. In: NIPS, vol. 18 (2005)
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
Edelman, A., Arias, T.A., Smith, S.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Ferreira, O.P., Oliveira, P.R.: Proximal point algorithm on Riemannian manifolds. Optimization 51(2), 257–270 (2002)
Frieze, A., Jerrum, M.: Improved approximation algorithms for MAX k-CUT and MAX bisection. Algorithmica 18(1), 67–81 (1997)
Fu, W.J.: Penalized regressions: the bridge versus the lasso. J. Comput. Graph. Stat. 7(3), 397–416 (1998)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Ghosh, S., Lam, H.: Computing worst-case input models in stochastic simulation. arXiv preprint arXiv:1507.05609 (2015)
Ghosh, S., Lam, H.: Mirror descent stochastic approximation for computing worst-case stochastic input models. In: Winter Simulation Conference, 2015, pp. 425–436. IEEE (2015)
Grant, M., Boyd, S., Ye, Y.: CVX: MATLAB software for disciplined convex programming (2008)
Hong, M.: Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543 (2016)
Hong, M., Luo, Z.-Q., Razaviyayn, M.: Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optim. 26(1), 337–364 (2016)
Hosseini, S., Pouryayevali, M.R.: Generalized gradients and characterization of epi-Lipschitz sets in Riemannian manifolds. Fuel Energy Abstr. 74(12), 3884–3895 (2011)
Huper, K., Trumpf, J.: Newton-like methods for numerical optimization on manifolds. In: Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference, vol. 1, pp. 136–139. IEEE (2004)
Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pp. 665–674. ACM (2013)
Jiang, B., Lin, T., Ma, S., Zhang, S.: Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Comput. Optim. Appl. 72(1), 115–157 (2019)
Jiang, B., Ma, S., So, A.M.-C., Zhang, S.: Vector transport-free SVRG with general retraction for Riemannian optimization: complexity analysis and practical implementation. Preprint arXiv:1705.09059 (2017)
Jin, J.: Fast community detection by score. Ann. Stat. 43(1), 57–89 (2015)
Kasai, H., Sato, H., Mishra, B.: Riemannian stochastic variance reduced gradient on Grassmann manifold. arXiv preprint arXiv:1605.07367 (2016)
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Kovnatsky, A., Glashoff, K., Bronstein, M.: MADMM: a generic algorithm for non-smooth optimization on manifolds. In: European Conference on Computer Vision, pp. 680–696. Springer (2016)
Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)
Lai, Z., Xu, Y., Chen, Q., Yang, J., Zhang, D.: Multilinear sparse principal component analysis. IEEE Trans. Neural Netw. Learn. Syst. 25(10), 1942–1950 (2014)
Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst. 19, 801 (2007)
Lee, J.M.: Introduction to Smooth Manifolds. Springer, New York (2013)
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)
Liu, H., Wu, W., So, A.M.-C.: Quadratic optimization with orthogonality constraints: explicit Lojasiewicz exponent and linear convergence of line-search methods. In: ICML, pp. 1158–1167 (2016)
Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N.: MPCA: multilinear principal component analysis of tensor objects. IEEE Trans. Neural Netw. 19(1), 18–39 (2008)
Luenberger, D.G.: The gradient projection method along geodesics. Manag. Sci. 18(11), 620–631 (1972)
Motreanu, D., Pavel, N.H.: Quasi-tangent vectors in flow-invariance and optimization problems on Banach manifolds. J. Math. Anal. Appl. 88(1), 116–132 (1982)
Nemirovski, A.: Sums of random symmetric matrices and quadratic optimization under orthogonality constraints. Math. Program. 109(2), 283–317 (2007)
Nocedal, J., Wright, S.J.: Numerical Optimization, vol. 9, no. 4, p. 1556. Springer
Oseledets, I.V.: Tensor-train decomposition. SIAM J. Sci. Comput. 33(5), 2295–2317 (2011)
Oseledets, I.V., Tyrtyshnikov, E.: TT-cross approximation for multidimensional arrays. Linear Algebra Appl. 432(1), 70–88 (2010)
Panagakis, Y., Kotropoulos, C., Arce, G.R.: Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Trans. Audio Speech Lang. Process. 18(3), 576–588 (2010)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Advances in Neural Information Processing Systems, pp. 1145–1153 (2016)
Rockafellar, R.T.: Clarke’s tangent cones and the boundaries of closed sets in \(\mathbb{R}^n\). Nonlinear Anal. Theory Methods Appl. 3, 145–154 (1979)
Smith, S.T.: Optimization techniques on Riemannian manifolds. Fields Inst. Commun. 3(3), 113–135 (1994)
Srebro, N., Jaakkola, T.: Weighted low-rank approximations. In: ICML, vol. 3, pp. 720–727 (2003)
Sun, J., Qu, Q., Wright, J.: Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory 63(2), 885–914 (2017)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996)
Wang, F., Cao, W., Xu, Z.: Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063 (2015)
Wang, S., Sun, M., Chen, Y., Pang, E., Zhou, C.: STPCA: sparse tensor principal component analysis for feature extraction. In: 21st International Conference on Pattern Recognition, 2012, pp. 2278–2281. IEEE (2012)
Wang, Y., Yin, W., Zeng, J.: Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78(1), 29–63 (2019)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)
Wiegele, A.: Biq Mac library—a collection of max-cut and quadratic 0–1 programming instances of medium size. Preprint (2007)
Xu, Y.: Alternating proximal gradient method for sparse nonnegative Tucker decomposition. Math. Program. Comput. 7(1), 39–70 (2015)
Yang, L., Pong, T.K., Chen, X.: Alternating direction method of multipliers for a class of nonconvex and nonsmooth problems with applications to background/foreground extraction. SIAM J. Imaging Sci. 10(1), 74–110 (2017)
Yang, W.H., Zhang, L.-H., Song, R.: Optimality conditions for the nonlinear programming problems on Riemannian manifolds. Pac. J. Optim. 10(2), 415–434 (2014)
Ye, Y.: A. 699-approximation algorithm for max-bisection. Math. Program. 90(1), 101–111 (2001)
Zhang, H., Reddi, S.J., Sra, S.: Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In: Advances in Neural Information Processing Systems, pp. 4592–4600 (2016)
Zhang, H., Sra, S.: First-order methods for geodesically convex optimization. arXiv preprint arXiv:1602.06053 (2016)
Zhang, J., Liu, H., Wen, Z., Zhang, S.: A sparse completely positive relaxation of the modularity maximization for community detection. SIAM J. Sci. Comput. 40(5), A3091–A3120 (2018)
Zhang, T., Golub, G.H.: Rank-one approximation to high order tensors. SIAM J. Matrix Anal. Appl. 23(2), 534–550 (2001)
Zhang, Y., Levina, E., Zhu, J.: Detecting overlapping communities in networks using spectral methods. arXiv preprint arXiv:1412.3432 (2014)
Zhu, H., Zhang, X., Chu, D., Liao, L.: Nonconvex and nonsmooth optimization with generalized orthogonality constraints: an approximate augmented Lagrangian method. J. Sci. Comput. 72(1), 331–372 (2017)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)
Acknowledgements
The authors would like to thank the associate editor and two anonymous reviewers for insightful and constructive comments that helped improve the presentation of this paper. The work of S. Ma was supported in part by a startup package in the Department of Mathematics at University of California, Davis. The work of S. Zhang was supported in part by the National Science Foundation under Grant CMMI-1462408 and in part by the Shenzhen Fundamental Research Fund under Grant KQTD2015033114415450.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proofs of the technical lemmas
Proofs of the technical lemmas
1.1 Proof of Lemma 3.5
Proof
By the global optimality for the subproblems in Step 1 of Algorithm 1, we have
By Step 2 of Algorithm 1 we have
By Step 3, directly substituting \(\lambda ^{k+1}\) into the augmented Lagrangian gives
Summing up (70), (71), (72)) and apply Lemma 3.4, we obtain the following inequality,
which further indicates
To ensure that the right hand side of (21) is negative, we need to choose \(H_i\succ \frac{6L^2}{\beta }I\), and ensure that
This can be proved by first viewing it as a quadratic function of \(z = \frac{1}{\gamma }\). To find some \(z>0\) such that
we need the discriminant to be positive, i.e.
It is easy to verify that (19) suffices to guarantee (76). Solving \(p(z) = 0\), we find two positive roots
Note that \(\gamma \) defined in (20) satisfies \(\frac{1}{z_2}<\gamma <\frac{1}{z_1}\) and thus guarantees (75). This completes the proof. \(\square \)
1.2 Proof of Lemma 3.8
Proof
For the subproblem in Step 1 of Algorithm 2, since \(x_i^{k+1}\) is the global minimizer, we have
By the L-Lipschitz continuity of \(\nabla _i f\), we have
Combining the above two inequalities and using the definition of \(\mathcal {L}_{\beta }\) in (16), we have
Summing (77) over \(i=1,\ldots ,N-1\), we have the following inequality, which is the counterpart of (70):
Besides, since (71) and (72) still hold, by combining (78), (71) and (72) and applying Lemma 3.4, we establish the following two inequalities, which are respectively the counterparts of (73) and (21):
and
From the proof of Lemma 3.5, it is easy to see that the right hand side of the above inequality is negative, if \(H_i\succ \left( \frac{6L^2}{\beta }+L\right) I\) and \(\beta \) and \(\gamma \) are chosen according to (19) and (20). \(\square \)
1.3 Proof of Lemma 3.11
Proof
For the ease of notation, we denote
Note that \(\delta _i^k\) is a zero-mean random variable. By Steps 2 and 3 of Algorithm 3 we obtain
Applying (81) for k and \(k+1\), and using (81), we get
Taking expectation with respect to all random variables on both sides and using \(\textsf {E} [\langle \delta _N^k,\delta _N^{k-1}\rangle ] = 0\) completes the proof. \(\square \)
1.4 Proof of Lemma 3.12
Proof
Similar as (77), by further incorporating (80), we have
Taking expectation with respect to all random variables on both sides and summing over \(i=1,\ldots ,N-1\), and using (36), we obtain
Note that by the Step 2 of Algorithm 3 and the descent lemma we have
Taking the expectation with respect to all random variables yields
The following equality holds trivially from Step 3 of Algorithm 3:
Combining (82), (83), (84) and (38), we obtain
Choosing \(\beta \) and \(\gamma \) according to (40) and (41), and using the similar arguments in the proof of Lemma 3.5, it is easy to verify that
By further choosing \(H_i\succ \left( \frac{8L^2}{\beta }+L+1\right) I\), we know that the right hand side of (85) is negative, and this completes the proof. \(\square \)
1.5 Proof of Lemma 3.13
Proof
From (81) and (15), we have that
where the first inequality is obtained by applying \(\langle a, b\rangle \le \frac{1}{2}(\frac{1}{\eta }\Vert a\Vert ^2+\eta \Vert b\Vert ^2)\) to terms \(\langle \sum _{i = 1}^NA_ix^{k+1}_i-b, \left( \beta -\frac{1}{\gamma }\right) (x^k_N-x^{k+1}_N)\rangle \), \(\langle \sum _{i = 1}^NA_ix^{k+1}_i-b, \nabla _Nf(x^{k+1}_1,\ldots ,x^{k+1}_{N-1},x^k_N) - \nabla _Nf(x^{k+1})\rangle \) and \(\langle \sum _{i = 1}^NA_ix^{k+1}_i-b,\delta _N^k\rangle \) respectively with \(\eta = \frac{8}{\beta }, \frac{8}{\beta }\) and \(\frac{4}{\beta }\). Note that \(\beta >2L\) according to (40), thus \((\frac{\beta }{2}-\frac{\beta }{8}-\frac{\beta }{8}-\frac{L}{2})>0\) and the last inequality holds. By rearranging the terms and taking expectation with respect to all random variables completes the proof. \(\square \)
1.6 Proof for Theorem 3.19
Proof
Through similar argument, one can easily obtain
where \(\theta _k = \sum _{i=1}^N(\Vert t_i^{k+1}g_i^{k+1}\Vert ^2+\Vert t_i^kg_i^k\Vert ^2+\Vert t_i^{k-1}g_i^{k-1}\Vert ^2)\). The only remaining task is to guarantee an \(\epsilon \) version of (48). First let us prove that
Denote \(h_i(x_i) = \mathcal {L}_{\beta }(x^{k+2}_1,\ldots ,x^{k+2}_{i-1},x_i,x^{k+1}_{i+1},\ldots ,x^{k+1}_N,\lambda ^{k+1})\) and \(Y_i(t) = R(x^{k+1}_i,-tg_i^{k+1})\), then it is not hard to see that \(\nabla h_i(x_i)\) is Lipschitz continuous with parameter \(L+\beta \Vert A_i\Vert _2^2 \le L_3:=L+\beta A_{\max }^2\). Consequently, it yields
where the last equality is due to \(\langle \nabla h_i(Y_i(0)),Y_i'(0)\rangle = -\langle Y_i'(0),Y_i'(0)\rangle \). Also note the relationship
Note that \(\left\| \sum _{i=1}^{N-1}A_ix^{k+1}_i{+}x^{k+1}_N{-}b\right\| \le \sqrt{\kappa _1\theta _k}{\le } \sqrt{\frac{\kappa _1}{\tau }(\Psi _G(x_1^1,\ldots ,x_N^1,\lambda ^1,x_N^0){-}f^*)}.\) Because \(\mathcal {M}_i, i = 1,\ldots ,N-1\) are all compact submanifolds, \(x^{k+1}_i, i = 1,\ldots ,N-1\) are all bounded. Hence the whole sequence \(\{x_N^{k}\}\) is also bounded. By (27) (which also holds in this case),
By the boundedness of \(\{(x^k_1,\ldots ,x^k_N)\}\) and the continuity of \(\nabla f(\cdot )\), the second term is bounded. Combining the boundedness of \(\{\theta _k\}\), we know that whole sequence \(\{\lambda ^k\}\) is bounded. Consequently, there exists a constant \(C>0\) such that \(\Vert \nabla h_i(Y_i(0))\Vert \le C,\) where
Note that this constant C depends only on the first two iterates \(\{x_1^t,\ldots ,x_N^t,\lambda ^t\}, t = 0,1,\) except for the absolute constants such as \(\Vert A_i\Vert _2,i = 1,\ldots ,N\). Therefore, when
it holds that
Note that \(\sigma >\frac{2\alpha }{s}\), by the terminating rule of the line-search step, we have
Then by noting
we have (86).
Now let us discuss the issue of (48). By definition,
Consequently, we obtain
\(\square \)
1.7 Proof for inequality (60)
Proof
First, we need to figure out the Lipschitz constant of \(\bar{f}_{\beta }\).
So we define \(\hat{L} = L+\beta N\max _{1\le i\le N}\Vert A_i\Vert _2^2 \) as the Lipschitz constant for function \(\bar{f}_{\beta }.\) The global optimality of the subproblem (59) yields
By the descent lemma we have
Combining the above two inequalities yields (60). \(\square \)
Rights and permissions
About this article
Cite this article
Zhang, J., Ma, S. & Zhang, S. Primal-dual optimization algorithms over Riemannian manifolds: an iteration complexity analysis. Math. Program. 184, 445–490 (2020). https://doi.org/10.1007/s10107-019-01418-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01418-8
Keywords
- Nonconvex and nonsmooth optimization
- Riemannian manifold
- \(\epsilon \)-Stationary solution
- ADMM
- Iteration complexity