Abstract
While the optimization problem associated with LRR is convex and easy to solve, it is actually a big challenge to achieve high efficiency, especially under large-scale settings. In this chapter we therefore address the problem of solving nuclear norm regularized optimization problems (NNROPs), which contain a category of problems including LRR. Based on the fact that the optimal solution matrix to an NNROP is often low-rank, we revisit the classic mechanism of low-rank matrix factorization, based on which we present an active subspace algorithm for efficiently solving NNROPs by transforming large-scale NNROPs into small-scale problems. The transformation is achieved by factorizing the large-size solution matrix into the product of a small-size orthonormal matrix (active subspace) and another small-size matrix. Although such a transformation generally leads to non-convex problems, we show that suboptimal solution can be found by the augmented Lagrange alternating direction method. For the robust PCA (RPCA) [7] problem, which is a typical example of NNROPs, theoretical results verify sub-optimality of the solution produced by our algorithm. For the general NNROPs, we empirically show that our algorithm significantly reduces the computational complexity without loss of optimality.
The contents of this chapter have been published on Neural Computation [18]. \(\copyright \) [2014] MIT Press. Reprinted, with permission, from MIT Press Journals.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
More generally, NNROPs are expressed as \(\min _{X}\Vert X\Vert _*+\lambda {}f(x)\), where \(f(x)\) is a convex function. In this work, we are particularly interested in the form (1), which has covered a wide range of problems.
- 2.
For an \(m\times {n}\) matrix \(M\) (without loss of generality, assuming \(m\le {n}\)), its SVD is defined by \(M=U[\varSigma ,0]V^T\), where \(U\) and \(V\) are orthogonal matrices and \(\varSigma =\mathrm {diag}\left( \sigma _1,\sigma _2,\ldots ,\sigma _m\right) \) with \(\{\sigma _i\}_{i=1}^m\) being singular values. The SVD defined in this way is also called the full SVD. If we only calculate the \(m\) column vectors of \(V\), i.e., \(M=U\varSigma {}V^T\) with \(U\in \fancyscript{R}^{m\times {}m}\), \(\varSigma \in \fancyscript{R}^{m\times {}m}\), and \(V\in \fancyscript{R}^{n\times {}m}\), the simplified form is called the thin SVD. If we only keep the positive singular values, the reduced form is called the skinny SVD. For a matrix \(M\) of rank \(r\), its skinny SVD is computed by \(M=U_r\varSigma _rV_r^T\), where \(\varSigma _r=\mathrm {diag}\left( \sigma _1,\sigma _2,\ldots ,\sigma _r\right) \) with \(\{\sigma _i\}_{i=1}^r\) being positive singular values. More precisely, \(U_r\) and \(V_r\) are formed by taking the first \(r\) columns of \(U\) and \(V\), respectively.
- 3.
Nevertheless, as shown in Fig. 1b, the algorithm is less efficient while using smaller \(\rho \). So ones could choose this parameter by trading off between efficiency and optimality. Here, we introduce a heuristic technique that modifies Step 5 of Algorithm 1 into:
$$\begin{aligned} \mu _{k+1} = \min (10^{6},\rho \mu _k). \end{aligned}$$In this way, it will be safe to use relatively large \(\rho \).
References
F. Bach, Consistency of trace norm minimization. J. Mach. Learn. Res. 9, 1019–1048 (2008)
S. Burer, R. Monteiro, Local minima and convergence in low-rank semidefinite programming. Math. Program. 103, 427–444 (2005)
J. Cai, S. Osher, Fast singular value thresholding without singular value decomposition. UCLA Technical Report (2010)
J. Cai, E. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
E. Candés, Y. Plan, Matrix completion with noise. IEEE Proc. 9(6), 925–936 (2010)
E. Candès, B. Recht, Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)
E. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58(3), 1–37 (2009)
V. Chandrasekaran, S. Sanghavi, P. Parrilo, A. Willsky, Rank-sparsity incoherence for matrix decomposition. SIAM J. Optim. 21(2), 572–596 (2009)
A. Edelman, T. Arias, S. Smith, The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–353 (1999)
M. Fazel, Matrix rank minimization with applications. PhD Thesis (2002)
N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
N. Higham, Matrix procrustes problems (1995)
M. Jaggi, M. Sulovský, A simple algorithm for nuclear norm regularized problems, in International Conference on Machine Learning, pp. 471–478 (2010)
K.C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 684–698 (2005)
Z. Lin, M. Chen, L. Wu, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Technical Report, UILU-ENG-09-2215 (2009)
Z. Lin, R. Liu, Z. Su, Linearized alternating direction method with adaptive penalty for low-rank representation. Neural Inf. Process. Syst. 25, 612–620 (2011)
G. Liu, Z. Lin, Y. Yu, Robust subspace segmentation by low-rank representation. Int. Conf. Mach. Learn. 3, 663–670 (2010)
G. Liu, S. Yan, Active subspace: toward scalable low-rank learning. Neural Comput. 24(12), 3371–3394 (2012)
G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., Preprint (2012)
K. Min, Z. Zhang, J. Wright, Y. Ma, Decomposing background topics from keywords by principal component pursuit. Conf. Inf. Knowl. Manag. 269–278 (2010)
J. Nocedal, S. Wright, Numerical Optimization (Springer, New York, 2006)
S. Shalev-Shwartz, A. Gonen, O. Shamir, Large-scale convex minimization with a low-rank constraint. Int. Conf. Mach. Learn. 329–336 (2011)
Y. Shen, Z. Wen, Y. Zhang, Augmented lagrangian alternating direction method for matrix separation based on low-rank factorization. Technical Report (2011)
N. Srebro, N. Alon, T. Jaakkola, Generalization error bounds for collaborative prediction with low-rank matrices. Neural Inf. Process. Syst. 5–27 (2005)
R. Tomioka, T. Suzuki, M. Sugiyama, H. Kashima, A fast augmented lagrangian algorithm for learning low-rank matrices. Int. Conf. Mach. Learn. 1087–1094 (2010)
P. Tseng, On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM J. Optim. (2008)
M. Weimer, A. Karatzoglou, Q. Le, A. Smola, Cofi rank—maximum margin matrix factorization for collaborative ranking. Neural Inf. Process. Syst. (2007)
C. Williams, M. Seeger, The effect of the input density distribution on kernel-based classifiers. Int. Conf. Mach. Learn., 1159–1166 (2000)
J. Wright, A. Ganesh, S. Rao, Y. Peng, Y. Ma, Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. Neural Inform. Process. Syst. 2080–2088 (2009)
J. Yang, X. Yuan, An inexact alternating direction method for trace norm regularized least squares problem. Under Rev. Math. Comput. (2010)
Y. Zhang, Recent advances in alternating direction methods: practice and theory. Tutorial (2010)
Z. Zhang, X. Liang, A. Ganesh, Y. Ma, TILT: transform invariant low-rank textures. Int. J. Comput. Vis. 99(1), 314–328 (2012)
G. Zhu, S. Yan, Y. Ma, Image tag refinement towards low-rank, content-tag prior and error sparsity. ACM Multimed. 461–470 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Proof of Lemma 1
The proof is based on the following two lemmas.
Lemma 3
The sequences \(\{Y_{k}\}\), \(\{\hat{Y}_k\}\) and \(\{\tilde{Y}_k\}\) are all bounded.
Proof
By the optimality of \(E_{k+1}\), the standard conclusion from convex optimization states that
i.e.,
which directly leads to
Hence, the sequence \(\{Y_k\}\) is bounded.
By the optimality of \(Q_{k+1}\), it can be calculated that
So \(\{\tilde{Y}_k\}\) is bounded due to the boundedness of \(\{Y_k\}\).
By the optimality of \(J_{k+1}\), the standard conclusion from convex optimization states that
which leads to
At the same time, let \(Q_{k+1}^{\bot }\) be the orthogonal component of \(Q_{k+1}\), it can be calculated that
Hence,
So both \(Q_{k+1}^{T}\hat{Y}_{k+1}\) and \((Q_{k+1}^{\bot })^T\hat{Y}_{k+1}\) are bounded, which implies that \(\hat{Y}_{k+1}\) is bounded. \(\square \)
Lemma 4
The sequences \(\{J_{k}\}\), \(\{E_k\}\) and \(\{Q_kJ_k\}\) are all bounded.
Proof
From the iteration procedure of Algorithm 1, we have that
So \(\{\fancyscript{L}(Q_{k+1},J_{k+1},E_{k+1},Y_k,\mu _k)\}\) is upper bounded due to the boundedness of \(\{Y_k\}\) and
Hence,
is upper bounded, which means that \(\{J_k\}\) and \(\{E_k\}\) are bounded. Since \(\Vert Q_{k}J_{k}\Vert _*=\Vert J_k\Vert _*\), \(\{Q_kJ_k\}\) is also bounded. \(\square \)
Proof
(of Lemma 1 ). By the boundedness of \(Y_k\), \(\hat{Y}_k\) and \(\tilde{Y}_{k+1}\) and the fact that \(\lim _{k\rightarrow \infty }\mu _k=\infty \),
According to the definitions of \(Y_{k}\) and \(\hat{Y}_{k}\), it can be also calculated that
Hence, the sequences \(\{J_k\}\), \(\{E_k\}\) and \(\{Q_kJ_k\}\) are Cauchy sequences, and Algorithm 1 can stop within a finite number of iterations.
By the convergence conditions of Algorithm 1, it can be calculated that
where \(k^*\) is defined in (6), and \(\varepsilon >0\) is the control parameter set in Algorithm 1. \(\square \)
Note. One may have noticed that \(\{Q_k\}\) may not converge. This is because the basis of a subspace is not unique. Nevertheless, it is actually insignificant whether or not \(\{Q_k\}\) converges, because it is the product of \(Q^*\) and \(J^*\), namely \((X=Q^*J^*,E=E^*)\) that recovers a solution to the original RPCA problem.
1.2 Proof of Lemma 2
We prove the following lemma at first.
Lemma 5
Let \(X,Y\) and \(Q\) are matrices of compatible dimensions. If \(Q\) obeys \(Q^TQ=\mathtt {I}\) and \(Y\in \partial \Vert X\Vert _*\), then
Proof
Let the skinny SVD of \(X\) is \(U\varSigma {}V^T\). By \(Y\in \partial \Vert X\Vert _*\), we have
Since \(Q\) is column-orthonormal, we have
With the above notations, it can be verified that \(QY\in \partial \Vert QX\Vert _*\).
Proof
of (Lemma 2 ) Let the skinny SVD of \(D-E_k+Y_k/\mu _k\) be \(D-E_k+Y_k/\mu _k=U_k\varSigma _k{}V_k^T\), then it can be calculated that
Let the full SVD of \(\varSigma _kV_k^TJ_k^T\) be \(\varSigma _kV_k^TJ_k^T=U\varSigma {}V^T\) (note that \(U\) and \(V\) are orthogonal matrices), then it can be calculated that
which simply leads to
Hence,
i.e.,
According to (12) and Lemma 5, we have
Hence,
where the conclusion of \(Y_{k+1}\in {}\lambda \partial \left\| E_{k+1} \right\| _1\) is quoted from (11). Since the above conclusion holds for any \(k\), it naturally holds at \((Q^*,J^*,E^*)\):
Given any feasible solution \((Q,J,E)\) to problem (5), by the convexity of matrix norms and (13), it can be calculated that
By Lemma 1, we have that \(\Vert QJ+E-Q^*J^*-E^*\Vert _{\infty }\le \Vert D-Q^*J^*-E^*\Vert _{\infty }<\varepsilon \), which leads to
where \(\hat{Y}_*\le 1\) is due to (13). Hence,
\(\square \)
1.3 Proof of Theorem 1
Proof
Notice that \((Q^*,J=0,E=D)\) is feasible to (5). Let \((Q^g,J^g,E^g)\) be a globally optimal solution to (5), then we have
By the proof procedure of Lemma 4, we have that \(E^*\) is bounded by
Hence,
Note that \(|\langle {}M,N\rangle |\le \Vert M\Vert _{\infty }\Vert N\Vert _1\) holds for any matrices \(M\) and \(N\). By Lemma 2 and (14), we have
which simply leads to the inequality stated in Theorem 1. \(\square \)
1.4 Proof of Theorem 2
Proof
Let \(X=Q^*J^*\) and \(E=E^*\), then \((X,E)\) is a feasible solution to the original RPCA problem. By the convexity of the RPCA problem and the optimality of \((X^o,E^o)\), it naturally follows that
Let \(X^o=U^o\varSigma ^o{}(V^o)^T\) be the skinny SVD of \(X^o\). Construct \(Q^{\prime }=U^o\), \(J^{\prime }=\varSigma ^o{}(V^o)^T\) and \(E^{\prime }=E^o\). When \(r\ge {}r_0\), we have
i.e., \((Q^{\prime },J^{\prime },E^{\prime })\) is a feasible solution to problem (5). By Theorem 1, it can be concluded that
For \(r<r_0\), we decompose the skinny SVD of \(X^o\) as
where \(U_0,V_0\) (resp. \(U_1, V_1\)) are the singular vectors associated with the \(r\) largest singular values (resp. the rest singular values smaller than or equal to \(\sigma _{r}\)). With these notations, we have a feasible solution to problem (5) by constructing
By Theorem 1, it can be calculated that
\(\square \)
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Liu, G., Yan, S. (2014). Scalable Low-Rank Representation. In: Fu, Y. (eds) Low-Rank and Sparse Modeling for Visual Analysis. Springer, Cham. https://doi.org/10.1007/978-3-319-12000-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-12000-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11999-1
Online ISBN: 978-3-319-12000-3
eBook Packages: Computer ScienceComputer Science (R0)