Abstract
Optimization over low rank matrices has broad applications in machine learning. For large-scale problems, an attractive heuristic is to factorize the low rank matrix to a product of two much smaller matrices. In this paper, we study the nonconvex problem \(\min _{\mathbf {U}\in \mathbb {R}^{n\times r}} g(\mathbf {U})=f(\mathbf {U}\mathbf {U}^T)\) under the assumptions that \(f(\mathbf {X})\) is restricted \(\mu \)-strongly convex and L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\). We propose an accelerated gradient method with alternating constraint that operates directly on the \(\mathbf {U}\) factors and show that the method has local linear convergence rate with the optimal dependence on the condition number of \(\sqrt{L/\mu }\). Globally, our method converges to the critical point with zero gradient from any initializer. Our method also applies to the problem with the asymmetric factorization of \(\mathbf {X}={\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) and the same convergence result can be obtained. Extensive experimental results verify the advantage of our method.
Similar content being viewed by others
1 Introduction
Low rank matrix estimation has broad applications in machine learning, computer vision and signal processing. In this paper, we consider the problem of the form:
where there exists minimizer \(\mathbf {X}^*\) of rank-r. We consider the case of \(r\ll n\). Optimizing problem (1) in the \(\mathbf {X}\) space often requires computing at least the top-r singular value/vectors in each iteration and \(O(n^2)\) memory to store a large n by n matrix, which restricts the applications with huge size matrices. To reduce the computational cost as well as the storage space, many literatures exploit the observation that a positive semidefinite low rank matrix can be factorized as a product of two much smaller matrices, i.e., \(\mathbf {X}=\mathbf {U}\mathbf {U}^T\), and study the following nonconvex problem instead:
A wide family of problems can be cast as problem (2), including matrix sensing (Bhojanapalli et al. 2016b), matrix completion (Jain et al. 2013), one bit matrix completion (Davenport et al. 2014), sparse principle component analysis (Cai et al. 2013) and factorization machine (Lin and Ye 2016). In this paper, we study problem (2) and aim to propose an accelerated gradient method that operates on the \(\mathbf {U}\) factors directly. The factorization in problem (2) makes \(g(\mathbf {U})\) nonconvex, even if \(f(\mathbf {X})\) is convex. Thus, proving the acceleration becomes a harder task than the analysis for convex programming.
1.1 Related work
Recently, there is a trend to study the nonconvex problem (2) in the machine learning and optimization community. Recent developments come from two aspects: (1). The geometric aspect which proves that there is no spurious local minimum for some special cases of problem (2), e.g., matrix sensing (Bhojanapalli et al. 2016b), matrix completion (Ge et al. 2016, 2017; Li et al. 2018; Zhu et al. 2018; Zhang et al. 2018) for a unified analysis. (2). The algorithmic aspect which analyzes the local linear convergence of some efficient schemes such as the gradient descent method. Examples include (Burer and Monteiro 2003, 2005; Boumal et al. 2016; Tu et al. 2016; Zhang and Lafferty 2015; Park et al. 2016) for semidefinite programs, (Sun and Luo 2015; Park et al. 2013; Hardt and Wootters 2014; Zheng and Lafferty 2016; Zhao et al. 2015) for matrix completion, (Zhao et al. 2015; Park et al. 2013) for matrix sensing and (Yi et al. 2016; Gu et al. 2016) for Robust PCA. The local linear convergence rate of the gradient descent method is proved for problem (2) in a unified framework in Bhojanapalli et al. (2016a), Chen and Wainwright (2015), Wang et al. (2017). However, no acceleration scheme is studied in these literatures. It remains an open problem on how to analyze the accelerated gradient method for nonconvex problem (2).
Nesterov’s acceleration technique (Nesterov 1983, 1988, 2004) has been empirically verified efficient on some nonconvex problems, e.g., Deep Learning (Sutskever et al. 2013). Several literatures studied the accelerated gradient method and the inertial gradient descent method for the general nonconvex programming (Ghadimi and Lan 2016; Li and Lin 2015; Xu and Yin 2014). However, they only proved the convergence and had no guarantee on the acceleration for nonconvex problems. Carmon et al. (2018), Carmon et al. (2017), Agarwal et al. (2017) and Jin et al. (2018) analyzed the accelerated gradient method for the general nonconvex optimization and proved the complexity of \(O(\epsilon ^{-7/4}\text{ log }(1/\epsilon ))\) to escape saddle points or achieve critical points. They studied the general problem and did not exploit the specification of problem (2). Thus, their complexity is sublinear. Necoara et al. (2019) studied several conditions under which the gradient descent and accelerated gradient method converge linearly for non-strongly convex optimization. Their conclusion of the gradient descent method can be extended to nonconvex problem (2). For the accelerated gradient method, Necoara et al. required a strong assumption that all \(\mathbf {y}^k,k=0,1,\ldots ,\)Footnote 1 have the same projection onto the optimum solution set. It does not hold for problem (2).
1.2 Our contributions
In this paper, we use Nesterov’s acceleration scheme for problem (2) and an efficient accelerated gradient method with alternating constraint is proposed, which operates on the \(\mathbf {U}\) factors directly. We back up our method with provable theoretical results. Specifically, our contributions can be summarized as follows:
-
1.
We establish the curvature of local restricted strong convexity along a certain trajectory by restricting the problem onto a constraint set, which allows us to use the classical accelerated gradient method for convex programs to solve the constrained problem. We build our result with the tool of polar decomposition.
-
2.
In order to reduce the negative influence of the constraint and ensure the convergence to the critical point of the original unconstrained problem, rather than the reformulated constrained problem, we propose a novel alternating constraint strategy and combine it with the classical accelerated gradient method.
-
3.
When f is restricted \(\mu \)-strongly convex and restricted L-smooth, our method has the local linear convergence to the optimum solution, which has the same dependence on \(\sqrt{L/\mu }\) as convex programming. As far as we know, we are the first to establish the convergence matching the optimal dependence on \(\sqrt{L/\mu }\) for this kind of nonconvex problems. Globally, our method converges to a critical point of problem (2) from any initializer.
1.3 Notations and assumptions
For matrices \(\mathbf {U},\mathbf {V}\in \mathbb {R}^{n\times r}\), we use \(\Vert \mathbf {U}\Vert _F\) as the Frobenius norm, \(\Vert \mathbf {U}\Vert _2\) as the spectral norm and \(\left\langle \mathbf {U},\mathbf {V}\right\rangle =\text{ trace }(\mathbf {U}^T\mathbf {V})\) as their inner products. We denote \(\sigma _r(\mathbf {U})\) as the smallest singular value of \(\mathbf {U}\) and \(\sigma _1(\mathbf {U})=\Vert \mathbf {U}\Vert _2\) as the largest one. We use \(\mathbf {U}_{ S}\in \mathbb {R}^{r\times r}\) as the submatrix of \(\mathbf {U}\) with the rows indicated by the index set \(S\subseteq \{1,2,\ldots ,n\}\), \(\mathbf {U}_{-S}\in \mathbb {R}^{(n-r)\times r}\) as the submatrix with the rows indicated by the indexes out of S and \(\mathbf {X}_{ S, S}\in \mathbb {R}^{r\times r}\) as the submatrix of \(\mathbf {X}\) with the rows and columns indicated by S. \(\mathbf {X}\succeq 0\) means that \(\mathbf {X}\) is symmetric and positive semidefinite. Let \(I_{\varOmega _{ S}}(\mathbf {U})\) be the indicator function of set \(\varOmega _{ S}\). For the objective function \(g(\mathbf {U})\), its gradient w.r.t. \(\mathbf {U}\) is \(\nabla g(\mathbf {U})=2\nabla f(\mathbf {U}\mathbf {U}^T)\mathbf {U}\). We assume that \(\nabla f(\mathbf {U}\mathbf {U}^T)\) is symmetric for simplicity. Our conclusions for the asymmetric case naturally generalize since \(\nabla g(\mathbf {U})=\nabla f(\mathbf {U}\mathbf {U}^T)\mathbf {U}+\nabla f(\mathbf {U}\mathbf {U}^T)^T\mathbf {U}\) in this case. Denote the optimum solution set of problem (2) as
where \(\mathbf {X}^*\) is a minimizer of problem (1). An important issue in minimizing \(g(\mathbf {U})\) is that its optimum solution is not unique, i.e., if \(\mathbf {U}^*\) is the optimum solution of problem (2), then \(\mathbf {U}^*\mathbf {R}\) is also an optimum solution for any orthogonal matrix \(\mathbf {R}\in \mathbb {R}^{r\times r}\). Given \(\mathbf {U}\), we define the optimum solution that is closest to \(\mathbf {U}\) as
1.3.1 Assumptions
In this paper, we assume that f is restricted \(\mu \)-strongly convex and L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\). We state the standard definitions below.
Definition 1
Let \(f:\mathbb {R}^{n\times n}\rightarrow \mathbb {R}\) be a convex differentiable function. Then, f is restricted \(\mu \)-strongly convex on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\) if, for any \(\mathbf {X},\mathbf {Y}\in \{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\), we have
Definition 2
Let \(f:\mathbb {R}^{n\times n}\rightarrow \mathbb {R}\) be a convex differentiable function. Then, f is restricted L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\) if, for any \(\mathbf {X},\mathbf {Y}\in \{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\), we have
and
1.3.2 Polar decomposition
Polar decomposition is a powerful tool for matrix analysis. We briefly review it in this section. We only describe the left polar decomposition of a square matrix.
Definition 3
The polar decomposition of a matrix \(\mathbf {A}\in \mathbb {R}^{r\times r}\) has the form \(\mathbf {A}=\mathbf {H}\mathbf {Q}\) where \(\mathbf {H}\in \mathbb {R}^{r\times r}\) is positive semidefinite and \(\mathbf {Q}\in \mathbb {R}^{r\times r}\) is an orthogonal matrix.
If \(\mathbf {A}\in \mathbb {R}^{r\times r}\) is of full rank, then \(\mathbf {A}\) has the unique polar decomposition with positive definite \(\mathbf {H}\). In fact, since a positive semidefinite Hermitian matrix has a unique positive semidefinite square root, \(\mathbf {H}\) is uniquely given by \(\mathbf {H}=\sqrt{\mathbf {A}\mathbf {A}^T}\). \(\mathbf {Q}=\mathbf {H}^{-1}\mathbf {A}\) is also unique.
In this paper, we use the tool of polar decomposition’s perturbation theorem to build the restricted strong convexity of \(g(\mathbf {U})\). It is described below.
Lemma 1
(Li 1995) Let \(\mathbf {A}\in \mathbb {R}^{r\times r}\) be of full rank and \(\mathbf {H}\mathbf {Q}\) be its unique polar decomposition, \(\mathbf {A}+\triangle \mathbf {A}\) be of full rank and \((\mathbf {H}+\triangle \mathbf {H})(\mathbf {Q}+\triangle \mathbf {Q})\) be its unique polar decomposition. Then, we have
2 The restricted strongly convex curvature
Function \(g(\mathbf {U})\) is a special kind of nonconvex function and the non-convexity only comes from the factorization of \(\mathbf {U}\mathbf {U}^T\). Based on this observation, we exploit the special curvature of \(g(\mathbf {U})\) in this section.
The existing works proved the local linear convergence of the gradient descent method for problem (2) by exploiting curvatures such as the local second order growth property (Sun and Luo 2015; Chen and Wainwright 2015) or the \((\alpha ,\beta )\) regularity condition (Jin et al. 2018; Bhojanapalli et al. 2016a, b; Wang et al. 2017). The former is described as
while the later is defined as
where \(\mathbf {U}^*\in \mathcal {X}^*\) and \(P_{\mathcal {X}^*}(\mathbf {U})\) is defined in (4). Both (5) and (6) can be derived by the local weakly strongly convex condition (Necoara et al. 2019) combing with the smoothness of \(g(\mathbf {U})\). The former is described as
where \(\alpha =\mu \sigma ^2_r(\mathbf {U}^*)\). As discussed in Sect. 1.3, the optimum solution of problem (2) is not unique. This non-uniqueness makes the difference between the weakly strong convexity and strong convexity, e.g., on the right hand side of (7), we use \(P_{\mathcal {X}^*}(\mathbf {U})\), rather than \(\mathbf {U}^*\). Moreover, the weakly strongly convex condition cannot infer convexity and \(g(\mathbf {U})\) is not convex even around a small neighborhood of the global optimum solution (Li et al. 2016).
Necoara et al. (2019) studied several conditions under which the linear convergence of the gradient descent method is guaranteed for general convex programming without strong convexity. The weakly strongly convex condition is the strongest one and can derive all the other conditions. However, it is not enough to analyze the accelerated gradient method only with the weakly strongly convex condition. Necoara et al. (2019) proved the acceleration of the classical accelerated gradient method under an additional assumption that all the iterates \(\{\mathbf {y}^k,k=0,1,\ldots \}\) have the same projection onto the optimum solution set besides the weakly strongly convex condition and the smoothness condition. From the proof in (Necoara et al. 2019, Sect. 5.2.1), we can see that the non-uniqueness of the optimum solution makes the main trouble to analyze the accelerated gradient methodFootnote 2. The additional assumption made in Necoara et al. (2019) somehow aims to reduce this non-uniqueness. Since this assumption is not satisfied for problem (2), only (7) is not enough to prove the acceleration for problem (2) and it requires us to exploit stronger curvature than (7) to analyze the accelerated gradient method.
Motivated by Necoara et al. (2019), we should remove the non-uniqueness in problem (2). Our intuition is based on the following observation. Suppose that we can find an index set \( S\subseteq \{1,2,\ldots ,n\}\) with size r such that \(\mathbf {X}^*_{ S, S}\) is of r full rank, then there exists a unique decomposition \(\mathbf {X}^*_{ S, S}=\mathbf {U}^*_{ S}(\mathbf {U}^*_{ S})^T\) where we require \(\mathbf {U}^*_{ S}\succ 0\). Thus, we can easily have that there exists a unique \(\mathbf {U}^*\) such that \(\mathbf {U}^*{\mathbf {U}^*}^T=\mathbf {X}^*\) and \(\mathbf {U}^*_{ S}\succ 0\). To verify it, consider \( S=\{1,\ldots ,r\}\) for simplicity. Then \(\mathbf {U}\mathbf {U}^T=\left( \begin{array}{cc} \mathbf {U}_{S}\mathbf {U}_{S}^T &{} \mathbf {U}_{S}\mathbf {U}_{-S}^T\\ \mathbf {U}_{-S}\mathbf {U}_{S}^T &{} \mathbf {U}_{-S}\mathbf {U}_{-S}^T \end{array} \right) =\left( \begin{array}{cc} \mathbf {X}_{S,S} &{} \mathbf {X}_{S,-S} \\ \mathbf {X}_{-S,S} &{} \mathbf {X}_{-S,-S} \end{array} \right) \). The uniqueness of \(\mathbf {U}_S\) comes from \(\mathbf {X}_{S,S}\succ 0\) and \(\mathbf {U}_S\succ 0\) and the uniqueness of \(\mathbf {U}_{-S}\) comes from \(\mathbf {U}_{-S}=\mathbf {X}_{-S,S}\mathbf {U}_{S}^{-T}\).
Based on the above observation, we can reformulate problem (2) as
where
and \(\epsilon \) is a small enough constant such that \(\epsilon \ll \sigma _r(\mathbf {U}_{ S}^*)\). We require \(\mathbf {U}_{ S}\succeq \epsilon \mathbf {I}\) rather than \(\mathbf {U}_{ S}\succ 0\) to make the projection onto \(\varOmega _S\) computable. Due to the additional constraint of \(\mathbf {U}\in \varOmega _S\), we observe that the optimum solution of problem (8) is unique. Moreover, the minimizer of (8) minimizes also (2).
Until now, we are ready to establish a stronger curvature than (7) by restricting the variables of \(g(\mathbf {U})\) on the set \(\varOmega _S\). We should lower bound \(\Vert P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\Vert _F^2\) in (7) by \(\Vert \mathbf {U}^*-\mathbf {U}\Vert _F^2\). Our result is built upon polar decomposition’s perturbation theorem (Li 1995). Based on Lemma 1, we first establish the following critical lemma.
Lemma 2
For any \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\), let \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\in \mathbb {R}^{r\times r},\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {V}\mathbf {R}-\mathbf {U}\Vert _F^2\) and \({\hat{\mathbf {V}}}=\mathbf {V}\mathbf {R}\). Then, we have
Proof
Since the conclusion is not affected by permutating the rows of \(\mathbf {U}\) and \(\mathbf {V}\) under the same permutation, we can consider the case of \( S=\{1,\ldots ,r\}\) for simplicity. Let \(\mathbf {U}=\left( \begin{array}{c} \mathbf {U}_1 \\ \mathbf {U}_2 \end{array} \right) \), \(\mathbf {V}=\left( \begin{array}{c} \mathbf {V}_1 \\ \mathbf {V}_2 \end{array} \right) \) and \({\hat{\mathbf {V}}}=\left( \begin{array}{c} {\hat{\mathbf {V}}}_1 \\ {\hat{\mathbf {V}}}_2 \end{array} \right) \), where \(\mathbf {U}_1,\mathbf {V}_1,{\hat{\mathbf {V}}}_1\in \mathbb {R}^{r\times r}\). Then, we have \({\hat{\mathbf {V}}}_1=\mathbf {V}_1\mathbf {R}\). From \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\), we know \(\mathbf {U}_1\succ 0\) and \(\mathbf {V}_1\succ 0\). Thus, \(\mathbf {U}_1\mathbf {I}\) and \(\mathbf {V}_1\mathbf {R}\) are the unique polar decompositions of \(\mathbf {U}_1\) and \({\hat{\mathbf {V}}}_1\), respectively. From Lemma 1, we have
With some simple computations, we can have
where we use \(\sigma _r(\mathbf {U}_1)\le \Vert \mathbf {U}\Vert _2\) and \(\Vert {\hat{\mathbf {V}}}_1-\mathbf {U}_1\Vert _F\le \Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F\) in the last inequality. Replacing \(\mathbf {U}_1\) with \(\mathbf {U}_{ S}\), we can have the conclusion. \(\square \)
Built upon Lemma 2, we can give the local restricted strong convexity of \(g(\mathbf {U})\) on the set \(\varOmega _S\) in the following theorem. There are two differences between the restricted strong convexity and the weakly strong convexity: (i) the restricted strong convexity removes the non-uniqueness and (ii) the restricted strong convexity establishes the curvature between any two points \(\mathbf {U}\) and \(\mathbf {V}\) in a local neighborhood of \(\mathbf {U}^*\), while (7) only exploits the curvature between \(\mathbf {U}\) and the optimum solution.
Theorem 1
Let \(\mathbf {U}^*=\varOmega _{ S}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\) with \(\Vert \mathbf {U}-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\le C\), where \(C=\frac{\mu \sigma _r^2(\mathbf {U}^*)\sigma _r^2(\mathbf {U}_{ S}^*)}{100L\Vert \mathbf {U}^*\Vert _2^3}\). Then, we have
Proof
From the restricted convexity of \(f(\mathbf {X})\), we have
where we use \(\nabla f(\mathbf {X}^*)\succeq 0\) proved in Lemma 7 and the fact that the inner product of two positive semidefinite matrices is nonnegative in the last inequality, i.e., \(\left\langle \nabla f(\mathbf {X}^*),(\mathbf {V}-\mathbf {U})(\mathbf {V}-\mathbf {U})^T\right\rangle \ge 0\). Applying Von Neumann’s trace inequality and Lemma 10 to bound the second term, applying Lemmas 2 and 8 to bound the third term, we can have
where we use Lemma 9 in the last inequality. From the assumption of \(\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\le C\), we can have the conclusion. We leave Lemmas 7, 8, 9 and 10 in “Appendix A”. \(\square \)
2.1 Smoothness of function \(g(\mathbf {U})\)
Besides the local restricted strong convexity, we can also prove the smoothness of \(g(\mathbf {U})\), which is built in the following theorem.
Theorem 2
Let \({\hat{L}}=2\Vert \nabla f(\mathbf {V}\mathbf {V}^T)\Vert _2+L(\Vert \mathbf {V}\Vert _2+\Vert \mathbf {U}\Vert _2)^2\). Then, we can have
Proof
From the restricted Lipschitz smoothness of f and a similar induction to (10), we have
Applying Von Neumann’s trace inequality to the first term, applying Lemma 10 to the third term, we can have the conclusion. \(\square \)
When restricted in a small neighborhood of \(\mathbf {U}^*\), we can give a better estimate for the smoothness parameter \({\hat{L}}\), as follows. The proof is provided in “Appendix A”.
Corollary 1
Let \(\mathbf {U}^*=\varOmega _S\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\in \varOmega _{ S}\) with \(\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le C\), \(\Vert \mathbf {U}^k-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F\le C\), where C is defined in Theorem 1 and \(\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\) are generated in Algorithm 1, which will be described later. Let \(L_g=38L\Vert \mathbf {U}^*\Vert _2^2+2\Vert \nabla f(\mathbf {X}^*)\Vert _2\) and \(\eta =\frac{1}{L_g}\). Then, we have
3 Accelerated gradient method with alternating constraint
From Theorem 1 and Corollary 1, we know that the objective \(g(\mathbf {U})\) behaves locally like a strongly convex and smooth function when restricted on the set \(\varOmega _S\). Thus, we can use the classical method for convex programming to solve problem (8), e.g., the accelerated gradient methodFootnote 3.
However, there remains a practical issue that when solving problem (8), we may get stuck at a critical point of problem (8) at the boundary of the constraint \(\mathbf {U}\in \varOmega _{ S}\), which is not the optimum solution of problem (2). In other words, we may halt before reaching the acceleration region, i.e., the local neighborhood of the optimum solution of problem (2). To overcome this trouble, we propose a novel alternating trajectory strategy. Specifically, we define two sets \(\varOmega _{S^1}\) and \(\varOmega _{S^2}\) as follows
and minimize the objective \(g(\mathbf {U})\) along the trajectories of \(\varOmega _{S^1}\) and \(\varOmega _{S^2}\) alternatively, i.e., when the iteration number t is odd, we minimize \(g(\mathbf {U})\) with the constraint of \(\mathbf {U}\in \varOmega _{S^1}\), and when t is even, we minimize \(g(\mathbf {U})\) with the constraint of \(\mathbf {U}\in \varOmega _{S^2}\). Intuitively, when the iterates approach the boundary of \(\varOmega _{S^1}\), we cancel the constraint of positive definiteness on \(\mathbf {U}_{ S^1}\) and put it on \(\mathbf {U}_{ S^2}\). Fortunately, with this strategy we can cancel the negative influence of the constraint. We require that both the two index sets \( S^1\) and \( S^2\) are of size r and \( S^1\cap S^2=\emptyset \) such that \(\mathbf {U}^*_{ S^1}\) and \(\mathbf {U}^*_{ S^2}\) are of full rank. Given proper \(S^1\) and \(S^2\), we can prove that the method globally converges to a critical point of problem (2). i.e., a point with \(\nabla g(\mathbf {U})=0\), rather than a critical point of problem (8) at the boundary of the constraint.
We describe our method in Algorithm 1. We use Nesterov’s acceleration scheme in the inner loop with finite K iterations and restart the acceleration scheme at each outer iteration. At the end of each outer iteration, we change the constraint and transform \(\mathbf {U}^{t,K+1}\in \varOmega _{ S}\) to a new point \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\) via polar decomposition such that \(g(\mathbf {U}^{t,K+1})=g(\mathbf {U}^{t+1,0})\). At step (12), we need to project \(\mathbf {Z}\equiv \mathbf {Z}^{t,k}-\frac{\eta }{\theta _k}\nabla g(\mathbf {V}^{t,k})\) onto \(\varOmega _S\). Let \(\mathbf {A}\varSigma \mathbf {A}^T\) be the eigenvalue decomposition of \(\frac{\mathbf {Z}_S+\mathbf {Z}_S^T}{2}\) and \({\hat{\varSigma }}=\text{ diag }([\max \{\epsilon ,\varSigma _{1,1}\},\ldots ,\max \{\epsilon ,\varSigma _{r,r}\}])\), then \(\mathbf {Z}^{t,k+1}_S=\mathbf {A}{\hat{\varSigma }}\mathbf {A}^T\) and \(\mathbf {Z}^{t,k+1}_{-S}=\mathbf {Z}_{-S}\). At step (14), \(\theta _{k+1}\) is computed by \(\theta _{k+1}=\frac{\sqrt{\theta _{k}^4+4\theta _{k}^2}-\theta _{k}^2}{2}\). At the end of each outer iteration, we need to compute the polar decomposition. Let \(\mathbf {A}\varSigma \mathbf {B}^T\) be the SVD of \(\mathbf {U}_{S'}^{t,K+1}\), then we can set \(\mathbf {H}=\mathbf {A}\varSigma \mathbf {A}^T\) and \(\mathbf {Q}=\mathbf {A}\mathbf {B}^T\). In Algorithm 1, we predefine \(S^1\) and \(S^2\) and fix them during the iterations. In Sect. 3.1 we will discuss how to find \(S^1\) and \(S^2\) using some local information.
At last, let’s compare the per-iteration cost of Algorithm 1 with the methods operating on \(\mathbf {X}\) space. Both the eigenvalue decomposition and polar decomposition required in Algorithm 1 perform on the submatrices of size \(r\times r\), which need \(O(r^3)\) operations. Thus, the per-iteration complexity of Algorithm 1 is \(O(nr+r^3)\). As a comparison, the methods operating on \(\mathbf {X}\) space require at least the top-r singular value/vectors, which need \(O(n^2r)\) operations for the deterministic algorithms and \(O(n^2\log r)\) for randomized algorithms (Halko et al. 2011). Thus, our method is more efficient at each iteration when \(r\ll n\), especially when r is upper bounded by a constant independent on n.
3.1 Finding the index sets \( S^1\) and \( S^2\)
In this section, we consider how to find the index sets \( S^1\) and \( S^2\). \(S^1\cap S^2=\emptyset \) can be easily satisfied and we only need to ensure that \(\mathbf {U}^*_{ S^1}\) and \(\mathbf {U}^*_{ S^2}\) are of full rank. Suppose that we have some initializer \(\mathbf {U}^0\) close to \(\mathbf {U}^*\). We want to use \(\mathbf {U}^0\) to find such \( S^1\) and \( S^2\). We first discuss how to select one index set S based on \(\mathbf {U}^0\). We can use the volume sampling subset selection algorithm (Guruswami and Sinop 2012; Avron and Boutsidis 2013), which can select S such that \(\sigma _r(\mathbf {U}^0_{ S})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\) in \(O(nr^3\log (1/\delta '))\) operations. Then, we can bound \(\sigma _r(\mathbf {U}^*_{ S})\) in the following lemma since \(\mathbf {U}^0\) is close to \(\mathbf {U}^*\).
Lemma 3
If \(\Vert \mathbf {U}^0-\mathbf {U}^*\Vert _F\le 0.01\sigma _r(\mathbf {U}^*)\) and \(\Vert \mathbf {U}_S^0-\mathbf {U}_S^*\Vert _F\le \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}\), then for the index set S returned by the volume sampling subset selection algorithm performed on \(\mathbf {U}^0\) after \(O(nr^3\log (1/\delta '))\) operations, we have \(\sigma _r(\mathbf {U}^*_{ S})\ge \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\).
Proof
Form Theorem 3.11 in (Avron and Boutsidis 2013), we have \(\sigma _r(\mathbf {U}^0_{ S})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\) after \(O(nr^3\log (1/\delta '))\) operations. So we can obtain
which leads to
where we use \(0.99\sigma _r(\mathbf {U}^*)\le \sigma _r(\mathbf {U}^0)\), which is proved in Lemma 9 in “Appendix A”. \(\square \)
In the column selection problem and its variants, existing algorithms (please see Avron and Boutsidis 2013 and the references therein) can only find one index set. Our purpose is to find both \( S^1\) and \( S^2\). We believe that this is a challenging target in the theoretical computer science community. In our applications, since \(n\gg r\), we may expect that the rank of \(\mathbf {U}^0_{- S^1}\) is not influenced after dropping r rows from \(\mathbf {U}^0\). Thus, we can use the procedure discussed above again to find \( S^2\) from \(\mathbf {U}^0_{- S^1}\). From Lemma 3, we have \(\sigma _r(\mathbf {U}^0_{S^1})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) and \(\sigma _r(\mathbf {U}^0_{S^2})\ge \frac{\sigma _r(\mathbf {U}_{-S^1}^0)}{\sqrt{2r(n-2r+1)}}\). In the asymmetric case, this challenge disappears. Please see the details in Sect. 7. We show in experiments that Algorithm 1 works well even for the simple choice of \(S^1=\{1,\ldots ,r\}\) and \(S^2=\{r+1,\ldots ,2r\}\). The discussion of finding \( S^1\) and \( S^2\) in this section is only for the theoretical purpose.
3.2 Initialization
Our theorem ensures the accelerated linear convergence given that the initial point \(\mathbf {U}^0\in \varOmega _{S^2}\) is within the local neighborhood of the optimum solution, with radius C defined in Theorem 1. We use the initialization strategy in Bhojanapalli et al. (2016a). Specifically, let \(\mathbf {X}^0=\text{ Project }_{+}\left( \frac{-\nabla f(0)}{\Vert \nabla f(0)-\nabla f(11^T)\Vert _F}\right) \) and \(\mathbf {V}^0{\mathbf {V}^0}^T\) be the best rank-r approximation of \(\mathbf {X}_0\), where \(\text{ Project }_{+}\) means the projection operator onto the semidefinite cone. Then, Bhojanapalli et al. (2016a) proved \(\Vert \mathbf {V}^0-P_{\mathcal {X}^*}(\mathbf {V}^0)\Vert _F\le \frac{4\sqrt{2}r\Vert \mathbf {U}^*\Vert _2^2}{\sigma _r(\mathbf {U}^*)}\sqrt{\frac{L^2}{\mu ^2}-\frac{2\mu }{L}+1}\). Let \(\mathbf {H}\mathbf {Q}=\mathbf {V}^0_{S^2}\) be its polar decomposition and \(\mathbf {U}^0=\mathbf {V}^0\mathbf {Q}^T\). Then, \(\mathbf {U}^0\) belongs to \(\varOmega _{S^2}\). Although this strategy does not produce an initial point close enough to the target, we show in experiments that our method performs well in practice. It should be noted that for the gradient descent method to solve the general problem (2), the initialization strategy in Bhojanapalli et al. (2016a) also does not satisfy the requirement of the theorems in Bhojanapalli et al. (2016a) for the general objective f.
4 Accelerated convergence rate analysis
In this section, we prove the local accelerated linear convergence rate of Algorithm 1. We first consider the inner loop. It uses the classical accelerated gradient method to solve problem (8) with fixed index set S for finite K iterations. Thanks to the stronger curvature built in Theorem 1 and the smoothness in Corollary 1, we can use the standard proof framework to analyze the inner loop, e.g., Tseng (2008). Some slight modifications are needed since we should ensure that all the iterates belong to the local neighborhood of \(\mathbf {U}^*\). We present the result in the following lemma and give its proof sketch. For simplicity, we omit the outer iteration number t.
Lemma 4
Let \(\mathbf {U}^*=\varOmega _{ S}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^0\in \varOmega _{ S}\) with \(\epsilon \le 0.99\sigma _r(\mathbf {U}_{ S'}^*)\) and \(\Vert \mathbf {U}^0-\mathbf {U}^*\Vert _F\le C\). Let \(\eta =\frac{1}{L_g}\), where C is defined in Theorem 1 and \(L_g\) is defined in Corollary 1. Then, we have \(\sigma _r(\mathbf {U}_{ S'}^{K+1})\ge \epsilon \), \(\Vert \mathbf {U}^{K+1}-\mathbf {U}^*\Vert _F\le C\) and
Proof
We follow four step to prove the lemma.
Step 1 We can easily check that if \(\mathbf {U}^0\in \varOmega _{ S}\), then all the iterates of \(\{\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\}\) belong to \(\varOmega _{ S}\) by \(0\le \theta _k\le 1\), the convexity of \(\varOmega _S\) and the convex combinations in (11) and (13).
Step 2 Consider the k-th iteration. If \(\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le C\), \(\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {U}^k-\mathbf {U}^*\Vert _F\le C\), then Theorem 1 and Corollary 1 hold. From the standard analysis of the accelerated gradient method for convex programming, e.g., Proposition 1 in Tseng (2008), we can have
Step 3 Since Theorem 1 and Corollary 1 hold only in a local neighbourhood of \(\mathbf {U}^*\), we need to check that \(\{\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\}\) belongs to this neighborhood for all the iterations, which can be easily done via induction. In fact, from (15) and the convexity combinations in (11) and (13), we know that if the following conditions hold,
then we can have
Step 4 From \(\frac{1}{\theta _{-1}}=0\) and Step 3, we know (15) holds for all the iterations. Thus, we have
where we use \(\theta _k\le \frac{2}{k+1}\) from \(\frac{1-\theta _{k+1}}{\theta _{k+1}^2}=\frac{1}{\theta _k^2}\) and \(\theta _0=1\).
On the other hand, from the perturbation theorem of singular values, we have
which leads to \(\sigma _r(\mathbf {U}_{ S'}^{K+1})\ge 0.99\sigma _r(\mathbf {U}_{ S'}^*)\ge \epsilon \). \(\square \)
Now we consider the outer loop of Algorithm 1. Based on Lemma 4, the second order growth property (5) and the perturbation theory of polar decomposition, we can establish the exponentially decreasing of \(\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F\) in the following lemma.
Lemma 5
Let \(\mathbf {U}^{t,*}=\varOmega _{ S}\cap \mathcal {X}^*\) and \(\mathbf {U}^{t+1,*}=\varOmega _{ S'}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^{t,0}\in \varOmega _{ S}\) with \(\epsilon \le 0.99\sigma _r(\mathbf {U}_{ S'}^{t,*})\) and \(\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F\le C\). Let \(K+1=\frac{28\Vert \mathbf {U}^{*}\Vert _2}{\sqrt{\eta \mu }\sigma _r(\mathbf {U}^{*})\min \{\sigma _r(\mathbf {U}_{ S^1}^{*}),\sigma _r(\mathbf {U}_{ S^2}^{*})\}}\). Then, we can have \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\) and
Proof
We follow four steps to prove the lemma.
Step 1 From Lemma 4, we have \(\sigma _r(\mathbf {U}_{ S'}^{t,K+1})\ge \epsilon \), \(\Vert \mathbf {U}^{t,K+1}-\mathbf {U}^{t,*}\Vert _F\le C\) and
From Algorithm 1, we have \(\sigma _r(\mathbf {U}_{ S'}^{t+1,0})=\sigma _r(\mathbf {U}_{ S'}^{t,K+1})\). So \(\mathbf {U}_{ S'}^{t+1,0}\succeq \epsilon \mathbf {I}\) and \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\).
Step 2 From Lemma 11 in “Appendix B”, we have
where \({\hat{\mathbf {U}}}^{t,*}=P_{\mathcal {X}^*}(\mathbf {U}^{t,K+1})=\mathbf {U}^{t,*}\mathbf {R}\) and \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {U}^{t,*}\mathbf {R}-\mathbf {U}^{t,K+1}\Vert _F^2\).
Step 3 Given (17) and (18), in order to prove (16), we only need to lower bound \(\Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F\) by \(\Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\).
From Algorithm 1, we know that \(\mathbf {H}\mathbf {Q}=\mathbf {U}_{ S'}^{t,K+1}\) is the unique polar decomposition of \(\mathbf {U}_{ S'}^{t,K+1}\) and \(\mathbf {U}^{t+1,0}=\mathbf {U}^{t,K+1}\mathbf {Q}^T\). Let \(\mathbf {H}^*\mathbf {Q}^*={\hat{\mathbf {U}}}_{ S'}^{t,*}\) be its unique polar decomposition and \(\mathbf {U}^{t+1,*}={\hat{\mathbf {U}}}^{t,*}(\mathbf {Q}^*)^T\), then \(\mathbf {U}^{t+1,*}\in \varOmega _{ S'}\cap \mathcal {X}^*\). From the perturbation theorem of polar decomposition in Lemma 1, we have
Similar to (9), we have
Step 4 Combining (17), (18) and (19), we have
From the setting of \(K+1\), we can have the conclusion. \(\square \)
From Lemma 5, we can give the accelerated convergence rate in the following theorem. The proof is provided in “Appendix B”. We contain several assumptions in Theorem 3. For the trajectories, we assume that we can find two disjoint sets \(S^1\) and \(S^2\) such that \(\sigma _r(\mathbf {U}_{ S^1}^*)\) and \(\sigma _r(\mathbf {U}_{ S^2}^*)\) are as large as possible (please see Sect. 3.1 for the discussion). For the initialization, we assume that we can find an initial point \(\mathbf {U}^{0,0}\) close enough to \(\mathbf {U}^{0,*}\) (please see Sect. 3.2 for the discussion). Then, we can prove that when the outer iteration number t is odd, \(\mathbf {U}^{t,k}\) belongs to \(\varOmega _{S^1}\) and the iterates converge to the optimum solution of \(\varOmega _{S^1}\cap \mathcal {X}^*\). When t is even, the iterates belong to \(\varOmega _{S^2}\) and converge to another optimum solution of \(\varOmega _{S^2}\cap \mathcal {X}^*\). In our algorithm, we set \(\eta \) and K based on a reliable knowledge on \(\Vert \mathbf {U}^*\Vert _2\), \(\sigma _r(\mathbf {U}^*)\) and \(\sigma _r(\mathbf {U}_{S}^*)\). As suggested by Bhojanapalli et al. (2016a), Park et al. (2018), they can be estimated by \(\Vert \mathbf {U}^0\Vert _2\), \(\sigma _r(\mathbf {U}^0)\) and \(\sigma _r(\mathbf {U}_S^0)-\)up to constants−since \(\mathbf {U}^0\) is close to \(\mathbf {U}^*\).
Theorem 3
Let \(\mathbf {U}^{t,*}=\varOmega _{S^1}\cap \mathcal {X}^*\) when t is odd and \(\mathbf {U}^{t,*}=\varOmega _{S^2}\cap \mathcal {X}^*\) when t is even. Assume that \(\mathbf {U}^*\in \mathcal {X}^*\) and \(\mathbf {U}^{0,0}\in \varOmega _{S^2}\) with \(\Vert \mathbf {U}^{0,0}-\mathbf {U}^{0,*}\Vert _F\le C\) and \(\epsilon \le \min \{0.99\sigma _r(\mathbf {U}^*_{ S^1}),0.99\sigma _r(\mathbf {U}^*_{ S^2})\}\). Then, we have
and
where \(\mu _g=\frac{\mu \sigma _r^2(\mathbf {U}^*)\min \{\sigma _r^2(\mathbf {U}_{ S^1}^*),\sigma _r^2(\mathbf {U}_{ S^2}^*)\}}{25\Vert \mathbf {U}^*\Vert _2^2}\), \(L_g=38L\Vert \mathbf {U}^*\Vert _2^2+2\Vert \nabla f(\mathbf {X}^*)\Vert _2\) and \(C=\frac{\mu \sigma _r^2(\mathbf {U}^*)\min \{\sigma _r^2(\mathbf {U}_{S^1}^*),\sigma _r^2(\mathbf {U}_{S^2}^*)\}}{100L\Vert \mathbf {U}^*\Vert _2^3}\).
4.1 Comparison to the gradient descent
Bhojanapalli et al. (2016a) used the gradient descent to solve problem (2), which consists of the following recursion:
With the restricted strong convexity and smoothness of \(f(\mathbf {X})\), Bhojanapalli et al. (2016a) proved the linear convergence of gradient descent in the form of
As a comparison, from Theorem 3, our method converges linearly within the error of \(\left( 1-\frac{\sigma _r(\mathbf {U}^*)\min \{\sigma _r(\mathbf {U}_{ S^1}^*),\sigma _r(\mathbf {U}_{ S^2}^*)\}}{\Vert \mathbf {U}^*\Vert _2^2}\sqrt{\frac{\mu }{L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2}}\right) ^N\), where N is the total number of inner iterations. From Lemma 3, we know \(\sigma _r(\mathbf {U}^*_{ S})\approx \frac{1}{\sqrt{rn}}\sigma _r(\mathbf {U}^*)\) in the worst case and it is tight (Avron and Boutsidis 2013). Thus, our method has the convergence rate of \(\left( 1-\frac{\sigma _r^2(\mathbf {U}^*)}{\Vert \mathbf {U}^*\Vert _2^2}\sqrt{\frac{\mu }{nr(L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2)}}\right) ^N\) in the worst case. When the function f is ill-conditioned, i.e., \(\frac{L}{\mu }\ge nr\), our method outperforms the gradient descent. This phenomenon is similar to the case observed in the stochastic optimization community: the non-accelerated methods such as SDCA (Shalev-Shwartz and Zhang 2013), SVRG (Xiao and Zhang 2014) and SAG (Schmidt et al. 2017) have the complexity of \(O\left( \frac{L}{\mu }\log \frac{1}{\epsilon }\right) \) while the accelerated methods such as Accelerated SDCA (Shalev-Shwartz and Zhang 2016), Catalyst (Lin et al. 2015) and Katyusha (Allen-Zhu 2017) have the complexity of \(O\left( \sqrt{\frac{mL}{\mu }}\log \frac{1}{\epsilon }\right) \), where m is the sample size. The latter is tight when \(\frac{L}{\mu }\ge m\) for stochastic programming (Woodworth and Srebro 2016). In matrix completion, the optimal sample complexity is \(O(rn\log n)\) (Candès and Recht 2009). It is unclear whether our convergence rate for problem (2) is tight or there exists a faster method. We leave it as an open problem.
For better reference, we summarize the comparisons in Table 1. We can see that our method has the same optimal dependence on \(\sqrt{\frac{L}{\mu }}\) as convex programming.
4.1.1 Dropping the dependence on n
Our convergence rate has an additional dependence on n compared with the gradient descent method. It comes from \(\sigma _r(\mathbf {U}^*_S)\), i.e., Lemma 2. In fact, we use a loose relaxation in the last inequality of (9), i.e., \(\frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}_S-\mathbf {U}_S\Vert _F\le \frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F\). Since \(\mathbf {U}_S\in \mathbb {R}^{r\times r}\) and \(\mathbf {U}\in \mathbb {R}^{n\times r}\), a more suitable estimation should be
In practice, (21) holds when the entries of \(\mathbf {U}^{t,k}\) and \(\mathbf {V}^{t,k}\) converge nearly equally fast to those of \(\mathbf {U}^{t,*}\), which may be expected in practice. Thus, under the condition of (21), our convergence rate can be improved to
We numerically verify (21) in Sect. 8.4.
4.1.2 Examples with ill-conditioned objective f
Although the condition number \(\frac{L}{\mu }\) approximate to 1 for some famous problems in machine learning, e.g., matrix regression and matrix completion (Chen and Wainwright 2015), we can still find many problems with ill-conditioned objective, especially in the computer vision applications. We give the example of low rank representation (LRR) (Liu et al. 2013). The LRR model is a famous model in computer vision. It can be formulated as
where \(\mathbf {A}\) is the observed data and \(\mathbf {D}\) is a dictionary that linearly spans the data space. We can reformulate the problem as follows:
We know \(L/\mu =\kappa (\mathbf {D}^T\mathbf {D})\), i.e., the condition number of \(\mathbf {D}^T\mathbf {D}\). If we generate \(\mathbf {D}\in \mathbb {R}^{n\times n}\) as a random matrix with normal distribution, then \(E\left[ \text{ log }\kappa (\mathbf {D})\right] \sim \text{ log }n\) as \(n\rightarrow \infty \) Edelman (1988) and thus \(E\left[ \frac{L}{\mu }\right] \sim n^2\). We numerically verify on MATLAB that if \(n=1000\), then \(\frac{L}{\mu }\) is of the order \(10^7\), which is much larger than O(n).
Another example is the reduced rank logistic generalized linear model (RR-LGLM) Yee and Hastie (2000), She (2013). Assume that \(\mathbf {A}\) are all binary and denote \(\mathbf {D}=[\mathbf {d}_1,\ldots ,\mathbf {d}_n]^T\) and \(\mathbf {X}=[\mathbf {x}_1,\ldots ,\mathbf {x}_n]\). RR-LGLAM minimizes
The Hessian of the objective is \(\text{ diag }(\mathbf {D}^T\mathbf {G}_1\mathbf {D},\ldots ,\mathbf {D}^T\mathbf {G}_n\mathbf {D})\), where \(\mathbf {G}_j\) is the \(n\times n\) diagonal matrix whose i-th component is \(\frac{\exp (\mathbf {d}_i^T\mathbf {x}_j)}{(1+\exp (\mathbf {d}_i^T\mathbf {x}_j))^2}\). Thus, \(L/\mu \) is at least \(\kappa (\mathbf {D}^T\mathbf {D})\). As discussed above, it may be much larger than n. Other similar examples can be found in Wagner and Zuk (2015) and Liu and Li (2016).
5 Global convergence
In this section, we study the global convergence of Algorithm 1 without the assumption that \(f(\mathbf {X})\) is restricted strongly convex. We allow the algorithm to start from any initializer. Since we have no information about \(\mathbf {U}^*\) when \(\mathbf {U}^0\) is far from \(\mathbf {U}^*\), we use an adaptive index sets selection procedure for Algorithm 1. That is to say, after each inner loop, we check whether \(\sigma _r(\mathbf {U}^{t,K+1}_{S'})\ge \epsilon \) holds. If not, we select the new index set \(S'\) using the volume sampling subset selection algorithm.
We first consider the inner loop and establish Lemma 6. We drop the outer iteration number t for simplicity and leave the proof in “Appendix C”.
Lemma 6
Assume that \(\{\mathbf {U}^k,\mathbf {V}^k\}\) is bounded and \(\mathbf {U}^0\in \varOmega _{ S}\). Let \(\eta \le \frac{1-\beta _{\max }^2}{{\hat{L}}(2\beta _{\max }+1)+2\gamma }\), where \({\hat{L}}=2D+4LM^2\), \(D=\max \{\Vert \nabla f(\mathbf {U}^k(\mathbf {U}^k)^T)\Vert _2,\Vert \nabla f(\mathbf {V}^k(\mathbf {V}^k)^T)\Vert _2,\forall k\}\), \(M=\max \{\Vert \mathbf {U}^k\Vert _2,\Vert \mathbf {V}^k\Vert _2,\forall k\}\), \(\beta _{\max }=\max \left\{ \beta _k,k=0,\ldots ,K\right\} \), \(\beta _k=\frac{\theta _k(1-\theta _{k-1})}{\theta _{k-1}}\) and \(\gamma \) is a small constant. Then, we have
Now we consider the outer loop. As discussed in Sect. 3, when solving problem (8) directly, we may get stuck at the boundary of the constraint. Thanks to the alternating constraint strategy, we can cancel the negative influence of the constraint and establish the global convergence to a critical point of problem (2), which is described in Theorem 4. It establishes that after at most \(O\left( \frac{1}{\varepsilon ^2}\log \frac{1}{\varepsilon }\right) \) operations, \(\mathbf {U}^{T,K+1}\) is an approximate zero gradient point in the precision of \(\varepsilon \). Briefly speaking, since the projection operation in (12) only influences the rows indicated by the index set S, a simple calculation yields that \(\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^1}\Vert _F\le O(\varepsilon )\) and \(\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^2}\Vert _F\le O(\varepsilon )\). From \(S^1\cap S^2=\emptyset \), we have \(\Vert \nabla g(\mathbf {Z}^{t,K+1})\Vert _F\le \Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^1}\Vert _F+\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^2}\Vert _F\le O(\varepsilon )\), which explains why the alternating constraint strategy avoids the boundary of the constraint.
Theorem 4
Assume that \(\{\mathbf {U}^{t,k},\mathbf {V}^{t,k}\}\) is bounded and \(\sigma _r(\mathbf {U}_{S'}^{t,K+1})\ge \epsilon ,\forall t\). Let \(\eta \) be the one defined in Lemma 6. Then, after at most \(T=2\frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {X}^*)}{\varepsilon ^2}\) outer iterations, we have
with probability of \(1-\delta \). The volume sampling subset selection algorithm needs \(O\left( nr^3\log \left( \frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {X}^*))}{\delta \varepsilon ^2}\right) \right) \) operations for each running.
Proof
We follow three steps to prove the theorem.
Step 1 Firstly, we bound the difference of two consecutive variables, i.e., \(\mathbf {U}^{t,k+1}-\mathbf {U}^{t,k}\).
From Lemma 6 we have
Summing over \(t=0,\ldots ,T\) yields
So after \(T=2\frac{g(\mathbf {U}^{t,0})-f(\mathbf {X}^*)}{\varepsilon ^2}\) outer iterations, we must have
for some \(t<T\). Thus, we can bound \(\Vert \mathbf {U}^{t',k+1}-\mathbf {U}^{t',k}\Vert _F\) by \(\varepsilon \), where \(t'=t\) or \(t'=t+1\). Moreover, from Lemma 13 in “Appendix C”, we can bound \(\Vert \mathbf {U}^{t',k+1}-\mathbf {Z}^{t',k+1}\Vert _F\), \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {Z}^{t',k}\Vert _F\) and \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {V}^{t',k}\Vert _F\) by \(\frac{\varepsilon }{\theta _k}\).
Step 2 Secondly, we bound parts of elements of the gradient, i.e., \(\left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^1}\) and \(\left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^2}\).
From the optimality condition of (12), we have
for \(j=1\) when \(t'=t\) and \(j=2\) when \(t'=t+1\). From Lemmas 10 and 13, we can easily check that
Thus, we obtain
Since \(\partial I_{\varOmega _{ S^j}}(\mathbf {Z}^{t',k+1})\) has zero elements for the rows indicated by the indexes out of \(S^j\), we can have
and
On the other hand,
where we use Lemma 13 in the last inequality. Combing (24) and (25), we can obtain
Since \(\mathbf {Z}^{t+1,0}=\mathbf {Z}^{t,K+1}\mathbf {Q}^T\) for some orthogonal \(\mathbf {Q}\), we can have
Step 3 We bound all the elements of the gradient. Recall that we require \(S^1\cap S^2=\emptyset \). Thus, we have \(- S^1\cup - S^2=\{1,2,\ldots ,n\}\). Then, from (23) and (26), we have
At last, we can bound \(\left\| \nabla g(\mathbf {U}^{t,K+1})\right\| _F\) from Lemmas 10 and 13.
From the Algorithm, we know that the index set is selected at most T times. The volume sampling subset selection algorithm succeeds with the probability of \(1-\delta '\). So the Algorithm succeeds with the probability at least of \(1-T\delta '=1-\delta \). On the other hand, the volume sampling subset selection algorithm needs \(O\left( nr^3\log \left( \frac{1}{\delta '}\right) \right) =O\left( nr^3\log \left( \frac{T}{\delta }\right) \right) =O\left( nr^3\log \left( \frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {U}^*{\mathbf {U}^*}^T))}{\delta \varepsilon ^2}\right) \right) \) operations. \(\square \)
6 Minimizing (2) directly without the constraint
Someone may doubt the necessity of the constraint in problem (8) and they wonder the performance of the classical accelerated gradient method to minimize problem (2) directly. In this case, the classical accelerated gradient method (Nesterov 1983, 1988; Tseng 2008) becomes
and it is equivalent to
where \(\beta _k\) is defined in Lemma 6. Another choice is a constant of \(\beta <1\). Theorem 5 establishes the convergence rate for the above two recursions. We leave the proof in “Appendix D”.
Theorem 5
Assume that \(\mathbf {U}^*\in \mathcal {X}^*\) and \(\mathbf {V}^k\in \mathbb {R}^{n\times r}\) satisfy \(\Vert \mathbf {V}^k-P_{\mathcal {X}^*}(\mathbf {V}^k)\Vert _F\le \min \left\{ 0.01\sigma _r(\mathbf {U}^*), \frac{\mu \sigma _r^2(\mathbf {U}^*)}{6L\Vert \mathbf {U}^*\Vert _2}\right\} \). Let \(\eta \) be the one in Lemma 6. Then, we can have
where \(\gamma =\frac{1-\beta _{\max }^2}{4\eta }-\frac{\beta _{\max }{\hat{L}}}{2}-\frac{{\hat{L}}}{4}>0\) and \(\nu =\frac{1+\beta _{\max }^2}{4\eta }-\frac{{\hat{L}}}{4}>0\).
Consider the case that \(\beta _k\) is a constant. Then, we know that all of the constants \(\gamma ,\nu ,{\hat{L}}\) and \(\frac{1}{\eta }\) are of the order \(O\left( L\Vert \mathbf {U}^*\Vert _2^2+\Vert \nabla f(\mathbf {X}^*)\Vert _2\right) \). Thus, the convergence rate of recursion (30), (31) is in the form of
which is the same as that of the gradient descent method in (20). Thus, although the convergence of the classical accelerated gradient method for problem (2) can be proved, it is not easy to build the acceleration upon the gradient descent. As a comparison, Algorithm 1 has a theoretical better dependence on the condition number of \(\frac{L}{\mu }\). Thus, the reformulation of problem (2) to a constrained one is necessary to prove acceleration.
7 The asymmetric case
In this section, we consider the asymmetric case of problem (1):
where there exists a minimizer \({\widetilde{\mathbf {X}}}^*\) of rank-r. We follow Park et al. (2018) to assume \(\nabla f({\widetilde{\mathbf {X}}}^*)=0\). In the asymmetric case, we can factorize \({\widetilde{\mathbf {X}}}={\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) and reformulate problem (32) as a similar problem to (2). Moreover, we follow Park et al. (2018), Wang et al. (2017) to regularize the objective and force the solution pair \(({\widetilde{\mathbf {U}}},{\widetilde{\mathbf {V}}})\) to be balanced. Otherwise, the problem may be ill-conditioned since \(\left( \frac{1}{\delta }{\widetilde{\mathbf {U}}}\right) (\delta {\widetilde{\mathbf {V}}})\) is also a factorization of \({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) for any large \(\delta \) (Park et al. 2018). Specifically, we consider the following problem
Let \({\widetilde{\mathbf {X}}}^*=\mathbf {A}\varSigma \mathbf {B}^T\) be its SVD. Then, \(({\widetilde{\mathbf {U}}}^*=\mathbf {A}\sqrt{\varSigma },{\widetilde{\mathbf {V}}}^*=\mathbf {B}\sqrt{\varSigma })\) is a minimizer of problem (33). Define a stacked matrix \(\mathbf {U}=\left( \begin{array}{c} {\widetilde{\mathbf {U}}} \\ {\widetilde{\mathbf {V}}} \end{array} \right) \) and let \(\mathbf {X}=\mathbf {U}\mathbf {U}^T=\left( \begin{array}{cc} {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {U}}}^T &{} {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\\ {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {U}}}^T &{} {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {V}}}^T \end{array} \right) \). Then we can write the objective in (33) in the form of \({\hat{f}}(\mathbf {X})\), defined as \({\hat{f}}(\mathbf {X})=f({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)+\frac{\mu }{8}\Vert {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {U}}}^T\Vert _F^2+\frac{\mu }{8}\Vert {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {V}}}^T\Vert _F^2-\frac{\mu }{4}\Vert {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\Vert _F^2\). Since f is restricted \(\mu \)-strongly convex, we can easily check that \({\hat{f}}(\mathbf {X})\) is restricted \(\frac{\mu }{4}\)-strongly convex. On the other hand, we know that \({\hat{f}}(\mathbf {X})\) is restricted \(\left( L+\frac{\mu }{2}\right) \)-smooth. Applying the conclusions on the symmetric case to \({\hat{f}}(\mathbf {X})\), we can apply Algorithm 1 to the asymmetric case. From Theorem 3, we can get the convergence rate. Moreover, since \(\sigma _i(\mathbf {X}^*)=2\sigma _i({\widetilde{\mathbf {X}}}^*)\),
and \(\Vert \nabla {\hat{f}}(\mathbf {X}^*)\Vert _2=\frac{\mu }{4}\Vert \mathbf {X}^*\Vert _2\), where \(\mathbf {X}^*=\left( \begin{array}{cc} {\widetilde{\mathbf {U}}}^*\widetilde{\mathbf {U}}^{*}{^{T}} &{} {\widetilde{\mathbf {U}}}^{*}\widetilde{\mathbf {V}}^{*}{^{T}}\\ {\widetilde{\mathbf {V}}}^*\widetilde{\mathbf {U}^*}^T &{} \widetilde{\mathbf {V}}^{*}\widetilde{\mathbf {V}}^{*}{^{T}} \end{array} \right) \), we can simplify the worst case convergence rate to \(\left( 1-\frac{\sigma _r({\widetilde{\mathbf {X}}}^*)}{\Vert {\widetilde{\mathbf {X}}}^*\Vert _2}\sqrt{\frac{\mu }{(m+n)rL}}\right) ^N\). As a comparison, the rate of the gradient descent is \(\left( 1-\frac{\sigma _r({\widetilde{\mathbf {X}}}^*)}{\Vert {\widetilde{\mathbf {X}}}^*\Vert _2}\frac{\mu }{L}\right) ^N\) (Park et al. 2018).
In the asymmetric case, both \({\widetilde{\mathbf {U}}}^*\) and \({\widetilde{\mathbf {V}}}^*\) are of full rank. Otherwise, \(\text{ rank }({\widetilde{\mathbf {X}}}^*)<r\). Thus, we can select the index set \( S^1\) from \({\widetilde{\mathbf {U}}}^0\) and select \( S^2\) from \({\widetilde{\mathbf {V}}}^0\) with the guarantee of \(\sigma _r({\widetilde{\mathbf {U}}}^0_{ S^1})\ge \frac{\sigma _r({\widetilde{\mathbf {U}}}^0)}{\sqrt{2r(n-r+1)}}\) and \(\sigma _r({\widetilde{\mathbf {V}}}^0_{ S^2})\ge \frac{\sigma _r({\widetilde{\mathbf {V}}}^0)}{\sqrt{2r(m-r+1)}}\).
8 Experiments
In this section, we test the efficiency of the proposed accelerated gradient descent (AGD) method on Matrix Completion, One Bit Matrix Completion and Matrix Regression.
8.1 Matrix completion
In matrix completion (Rohde and Tsybakov 2011; Koltchinsii et al. 2011; Negahban and Wainwright 2012), the goal is to recover the low rank matrix \(\mathbf {X}^*\) based on a set of randomly observed entries \({\mathbf {O}}\) from \(\mathbf {X}^*\). The traditional matrix completion problem is to solve the following model:
We consider the asymmetric case and solve the following model:
We set \(r=10\) and test the algorithms on the Movielen-10M, Movielen-20M and Netflix data sets. The corresponding observed matrices are of size \(69878\times 10677\) with \(o\%=1.34\%\), \(138493\times 26744\) with \(o\%=0.54\%\) and \(480189\times 17770\) with \(o\%=1.18\%\), respectively, where \(o\%\) means the percentage of the observed entries. We compare AGD and AGD-adp (AGD with adaptive index sets selection) with GD and several variants of the original
AGD: 1. AGD-original1: The classical AGD with recursions of (30), (31).2. AGD-original1-r: AGD-original1 with restart.3. AGD-original1-f: AGD-original1 with fixed \(\beta _k\) of \(\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}\).4. AGD-original2: The classical AGD with recursions of (27)–(29).5. AGD-original2-r: AGD-original2 with restart.6. AGD-original2-f: AGD-original2 with fixed \(\theta \).
Let \(\mathbf {X}_{{\mathbf {O}}}\) be the observed data and \(\mathbf {A}\varvec{\varSigma }\mathbf {B}^T\) be its SVD. We initialize \({\widetilde{\mathbf {U}}}=\mathbf {A}_{:,1:r}\sqrt{\varvec{\varSigma }_{1:r,1:r}}\) and \({\widetilde{\mathbf {V}}}=\mathbf {B}_{:,1:r}\sqrt{\varvec{\varSigma }_{1:r,1:r}}\) for all the compared methods. Since \(\mathbf {X}_{{\mathbf {O}}}\) is sparse, it is efficient to find the top r singular values and the corresponding singular vectors for large-scale matrices (Larsen 1998). We tune the best step sizes of \(\eta =5\times 10^{-5},4\times 10^{-5}\) and \(1\times 10^{-5}\) for all the compared methods on the three data sets, respectively. For AGD, we set \(\epsilon =10^{-10}\), \( S^1=\{1:r\}\) and \( S^2=\{r+1:2r\}\) for simplicity. We set \(K = 100\) for AGD, AGD-adp and the original AGD with restart. We run the compared methods 500 iterations for the Movielen-10M and Movielen-20M data sets and 1000 iterations for the Netflix data set.
The top part of Fig. 1 plots the curves of the training RMSE v.s. time (seconds). We can see that AGD is faster than GD. The performances of AGD, AGD-adp and the original AGD are similar. In fact, in AGD-adp, we observe that the index sets do not change during the iterations. Thus, the condition of \(\sigma _r(\mathbf {U}^{t,K+1}_{ S'})\ge \epsilon \)\(\forall t\) in Theorem 4 holds. The original AGD performs almost equally fast as our modified AGD in practice. However, it has an inferior convergence rate theoretically. The bottom part of Fig. 1 plots the curves of the testing RMSE v.s. time. Besides GD, we also compare AGD with LMaFit (Wen et al. 2012), Soft-ALS (Hastie et al. 2015) and MSS (Xu et al. 2017). They all solve a factorization based nonconvex model. From Fig. 1 we can see that AGD achieves the lowest testing RMSE with the fastest speed.
8.2 One bit matrix completion
In one bit matrix completion (Davenport et al. 2014), the sign of a random subset from the unknown low rank matrix \(\mathbf {X}^*\) is observed, instead of observing the actual entries. Given a probability density function f, e.g., the logistic function \(f(\mathbf {x})=\frac{e^x}{1+e^x}\), we observe the sign of \(\mathbf {x}\) as \(+1\) with probability \(f(\mathbf {x})\) and observe the sign as \(-1\) with probability \(1-f(\mathbf {x})\). The training objective is to minimize the negative log-likelihood:
In this section, we solve the following model:
We use the data sets of Movielen-10M, Movielen-20M and Netflix. We set \(\mathbf {Y}_{i,j}=1\) if the (i, j)-th observation is larger than the average of all observations and \(\mathbf {Y}_{i,j}=-1\), otherwise. We set \(r=5\) and \(\eta =0.001\), 0.001, 0.0005 for all the compared methods on the three data sets. The other experimental setting is the same as Matrix Completion. We run all the methods for 500 iterations. Figure 2 plots the curves of the objective value v.s. time (seconds) and we can see that AGD is also faster than GD. The performances of AGD, AGD-adp and the original AGD are nearly the same.
8.3 Matrix regression
In matrix regression (Recht et al. 2010; Negahban and Wainwright 2011), the goal is to estimate the unknown low rank matrix \(\mathbf {X}^*\) from a set of measurements \(\mathbf {y}=\mathbf {A}(\mathbf {X}^*)+\varepsilon \), where \(\mathbf {A}\) is a linear operator and \(\varepsilon \) is the noise. A reasonable estimation of \(\mathbf {X}^*\) is to solve the following rank constrained problem:
We consider the symmetric case of \(\mathbf {X}\) and solve the following nonconvex model:
We follow (Bhojanapalli et al. 2016a) to use the permuted and sub-sampled noiselets (Waters et al. 2011) for the linear operator \(\mathbf {A}\) and \(\mathbf {U}^*\) is generated from the normal Gaussian distribution without noise. We set \(r=10\) and test different n with \(n = 512\), 1024 and 2048. We fix the number of measurements to 4nr and follow (Bhojanapalli et al. 2016a) to use the initializer from the eigenvalue decomposition of \(\frac{\mathbf {X}^0+(\mathbf {X}^0)^T}{2}\) for all the compared methods, where \(\mathbf {X}^0=\text{ Project }_{+}\left( \frac{-\nabla f(0)}{\Vert \nabla f(0)-\nabla f(11^T)\Vert _F}\right) \). We set \(\eta =5,10\) and 20 for all the compared methods for \(n=512,1024\) and 2048, respectively. In AGD, we set \(\epsilon =10^{-10}\), \(K=10\), \( S^1=\{1:r\}\) and \( S^2=\{r+1:2r\}\). Figure 3 plots the curves of the objective value v.s. time (seconds). We run all the compared methods for 300 iterations. We can see that AGD and the original AGD with restart perform almost equally fast. AGD runs faster than GD and the original AGD without restart.
8.4 Verifying (21) in practice
In this section, we verify that the conditions of \(\Vert {\hat{\mathbf {U}}}_S^{t,k}-\mathbf {U}_S^*\Vert _F\le c\sqrt{\frac{r}{n}}\Vert {\hat{\mathbf {U}}}^{t,k}-\mathbf {U}^*\Vert _F\) and \(\Vert {\hat{\mathbf {V}}}_S^{t,k}-\mathbf {U}_S^*\Vert _F\le c\sqrt{\frac{r}{n}}\Vert {\hat{\mathbf {V}}}^{t,k}-\mathbf {U}^*\Vert _F\) in (21) hold in our experiments, where \({\hat{\mathbf {U}}}^{t,k}=\mathbf {U}^{t,k}\mathbf {R}\) with \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\in \mathbb {R}^{r\times r},\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {U}^{t,k}\mathbf {R}-\mathbf {U}^*\Vert _F^2\) and \({\hat{\mathbf {V}}}^{t,k}\) is defined similarly. We use the final output \(\mathbf {U}^{T,K+1}\) as \(\mathbf {U}^*\). Table 2 lists the results. We can see that \(\frac{\Vert {\hat{\mathbf {U}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {U}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\) and \(\frac{\Vert {\hat{\mathbf {V}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {V}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\) have the same order as \(\sqrt{\frac{r}{n}}\).
9 Conclusions
In this paper, we study the factorization based low rank optimization. A linearly convergent accelerated gradient method with alternating constraint is proposed with the optimal dependence on the condition number of \(\sqrt{L/\mu }\) as convex programming. As far as we know, this is the first work with the provable optimal dependence on \(\sqrt{L/\mu }\) for this kind of nonconvex problems. Globally, the convergence to a critical point is proved.
There are two problems unsolved in this paper. 1. How to find two distinct sets \(S^1\) and \(S^2\) such that \(\sigma _r(\mathbf {U}_{S^1})\) and \(\sigma _r(\mathbf {U}_{S^2})\) are as large as possible? 2. How to find the initial point close enough to the optimum solution for the general problems with large condition number?
Notes
Necoara et al. (2019) analyzed the method with recursions of \(\mathbf {y}^k=\mathbf {x}^k+\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}(\mathbf {x}^k-\mathbf {x}^{k-1})\) and \(\mathbf {x}^{k+1}=\mathbf {y}^k-\eta \nabla f(\mathbf {y}^k)\).
Necoara et al. (2019) used induction to prove (Necoara et al. 2019, Lemma 1). When the optimum solution is not unique, \(\mathbf {y}^*\) in [Necoara et al. 2019, Equation (57)] should be replaced by \(P_{\mathcal {X}^*}(\mathbf {y}^k)\) and they have different values for different k. Thus, the induction is not correct any more.
However, it is still more challenging than convex programming since we should guarantee that all the variables in Theorem 1 belong to \(\varOmega _S\), while it is not required in convex programming. So the conclusion in Necoara et al. (2019) cannot be applied to problem (8) since we cannot obtain \(\mathbf {y}^{k+1}\in \varOmega _{ S}\) given \(\mathbf {x}^{k+1}\in \varOmega _{S}\) and \(\mathbf {x}^{k}\in \varOmega _{S}\) because \(\mathbf {y}^{k+1}\) is not a convex combination of \(\mathbf {x}^{k+1}\) and \(\mathbf {x}^{k}\).
References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., & Ma, T. (2017). Finding approximate local minima faster than gradient descent. In STOC.
Allen-Zhu, Z. (2017). Katyusha: The first direct acceleration of stochastic gradient methods. In STOC.
Avron, H., & Boutsidis, C. (2013). Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4), 1464–1499.
Bhojanapalli, S., Kyrillidis, A., & Sanghavi, S. (2016). Dropping convexity for faster semi-definite optimization. In COLT.
Bhojanapalli, S., Neyshabur, B., & Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. In NIPS.
Boumal, N., Voroninski, V., & Bandeira, A. (2016). The non-convex Burer–Monteiro approach works on smooth semidefinite programs. In NIPS.
Burer, S., & Monteiro, R. (2003). A nonlinaer programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2), 329–357.
Burer, S., & Monteiro, R. (2005). Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3), 427–444.
Cai, T., Ma, Z., & Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6), 3074–3110.
Candès, E., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6), 717–772.
Carmon, Y., Duchi, J., Hinder, O., & Sidford, A. (2018). Accelerated methods for non-convex optimization. SIAM Journal on Optimization, 28(2), 1751–1772.
Carmon, Y., Hinder, O., Duchi, J., & Sidford, A. (2017). Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. In ICML.
Chen, Y., & Wainwright, M. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algoritmic guarantees. arxiv:1509.03025.
Davenport, M., Plan, Y., Van Den Berg, E., & Wootters, M. (2014). 1-bit matrix completion. Information and Inference, 3(3), 189–223.
Edelman, A. (1988). Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9(4), 543–560.
Ge, R., Jin, C., & Zheng, Y. (2017). No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In ICML.
Ge, R., Lee, J., & Ma, T. (2016). Matrix completion has no spurious local minimum. In NIPS.
Ghadimi, S., & Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1–2), 59–99.
Gu, Q., Wang, Z., & Liu, H. (2016). Low-rank and sparse structure pursuit via alternating minimization. In AISTATS.
Guruswami, V., & Sinop, A. (2012). Optimal column-based low-rank matrix reconstruction. In SODA.
Halko, N., Martinsson, P., & Tropp, J. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288.
Hardt, M., & Wootters, M. (2014). Fast matrix completion without the condition number. In COLT.
Hastie, T., Mazumder, R., Lee, J., & Zadeh, R. (2015). Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(1), 3367–3402.
Jain, P., Netrapalli, P., & Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In STOC.
Jin, C., Ge, R., Netrapalli, P., & Kakade, S. (2018). How to escape saddle points efficiently. In ICML.
Jin, C., Netrapalli, P., & Jordan, M. (2018). Accelerated gradient descent escapes saddle points faster than gradient descent. In COLT.
Koltchinsii, V., Lounici, K., & Tsybakov, A. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5), 2302–2329.
Larsen, R. (1998). Lanczos bidiagonalization with partial reorthogonalization. Technical report, Aarhus University.
Li, H., & Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex programming. In NIPS.
Li, Q., Zhu, Z., & Tang, G. (2018). The non-convex geometry of low-rank matrix optimization. Information and Inference: A Journal of the IMA, 8(1), 51–96.
Li, R. (1995). New perturbation bounds for the unitary polar factor. SIAM Journal on Matrix Analysis and Applications, 16(1), 327–332.
Li, X., Wang, Z., Lu, J., Arora, R., Haupt, J., Liu, H., et al. (2016). Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arxiv:1612.09296.
Lin, H., Mairal, J., & Harchaoui, Z. (2015). A universal catalyst for first-order optimization. In NIPS.
Lin, M., & Ye, J. (2016). A non-convex one-pass framework for generalized factorization machines and rank-one matrix sensing. In NIPS.
Liu, G., & Li, P. (2016). Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21), 5623–5633.
Liu, G., Lin, Z., Yan, S., Sun, J., & Ma, Y. (2013). Robust recovery of subspace structures by low-rank representation. IEEE Transactions Pattern Analysis and Machine Intelligence, 35(1), 171–184.
Necoara, I., Nesterov, Yu., & Glineur, F. (2019). Linear convergence of first order methods for non-strongly convex optimization. Mathematical Programming, 175(1–2), 69–107.
Negahban, S., & Wainwright, M. (2011). Estimation of (near) low-rank maatrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2), 109–1097.
Negahban, S., & Wainwright, M. (2012). Resticted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May), 1665–1697.
Nesterov, Yu. (1983). A method for unconstrained convex minimization problem with the rate of convergence \({O}(1/k^2)\). Soviet Mathematics Doklady, 27(2), 372–376.
Nesterov, Yu. (1988). On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody, 24, 509–517.
Nesterov, Yu. (Ed.). (2004). Introductory lectures on convex optimization: A basic course. Berlin: Springer.
Park, D., Kyrillidis, A., Bhojanapalli, S., Caramanis, C., & Sanghavi, S. (2016). Provable Burer-Monteiro factorization for a class of norm-constrained matrix problems. arxiv:1606.01316.
Park, D., Kyrillidis, A., Caramanis, C., & Sanghavi, S. (2018). Finding low-rank solutions via non-convex matrix factorization, efficiently and provably. SIAM Journal on Image Science, 11(4), 333–361.
Park, D., Netrapalli, P., & Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In STOC.
Recht, B., Fazel, M., & Parrilo, P. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3), 471–501.
Rohde, A., & Tsybakov, A. (2011). Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2), 887–930.
Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1), 83–112.
Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb), 567–599.
Shalev-Shwartz, S., & Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1), 105–145.
She, Y. (2013). Reduced rank vector generalized linear models for feature extraction. Statistics and Its Interface, 6(2), 197–209.
Sun, R., & Luo, Z. (2015). Guaranteed matrix completion via nonconvex factorization. In FOCS.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML.
Tseng, P. (2008). On accelerated proximal gradient methods for convex-concave optimization. Technical report, University of Washington, Seattle.
Tu, S., Boczar, R., Soltanolkotabi, M., & Recht, B. (2016). Low-rank solutions of linear matrix equations via procrustes flow. In ICML.
Wagner, A., & Zuk, O. (2015). Low-rank matrix recovery from row-and-column affine measurements. In ICML.
Wang, L., Zhang, X., & Gu, Q. (2017). A unified computational and statistical framework for nonconvex low rank matrix estimation. In AISTATS.
Waters, A., Sankaranarayanan, A., & Baraniuk, R. (2011). SpaRCS: Recovering low rank and sparse matrices from compressive measurements. In NIPS.
Wen, Z., Yin, W., & Zhang, Y. (2012). Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Computation, 4(4), 333–361.
Woodworth, B., & Srebro, N. (2016). Tight complexity bounds for optimizing composite objectives. In NIPS.
Xiao, L., & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075.
Xu, Y., & Yin, W. (2014). A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientific Computing, 72(2), 700–734.
Xu, C., Lin, Z., & Zha, H. (2017). A unified convex surrogate for the Schatten-\(p\) norm. In AAAI.
Yee, T., & Hastie, T. (2000). Reduced-rank vector generalized linear models. Statistical Modelling, 3(1), 15–41.
Yi, X., Park, D., Chen, Y., & Caramanis, C. (2016). Fast algorithms for robust PCA via gradient descent. In NIPS.
Zhang, Q., & Lafferty, J. (2015). A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements. In NIPS.
Zhang, X., Wang, L., Yu, Y., & Gu, Q. (2018). A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In ICML.
Zhao, T., Wang, Z., & Liu, H. (2015). A nonconvex optimization framework for low rank matrix estimation. In NIPS.
Zheng, Q., & Lafferty, J. (2016). Convergence analysis for rectangular matrix completion using Burer–Monteiro factorization and gradient descent. arxiv:1605.07051.
Zhu, Z., Li, Q., Tang, G., & Wakin, M. (2018). The global optimization geometry on low-rank matrix optimization. arxiv:1703.01256.
Acknowledgements
Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (grant nos. 61625301 and 61731018), Beijing Academy of Artificial Intelligence, Qualcomm and Microsoft Research Asia.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Tijl De Bie.
This work was done when Huan Li was a Ph.D. student at Peking University.
Appendices
Appendix A
Lemma 7
For problem (1) and its minimizer \(\mathbf {X}^*\), we have
Proof
Introduce the Lagrange function
Since \(\mathbf {X}^*\) is the minimizer of problem (1), we know that there exists \(\varvec{\varLambda }^*\) such that
Thus, we can have the conclusion. \(\square \)
Lemma 8
(Tu et al. 2016) For any \(\mathbf {U}\in \mathbb {R}^{n\times r},\mathbf {V}\in \mathbb {R}^{n\times r}\), let \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {V}\mathbf {R}-\mathbf {U}\Vert _F^2\) and \({\hat{\mathbf {V}}}=\mathbf {V}\mathbf {R}\). Then, we can have
Lemma 9
(Bhojanapalli et al. 2016a) Assume that \(\Vert \mathbf {U}-\mathbf {U}^*\Vert _F\le 0.01\sigma _r(\mathbf {U}^*)\). Then, we can have
Lemma 10
For any \(\mathbf {U},\mathbf {V}\in \mathbb {R}^{n\times r}\), we have
Proof
For the first inequality, we have
For the second one, we have
where we use (34). For the third one, we have
where we use the restricted smoothness of f and (34) in the last inequality. \(\square \)
Now we give the proof of Corollary 1.
Proof
From Lemma 9 and the assumptions, we have
where \(\mathbf {U}\) can be \(\mathbf {U}^k\), \(\mathbf {V}^k\) and \(\mathbf {Z}^k\). From (35), we have
where we use \(\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le 0.01\Vert \mathbf {U}^*\Vert _2\). On the other hand, let
then we have \(\mathbf {Z}^{k+1}=\text{ Project }_{\varOmega _{ S}}({\hat{\mathbf {Z}}}^{k+1})\) and
where we use \(\Vert \mathbf {Z}^k\Vert _2\le 1.01\Vert \mathbf {U}^*\Vert _2\), \(\Vert \mathbf {V}^k\Vert _2\le 1.01\Vert \mathbf {U}^*\Vert _2\), (37) and the setting of \(\eta \). Let \({\hat{\varOmega }}_{ S}=\{\mathbf {U}_{ S}\in \mathbb {R}^{r\times r}:\mathbf {U}_{ S}\succeq \epsilon \mathbf {I}\}\), then
where we use \(\text{ trace }(\mathbf {A}\mathbf {B})=0\) if \(\mathbf {A}=\mathbf {A}^T\) and \(\mathbf {B}=-\mathbf {B}^T\), and \(\mathbf {U}=\mathbf {U}^T\) from \(\mathbf {U}\in {\hat{\varOmega }}_{ S}\). Let \(\mathbf {U}\varSigma \mathbf {U}^T\) be the eigenvalue decomposition of \(\frac{{\hat{\mathbf {Z}}}_{ S}^{k+1}+({\hat{\mathbf {Z}}}_{ S}^{k+1})^T}{2}\) and \({\hat{\varSigma }}_{i,i}=\max \{\epsilon ,\varSigma _{i,i}\}\). Then \(\text{ Project }_{{\hat{\varOmega }}_{ S}}({\hat{\mathbf {Z}}}_{ S}^{k+1})=\mathbf {U}{\hat{\varSigma }}\mathbf {U}^T\) and
where \(\varSigma _{1,1}\) is the largest eigenvalue of \(\frac{{\hat{\mathbf {Z}}}_{ S}^{k+1}+({\hat{\mathbf {Z}}}_{ S}^{k+1})^T}{2}\). Then, we have
and
where we use (13) in the first inequality, \(0\le \theta _k\le 1\) in the third and forth inequality and \(\Vert \mathbf {U}^*\Vert _2\ge \Vert \mathbf {U}^*_{ S}\Vert _2\ge \sigma _r(\mathbf {U}^*_{ S})\ge \epsilon \) in the last inequality. So
From Theorem 2, we can have the conclusion. \(\square \)
Appendix B
Lemma 11
Assume that \(\mathbf {U}^*\in \mathcal {X}^*\). Then, for any \(\mathbf {U}\), we have
Proof
From (10), we have
Since \(\mathbf {U}^*\) is a minimizer of problem (2), we have \(\nabla f(\mathbf {U}^*{\mathbf {U}^*}^T)\mathbf {U}^*=0\). From \(\big \langle \nabla f(\mathbf {U}^*{\mathbf {U}^*}^T), (\mathbf {U}^*-\mathbf {U})(\mathbf {U}^*-\mathbf {U})^T \big \rangle \ge 0\) and Lemma 8, we can have the conclusion. \(\square \)
Now we give the proof of Theorem 3.
Proof
From (16), we have
where we use \(4^{-x}\le e^{-x}\le 1-x\).
From Theorem 2 and \(\nabla g(\mathbf {U}^{t+1,*})=0\), we have
which leads to the conclusion. \(\square \)
Appendix C
Proof of Lemma 6.
Proof
We can easily check that \(\beta _{\max }<1\) due to \(\beta _k\le 1-\theta _{k-1}\) and the fact that K a finite constant. From Theorem 2, we have
Applying the inequality of \(\left\langle \mathbf {u},\mathbf {v}\right\rangle \le \Vert \mathbf {u}\Vert \Vert \mathbf {v}\Vert \), Lemma 10 and the inequality of \(2\Vert \mathbf {u}\Vert \Vert \mathbf {v}\Vert \le \alpha \Vert \mathbf {u}\Vert ^2+\frac{1}{\alpha }\Vert \mathbf {v}\Vert ^2\) to the second term, we can have
Applying Lemma 12 in “Appendix C” to bound the third term, we can have
for all \(k=1,2,\ldots ,K\), where we use \(\mathbf {V}^k-\mathbf {U}^k=\beta _k(\mathbf {U}^k-\mathbf {U}^{k-1})\) proved in Lemma 12. Specially, from \(\mathbf {U}^0=\mathbf {V}^0\) we have
Summing (38) over \(k=1,2,\ldots ,K\) and (39), we have
Letting \(\alpha =1/\beta _{\max }\), from the setting of \(\eta \), we have the desired conclusion. \(\square \)
Lemma 12
For Algorithm 1, we have
and
Proof
From the optimality condition of (12), we have
Since \(\varOmega _{ S}\) is a convex set, we have
and
With some simple computations, we have
which leads to the second conclusion. \(\square \)
Lemma 13
Under the assumptions in Theorem 4, if (22) holds, then we have \(\Vert \mathbf {U}^{t',k+1}-\mathbf {Z}^{t',k+1}\Vert _F\le \frac{2\varepsilon }{\theta _k}\), \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {Z}^{t',k}\Vert _F\le \frac{5\varepsilon }{\theta _k}\) and \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {V}^{t',k}\Vert _F\le \frac{9\varepsilon }{\theta _k}\) for \(t'=t\) or \(t'=t+1\).
Proof
From (22), for \(t'=t\) or \(t+1\) and \(\forall k=0,\ldots ,K\), we can have the following easy-to-check inequalities.
From \(\theta _k\le \theta _{k-1}\le 1\), we can have the conclusions. \(\square \)
Appendix D
Lemma 14
Assume that \(\mathbf {U}^*\in \mathcal {X}^*\) and \(\mathbf {V}\in \mathbb {R}^{n\times r}\) satisfy \(\Vert \mathbf {V}-P_{\mathcal {X}^*}(\mathbf {V})\Vert _F\le \min \left\{ 0.01\sigma _r(\mathbf {U}^*), \frac{\mu \sigma _r^2(\mathbf {U}^*)}{6L\Vert \mathbf {U}^*\Vert _2}\right\} \). Then, we have
Proof
Similar to the proof of Theorem 1, we have
where we use Lemma 8 to bound \(\Vert \mathbf {V}\mathbf {V}^T-P_{\mathcal {X}^*}(\mathbf {V})(P_{\mathcal {X}^*}(\mathbf {V}))^T\Vert _F^2\). Since \(g(\mathbf {U}^*)\le g(\mathbf {V})\), we can have
which leads to the conclusion. \(\square \)
Lemma 15
Under the assumptions of Lemma 6, we have
where \(\gamma =\frac{1-\beta _{\max }^2}{4\eta }-\frac{\beta _{\max }{\hat{L}}}{2}-\frac{{\hat{L}}}{4}>0\) and \(\nu =\frac{1+\beta _{\max }^2}{4\eta }-\frac{{\hat{L}}}{4}>0\).
Proof
Letting \(\alpha =\frac{1}{\beta _{\max }}\) in (38), we can have the conclusion. \(\square \)
Now we give the proof of Theorem 5.
Proof
Denote \({\hat{\mathbf {U}}}^*=P_{\mathcal {X}^*}(\mathbf {V}^k)\). From Theorem 2, we can have
where we use (48) in the second inequality, (31) in the second equality, \(\eta <\frac{1}{{\hat{L}}}\) in the third inequality, Lemma 14 in the fifth inequality, (30) in the forth equality and \(\beta _{\max }<1\) in the last inequality. So we have
Combing Lemma 15 and (49), we can have the conclusion. \(\square \)
Rights and permissions
About this article
Cite this article
Li, H., Lin, Z. Provable accelerated gradient method for nonconvex low rank optimization. Mach Learn 109, 103–134 (2020). https://doi.org/10.1007/s10994-019-05819-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-019-05819-w