1 Introduction

Low rank matrix estimation has broad applications in machine learning, computer vision and signal processing. In this paper, we consider the problem of the form:

$$\begin{aligned} \min _{\mathbf {X}\in \mathbb {R}^{n\times n}} f(\mathbf {X}),\quad s.t.\quad \mathbf {X}\succeq 0, \end{aligned}$$
(1)

where there exists minimizer \(\mathbf {X}^*\) of rank-r. We consider the case of \(r\ll n\). Optimizing problem (1) in the \(\mathbf {X}\) space often requires computing at least the top-r singular value/vectors in each iteration and \(O(n^2)\) memory to store a large n by n matrix, which restricts the applications with huge size matrices. To reduce the computational cost as well as the storage space, many literatures exploit the observation that a positive semidefinite low rank matrix can be factorized as a product of two much smaller matrices, i.e., \(\mathbf {X}=\mathbf {U}\mathbf {U}^T\), and study the following nonconvex problem instead:

$$\begin{aligned} \min _{\mathbf {U}\in \mathbb {R}^{n\times r}} g(\mathbf {U})=f(\mathbf {U}\mathbf {U}^T). \end{aligned}$$
(2)

A wide family of problems can be cast as problem (2), including matrix sensing (Bhojanapalli et al. 2016b), matrix completion (Jain et al. 2013), one bit matrix completion (Davenport et al. 2014), sparse principle component analysis (Cai et al. 2013) and factorization machine (Lin and Ye 2016). In this paper, we study problem (2) and aim to propose an accelerated gradient method that operates on the \(\mathbf {U}\) factors directly. The factorization in problem (2) makes \(g(\mathbf {U})\) nonconvex, even if \(f(\mathbf {X})\) is convex. Thus, proving the acceleration becomes a harder task than the analysis for convex programming.

1.1 Related work

Recently, there is a trend to study the nonconvex problem (2) in the machine learning and optimization community. Recent developments come from two aspects: (1). The geometric aspect which proves that there is no spurious local minimum for some special cases of problem (2), e.g., matrix sensing (Bhojanapalli et al. 2016b), matrix completion (Ge et al. 2016, 2017; Li et al. 2018; Zhu et al. 2018; Zhang et al. 2018) for a unified analysis. (2). The algorithmic aspect which analyzes the local linear convergence of some efficient schemes such as the gradient descent method. Examples include (Burer and Monteiro 2003, 2005; Boumal et al. 2016; Tu et al. 2016; Zhang and Lafferty 2015; Park et al. 2016) for semidefinite programs, (Sun and Luo 2015; Park et al. 2013; Hardt and Wootters 2014; Zheng and Lafferty 2016; Zhao et al. 2015) for matrix completion, (Zhao et al. 2015; Park et al. 2013) for matrix sensing and (Yi et al. 2016; Gu et al. 2016) for Robust PCA. The local linear convergence rate of the gradient descent method is proved for problem (2) in a unified framework in Bhojanapalli et al. (2016a), Chen and Wainwright (2015), Wang et al. (2017). However, no acceleration scheme is studied in these literatures. It remains an open problem on how to analyze the accelerated gradient method for nonconvex problem (2).

Nesterov’s acceleration technique (Nesterov 1983, 1988, 2004) has been empirically verified efficient on some nonconvex problems, e.g., Deep Learning (Sutskever et al. 2013). Several literatures studied the accelerated gradient method and the inertial gradient descent method for the general nonconvex programming (Ghadimi and Lan 2016; Li and Lin 2015; Xu and Yin 2014). However, they only proved the convergence and had no guarantee on the acceleration for nonconvex problems. Carmon et al. (2018), Carmon et al. (2017), Agarwal et al. (2017) and Jin et al. (2018) analyzed the accelerated gradient method for the general nonconvex optimization and proved the complexity of \(O(\epsilon ^{-7/4}\text{ log }(1/\epsilon ))\) to escape saddle points or achieve critical points. They studied the general problem and did not exploit the specification of problem (2). Thus, their complexity is sublinear. Necoara et al. (2019) studied several conditions under which the gradient descent and accelerated gradient method converge linearly for non-strongly convex optimization. Their conclusion of the gradient descent method can be extended to nonconvex problem (2). For the accelerated gradient method, Necoara et al. required a strong assumption that all \(\mathbf {y}^k,k=0,1,\ldots ,\)Footnote 1 have the same projection onto the optimum solution set. It does not hold for problem (2).

1.2 Our contributions

In this paper, we use Nesterov’s acceleration scheme for problem (2) and an efficient accelerated gradient method with alternating constraint is proposed, which operates on the \(\mathbf {U}\) factors directly. We back up our method with provable theoretical results. Specifically, our contributions can be summarized as follows:

  1. 1.

    We establish the curvature of local restricted strong convexity along a certain trajectory by restricting the problem onto a constraint set, which allows us to use the classical accelerated gradient method for convex programs to solve the constrained problem. We build our result with the tool of polar decomposition.

  2. 2.

    In order to reduce the negative influence of the constraint and ensure the convergence to the critical point of the original unconstrained problem, rather than the reformulated constrained problem, we propose a novel alternating constraint strategy and combine it with the classical accelerated gradient method.

  3. 3.

    When f is restricted \(\mu \)-strongly convex and restricted L-smooth, our method has the local linear convergence to the optimum solution, which has the same dependence on \(\sqrt{L/\mu }\) as convex programming. As far as we know, we are the first to establish the convergence matching the optimal dependence on \(\sqrt{L/\mu }\) for this kind of nonconvex problems. Globally, our method converges to a critical point of problem (2) from any initializer.

1.3 Notations and assumptions

For matrices \(\mathbf {U},\mathbf {V}\in \mathbb {R}^{n\times r}\), we use \(\Vert \mathbf {U}\Vert _F\) as the Frobenius norm, \(\Vert \mathbf {U}\Vert _2\) as the spectral norm and \(\left\langle \mathbf {U},\mathbf {V}\right\rangle =\text{ trace }(\mathbf {U}^T\mathbf {V})\) as their inner products. We denote \(\sigma _r(\mathbf {U})\) as the smallest singular value of \(\mathbf {U}\) and \(\sigma _1(\mathbf {U})=\Vert \mathbf {U}\Vert _2\) as the largest one. We use \(\mathbf {U}_{ S}\in \mathbb {R}^{r\times r}\) as the submatrix of \(\mathbf {U}\) with the rows indicated by the index set \(S\subseteq \{1,2,\ldots ,n\}\), \(\mathbf {U}_{-S}\in \mathbb {R}^{(n-r)\times r}\) as the submatrix with the rows indicated by the indexes out of S and \(\mathbf {X}_{ S, S}\in \mathbb {R}^{r\times r}\) as the submatrix of \(\mathbf {X}\) with the rows and columns indicated by S. \(\mathbf {X}\succeq 0\) means that \(\mathbf {X}\) is symmetric and positive semidefinite. Let \(I_{\varOmega _{ S}}(\mathbf {U})\) be the indicator function of set \(\varOmega _{ S}\). For the objective function \(g(\mathbf {U})\), its gradient w.r.t. \(\mathbf {U}\) is \(\nabla g(\mathbf {U})=2\nabla f(\mathbf {U}\mathbf {U}^T)\mathbf {U}\). We assume that \(\nabla f(\mathbf {U}\mathbf {U}^T)\) is symmetric for simplicity. Our conclusions for the asymmetric case naturally generalize since \(\nabla g(\mathbf {U})=\nabla f(\mathbf {U}\mathbf {U}^T)\mathbf {U}+\nabla f(\mathbf {U}\mathbf {U}^T)^T\mathbf {U}\) in this case. Denote the optimum solution set of problem (2) as

$$\begin{aligned} \mathcal {X}^*=\{\mathbf {U}^*:\mathbf {U}^*\in \mathbb {R}^{n\times r},\mathbf {U}^*{\mathbf {U}^*}^T=\mathbf {X}^*\}. \end{aligned}$$
(3)

where \(\mathbf {X}^*\) is a minimizer of problem (1). An important issue in minimizing \(g(\mathbf {U})\) is that its optimum solution is not unique, i.e., if \(\mathbf {U}^*\) is the optimum solution of problem (2), then \(\mathbf {U}^*\mathbf {R}\) is also an optimum solution for any orthogonal matrix \(\mathbf {R}\in \mathbb {R}^{r\times r}\). Given \(\mathbf {U}\), we define the optimum solution that is closest to \(\mathbf {U}\) as

$$\begin{aligned} P_{\mathcal {X}^*}(\mathbf {U})=\mathbf {U}^*\mathbf {R}, \text{ where } \mathbf {R}=\text{ argmin }_{\mathbf {R}\in \mathbb {R}^{r\times r},\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {U}^*\mathbf {R}-\mathbf {U}\Vert _F^2. \end{aligned}$$
(4)

1.3.1 Assumptions

In this paper, we assume that f is restricted \(\mu \)-strongly convex and L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\). We state the standard definitions below.

Definition 1

Let \(f:\mathbb {R}^{n\times n}\rightarrow \mathbb {R}\) be a convex differentiable function. Then, f is restricted \(\mu \)-strongly convex on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\) if, for any \(\mathbf {X},\mathbf {Y}\in \{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\), we have

$$\begin{aligned}&f(\mathbf {Y})\ge f(\mathbf {X})+\left\langle \nabla f(\mathbf {X}),\mathbf {Y}-\mathbf {X}\right\rangle +\frac{\mu }{2}\Vert \mathbf {Y}-\mathbf {X}\Vert _F^2. \end{aligned}$$

Definition 2

Let \(f:\mathbb {R}^{n\times n}\rightarrow \mathbb {R}\) be a convex differentiable function. Then, f is restricted L-smooth on the set \(\{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\) if, for any \(\mathbf {X},\mathbf {Y}\in \{\mathbf {X}:\mathbf {X}\succeq 0,\text{ rank }(\mathbf {X})\le r\}\), we have

$$\begin{aligned}&f(\mathbf {Y})\le f(\mathbf {X})+\left\langle \nabla f(\mathbf {X}),\mathbf {Y}-\mathbf {X}\right\rangle +\frac{L}{2}\Vert \mathbf {Y}-\mathbf {X}\Vert _F^2 \end{aligned}$$

and

$$\begin{aligned}&\Vert \nabla f(\mathbf {Y})-\nabla f(\mathbf {X})\Vert _F\le L\Vert \mathbf {Y}-\mathbf {X}\Vert _F. \end{aligned}$$

1.3.2 Polar decomposition

Polar decomposition is a powerful tool for matrix analysis. We briefly review it in this section. We only describe the left polar decomposition of a square matrix.

Definition 3

The polar decomposition of a matrix \(\mathbf {A}\in \mathbb {R}^{r\times r}\) has the form \(\mathbf {A}=\mathbf {H}\mathbf {Q}\) where \(\mathbf {H}\in \mathbb {R}^{r\times r}\) is positive semidefinite and \(\mathbf {Q}\in \mathbb {R}^{r\times r}\) is an orthogonal matrix.

If \(\mathbf {A}\in \mathbb {R}^{r\times r}\) is of full rank, then \(\mathbf {A}\) has the unique polar decomposition with positive definite \(\mathbf {H}\). In fact, since a positive semidefinite Hermitian matrix has a unique positive semidefinite square root, \(\mathbf {H}\) is uniquely given by \(\mathbf {H}=\sqrt{\mathbf {A}\mathbf {A}^T}\). \(\mathbf {Q}=\mathbf {H}^{-1}\mathbf {A}\) is also unique.

In this paper, we use the tool of polar decomposition’s perturbation theorem to build the restricted strong convexity of \(g(\mathbf {U})\). It is described below.

Lemma 1

(Li 1995) Let \(\mathbf {A}\in \mathbb {R}^{r\times r}\) be of full rank and \(\mathbf {H}\mathbf {Q}\) be its unique polar decomposition, \(\mathbf {A}+\triangle \mathbf {A}\) be of full rank and \((\mathbf {H}+\triangle \mathbf {H})(\mathbf {Q}+\triangle \mathbf {Q})\) be its unique polar decomposition. Then, we have

$$\begin{aligned} \Vert \triangle \mathbf {Q}\Vert _F\le \frac{2}{\sigma _r(\mathbf {A})}\Vert \triangle \mathbf {A}\Vert _F. \end{aligned}$$

2 The restricted strongly convex curvature

Function \(g(\mathbf {U})\) is a special kind of nonconvex function and the non-convexity only comes from the factorization of \(\mathbf {U}\mathbf {U}^T\). Based on this observation, we exploit the special curvature of \(g(\mathbf {U})\) in this section.

The existing works proved the local linear convergence of the gradient descent method for problem (2) by exploiting curvatures such as the local second order growth property (Sun and Luo 2015; Chen and Wainwright 2015) or the \((\alpha ,\beta )\) regularity condition (Jin et al. 2018; Bhojanapalli et al. 2016a, b; Wang et al. 2017). The former is described as

$$\begin{aligned} g(\mathbf {U})\ge g(\mathbf {U}^*)+\frac{\alpha }{2}\Vert P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\Vert _F^2,\forall \mathbf {U} \end{aligned}$$
(5)

while the later is defined as

$$\begin{aligned} \left\langle \nabla g(\mathbf {U}),\mathbf {U}-P_{\mathcal {X}^*}(\mathbf {U})\right\rangle \ge \frac{\alpha }{2}\Vert P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\Vert _F^2+\frac{1}{2\beta }\Vert \nabla g(\mathbf {U})\Vert _F^2,\forall \mathbf {U}, \end{aligned}$$
(6)

where \(\mathbf {U}^*\in \mathcal {X}^*\) and \(P_{\mathcal {X}^*}(\mathbf {U})\) is defined in (4). Both (5) and (6) can be derived by the local weakly strongly convex condition (Necoara et al. 2019) combing with the smoothness of \(g(\mathbf {U})\). The former is described as

$$\begin{aligned} g(\mathbf {U}^*)\ge g(\mathbf {U})+\left\langle \nabla g(\mathbf {U}),P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\right\rangle +\frac{\alpha }{2}\Vert P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\Vert _F^2, \end{aligned}$$
(7)

where \(\alpha =\mu \sigma ^2_r(\mathbf {U}^*)\). As discussed in Sect. 1.3, the optimum solution of problem (2) is not unique. This non-uniqueness makes the difference between the weakly strong convexity and strong convexity, e.g., on the right hand side of (7), we use \(P_{\mathcal {X}^*}(\mathbf {U})\), rather than \(\mathbf {U}^*\). Moreover, the weakly strongly convex condition cannot infer convexity and \(g(\mathbf {U})\) is not convex even around a small neighborhood of the global optimum solution (Li et al. 2016).

Necoara et al. (2019) studied several conditions under which the linear convergence of the gradient descent method is guaranteed for general convex programming without strong convexity. The weakly strongly convex condition is the strongest one and can derive all the other conditions. However, it is not enough to analyze the accelerated gradient method only with the weakly strongly convex condition. Necoara et al. (2019) proved the acceleration of the classical accelerated gradient method under an additional assumption that all the iterates \(\{\mathbf {y}^k,k=0,1,\ldots \}\) have the same projection onto the optimum solution set besides the weakly strongly convex condition and the smoothness condition. From the proof in (Necoara et al. 2019, Sect. 5.2.1), we can see that the non-uniqueness of the optimum solution makes the main trouble to analyze the accelerated gradient methodFootnote 2. The additional assumption made in Necoara et al. (2019) somehow aims to reduce this non-uniqueness. Since this assumption is not satisfied for problem (2), only (7) is not enough to prove the acceleration for problem (2) and it requires us to exploit stronger curvature than (7) to analyze the accelerated gradient method.

Motivated by Necoara et al. (2019), we should remove the non-uniqueness in problem (2). Our intuition is based on the following observation. Suppose that we can find an index set \( S\subseteq \{1,2,\ldots ,n\}\) with size r such that \(\mathbf {X}^*_{ S, S}\) is of r full rank, then there exists a unique decomposition \(\mathbf {X}^*_{ S, S}=\mathbf {U}^*_{ S}(\mathbf {U}^*_{ S})^T\) where we require \(\mathbf {U}^*_{ S}\succ 0\). Thus, we can easily have that there exists a unique \(\mathbf {U}^*\) such that \(\mathbf {U}^*{\mathbf {U}^*}^T=\mathbf {X}^*\) and \(\mathbf {U}^*_{ S}\succ 0\). To verify it, consider \( S=\{1,\ldots ,r\}\) for simplicity. Then \(\mathbf {U}\mathbf {U}^T=\left( \begin{array}{cc} \mathbf {U}_{S}\mathbf {U}_{S}^T &{} \mathbf {U}_{S}\mathbf {U}_{-S}^T\\ \mathbf {U}_{-S}\mathbf {U}_{S}^T &{} \mathbf {U}_{-S}\mathbf {U}_{-S}^T \end{array} \right) =\left( \begin{array}{cc} \mathbf {X}_{S,S} &{} \mathbf {X}_{S,-S} \\ \mathbf {X}_{-S,S} &{} \mathbf {X}_{-S,-S} \end{array} \right) \). The uniqueness of \(\mathbf {U}_S\) comes from \(\mathbf {X}_{S,S}\succ 0\) and \(\mathbf {U}_S\succ 0\) and the uniqueness of \(\mathbf {U}_{-S}\) comes from \(\mathbf {U}_{-S}=\mathbf {X}_{-S,S}\mathbf {U}_{S}^{-T}\).

Based on the above observation, we can reformulate problem (2) as

$$\begin{aligned} \min _{\mathbf {U}\in \varOmega _{ S}} g(\mathbf {U}) \end{aligned}$$
(8)

where

$$\begin{aligned} \varOmega _{ S}=\{\mathbf {U}\in \mathbb {R}^{n\times r}:\mathbf {U}_{ S}\succeq \epsilon \mathbf {I}\} \end{aligned}$$

and \(\epsilon \) is a small enough constant such that \(\epsilon \ll \sigma _r(\mathbf {U}_{ S}^*)\). We require \(\mathbf {U}_{ S}\succeq \epsilon \mathbf {I}\) rather than \(\mathbf {U}_{ S}\succ 0\) to make the projection onto \(\varOmega _S\) computable. Due to the additional constraint of \(\mathbf {U}\in \varOmega _S\), we observe that the optimum solution of problem (8) is unique. Moreover, the minimizer of (8) minimizes also (2).

Until now, we are ready to establish a stronger curvature than (7) by restricting the variables of \(g(\mathbf {U})\) on the set \(\varOmega _S\). We should lower bound \(\Vert P_{\mathcal {X}^*}(\mathbf {U})-\mathbf {U}\Vert _F^2\) in (7) by \(\Vert \mathbf {U}^*-\mathbf {U}\Vert _F^2\). Our result is built upon polar decomposition’s perturbation theorem (Li 1995). Based on Lemma 1, we first establish the following critical lemma.

Lemma 2

For any \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\), let \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\in \mathbb {R}^{r\times r},\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {V}\mathbf {R}-\mathbf {U}\Vert _F^2\) and \({\hat{\mathbf {V}}}=\mathbf {V}\mathbf {R}\). Then, we have

$$\begin{aligned} \Vert \mathbf {V}-\mathbf {U}\Vert _F\le \frac{3\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F. \end{aligned}$$

Proof

Since the conclusion is not affected by permutating the rows of \(\mathbf {U}\) and \(\mathbf {V}\) under the same permutation, we can consider the case of \( S=\{1,\ldots ,r\}\) for simplicity. Let \(\mathbf {U}=\left( \begin{array}{c} \mathbf {U}_1 \\ \mathbf {U}_2 \end{array} \right) \), \(\mathbf {V}=\left( \begin{array}{c} \mathbf {V}_1 \\ \mathbf {V}_2 \end{array} \right) \) and \({\hat{\mathbf {V}}}=\left( \begin{array}{c} {\hat{\mathbf {V}}}_1 \\ {\hat{\mathbf {V}}}_2 \end{array} \right) \), where \(\mathbf {U}_1,\mathbf {V}_1,{\hat{\mathbf {V}}}_1\in \mathbb {R}^{r\times r}\). Then, we have \({\hat{\mathbf {V}}}_1=\mathbf {V}_1\mathbf {R}\). From \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\), we know \(\mathbf {U}_1\succ 0\) and \(\mathbf {V}_1\succ 0\). Thus, \(\mathbf {U}_1\mathbf {I}\) and \(\mathbf {V}_1\mathbf {R}\) are the unique polar decompositions of \(\mathbf {U}_1\) and \({\hat{\mathbf {V}}}_1\), respectively. From Lemma 1, we have

$$\begin{aligned} \Vert \mathbf {R}-\mathbf {I}\Vert _F\le \frac{2}{\sigma _r(\mathbf {U}_1)}\Vert {\hat{\mathbf {V}}}_1-\mathbf {U}_1\Vert _F. \end{aligned}$$

With some simple computations, we can have

$$\begin{aligned} \begin{aligned} \Vert \mathbf {V}-\mathbf {U}\Vert _F=&\Vert {\hat{\mathbf {V}}}\mathbf {R}^T-\mathbf {U}\Vert _F\\ =&\Vert {\hat{\mathbf {V}}}\mathbf {R}^T-\mathbf {U}\mathbf {R}^T+\mathbf {U}\mathbf {R}^T-\mathbf {U}\Vert _F\\ \le&\Vert {\hat{\mathbf {V}}}\mathbf {R}^T-\mathbf {U}\mathbf {R}^T\Vert _F+\Vert \mathbf {U}\mathbf {R}^T-\mathbf {U}\Vert _F\\ \le&\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F+\Vert \mathbf {U}\Vert _2\Vert \mathbf {R}-\mathbf {I}\Vert _F\\ \le&\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F+\frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_1)}\Vert {\hat{\mathbf {V}}}_1-\mathbf {U}_1\Vert _F\\ \le&\frac{3\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_1)}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F, \end{aligned} \end{aligned}$$
(9)

where we use \(\sigma _r(\mathbf {U}_1)\le \Vert \mathbf {U}\Vert _2\) and \(\Vert {\hat{\mathbf {V}}}_1-\mathbf {U}_1\Vert _F\le \Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F\) in the last inequality. Replacing \(\mathbf {U}_1\) with \(\mathbf {U}_{ S}\), we can have the conclusion. \(\square \)

Built upon Lemma 2, we can give the local restricted strong convexity of \(g(\mathbf {U})\) on the set \(\varOmega _S\) in the following theorem. There are two differences between the restricted strong convexity and the weakly strong convexity: (i) the restricted strong convexity removes the non-uniqueness and (ii) the restricted strong convexity establishes the curvature between any two points \(\mathbf {U}\) and \(\mathbf {V}\) in a local neighborhood of \(\mathbf {U}^*\), while (7) only exploits the curvature between \(\mathbf {U}\) and the optimum solution.

Theorem 1

Let \(\mathbf {U}^*=\varOmega _{ S}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}\in \varOmega _{ S}\) and \(\mathbf {V}\in \varOmega _{ S}\) with \(\Vert \mathbf {U}-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\le C\), where \(C=\frac{\mu \sigma _r^2(\mathbf {U}^*)\sigma _r^2(\mathbf {U}_{ S}^*)}{100L\Vert \mathbf {U}^*\Vert _2^3}\). Then, we have

$$\begin{aligned} g(\mathbf {U})\ge g(\mathbf {V})+\left\langle \nabla g(\mathbf {V}),\mathbf {U}-\mathbf {V}\right\rangle +\frac{\mu \sigma _r^2(\mathbf {U}^*)\sigma _r^2(\mathbf {U}_{ S}^*)}{50\Vert \mathbf {U}^*\Vert _2^2}\Vert \mathbf {U}-\mathbf {V}\Vert _F^2. \end{aligned}$$

Proof

From the restricted convexity of \(f(\mathbf {X})\), we have

$$\begin{aligned} \begin{aligned}&f(\mathbf {V}\mathbf {V}^T)-f(\mathbf {U}\mathbf {U}^T)\\&\quad \le \left\langle \nabla f(\mathbf {V}\mathbf {V}^T),\mathbf {V}\mathbf {V}^T-\mathbf {U}\mathbf {U}^T\right\rangle -\frac{\mu }{2}\Vert \mathbf {V}\mathbf {V}^T-\mathbf {U}\mathbf {U}^T\Vert _F^2\\&\quad =\left\langle \nabla f(\mathbf {V}\mathbf {V}^T),(\mathbf {V}-\mathbf {U})\mathbf {V}^T\right\rangle +\left\langle \nabla f(\mathbf {V}\mathbf {V}^T),\mathbf {V}(\mathbf {V}-\mathbf {U})^T\right\rangle \\&\qquad -\left\langle \nabla f(\mathbf {V}\mathbf {V}^T),(\mathbf {V}-\mathbf {U})(\mathbf {V}-\mathbf {U})^T\right\rangle -\frac{\mu }{2}\Vert \mathbf {V}\mathbf {V}^T-\mathbf {U}\mathbf {U}^T\Vert _F^2\\&\quad =2\left\langle \nabla f(\mathbf {V}\mathbf {V}^T)\mathbf {V},\mathbf {V}-\mathbf {U}\right\rangle -\left\langle \nabla f(\mathbf {V}\mathbf {V}^T),(\mathbf {V}-\mathbf {U})(\mathbf {V}-\mathbf {U})^T\right\rangle \\&\qquad -\frac{\mu }{2}\Vert \mathbf {V}\mathbf {V}^T-\mathbf {U}\mathbf {U}^T\Vert _F^2\\&\quad \le 2\left\langle \nabla f(\mathbf {V}\mathbf {V}^T)\mathbf {V},\mathbf {V}-\mathbf {U}\right\rangle -\left\langle \nabla f(\mathbf {V}\mathbf {V}^T)-\nabla f(\mathbf {X}^*),(\mathbf {V}-\mathbf {U})(\mathbf {V}-\mathbf {U})^T\right\rangle \\&\qquad -\frac{\mu }{2}\Vert \mathbf {V}\mathbf {V}^T-\mathbf {U}\mathbf {U}^T\Vert _F^2. \end{aligned} \end{aligned}$$
(10)

where we use \(\nabla f(\mathbf {X}^*)\succeq 0\) proved in Lemma 7 and the fact that the inner product of two positive semidefinite matrices is nonnegative in the last inequality, i.e., \(\left\langle \nabla f(\mathbf {X}^*),(\mathbf {V}-\mathbf {U})(\mathbf {V}-\mathbf {U})^T\right\rangle \ge 0\). Applying Von Neumann’s trace inequality and Lemma 10 to bound the second term, applying Lemmas 2 and 8 to bound the third term, we can have

$$\begin{aligned}&f(\mathbf {V}\mathbf {V}^T)-f(\mathbf {U}\mathbf {U}^T)\\&\quad \le 2\left\langle \nabla f(\mathbf {V}\mathbf {V}^T)\mathbf {V},\mathbf {V}-\mathbf {U}\right\rangle +L(\Vert \mathbf {U}^*\Vert _2+\Vert \mathbf {V}\Vert _2)\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\Vert \mathbf {V}-\mathbf {U}\Vert _F^2\\&\qquad -\frac{(\sqrt{2}-1)\mu \sigma _r^2(\mathbf {U})\sigma _r^2(\mathbf {U}_{ S})}{9\Vert \mathbf {U}\Vert _2^2}\Vert \mathbf {V}-\mathbf {U}\Vert _F^2\\&\quad \le \left\langle \nabla g(\mathbf {V}),\mathbf {V}-\mathbf {U}\right\rangle -\left( \frac{\mu \sigma _r^2(\mathbf {U}^*)\sigma _r^2(\mathbf {U}_{ S}^*)}{23.1\Vert \mathbf {U}^*\Vert _2^2}-2.01L\Vert \mathbf {U}^*\Vert _2\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\right) \Vert \mathbf {V}-\mathbf {U}\Vert _F^2, \end{aligned}$$

where we use Lemma 9 in the last inequality. From the assumption of \(\Vert \mathbf {V}-\mathbf {U}^*\Vert _F\le C\), we can have the conclusion. We leave Lemmas 789 and 10 in “Appendix A”. \(\square \)

2.1 Smoothness of function \(g(\mathbf {U})\)

Besides the local restricted strong convexity, we can also prove the smoothness of \(g(\mathbf {U})\), which is built in the following theorem.

Theorem 2

Let \({\hat{L}}=2\Vert \nabla f(\mathbf {V}\mathbf {V}^T)\Vert _2+L(\Vert \mathbf {V}\Vert _2+\Vert \mathbf {U}\Vert _2)^2\). Then, we can have

$$\begin{aligned} g(\mathbf {U})\le g(\mathbf {V})+\left\langle \nabla g(\mathbf {V})\mathbf {V},\mathbf {U}-\mathbf {V}\right\rangle +\frac{{\hat{L}}}{2}\Vert \mathbf {U}-\mathbf {V}\Vert _F^2. \end{aligned}$$

Proof

From the restricted Lipschitz smoothness of f and a similar induction to (10), we have

$$\begin{aligned}&f(\mathbf {U}\mathbf {U}^T)-f(\mathbf {V}\mathbf {V}^T)\\&\quad \le \left\langle \nabla f(\mathbf {V}\mathbf {V}^T),\mathbf {U}\mathbf {U}^T-\mathbf {V}\mathbf {V}^T\right\rangle +\frac{L}{2}\Vert \mathbf {U}\mathbf {U}^T-\mathbf {V}\mathbf {V}^T\Vert _F^2\\&\quad = \left\langle \nabla f(\mathbf {V}\mathbf {V}^T),(\mathbf {U}-\mathbf {V})(\mathbf {U}-\mathbf {V})^T\right\rangle \\&\qquad +2\left\langle \nabla f(\mathbf {V}\mathbf {V}^T)\mathbf {V},\mathbf {U}-\mathbf {V}\right\rangle +\frac{L}{2}\Vert \mathbf {U}\mathbf {U}^T-\mathbf {V}\mathbf {V}^T\Vert _F^2. \end{aligned}$$

Applying Von Neumann’s trace inequality to the first term, applying Lemma 10 to the third term, we can have the conclusion. \(\square \)

When restricted in a small neighborhood of \(\mathbf {U}^*\), we can give a better estimate for the smoothness parameter \({\hat{L}}\), as follows. The proof is provided in “Appendix A”.

Corollary 1

Let \(\mathbf {U}^*=\varOmega _S\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\in \varOmega _{ S}\) with \(\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le C\), \(\Vert \mathbf {U}^k-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F\le C\), where C is defined in Theorem 1 and \(\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\) are generated in Algorithm 1, which will be described later. Let \(L_g=38L\Vert \mathbf {U}^*\Vert _2^2+2\Vert \nabla f(\mathbf {X}^*)\Vert _2\) and \(\eta =\frac{1}{L_g}\). Then, we have

$$\begin{aligned}&g(\mathbf {U}^{k+1})\le g(\mathbf {V}^k)+\left\langle \nabla g(\mathbf {V}^k),\mathbf {U}^{k+1}-\mathbf {V}^k\right\rangle +\frac{L_g}{2}\Vert \mathbf {U}^{k+1}-\mathbf {V}^k\Vert _F^2. \end{aligned}$$

3 Accelerated gradient method with alternating constraint

From Theorem 1 and Corollary 1, we know that the objective \(g(\mathbf {U})\) behaves locally like a strongly convex and smooth function when restricted on the set \(\varOmega _S\). Thus, we can use the classical method for convex programming to solve problem (8), e.g., the accelerated gradient methodFootnote 3.

However, there remains a practical issue that when solving problem (8), we may get stuck at a critical point of problem (8) at the boundary of the constraint \(\mathbf {U}\in \varOmega _{ S}\), which is not the optimum solution of problem (2). In other words, we may halt before reaching the acceleration region, i.e., the local neighborhood of the optimum solution of problem (2). To overcome this trouble, we propose a novel alternating trajectory strategy. Specifically, we define two sets \(\varOmega _{S^1}\) and \(\varOmega _{S^2}\) as follows

$$\begin{aligned} \varOmega _{ S^1}=\{\mathbf {U}\in \mathbb {R}^{n\times r}:\mathbf {U}_{ S^1}\succeq \epsilon \mathbf {I}\},\quad \varOmega _{ S^2}=\{\mathbf {U}\in \mathbb {R}^{n\times r}:\mathbf {U}_{ S^2}\succeq \epsilon \mathbf {I}\} \end{aligned}$$

and minimize the objective \(g(\mathbf {U})\) along the trajectories of \(\varOmega _{S^1}\) and \(\varOmega _{S^2}\) alternatively, i.e., when the iteration number t is odd, we minimize \(g(\mathbf {U})\) with the constraint of \(\mathbf {U}\in \varOmega _{S^1}\), and when t is even, we minimize \(g(\mathbf {U})\) with the constraint of \(\mathbf {U}\in \varOmega _{S^2}\). Intuitively, when the iterates approach the boundary of \(\varOmega _{S^1}\), we cancel the constraint of positive definiteness on \(\mathbf {U}_{ S^1}\) and put it on \(\mathbf {U}_{ S^2}\). Fortunately, with this strategy we can cancel the negative influence of the constraint. We require that both the two index sets \( S^1\) and \( S^2\) are of size r and \( S^1\cap S^2=\emptyset \) such that \(\mathbf {U}^*_{ S^1}\) and \(\mathbf {U}^*_{ S^2}\) are of full rank. Given proper \(S^1\) and \(S^2\), we can prove that the method globally converges to a critical point of problem (2). i.e., a point with \(\nabla g(\mathbf {U})=0\), rather than a critical point of problem (8) at the boundary of the constraint.

We describe our method in Algorithm 1. We use Nesterov’s acceleration scheme in the inner loop with finite K iterations and restart the acceleration scheme at each outer iteration. At the end of each outer iteration, we change the constraint and transform \(\mathbf {U}^{t,K+1}\in \varOmega _{ S}\) to a new point \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\) via polar decomposition such that \(g(\mathbf {U}^{t,K+1})=g(\mathbf {U}^{t+1,0})\). At step (12), we need to project \(\mathbf {Z}\equiv \mathbf {Z}^{t,k}-\frac{\eta }{\theta _k}\nabla g(\mathbf {V}^{t,k})\) onto \(\varOmega _S\). Let \(\mathbf {A}\varSigma \mathbf {A}^T\) be the eigenvalue decomposition of \(\frac{\mathbf {Z}_S+\mathbf {Z}_S^T}{2}\) and \({\hat{\varSigma }}=\text{ diag }([\max \{\epsilon ,\varSigma _{1,1}\},\ldots ,\max \{\epsilon ,\varSigma _{r,r}\}])\), then \(\mathbf {Z}^{t,k+1}_S=\mathbf {A}{\hat{\varSigma }}\mathbf {A}^T\) and \(\mathbf {Z}^{t,k+1}_{-S}=\mathbf {Z}_{-S}\). At step (14), \(\theta _{k+1}\) is computed by \(\theta _{k+1}=\frac{\sqrt{\theta _{k}^4+4\theta _{k}^2}-\theta _{k}^2}{2}\). At the end of each outer iteration, we need to compute the polar decomposition. Let \(\mathbf {A}\varSigma \mathbf {B}^T\) be the SVD of \(\mathbf {U}_{S'}^{t,K+1}\), then we can set \(\mathbf {H}=\mathbf {A}\varSigma \mathbf {A}^T\) and \(\mathbf {Q}=\mathbf {A}\mathbf {B}^T\). In Algorithm 1, we predefine \(S^1\) and \(S^2\) and fix them during the iterations. In Sect. 3.1 we will discuss how to find \(S^1\) and \(S^2\) using some local information.

At last, let’s compare the per-iteration cost of Algorithm 1 with the methods operating on \(\mathbf {X}\) space. Both the eigenvalue decomposition and polar decomposition required in Algorithm 1 perform on the submatrices of size \(r\times r\), which need \(O(r^3)\) operations. Thus, the per-iteration complexity of Algorithm 1 is \(O(nr+r^3)\). As a comparison, the methods operating on \(\mathbf {X}\) space require at least the top-r singular value/vectors, which need \(O(n^2r)\) operations for the deterministic algorithms and \(O(n^2\log r)\) for randomized algorithms (Halko et al. 2011). Thus, our method is more efficient at each iteration when \(r\ll n\), especially when r is upper bounded by a constant independent on n.

figure a

3.1 Finding the index sets \( S^1\) and \( S^2\)

In this section, we consider how to find the index sets \( S^1\) and \( S^2\). \(S^1\cap S^2=\emptyset \) can be easily satisfied and we only need to ensure that \(\mathbf {U}^*_{ S^1}\) and \(\mathbf {U}^*_{ S^2}\) are of full rank. Suppose that we have some initializer \(\mathbf {U}^0\) close to \(\mathbf {U}^*\). We want to use \(\mathbf {U}^0\) to find such \( S^1\) and \( S^2\). We first discuss how to select one index set S based on \(\mathbf {U}^0\). We can use the volume sampling subset selection algorithm (Guruswami and Sinop 2012; Avron and Boutsidis 2013), which can select S such that \(\sigma _r(\mathbf {U}^0_{ S})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\) in \(O(nr^3\log (1/\delta '))\) operations. Then, we can bound \(\sigma _r(\mathbf {U}^*_{ S})\) in the following lemma since \(\mathbf {U}^0\) is close to \(\mathbf {U}^*\).

Lemma 3

If \(\Vert \mathbf {U}^0-\mathbf {U}^*\Vert _F\le 0.01\sigma _r(\mathbf {U}^*)\) and \(\Vert \mathbf {U}_S^0-\mathbf {U}_S^*\Vert _F\le \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}\), then for the index set S returned by the volume sampling subset selection algorithm performed on \(\mathbf {U}^0\) after \(O(nr^3\log (1/\delta '))\) operations, we have \(\sigma _r(\mathbf {U}^*_{ S})\ge \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\).

Proof

Form Theorem 3.11 in (Avron and Boutsidis 2013), we have \(\sigma _r(\mathbf {U}^0_{ S})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) with probability of \(1-\delta '\) after \(O(nr^3\log (1/\delta '))\) operations. So we can obtain

$$\begin{aligned} \sigma _r(\mathbf {U}^0_{ S})-\sigma _r(\mathbf {U}^*_{ S})\le \Vert \mathbf {U}_S^0-\mathbf {U}_S^*\Vert _F\le \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}\le \frac{\sigma _r(\mathbf {U}^0)}{2\sqrt{2r(n-r+1)}}\le \frac{\sigma _r(\mathbf {U}^0_{ S})}{2}, \end{aligned}$$

which leads to

$$\begin{aligned} \sigma _r(\mathbf {U}^*_{ S})\ge \frac{\sigma _r(\mathbf {U}^0_{ S})}{2}\ge \frac{\sigma _r(\mathbf {U}^0)}{2\sqrt{2r(n-r+1)}}\ge \frac{0.99\sigma _r(\mathbf {U}^*)}{2\sqrt{2r(n-r+1)}}, \end{aligned}$$

where we use \(0.99\sigma _r(\mathbf {U}^*)\le \sigma _r(\mathbf {U}^0)\), which is proved in Lemma 9 in “Appendix A”. \(\square \)

In the column selection problem and its variants, existing algorithms (please see Avron and Boutsidis 2013 and the references therein) can only find one index set. Our purpose is to find both \( S^1\) and \( S^2\). We believe that this is a challenging target in the theoretical computer science community. In our applications, since \(n\gg r\), we may expect that the rank of \(\mathbf {U}^0_{- S^1}\) is not influenced after dropping r rows from \(\mathbf {U}^0\). Thus, we can use the procedure discussed above again to find \( S^2\) from \(\mathbf {U}^0_{- S^1}\). From Lemma 3, we have \(\sigma _r(\mathbf {U}^0_{S^1})\ge \frac{\sigma _r(\mathbf {U}^0)}{\sqrt{2r(n-r+1)}}\) and \(\sigma _r(\mathbf {U}^0_{S^2})\ge \frac{\sigma _r(\mathbf {U}_{-S^1}^0)}{\sqrt{2r(n-2r+1)}}\). In the asymmetric case, this challenge disappears. Please see the details in Sect. 7. We show in experiments that Algorithm 1 works well even for the simple choice of \(S^1=\{1,\ldots ,r\}\) and \(S^2=\{r+1,\ldots ,2r\}\). The discussion of finding \( S^1\) and \( S^2\) in this section is only for the theoretical purpose.

3.2 Initialization

Our theorem ensures the accelerated linear convergence given that the initial point \(\mathbf {U}^0\in \varOmega _{S^2}\) is within the local neighborhood of the optimum solution, with radius C defined in Theorem 1. We use the initialization strategy in Bhojanapalli et al. (2016a). Specifically, let \(\mathbf {X}^0=\text{ Project }_{+}\left( \frac{-\nabla f(0)}{\Vert \nabla f(0)-\nabla f(11^T)\Vert _F}\right) \) and \(\mathbf {V}^0{\mathbf {V}^0}^T\) be the best rank-r approximation of \(\mathbf {X}_0\), where \(\text{ Project }_{+}\) means the projection operator onto the semidefinite cone. Then, Bhojanapalli et al. (2016a) proved \(\Vert \mathbf {V}^0-P_{\mathcal {X}^*}(\mathbf {V}^0)\Vert _F\le \frac{4\sqrt{2}r\Vert \mathbf {U}^*\Vert _2^2}{\sigma _r(\mathbf {U}^*)}\sqrt{\frac{L^2}{\mu ^2}-\frac{2\mu }{L}+1}\). Let \(\mathbf {H}\mathbf {Q}=\mathbf {V}^0_{S^2}\) be its polar decomposition and \(\mathbf {U}^0=\mathbf {V}^0\mathbf {Q}^T\). Then, \(\mathbf {U}^0\) belongs to \(\varOmega _{S^2}\). Although this strategy does not produce an initial point close enough to the target, we show in experiments that our method performs well in practice. It should be noted that for the gradient descent method to solve the general problem (2), the initialization strategy in Bhojanapalli et al. (2016a) also does not satisfy the requirement of the theorems in Bhojanapalli et al. (2016a) for the general objective f.

4 Accelerated convergence rate analysis

In this section, we prove the local accelerated linear convergence rate of Algorithm 1. We first consider the inner loop. It uses the classical accelerated gradient method to solve problem (8) with fixed index set S for finite K iterations. Thanks to the stronger curvature built in Theorem 1 and the smoothness in Corollary 1, we can use the standard proof framework to analyze the inner loop, e.g., Tseng (2008). Some slight modifications are needed since we should ensure that all the iterates belong to the local neighborhood of \(\mathbf {U}^*\). We present the result in the following lemma and give its proof sketch. For simplicity, we omit the outer iteration number t.

Lemma 4

Let \(\mathbf {U}^*=\varOmega _{ S}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^0\in \varOmega _{ S}\) with \(\epsilon \le 0.99\sigma _r(\mathbf {U}_{ S'}^*)\) and \(\Vert \mathbf {U}^0-\mathbf {U}^*\Vert _F\le C\). Let \(\eta =\frac{1}{L_g}\), where C is defined in Theorem 1 and \(L_g\) is defined in Corollary 1. Then, we have \(\sigma _r(\mathbf {U}_{ S'}^{K+1})\ge \epsilon \), \(\Vert \mathbf {U}^{K+1}-\mathbf {U}^*\Vert _F\le C\) and

$$\begin{aligned} g(\mathbf {U}^{K+1})-g(\mathbf {U}^*)\le \frac{2}{(K+1)^2\eta }\left\| \mathbf {U}^*-\mathbf {U}^0\right\| _F^2. \end{aligned}$$

Proof

We follow four step to prove the lemma.

Step 1 We can easily check that if \(\mathbf {U}^0\in \varOmega _{ S}\), then all the iterates of \(\{\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\}\) belong to \(\varOmega _{ S}\) by \(0\le \theta _k\le 1\), the convexity of \(\varOmega _S\) and the convex combinations in (11) and (13).

Step 2 Consider the k-th iteration. If \(\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le C\), \(\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F\le C\) and \(\Vert \mathbf {U}^k-\mathbf {U}^*\Vert _F\le C\), then Theorem 1 and Corollary 1 hold. From the standard analysis of the accelerated gradient method for convex programming, e.g., Proposition 1 in Tseng (2008), we can have

$$\begin{aligned} \begin{aligned}&\frac{1}{\theta _k^2}\left( g(\mathbf {U}^{k+1})-g(\mathbf {U}^*)\right) +\frac{1}{2\eta }\Vert \mathbf {Z}^{k+1}-\mathbf {U}^*\Vert _F^2\\&\quad \le \frac{1}{\theta _{k-1}^2}\left( g(\mathbf {U}^k)-g(\mathbf {U}^*)\right) +\frac{1}{2\eta }\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F^2. \end{aligned} \end{aligned}$$
(15)

Step 3 Since Theorem 1 and Corollary 1 hold only in a local neighbourhood of \(\mathbf {U}^*\), we need to check that \(\{\mathbf {U}^k,\mathbf {V}^k,\mathbf {Z}^k\}\) belongs to this neighborhood for all the iterations, which can be easily done via induction. In fact, from (15) and the convexity combinations in (11) and (13), we know that if the following conditions hold,

$$\begin{aligned}&\Vert \mathbf {V}^k-\mathbf {U}^*\Vert _F\le C,\quad \Vert \mathbf {U}^k-\mathbf {U}^*\Vert _F\le C,\quad \Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F\le C,\\&\quad \frac{1}{\theta _{k-1}^2}\left( g(\mathbf {U}^k)-g(\mathbf {U}^*)\right) +\frac{1}{2\eta }\Vert \mathbf {Z}^k-\mathbf {U}^*\Vert _F^2\le \frac{C^2}{2\eta }, \end{aligned}$$

then we can have

$$\begin{aligned}&\Vert \mathbf {V}^{k+1}-\mathbf {U}^*\Vert _F\le C,\quad \Vert \mathbf {U}^{k+1}-\mathbf {U}^*\Vert _F\le C,\quad \Vert \mathbf {Z}^{k+1}-\mathbf {U}^*\Vert _F\le C,\\&\quad \frac{1}{\theta _k^2}\left( g(\mathbf {U}^{k+1})-g(\mathbf {U}^*)\right) +\frac{1}{2\eta }\Vert \mathbf {Z}^{k+1}-\mathbf {U}^*\Vert _F^2\le \frac{C^2}{2\eta }. \end{aligned}$$

Step 4 From \(\frac{1}{\theta _{-1}}=0\) and Step 3, we know (15) holds for all the iterations. Thus, we have

$$\begin{aligned} g(\mathbf {U}^{K+1})-g(\mathbf {U}^*)\le \frac{\theta _K^2}{2\eta }\left\| \mathbf {Z}^0-\mathbf {U}^*\right\| _F^2\le \frac{2}{(K+1)^2\eta }\Vert \mathbf {Z}^0-\mathbf {U}^*\Vert _F^2, \end{aligned}$$

where we use \(\theta _k\le \frac{2}{k+1}\) from \(\frac{1-\theta _{k+1}}{\theta _{k+1}^2}=\frac{1}{\theta _k^2}\) and \(\theta _0=1\).

On the other hand, from the perturbation theorem of singular values, we have

$$\begin{aligned} \sigma _r(\mathbf {U}_{ S'}^*)-\sigma _r(\mathbf {U}_{ S'}^{K+1})\le \Vert \mathbf {U}_{ S'}^{K+1}-\mathbf {U}_{ S'}^*\Vert _F\le \Vert \mathbf {U}^{K+1}-\mathbf {U}^*\Vert _F\le C\le 0.01\sigma _r(\mathbf {U}_{ S'}^*), \end{aligned}$$

which leads to \(\sigma _r(\mathbf {U}_{ S'}^{K+1})\ge 0.99\sigma _r(\mathbf {U}_{ S'}^*)\ge \epsilon \). \(\square \)

Now we consider the outer loop of Algorithm 1. Based on Lemma 4, the second order growth property (5) and the perturbation theory of polar decomposition, we can establish the exponentially decreasing of \(\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F\) in the following lemma.

Lemma 5

Let \(\mathbf {U}^{t,*}=\varOmega _{ S}\cap \mathcal {X}^*\) and \(\mathbf {U}^{t+1,*}=\varOmega _{ S'}\cap \mathcal {X}^*\) and assume that \(\mathbf {U}^{t,0}\in \varOmega _{ S}\) with \(\epsilon \le 0.99\sigma _r(\mathbf {U}_{ S'}^{t,*})\) and \(\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F\le C\). Let \(K+1=\frac{28\Vert \mathbf {U}^{*}\Vert _2}{\sqrt{\eta \mu }\sigma _r(\mathbf {U}^{*})\min \{\sigma _r(\mathbf {U}_{ S^1}^{*}),\sigma _r(\mathbf {U}_{ S^2}^{*})\}}\). Then, we can have \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\) and

$$\begin{aligned} \Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\le \frac{1}{4}\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F. \end{aligned}$$
(16)

Proof

We follow four steps to prove the lemma.

Step 1 From Lemma 4, we have \(\sigma _r(\mathbf {U}_{ S'}^{t,K+1})\ge \epsilon \), \(\Vert \mathbf {U}^{t,K+1}-\mathbf {U}^{t,*}\Vert _F\le C\) and

$$\begin{aligned} g(\mathbf {U}^{t,K+1})-g(\mathbf {U}^{t,*})\le \frac{2}{(K+1)^2\eta }\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F^2. \end{aligned}$$
(17)

From Algorithm 1, we have \(\sigma _r(\mathbf {U}_{ S'}^{t+1,0})=\sigma _r(\mathbf {U}_{ S'}^{t,K+1})\). So \(\mathbf {U}_{ S'}^{t+1,0}\succeq \epsilon \mathbf {I}\) and \(\mathbf {U}^{t+1,0}\in \varOmega _{ S'}\).

Step 2 From Lemma 11 in “Appendix B”, we have

$$\begin{aligned} g(\mathbf {U}^{t,K+1})-g(\mathbf {U}^{t,*})\ge 0.4\mu \sigma _r^2(\mathbf {U}^{t,*})\Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F^2, \end{aligned}$$
(18)

where \({\hat{\mathbf {U}}}^{t,*}=P_{\mathcal {X}^*}(\mathbf {U}^{t,K+1})=\mathbf {U}^{t,*}\mathbf {R}\) and \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {U}^{t,*}\mathbf {R}-\mathbf {U}^{t,K+1}\Vert _F^2\).

Step 3 Given (17) and (18), in order to prove (16), we only need to lower bound \(\Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F\) by \(\Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\).

From Algorithm 1, we know that \(\mathbf {H}\mathbf {Q}=\mathbf {U}_{ S'}^{t,K+1}\) is the unique polar decomposition of \(\mathbf {U}_{ S'}^{t,K+1}\) and \(\mathbf {U}^{t+1,0}=\mathbf {U}^{t,K+1}\mathbf {Q}^T\). Let \(\mathbf {H}^*\mathbf {Q}^*={\hat{\mathbf {U}}}_{ S'}^{t,*}\) be its unique polar decomposition and \(\mathbf {U}^{t+1,*}={\hat{\mathbf {U}}}^{t,*}(\mathbf {Q}^*)^T\), then \(\mathbf {U}^{t+1,*}\in \varOmega _{ S'}\cap \mathcal {X}^*\). From the perturbation theorem of polar decomposition in Lemma 1, we have

$$\begin{aligned} \Vert \mathbf {Q}-\mathbf {Q}^*\Vert _F\le \frac{2}{\sigma _r({\hat{\mathbf {U}}}_{ S'}^{t,*})}\Vert \mathbf {U}_{ S'}^{t,K+1}-{\hat{\mathbf {U}}}_{ S'}^{t,*}\Vert _F. \end{aligned}$$

Similar to (9), we have

$$\begin{aligned} \begin{aligned}&\Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\\&\quad =\Vert \mathbf {U}^{t,K+1}\mathbf {Q}^T-{\hat{\mathbf {U}}}^{t,*}(\mathbf {Q}^*)^T\Vert _F\\&\quad =\Vert \mathbf {U}^{t,K+1}\mathbf {Q}^T-{\hat{\mathbf {U}}}^{t,*}\mathbf {Q}^T+{\hat{\mathbf {U}}}^{t,*}\mathbf {Q}^T-{\hat{\mathbf {U}}}^{t,*}(\mathbf {Q}^*)^T\Vert _F\\&\quad \le \Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F+\Vert {\hat{\mathbf {U}}}^{t,*}\Vert _2\Vert \mathbf {Q}-\mathbf {Q}^*\Vert _F\\&\quad \le \frac{3\Vert \mathbf {U}^{t,*}\Vert _2}{\sigma _r(\mathbf {U}^{t,*}_{ S'})}\Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F. \end{aligned} \end{aligned}$$
(19)

Step 4 Combining (17), (18) and (19), we have

$$\begin{aligned}&\Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\\&\quad \le \frac{3\Vert \mathbf {U}^{t,*}\Vert _2}{\sigma _r(\mathbf {U}^{t,*}_{ S'})}\Vert \mathbf {U}^{t,K+1}-{\hat{\mathbf {U}}}^{t,*}\Vert _F\\&\quad \le \frac{3\Vert \mathbf {U}^{t,*}\Vert _2}{\sigma _r(\mathbf {U}^{t,*}_{ S'})}\frac{\sqrt{5}}{\sqrt{\eta \mu }(K+1)\sigma _r(\mathbf {U}^{t,*})}\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F\\&\quad \le \frac{7\Vert \mathbf {U}^{t,*}\Vert _2}{\sqrt{\eta \mu }(K+1)\sigma _r(\mathbf {U}^{t,*})\min \{\sigma _r(\mathbf {U}_{ S}^{t,*}),\sigma _r(\mathbf {U}_{ S'}^{t,*})\}}\Vert \mathbf {U}^{t,0}-\mathbf {U}^{t,*}\Vert _F. \end{aligned}$$

From the setting of \(K+1\), we can have the conclusion. \(\square \)

From Lemma 5, we can give the accelerated convergence rate in the following theorem. The proof is provided in “Appendix B”. We contain several assumptions in Theorem 3. For the trajectories, we assume that we can find two disjoint sets \(S^1\) and \(S^2\) such that \(\sigma _r(\mathbf {U}_{ S^1}^*)\) and \(\sigma _r(\mathbf {U}_{ S^2}^*)\) are as large as possible (please see Sect. 3.1 for the discussion). For the initialization, we assume that we can find an initial point \(\mathbf {U}^{0,0}\) close enough to \(\mathbf {U}^{0,*}\) (please see Sect. 3.2 for the discussion). Then, we can prove that when the outer iteration number t is odd, \(\mathbf {U}^{t,k}\) belongs to \(\varOmega _{S^1}\) and the iterates converge to the optimum solution of \(\varOmega _{S^1}\cap \mathcal {X}^*\). When t is even, the iterates belong to \(\varOmega _{S^2}\) and converge to another optimum solution of \(\varOmega _{S^2}\cap \mathcal {X}^*\). In our algorithm, we set \(\eta \) and K based on a reliable knowledge on \(\Vert \mathbf {U}^*\Vert _2\), \(\sigma _r(\mathbf {U}^*)\) and \(\sigma _r(\mathbf {U}_{S}^*)\). As suggested by Bhojanapalli et al. (2016a), Park et al. (2018), they can be estimated by \(\Vert \mathbf {U}^0\Vert _2\), \(\sigma _r(\mathbf {U}^0)\) and \(\sigma _r(\mathbf {U}_S^0)-\)up to constants−since \(\mathbf {U}^0\) is close to \(\mathbf {U}^*\).

Theorem 3

Let \(\mathbf {U}^{t,*}=\varOmega _{S^1}\cap \mathcal {X}^*\) when t is odd and \(\mathbf {U}^{t,*}=\varOmega _{S^2}\cap \mathcal {X}^*\) when t is even. Assume that \(\mathbf {U}^*\in \mathcal {X}^*\) and \(\mathbf {U}^{0,0}\in \varOmega _{S^2}\) with \(\Vert \mathbf {U}^{0,0}-\mathbf {U}^{0,*}\Vert _F\le C\) and \(\epsilon \le \min \{0.99\sigma _r(\mathbf {U}^*_{ S^1}),0.99\sigma _r(\mathbf {U}^*_{ S^2})\}\). Then, we have

$$\begin{aligned} \Vert \mathbf {U}^{t+1,0}-\mathbf {U}^{t+1,*}\Vert _F\le \left( 1-\frac{1}{6}\sqrt{\frac{\mu _g}{L_g}}\right) ^{(t+1)(K+1)}\Vert \mathbf {U}^{0,0}-\mathbf {U}^*\Vert _F, \end{aligned}$$

and

$$\begin{aligned} g(\mathbf {U}^{t+1,0})-g(\mathbf {U}^{*})\le L_g\left( 1-\frac{1}{6}\sqrt{\frac{\mu _g}{L_g}}\right) ^{2(t+1)(K+1)}\Vert \mathbf {U}^{0,0}-\mathbf {U}^*\Vert _F^2, \end{aligned}$$

where \(\mu _g=\frac{\mu \sigma _r^2(\mathbf {U}^*)\min \{\sigma _r^2(\mathbf {U}_{ S^1}^*),\sigma _r^2(\mathbf {U}_{ S^2}^*)\}}{25\Vert \mathbf {U}^*\Vert _2^2}\), \(L_g=38L\Vert \mathbf {U}^*\Vert _2^2+2\Vert \nabla f(\mathbf {X}^*)\Vert _2\) and \(C=\frac{\mu \sigma _r^2(\mathbf {U}^*)\min \{\sigma _r^2(\mathbf {U}_{S^1}^*),\sigma _r^2(\mathbf {U}_{S^2}^*)\}}{100L\Vert \mathbf {U}^*\Vert _2^3}\).

4.1 Comparison to the gradient descent

Bhojanapalli et al. (2016a) used the gradient descent to solve problem (2), which consists of the following recursion:

$$\begin{aligned} \mathbf {U}^{k+1}=\mathbf {U}^k-\eta \nabla g(\mathbf {U}^k). \end{aligned}$$

With the restricted strong convexity and smoothness of \(f(\mathbf {X})\), Bhojanapalli et al. (2016a) proved the linear convergence of gradient descent in the form of

$$\begin{aligned} \begin{aligned}&\Vert \mathbf {U}^{N+1}-P_{\mathcal {X}^*}(\mathbf {U}^{N+1})\Vert _F^2\\&\quad \le \left( 1-\frac{\sigma _r^2(\mathbf {U}^*)}{\Vert \mathbf {U}^*\Vert _2^2}\frac{\mu }{L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2}\right) ^N\Vert \mathbf {U}^0-P_{\mathcal {X}^*}(\mathbf {U}^0)\Vert _F^2. \end{aligned} \end{aligned}$$
(20)

As a comparison, from Theorem 3, our method converges linearly within the error of \(\left( 1-\frac{\sigma _r(\mathbf {U}^*)\min \{\sigma _r(\mathbf {U}_{ S^1}^*),\sigma _r(\mathbf {U}_{ S^2}^*)\}}{\Vert \mathbf {U}^*\Vert _2^2}\sqrt{\frac{\mu }{L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2}}\right) ^N\), where N is the total number of inner iterations. From Lemma 3, we know \(\sigma _r(\mathbf {U}^*_{ S})\approx \frac{1}{\sqrt{rn}}\sigma _r(\mathbf {U}^*)\) in the worst case and it is tight (Avron and Boutsidis 2013). Thus, our method has the convergence rate of \(\left( 1-\frac{\sigma _r^2(\mathbf {U}^*)}{\Vert \mathbf {U}^*\Vert _2^2}\sqrt{\frac{\mu }{nr(L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2)}}\right) ^N\) in the worst case. When the function f is ill-conditioned, i.e., \(\frac{L}{\mu }\ge nr\), our method outperforms the gradient descent. This phenomenon is similar to the case observed in the stochastic optimization community: the non-accelerated methods such as SDCA (Shalev-Shwartz and Zhang 2013), SVRG (Xiao and Zhang 2014) and SAG (Schmidt et al. 2017) have the complexity of \(O\left( \frac{L}{\mu }\log \frac{1}{\epsilon }\right) \) while the accelerated methods such as Accelerated SDCA (Shalev-Shwartz and Zhang 2016), Catalyst (Lin et al. 2015) and Katyusha (Allen-Zhu 2017) have the complexity of \(O\left( \sqrt{\frac{mL}{\mu }}\log \frac{1}{\epsilon }\right) \), where m is the sample size. The latter is tight when \(\frac{L}{\mu }\ge m\) for stochastic programming (Woodworth and Srebro 2016). In matrix completion, the optimal sample complexity is \(O(rn\log n)\) (Candès and Recht 2009). It is unclear whether our convergence rate for problem (2) is tight or there exists a faster method. We leave it as an open problem.

For better reference, we summarize the comparisons in Table 1. We can see that our method has the same optimal dependence on \(\sqrt{\frac{L}{\mu }}\) as convex programming.

Table 1 Convergence rate comparisons of the gradient descent method (GD) and accelerated gradient descent method (AGD)

4.1.1 Dropping the dependence on n

Our convergence rate has an additional dependence on n compared with the gradient descent method. It comes from \(\sigma _r(\mathbf {U}^*_S)\), i.e., Lemma 2. In fact, we use a loose relaxation in the last inequality of (9), i.e., \(\frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}_S-\mathbf {U}_S\Vert _F\le \frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F\). Since \(\mathbf {U}_S\in \mathbb {R}^{r\times r}\) and \(\mathbf {U}\in \mathbb {R}^{n\times r}\), a more suitable estimation should be

$$\begin{aligned} \frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\Vert {\hat{\mathbf {V}}}_S-\mathbf {U}_S\Vert _F\approx \frac{2\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U}_S)}\sqrt{\frac{r}{n}}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F\approx \frac{2r\Vert \mathbf {U}\Vert _2}{\sigma _r(\mathbf {U})}\Vert {\hat{\mathbf {V}}}-\mathbf {U}\Vert _F. \end{aligned}$$
(21)

In practice, (21) holds when the entries of \(\mathbf {U}^{t,k}\) and \(\mathbf {V}^{t,k}\) converge nearly equally fast to those of \(\mathbf {U}^{t,*}\), which may be expected in practice. Thus, under the condition of (21), our convergence rate can be improved to

$$\begin{aligned} \left( 1-\frac{\sigma _r^2(\mathbf {U}^*)}{r\Vert \mathbf {U}^*\Vert _2^2}\sqrt{\frac{\mu }{L+\Vert \nabla f(\mathbf {X}^*)\Vert _2/\Vert \mathbf {U}^*\Vert _2^2}}\right) ^N. \end{aligned}$$

We numerically verify (21) in Sect. 8.4.

4.1.2 Examples with ill-conditioned objective f

Although the condition number \(\frac{L}{\mu }\) approximate to 1 for some famous problems in machine learning, e.g., matrix regression and matrix completion (Chen and Wainwright 2015), we can still find many problems with ill-conditioned objective, especially in the computer vision applications. We give the example of low rank representation (LRR) (Liu et al. 2013). The LRR model is a famous model in computer vision. It can be formulated as

$$\begin{aligned} \min _{\mathbf {X}} \text{ rank }(\mathbf {X}) \quad s.t.\quad \mathbf {D}\mathbf {X}=\mathbf {A}, \end{aligned}$$

where \(\mathbf {A}\) is the observed data and \(\mathbf {D}\) is a dictionary that linearly spans the data space. We can reformulate the problem as follows:

$$\begin{aligned} \min _{\mathbf {X}} \Vert \mathbf {D}\mathbf {X}-\mathbf {A}\Vert _F^2 \quad s.t.\quad \text{ rank }(\mathbf {X})\le r. \end{aligned}$$

We know \(L/\mu =\kappa (\mathbf {D}^T\mathbf {D})\), i.e., the condition number of \(\mathbf {D}^T\mathbf {D}\). If we generate \(\mathbf {D}\in \mathbb {R}^{n\times n}\) as a random matrix with normal distribution, then \(E\left[ \text{ log }\kappa (\mathbf {D})\right] \sim \text{ log }n\) as \(n\rightarrow \infty \) Edelman (1988) and thus \(E\left[ \frac{L}{\mu }\right] \sim n^2\). We numerically verify on MATLAB that if \(n=1000\), then \(\frac{L}{\mu }\) is of the order \(10^7\), which is much larger than O(n).

Another example is the reduced rank logistic generalized linear model (RR-LGLM) Yee and Hastie (2000), She (2013). Assume that \(\mathbf {A}\) are all binary and denote \(\mathbf {D}=[\mathbf {d}_1,\ldots ,\mathbf {d}_n]^T\) and \(\mathbf {X}=[\mathbf {x}_1,\ldots ,\mathbf {x}_n]\). RR-LGLAM minimizes

$$\begin{aligned} \min _{\mathbf {X}} -\sum _{i=1}^n\sum _{j=1}^n \left( \mathbf {A}_{i,j}\mathbf {d}_i^T\mathbf {x}_j-\log \left( 1+\exp (\mathbf {d}_i^T\mathbf {x}_j)\right) \right) \quad s.t.\quad \text{ rank }(\mathbf {X})\le r. \end{aligned}$$

The Hessian of the objective is \(\text{ diag }(\mathbf {D}^T\mathbf {G}_1\mathbf {D},\ldots ,\mathbf {D}^T\mathbf {G}_n\mathbf {D})\), where \(\mathbf {G}_j\) is the \(n\times n\) diagonal matrix whose i-th component is \(\frac{\exp (\mathbf {d}_i^T\mathbf {x}_j)}{(1+\exp (\mathbf {d}_i^T\mathbf {x}_j))^2}\). Thus, \(L/\mu \) is at least \(\kappa (\mathbf {D}^T\mathbf {D})\). As discussed above, it may be much larger than n. Other similar examples can be found in Wagner and Zuk (2015) and Liu and Li (2016).

5 Global convergence

In this section, we study the global convergence of Algorithm 1 without the assumption that \(f(\mathbf {X})\) is restricted strongly convex. We allow the algorithm to start from any initializer. Since we have no information about \(\mathbf {U}^*\) when \(\mathbf {U}^0\) is far from \(\mathbf {U}^*\), we use an adaptive index sets selection procedure for Algorithm 1. That is to say, after each inner loop, we check whether \(\sigma _r(\mathbf {U}^{t,K+1}_{S'})\ge \epsilon \) holds. If not, we select the new index set \(S'\) using the volume sampling subset selection algorithm.

We first consider the inner loop and establish Lemma 6. We drop the outer iteration number t for simplicity and leave the proof in “Appendix C”.

Lemma 6

Assume that \(\{\mathbf {U}^k,\mathbf {V}^k\}\) is bounded and \(\mathbf {U}^0\in \varOmega _{ S}\). Let \(\eta \le \frac{1-\beta _{\max }^2}{{\hat{L}}(2\beta _{\max }+1)+2\gamma }\), where \({\hat{L}}=2D+4LM^2\), \(D=\max \{\Vert \nabla f(\mathbf {U}^k(\mathbf {U}^k)^T)\Vert _2,\Vert \nabla f(\mathbf {V}^k(\mathbf {V}^k)^T)\Vert _2,\forall k\}\), \(M=\max \{\Vert \mathbf {U}^k\Vert _2,\Vert \mathbf {V}^k\Vert _2,\forall k\}\), \(\beta _{\max }=\max \left\{ \beta _k,k=0,\ldots ,K\right\} \), \(\beta _k=\frac{\theta _k(1-\theta _{k-1})}{\theta _{k-1}}\) and \(\gamma \) is a small constant. Then, we have

$$\begin{aligned} g(\mathbf {U}^{K+1})-g(\mathbf {U}^0)\le -\sum _{k=0}^K \gamma \Vert \mathbf {U}^{k+1}-\mathbf {U}^k\Vert _F^2. \end{aligned}$$

Now we consider the outer loop. As discussed in Sect. 3, when solving problem (8) directly, we may get stuck at the boundary of the constraint. Thanks to the alternating constraint strategy, we can cancel the negative influence of the constraint and establish the global convergence to a critical point of problem (2), which is described in Theorem 4. It establishes that after at most \(O\left( \frac{1}{\varepsilon ^2}\log \frac{1}{\varepsilon }\right) \) operations, \(\mathbf {U}^{T,K+1}\) is an approximate zero gradient point in the precision of \(\varepsilon \). Briefly speaking, since the projection operation in (12) only influences the rows indicated by the index set S, a simple calculation yields that \(\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^1}\Vert _F\le O(\varepsilon )\) and \(\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^2}\Vert _F\le O(\varepsilon )\). From \(S^1\cap S^2=\emptyset \), we have \(\Vert \nabla g(\mathbf {Z}^{t,K+1})\Vert _F\le \Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^1}\Vert _F+\Vert (\nabla g(\mathbf {Z}^{t,K+1}))_{- S^2}\Vert _F\le O(\varepsilon )\), which explains why the alternating constraint strategy avoids the boundary of the constraint.

Theorem 4

Assume that \(\{\mathbf {U}^{t,k},\mathbf {V}^{t,k}\}\) is bounded and \(\sigma _r(\mathbf {U}_{S'}^{t,K+1})\ge \epsilon ,\forall t\). Let \(\eta \) be the one defined in Lemma 6. Then, after at most \(T=2\frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {X}^*)}{\varepsilon ^2}\) outer iterations, we have

$$\begin{aligned} \left\| \nabla g(\mathbf {U}^{T,K+1})\right\| _F\le \frac{35\varepsilon }{\eta \theta _K} \end{aligned}$$

with probability of \(1-\delta \). The volume sampling subset selection algorithm needs \(O\left( nr^3\log \left( \frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {X}^*))}{\delta \varepsilon ^2}\right) \right) \) operations for each running.

Proof

We follow three steps to prove the theorem.

Step 1 Firstly, we bound the difference of two consecutive variables, i.e., \(\mathbf {U}^{t,k+1}-\mathbf {U}^{t,k}\).

From Lemma 6 we have

$$\begin{aligned} \gamma \sum _{k=0}^K \Vert \mathbf {U}^{t,k+1}-\mathbf {U}^{t,k}\Vert _F^2\le g(\mathbf {U}^{t,0})-g(\mathbf {U}^{t,K+1}). \end{aligned}$$

Summing over \(t=0,\ldots ,T\) yields

$$\begin{aligned}&\gamma \sum _{t=0}^{T}\sum _{k=0}^K \Vert \mathbf {U}^{t,k+1}-\mathbf {U}^{t,k}\Vert _F^2\le \sum _{t=0}^{T}\left( g(\mathbf {U}^{t,0})-g(\mathbf {U}^{t,K+1})\right) \\&\quad =\sum _{t=0}^{T}\left( g(\mathbf {U}^{t,0})-g(\mathbf {U}^{t+1,0})\right) \le g(\mathbf {U}^{t,0})-f(\mathbf {U}^*{\mathbf {U}^*}^T). \end{aligned}$$

So after \(T=2\frac{g(\mathbf {U}^{t,0})-f(\mathbf {X}^*)}{\varepsilon ^2}\) outer iterations, we must have

$$\begin{aligned} \sum _{k=0}^K \Vert \mathbf {U}^{t,k+1}-\mathbf {U}^{t,k}\Vert _F^2+\sum _{k=0}^K \Vert \mathbf {U}^{t+1,k+1}-\mathbf {U}^{t+1,k}\Vert _F^2\le \varepsilon ^2 \end{aligned}$$
(22)

for some \(t<T\). Thus, we can bound \(\Vert \mathbf {U}^{t',k+1}-\mathbf {U}^{t',k}\Vert _F\) by \(\varepsilon \), where \(t'=t\) or \(t'=t+1\). Moreover, from Lemma 13 in “Appendix C”, we can bound \(\Vert \mathbf {U}^{t',k+1}-\mathbf {Z}^{t',k+1}\Vert _F\), \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {Z}^{t',k}\Vert _F\) and \(\Vert \mathbf {Z}^{t',k+1}-\mathbf {V}^{t',k}\Vert _F\) by \(\frac{\varepsilon }{\theta _k}\).

Step 2 Secondly, we bound parts of elements of the gradient, i.e., \(\left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^1}\) and \(\left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^2}\).

From the optimality condition of (12), we have

$$\begin{aligned} -\frac{\theta _k}{\eta }\left( \mathbf {Z}^{t',k+1}-\mathbf {Z}^{t',k}\right) +\nabla g(\mathbf {Z}^{t',k+1})-\nabla g(\mathbf {V}^{t',k})\in \nabla g(\mathbf {Z}^{t',k+1})+\partial I_{\varOmega _{ S^j}}(\mathbf {Z}^{t',k+1}) \end{aligned}$$

for \(j=1\) when \(t'=t\) and \(j=2\) when \(t'=t+1\). From Lemmas 10 and 13, we can easily check that

$$\begin{aligned} \left\| -\frac{\theta _k}{\eta }\left( \mathbf {Z}^{t',k+1}-\mathbf {Z}^{t',k}\right) +\nabla g(\mathbf {Z}^{t',k+1})-\nabla g(\mathbf {V}^{t',k})\right\| _F\le \frac{14\varepsilon }{\eta \theta _k}. \end{aligned}$$

Thus, we obtain

$$\begin{aligned} \text{ dist }\left( 0,\nabla g(\mathbf {Z}^{t',k+1})+\partial I_{\varOmega _{ S^j}}(\mathbf {Z}^{t',k+1})\right) \le \frac{14\varepsilon }{\eta \theta _k},\forall k=0,\ldots ,K. \end{aligned}$$

Since \(\partial I_{\varOmega _{ S^j}}(\mathbf {Z}^{t',k+1})\) has zero elements for the rows indicated by the indexes out of \(S^j\), we can have

$$\begin{aligned} \left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^1}\right\| _F\le \frac{14\varepsilon }{\eta \theta _K}, \end{aligned}$$
(23)

and

$$\begin{aligned} \left\| \left( \nabla g(\mathbf {Z}^{t+1,1})\right) _{- S^2}\right\| _F\le \frac{14\varepsilon }{\eta \theta _K}, \end{aligned}$$
(24)

On the other hand,

$$\begin{aligned} \begin{aligned}&\left\| \left( \nabla g(\mathbf {Z}^{t+1,0})\right) _{- S^2}\right\| _F-\left\| \left( \nabla g(\mathbf {Z}^{t+1,1})\right) _{- S^2}\right\| _F\\&\quad \le \left\| \left( \nabla g(\mathbf {Z}^{t+1,0})-\nabla g(\mathbf {Z}^{t+1,1})\right) _{- S^2}\right\| _F\\&\quad \le \left\| \nabla g(\mathbf {Z}^{t+1,0})-\nabla g(\mathbf {Z}^{t+1,1})\right\| _F\le {\hat{L}}\Vert \mathbf {Z}^{t+1,0}-\mathbf {Z}^{t+1,1}\Vert _F\le \frac{5{\hat{L}}\varepsilon }{\theta _K}, \end{aligned} \end{aligned}$$
(25)

where we use Lemma 13 in the last inequality. Combing (24) and (25), we can obtain

$$\begin{aligned} \left\| \left( \nabla g(\mathbf {Z}^{t+1,0})\right) _{- S^2}\right\| _F\le \frac{19\varepsilon }{\eta \theta _K}. \end{aligned}$$

Since \(\mathbf {Z}^{t+1,0}=\mathbf {Z}^{t,K+1}\mathbf {Q}^T\) for some orthogonal \(\mathbf {Q}\), we can have

$$\begin{aligned} \begin{aligned} \frac{19\varepsilon }{\eta \theta _K}\ge&\left\| \left( \nabla g(\mathbf {Z}^{t+1,0})\right) _{- S^2}\right\| _F=\left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\mathbf {Q}^T\right) _{- S^2}\right\| _F\\ =&\left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^2}\mathbf {Q}^T\right\| _F=\left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^2}\right\| _F. \end{aligned} \end{aligned}$$
(26)

Step 3 We bound all the elements of the gradient. Recall that we require \(S^1\cap S^2=\emptyset \). Thus, we have \(- S^1\cup - S^2=\{1,2,\ldots ,n\}\). Then, from (23) and (26), we have

$$\begin{aligned} \left\| \nabla g(\mathbf {Z}^{t,K+1})\right\| _F\le \left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^1}\right\| _F+\left\| \left( \nabla g(\mathbf {Z}^{t,K+1})\right) _{- S^2}\right\| _F\le \frac{33\varepsilon }{\eta \theta _K}. \end{aligned}$$

At last, we can bound \(\left\| \nabla g(\mathbf {U}^{t,K+1})\right\| _F\) from Lemmas 10 and 13.

From the Algorithm, we know that the index set is selected at most T times. The volume sampling subset selection algorithm succeeds with the probability of \(1-\delta '\). So the Algorithm succeeds with the probability at least of \(1-T\delta '=1-\delta \). On the other hand, the volume sampling subset selection algorithm needs \(O\left( nr^3\log \left( \frac{1}{\delta '}\right) \right) =O\left( nr^3\log \left( \frac{T}{\delta }\right) \right) =O\left( nr^3\log \left( \frac{f(\mathbf {U}^{t,0}(\mathbf {U}^{t,0})^T)-f(\mathbf {U}^*{\mathbf {U}^*}^T))}{\delta \varepsilon ^2}\right) \right) \) operations. \(\square \)

6 Minimizing (2) directly without the constraint

Someone may doubt the necessity of the constraint in problem (8) and they wonder the performance of the classical accelerated gradient method to minimize problem (2) directly. In this case, the classical accelerated gradient method (Nesterov 1983, 1988; Tseng 2008) becomes

$$\begin{aligned} \mathbf {V}^{k}= & {} (1-\theta _k)\mathbf {U}^{k}+\theta _k\mathbf {Z}^{k}, \end{aligned}$$
(27)
$$\begin{aligned} \mathbf {Z}^{k+1}= & {} \mathbf {Z}^{k}-\eta \nabla g(\mathbf {V}^k),\end{aligned}$$
(28)
$$\begin{aligned} \mathbf {U}^{k+1}= & {} (1-\theta _k)\mathbf {U}^{k}+\theta _k\mathbf {Z}^{k+1}, \end{aligned}$$
(29)

and it is equivalent to

$$\begin{aligned} \mathbf {V}^k= & {} \mathbf {U}^k+\beta _k(\mathbf {U}^k-\mathbf {U}^{k-1}),\end{aligned}$$
(30)
$$\begin{aligned} \mathbf {U}^{k+1}= & {} \mathbf {V}^k-\eta \nabla g(\mathbf {V}^{k}). \end{aligned}$$
(31)

where \(\beta _k\) is defined in Lemma 6. Another choice is a constant of \(\beta <1\). Theorem 5 establishes the convergence rate for the above two recursions. We leave the proof in “Appendix D”.

Theorem 5

Assume that \(\mathbf {U}^*\in \mathcal {X}^*\) and \(\mathbf {V}^k\in \mathbb {R}^{n\times r}\) satisfy \(\Vert \mathbf {V}^k-P_{\mathcal {X}^*}(\mathbf {V}^k)\Vert _F\le \min \left\{ 0.01\sigma _r(\mathbf {U}^*), \frac{\mu \sigma _r^2(\mathbf {U}^*)}{6L\Vert \mathbf {U}^*\Vert _2}\right\} \). Let \(\eta \) be the one in Lemma 6. Then, we can have

$$\begin{aligned}&g(\mathbf {U}^{k+1})+\nu \Vert \mathbf {U}^{k+1}-\mathbf {U}^k\Vert _F^2-g(\mathbf {U}^*)\\&\quad \le \frac{1}{1+\frac{\gamma }{\frac{5}{\eta ^2\mu \sigma _r^2(\mathbf {U}^*)}+\nu }}\left[ g(\mathbf {U}^k)+\nu \Vert \mathbf {U}^k-\mathbf {U}^{k-1}\Vert _F^2-g(\mathbf {U}^*)\right] . \end{aligned}$$

where \(\gamma =\frac{1-\beta _{\max }^2}{4\eta }-\frac{\beta _{\max }{\hat{L}}}{2}-\frac{{\hat{L}}}{4}>0\) and \(\nu =\frac{1+\beta _{\max }^2}{4\eta }-\frac{{\hat{L}}}{4}>0\).

Consider the case that \(\beta _k\) is a constant. Then, we know that all of the constants \(\gamma ,\nu ,{\hat{L}}\) and \(\frac{1}{\eta }\) are of the order \(O\left( L\Vert \mathbf {U}^*\Vert _2^2+\Vert \nabla f(\mathbf {X}^*)\Vert _2\right) \). Thus, the convergence rate of recursion (30), (31) is in the form of

$$\begin{aligned} \left( 1-\frac{\mu \sigma _r^2(\mathbf {U}^*)}{L\Vert \mathbf {U}^*\Vert _2^2+\Vert \nabla f(\mathbf {X}^*)\Vert _2}\right) ^N, \end{aligned}$$

which is the same as that of the gradient descent method in (20). Thus, although the convergence of the classical accelerated gradient method for problem (2) can be proved, it is not easy to build the acceleration upon the gradient descent. As a comparison, Algorithm 1 has a theoretical better dependence on the condition number of \(\frac{L}{\mu }\). Thus, the reformulation of problem (2) to a constrained one is necessary to prove acceleration.

7 The asymmetric case

In this section, we consider the asymmetric case of problem (1):

$$\begin{aligned} \min _{{\widetilde{\mathbf {X}}}\in \mathbb {R}^{n\times m}} f({\widetilde{\mathbf {X}}}), \end{aligned}$$
(32)

where there exists a minimizer \({\widetilde{\mathbf {X}}}^*\) of rank-r. We follow Park et al. (2018) to assume \(\nabla f({\widetilde{\mathbf {X}}}^*)=0\). In the asymmetric case, we can factorize \({\widetilde{\mathbf {X}}}={\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) and reformulate problem (32) as a similar problem to (2). Moreover, we follow Park et al. (2018), Wang et al. (2017) to regularize the objective and force the solution pair \(({\widetilde{\mathbf {U}}},{\widetilde{\mathbf {V}}})\) to be balanced. Otherwise, the problem may be ill-conditioned since \(\left( \frac{1}{\delta }{\widetilde{\mathbf {U}}}\right) (\delta {\widetilde{\mathbf {V}}})\) is also a factorization of \({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\) for any large \(\delta \) (Park et al. 2018). Specifically, we consider the following problem

$$\begin{aligned} \min _{{\widetilde{\mathbf {U}}}\in \mathbb {R}^{n\times r},{\widetilde{\mathbf {V}}}\in \mathbb {R}^{m\times r}} f({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)+\frac{\mu }{8}\Vert {\widetilde{\mathbf {U}}}^T{\widetilde{\mathbf {U}}}-{\widetilde{\mathbf {V}}}^T{\widetilde{\mathbf {V}}}\Vert _F^2. \end{aligned}$$
(33)

Let \({\widetilde{\mathbf {X}}}^*=\mathbf {A}\varSigma \mathbf {B}^T\) be its SVD. Then, \(({\widetilde{\mathbf {U}}}^*=\mathbf {A}\sqrt{\varSigma },{\widetilde{\mathbf {V}}}^*=\mathbf {B}\sqrt{\varSigma })\) is a minimizer of problem (33). Define a stacked matrix \(\mathbf {U}=\left( \begin{array}{c} {\widetilde{\mathbf {U}}} \\ {\widetilde{\mathbf {V}}} \end{array} \right) \) and let \(\mathbf {X}=\mathbf {U}\mathbf {U}^T=\left( \begin{array}{cc} {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {U}}}^T &{} {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\\ {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {U}}}^T &{} {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {V}}}^T \end{array} \right) \). Then we can write the objective in (33) in the form of \({\hat{f}}(\mathbf {X})\), defined as \({\hat{f}}(\mathbf {X})=f({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)+\frac{\mu }{8}\Vert {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {U}}}^T\Vert _F^2+\frac{\mu }{8}\Vert {\widetilde{\mathbf {V}}}{\widetilde{\mathbf {V}}}^T\Vert _F^2-\frac{\mu }{4}\Vert {\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T\Vert _F^2\). Since f is restricted \(\mu \)-strongly convex, we can easily check that \({\hat{f}}(\mathbf {X})\) is restricted \(\frac{\mu }{4}\)-strongly convex. On the other hand, we know that \({\hat{f}}(\mathbf {X})\) is restricted \(\left( L+\frac{\mu }{2}\right) \)-smooth. Applying the conclusions on the symmetric case to \({\hat{f}}(\mathbf {X})\), we can apply Algorithm 1 to the asymmetric case. From Theorem 3, we can get the convergence rate. Moreover, since \(\sigma _i(\mathbf {X}^*)=2\sigma _i({\widetilde{\mathbf {X}}}^*)\),

$$\begin{aligned} \begin{aligned} \nabla {\hat{f}}(\mathbf {X}^*)=&\left( \begin{array}{cc} 0 &{} \nabla f({\widetilde{\mathbf {X}}}^*)\\ \nabla f({\widetilde{\mathbf {X}}}^*)^T &{} 0 \end{array} \right) +\frac{\mu }{4}\left( \begin{array}{c} {\widetilde{\mathbf {U}}}^* \\ -{\widetilde{\mathbf {V}}}^* \end{array} \right) \left( {\widetilde{\mathbf {U}}^*}{^T},-\widetilde{\mathbf {V}}^{*}{^T}\right) \\ =&\frac{\mu }{4}\left( \begin{array}{c} {\widetilde{\mathbf {U}}}^* \\ -{\widetilde{\mathbf {V}}}^* \end{array} \right) \left( \widetilde{\mathbf {U}}^{*}{^T},-\widetilde{\mathbf {V}}^{*}{^T}\right) \end{aligned} \end{aligned}$$

and \(\Vert \nabla {\hat{f}}(\mathbf {X}^*)\Vert _2=\frac{\mu }{4}\Vert \mathbf {X}^*\Vert _2\), where \(\mathbf {X}^*=\left( \begin{array}{cc} {\widetilde{\mathbf {U}}}^*\widetilde{\mathbf {U}}^{*}{^{T}} &{} {\widetilde{\mathbf {U}}}^{*}\widetilde{\mathbf {V}}^{*}{^{T}}\\ {\widetilde{\mathbf {V}}}^*\widetilde{\mathbf {U}^*}^T &{} \widetilde{\mathbf {V}}^{*}\widetilde{\mathbf {V}}^{*}{^{T}} \end{array} \right) \), we can simplify the worst case convergence rate to \(\left( 1-\frac{\sigma _r({\widetilde{\mathbf {X}}}^*)}{\Vert {\widetilde{\mathbf {X}}}^*\Vert _2}\sqrt{\frac{\mu }{(m+n)rL}}\right) ^N\). As a comparison, the rate of the gradient descent is \(\left( 1-\frac{\sigma _r({\widetilde{\mathbf {X}}}^*)}{\Vert {\widetilde{\mathbf {X}}}^*\Vert _2}\frac{\mu }{L}\right) ^N\) (Park et al. 2018).

In the asymmetric case, both \({\widetilde{\mathbf {U}}}^*\) and \({\widetilde{\mathbf {V}}}^*\) are of full rank. Otherwise, \(\text{ rank }({\widetilde{\mathbf {X}}}^*)<r\). Thus, we can select the index set \( S^1\) from \({\widetilde{\mathbf {U}}}^0\) and select \( S^2\) from \({\widetilde{\mathbf {V}}}^0\) with the guarantee of \(\sigma _r({\widetilde{\mathbf {U}}}^0_{ S^1})\ge \frac{\sigma _r({\widetilde{\mathbf {U}}}^0)}{\sqrt{2r(n-r+1)}}\) and \(\sigma _r({\widetilde{\mathbf {V}}}^0_{ S^2})\ge \frac{\sigma _r({\widetilde{\mathbf {V}}}^0)}{\sqrt{2r(m-r+1)}}\).

8 Experiments

In this section, we test the efficiency of the proposed accelerated gradient descent (AGD) method on Matrix Completion, One Bit Matrix Completion and Matrix Regression.

8.1 Matrix completion

In matrix completion (Rohde and Tsybakov 2011; Koltchinsii et al. 2011; Negahban and Wainwright 2012), the goal is to recover the low rank matrix \(\mathbf {X}^*\) based on a set of randomly observed entries \({\mathbf {O}}\) from \(\mathbf {X}^*\). The traditional matrix completion problem is to solve the following model:

$$\begin{aligned} \min _{\mathbf {X}} \frac{1}{2}\sum _{(i,j)\in {\mathbf {O}}}(\mathbf {X}_{i,j}-\mathbf {X}_{i,j}^*)^2,\quad s.t.\quad \text{ rank }(\mathbf {X})\le r. \end{aligned}$$

We consider the asymmetric case and solve the following model:

$$\begin{aligned} \min _{{\widetilde{\mathbf {U}}}\in \mathbb {R}^{n\times r},{\widetilde{\mathbf {V}}}\in \mathbb {R}^{m\times r}} \frac{1}{2}\sum _{(i,j)\in {\mathbf {O}}}(({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)_{i,j}-\mathbf {X}_{i,j}^*)^2+\frac{1}{200}\Vert {\widetilde{\mathbf {U}}}^T{\widetilde{\mathbf {U}}}-{\widetilde{\mathbf {V}}}^T{\widetilde{\mathbf {V}}}\Vert _F^2. \end{aligned}$$

We set \(r=10\) and test the algorithms on the Movielen-10M, Movielen-20M and Netflix data sets. The corresponding observed matrices are of size \(69878\times 10677\) with \(o\%=1.34\%\), \(138493\times 26744\) with \(o\%=0.54\%\) and \(480189\times 17770\) with \(o\%=1.18\%\), respectively, where \(o\%\) means the percentage of the observed entries. We compare AGD and AGD-adp (AGD with adaptive index sets selection) with GD and several variants of the original

AGD: 1. AGD-original1: The classical AGD with recursions of (30), (31).2. AGD-original1-r: AGD-original1 with restart.3. AGD-original1-f: AGD-original1 with fixed \(\beta _k\) of \(\frac{\sqrt{L}-\sqrt{\mu }}{\sqrt{L}+\sqrt{\mu }}\).4. AGD-original2: The classical AGD with recursions of (27)–(29).5. AGD-original2-r: AGD-original2 with restart.6. AGD-original2-f: AGD-original2 with fixed \(\theta \).

Let \(\mathbf {X}_{{\mathbf {O}}}\) be the observed data and \(\mathbf {A}\varvec{\varSigma }\mathbf {B}^T\) be its SVD. We initialize \({\widetilde{\mathbf {U}}}=\mathbf {A}_{:,1:r}\sqrt{\varvec{\varSigma }_{1:r,1:r}}\) and \({\widetilde{\mathbf {V}}}=\mathbf {B}_{:,1:r}\sqrt{\varvec{\varSigma }_{1:r,1:r}}\) for all the compared methods. Since \(\mathbf {X}_{{\mathbf {O}}}\) is sparse, it is efficient to find the top r singular values and the corresponding singular vectors for large-scale matrices (Larsen 1998). We tune the best step sizes of \(\eta =5\times 10^{-5},4\times 10^{-5}\) and \(1\times 10^{-5}\) for all the compared methods on the three data sets, respectively. For AGD, we set \(\epsilon =10^{-10}\), \( S^1=\{1:r\}\) and \( S^2=\{r+1:2r\}\) for simplicity. We set \(K = 100\) for AGD, AGD-adp and the original AGD with restart. We run the compared methods 500 iterations for the Movielen-10M and Movielen-20M data sets and 1000 iterations for the Netflix data set.

The top part of Fig. 1 plots the curves of the training RMSE v.s. time (seconds). We can see that AGD is faster than GD. The performances of AGD, AGD-adp and the original AGD are similar. In fact, in AGD-adp, we observe that the index sets do not change during the iterations. Thus, the condition of \(\sigma _r(\mathbf {U}^{t,K+1}_{ S'})\ge \epsilon \)\(\forall t\) in Theorem 4 holds. The original AGD performs almost equally fast as our modified AGD in practice. However, it has an inferior convergence rate theoretically. The bottom part of Fig. 1 plots the curves of the testing RMSE v.s. time. Besides GD, we also compare AGD with LMaFit (Wen et al. 2012), Soft-ALS (Hastie et al. 2015) and MSS (Xu et al. 2017). They all solve a factorization based nonconvex model. From Fig. 1 we can see that AGD achieves the lowest testing RMSE with the fastest speed.

Fig. 1
figure 1

Top: Compare the training RMSE of GD, AGD, AGD-adp and several variants of the original AGD. Bottom: Compare the testing RMSE of GD, AGD, LMaFit, Soft-ALS and MSS

8.2 One bit matrix completion

In one bit matrix completion (Davenport et al. 2014), the sign of a random subset from the unknown low rank matrix \(\mathbf {X}^*\) is observed, instead of observing the actual entries. Given a probability density function f, e.g., the logistic function \(f(\mathbf {x})=\frac{e^x}{1+e^x}\), we observe the sign of \(\mathbf {x}\) as \(+1\) with probability \(f(\mathbf {x})\) and observe the sign as \(-1\) with probability \(1-f(\mathbf {x})\). The training objective is to minimize the negative log-likelihood:

$$\begin{aligned} \min _{\mathbf {X}} -\sum _{(i,j)\in {\mathbf {O}}}\{{\mathbf {1}}_{\mathbf {Y}_{i,j}=1}\text{ log }(f(\mathbf {X}_{i,j}))+{\mathbf {1}}_{\mathbf {Y}_{i,j}=-1}\text{ log }(1-f(\mathbf {X}_{i,j}))\},s.t.\, \text{ rank }(\mathbf {X})\le r. \end{aligned}$$

In this section, we solve the following model:

$$\begin{aligned}&\min _{{\widetilde{\mathbf {U}}},{\widetilde{\mathbf {V}}}}-\sum _{(i,j)\in {\mathbf {O}}}\left\{ {\mathbf {1}}_{\mathbf {Y}_{i,j}=1}\text{ log }(f(({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)_{i,j}))+{\mathbf {1}}_{\mathbf {Y}_{i,j}=-1}\text{ log }(1-f(({\widetilde{\mathbf {U}}}{\widetilde{\mathbf {V}}}^T)_{i,j}))\right\} \\&\quad +\frac{1}{200}\Vert {\widetilde{\mathbf {U}}}^T{\widetilde{\mathbf {U}}}-{\widetilde{\mathbf {V}}}^T{\widetilde{\mathbf {V}}}\Vert _F^2. \end{aligned}$$

We use the data sets of Movielen-10M, Movielen-20M and Netflix. We set \(\mathbf {Y}_{i,j}=1\) if the (ij)-th observation is larger than the average of all observations and \(\mathbf {Y}_{i,j}=-1\), otherwise. We set \(r=5\) and \(\eta =0.001\), 0.001, 0.0005 for all the compared methods on the three data sets. The other experimental setting is the same as Matrix Completion. We run all the methods for 500 iterations. Figure 2 plots the curves of the objective value v.s. time (seconds) and we can see that AGD is also faster than GD. The performances of AGD, AGD-adp and the original AGD are nearly the same.

Fig. 2
figure 2

Compare AGD and AGD-adp with GD and several variants of the original AGD on the One Bit Matrix Completion problem

8.3 Matrix regression

In matrix regression (Recht et al. 2010; Negahban and Wainwright 2011), the goal is to estimate the unknown low rank matrix \(\mathbf {X}^*\) from a set of measurements \(\mathbf {y}=\mathbf {A}(\mathbf {X}^*)+\varepsilon \), where \(\mathbf {A}\) is a linear operator and \(\varepsilon \) is the noise. A reasonable estimation of \(\mathbf {X}^*\) is to solve the following rank constrained problem:

$$\begin{aligned} \min _{\mathbf {X}}f(\mathbf {X})=\frac{1}{2}\Vert \mathbf {A}(\mathbf {X})-\mathbf {y}\Vert _F^2,\quad s.t. \quad \text{ rank }(\mathbf {X})\le r. \end{aligned}$$

We consider the symmetric case of \(\mathbf {X}\) and solve the following nonconvex model:

$$\begin{aligned} \min _{\mathbf {U}\in \mathbb {R}^{n\times r}} f(\mathbf {U})=\frac{1}{2}\Vert \mathbf {A}(\mathbf {U}\mathbf {U}^T)-\mathbf {y}\Vert _F^2. \end{aligned}$$

We follow (Bhojanapalli et al. 2016a) to use the permuted and sub-sampled noiselets (Waters et al. 2011) for the linear operator \(\mathbf {A}\) and \(\mathbf {U}^*\) is generated from the normal Gaussian distribution without noise. We set \(r=10\) and test different n with \(n = 512\), 1024 and 2048. We fix the number of measurements to 4nr and follow (Bhojanapalli et al. 2016a) to use the initializer from the eigenvalue decomposition of \(\frac{\mathbf {X}^0+(\mathbf {X}^0)^T}{2}\) for all the compared methods, where \(\mathbf {X}^0=\text{ Project }_{+}\left( \frac{-\nabla f(0)}{\Vert \nabla f(0)-\nabla f(11^T)\Vert _F}\right) \). We set \(\eta =5,10\) and 20 for all the compared methods for \(n=512,1024\) and 2048, respectively. In AGD, we set \(\epsilon =10^{-10}\), \(K=10\), \( S^1=\{1:r\}\) and \( S^2=\{r+1:2r\}\). Figure 3 plots the curves of the objective value v.s. time (seconds). We run all the compared methods for 300 iterations. We can see that AGD and the original AGD with restart perform almost equally fast. AGD runs faster than GD and the original AGD without restart.

Fig. 3
figure 3

Compare AGD with GD and several variants of the original AGD on the Matrix Regression problem

8.4 Verifying (21) in practice

In this section, we verify that the conditions of \(\Vert {\hat{\mathbf {U}}}_S^{t,k}-\mathbf {U}_S^*\Vert _F\le c\sqrt{\frac{r}{n}}\Vert {\hat{\mathbf {U}}}^{t,k}-\mathbf {U}^*\Vert _F\) and \(\Vert {\hat{\mathbf {V}}}_S^{t,k}-\mathbf {U}_S^*\Vert _F\le c\sqrt{\frac{r}{n}}\Vert {\hat{\mathbf {V}}}^{t,k}-\mathbf {U}^*\Vert _F\) in (21) hold in our experiments, where \({\hat{\mathbf {U}}}^{t,k}=\mathbf {U}^{t,k}\mathbf {R}\) with \(\mathbf {R}=\text{ argmin }_{\mathbf {R}\in \mathbb {R}^{r\times r},\mathbf {R}\mathbf {R}^T=\mathbf {I}}\Vert \mathbf {U}^{t,k}\mathbf {R}-\mathbf {U}^*\Vert _F^2\) and \({\hat{\mathbf {V}}}^{t,k}\) is defined similarly. We use the final output \(\mathbf {U}^{T,K+1}\) as \(\mathbf {U}^*\). Table 2 lists the results. We can see that \(\frac{\Vert {\hat{\mathbf {U}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {U}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\) and \(\frac{\Vert {\hat{\mathbf {V}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {V}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\) have the same order as \(\sqrt{\frac{r}{n}}\).

Table 2 Testing the order of \(\frac{\Vert {\hat{\mathbf {U}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {U}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\) and \(\frac{\Vert {\hat{\mathbf {V}}}^{t,k}_{S}-\mathbf {U}^{*}_{S}\Vert _F}{\Vert {\hat{\mathbf {V}}}^{t,k}-\mathbf {U}^{*}\Vert _F}\)

9 Conclusions

In this paper, we study the factorization based low rank optimization. A linearly convergent accelerated gradient method with alternating constraint is proposed with the optimal dependence on the condition number of \(\sqrt{L/\mu }\) as convex programming. As far as we know, this is the first work with the provable optimal dependence on \(\sqrt{L/\mu }\) for this kind of nonconvex problems. Globally, the convergence to a critical point is proved.

There are two problems unsolved in this paper. 1. How to find two distinct sets \(S^1\) and \(S^2\) such that \(\sigma _r(\mathbf {U}_{S^1})\) and \(\sigma _r(\mathbf {U}_{S^2})\) are as large as possible? 2. How to find the initial point close enough to the optimum solution for the general problems with large condition number?