We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Average Performance Analysis of the Stochastic Gradient Method for Online PCA | SpringerLink
Skip to main content

Average Performance Analysis of the Stochastic Gradient Method for Online PCA

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Data Science (LOD 2018)

Abstract

This paper studies the complexity of the stochastic gradient algorithm for PCA when the data are observed in a streaming setting. We also propose an online approach for selecting the learning rate. Simulation experiments confirm the practical relevance of the plain stochastic gradient approach and that drastic improvements can be achieved by learning the learning rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allen-Zhu, Z., Li, Y.: LazySVD: even faster SVD decomposition yet without agonizing pain. In: Advances in Neural Information Processing Systems, pp. 974–982 (2016)

    Google Scholar 

  2. Bandeira, A.S.: Ten lectures and forty-two open problems in the mathematics of data science (2015)

    Google Scholar 

  3. Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? arXiv preprint arXiv:1511.03688 (2015)

  4. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  Google Scholar 

  5. Grone, R., Johnson, C.R., Sá, E.M., Wolkowicz, H.: Positive definite completions of partial Hermitian matrices. Linear Algebra Appl. 58, 109–124 (1984)

    Article  MathSciNet  Google Scholar 

  6. Hazan, E., et al.: Introduction to online convex optimization. Found. Trends® Optim. 2(3–4), 157–325 (2016)

    Article  Google Scholar 

  7. Jin, C., Kakade, S.M., Musco, C., Netrapalli, P., Sidford, A.: Robust shift-and-invert preconditioning: faster and more sample efficient algorithms for eigenvector computation, arXiv preprint arXiv:1510.08896 (2015)

  8. Laurent, M.: A tour d’horizon on positive semidefinite and euclidean distance matrix completion problems. Top. Semidefinite Inter.-Point Methods 18, 51–76 (1998)

    MathSciNet  MATH  Google Scholar 

  9. Laurent, M.: Matrix completion problems. In: Floudas, C.A., Pardalos, P.M. (eds.) Encyclopedia of Optimization, pp. 1311–1319. Springer, Boston (2001). https://doi.org/10.1007/0-306-48332-7_271

    Chapter  Google Scholar 

  10. Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends®Mach. Learn. 4(2), 107–194 (2012)

    Article  Google Scholar 

  11. Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: ICML, pp. 144–152 (2015)

    Google Scholar 

  12. Shamir, O.: Convergence of stochastic gradient descent for PCA. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, pp. 257–265, vol. 48. JMLR.org (2016)

    Google Scholar 

  13. Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stéphane Chrétien .

Editor information

Editors and Affiliations

A Technical lemmæ

A Technical lemmæ

Recall that

$$\begin{aligned} B_T&= \prod _{t=T}^1 (I+\eta A_t)^{\top } ((1-\varepsilon )I-A) \prod _{t=1}^{\top } (I+\eta A_t). \end{aligned}$$
(25)

Lemma 2

In the case of matrix completion, given a matrix X, we have

$$\begin{aligned} \mathbb {E}[A_t^{\top } XA_t]=d^2 \ \mathrm {diag}(A\ \mathrm {diag}(X)A). \end{aligned}$$

Proof

The resulting matrix writes

$$\begin{aligned} A_t^{\top } XA_t&=d^4 A_{ij}A_{ji}\mathbf {e}_{j_t}\mathbf {e}_{i_t}^{\top } X \mathbf {e}_{i_t} \mathbf {e}_{j_t}^{\top } \\&=d^4 A_{ij}A_{ji} X_{ii} \mathbf {e}_{j_t} \mathbf {e}_{j_t}^{\top } . \end{aligned}$$

Therefore the expected matrix writes

$$\begin{aligned} \mathbb {E}[A_t^{\top } X A_t]=d^2 \sum _{i,j}^d A_{ij}A_{ji} X_{ii} \mathbf {e}_j \mathbf {e}_j^{\top } \end{aligned}$$

Using the symmetry of A gives the result.

Now our next goal is to see how \(\mathrm {diag}\left( A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A\right) \) evolves with the iterations. For this purpose, take the diagonal of (12), multiply from the left by \(A^{\top } \) and from the right by A and take the diagonal of the resulting expression.

Lemma 3

We have that

$$\begin{aligned} \Vert \mathrm {diag} \left( \mathbb E [B_{T}] \right) \Vert&\le 2\eta \ \Vert \mathbb {E}[B_{T-1}]\Vert _{1\rightarrow 2 } + (1+\eta ^2d^2)\ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert \end{aligned}$$
(26)

Proof

Expanding the recurrence relationship (12) gives

$$\begin{aligned} \mathrm {diag}(\mathbb {E}[B_T])&=\mathrm {diag}(\mathbb {E}[B_{T-1}])+\eta \left( \mathrm {diag}\left( A^{\top } \mathbb E[B_{T-1}]+\mathbb E[B_{T-1}] A \right) \right) \nonumber \\&+\,\eta ^2 d^2 \mathrm {diag}\left( A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A\right) .\nonumber \end{aligned}$$
(27)

For any diagonal matrix \(\varDelta \) and symmetric matrix A, we have

$$\begin{aligned} \Vert \mathrm {diag}( A^{\top } \varDelta A) \Vert \le \Vert A\Vert _{1\rightarrow 2 }^2 \Vert \varDelta \Vert . \end{aligned}$$
(28)

Therefore, by taking the operator norm on both sides of the equality, we have

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb {E}[B_T])\Vert&\le (1+\eta ^2 d^2 \Vert A\Vert _{1\rightarrow 2}^2)\Vert \mathrm {diag}(\mathbb {E}[B_{T-1}])\Vert + 2\eta \Vert \mathrm {diag}(A^{\top } \mathbb E[B_{T-1}])\Vert \end{aligned}$$
(29)

We conclude using \(\Vert \mathrm {diag}(A^{\top } E[B_{T-1}])\Vert \le \Vert A\Vert _{1\rightarrow 2}\Vert \mathbb E[B_{T-1}]\Vert _{1\rightarrow 2}\) and \(\Vert A\Vert _{1\rightarrow 2 }\le 1\).

We also have to understand how the \(\ell _{1\rightarrow 2}\) norm evolves.

Lemma 4

We have

$$\begin{aligned} \Vert \mathbb E [B_{T}]\Vert _{1\rightarrow 2}&\le \eta \ \Vert \mathbb E[B_{T-1}]\Vert + (1+\eta ) \ \Vert \mathbb {E}[B_{T-1}]\Vert _{1\rightarrow 2 }+\eta ^2 d^2\ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$
(30)

Proof

Expanding the recurrence relationship gives

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert _{1\rightarrow 2}&=\Vert \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2}+ \eta \ \left( \Vert A^{\top } \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2} +\Vert \mathbb E [B_{T-1}]^{\top } A\Vert _{1\rightarrow 2}\right) \\&+\,\eta ^2 d^2 \Vert \mathrm {diag}(A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A)\Vert _{1 \rightarrow 2}. \end{aligned}$$

For a diagonal matrix \(\varDelta \), we have \(\Vert \varDelta \Vert _{1\rightarrow 2}=\Vert \varDelta \Vert \). This leads to

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert _{1\rightarrow 2}&=\Vert \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2}+ \eta \ \left( \Vert A\Vert \Vert E [B_{T-1}]\Vert _{1\rightarrow 2} +\Vert \mathbb E [B_{T-1}]\Vert \Vert A\Vert _{1\rightarrow 2}\right) \\&+\,\eta ^2d^2\Vert A\Vert _{1\rightarrow 2}^2 \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$

Finally, using \(\Vert A\Vert _{1\rightarrow 2}\le 1\) concludes the proof.

We then have to understand how the operator norm of \(\mathbb E [B_T]\) evolves

Lemma 5

We have

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert \le (1+2\eta ) \Vert \mathbb E [B_{T-1}]\Vert + \eta ^2 d^2 \ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$
(31)

Proof

Expanding the recurrence relationship (12) return

$$\begin{aligned}&\Vert \mathbb {E}[B_T]\Vert =\mathbb E [B_{T-1}]+\eta (\Vert A^{\top } \mathbb E [B_{T-1}]\Vert +\Vert \mathbb E [B_{T-1}] A\Vert )\\&\quad +\eta ^2 d^2 \Vert \mathrm {diag}(A^{\top } \mathrm {diag}(\mathbb E [B_{T-1}])A)\Vert . \end{aligned}$$

Then using similar inequalities as in the proof of the lemmas above, we have the result.

Lemma 6

Let \(\Vert A\Vert =1\), then we have

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb {E}[B_T])\Vert&\le \alpha \max _j (1-\varepsilon -s_j)+\beta \Vert (1-\varepsilon )I-A\Vert _{1\rightarrow 2}+\gamma \max _j (1-\varepsilon -A_{jj}) \end{aligned}$$
(32)

where

$$\begin{aligned} \alpha&=2\frac{\eta }{\eta d^2+1}\left( \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2+2)}-\frac{1-\eta ^{T-2}}{1-\eta } \right) \\ \beta&=2\frac{\eta }{\eta d^2 +1}\left( \eta d^2 \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2+2)}+\frac{1-\eta ^{T-2}}{1-\eta } \right) \\ \gamma&=1+\eta ^2d^2 \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2 +2)} \end{aligned}$$

Proof

Expanding the recurrence and using Eqs. (27), (30), and (31) yields the following system

$$\begin{aligned} \begin{bmatrix} \Vert \mathbb E[B_T]\Vert \\ \Vert \mathbb E[B_T]\Vert _{1\rightarrow 2}\\ \Vert \mathrm {diag}(\mathbb E[B_T])\Vert \end{bmatrix}\le \left( I+\eta \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}\right) \begin{bmatrix} \Vert \mathbb E[B_{T-1}]\Vert \\ \Vert \mathbb E[B_{T-1}]\Vert _{1\rightarrow 2}\\ \Vert \mathrm {diag}(\mathbb E[B_{T-1}])\Vert \end{bmatrix} \end{aligned}$$
(33)

To obtain the result, we expand the inequality by recurrence. Therefore, we are interested in computing the T-th power of the matrix in inequality (33). We have

$$\begin{aligned} \left( I+\eta \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}\right) ^{T}=I+\sum _{i=1}^{T} \eta ^i \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}^i. \end{aligned}$$
(34)

After computing the power matrices, it result that

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb E[B_T])\Vert&\le \sum _{i=1}^{T} \left( \eta ^i \frac{2(\eta d^2 +2)^{i-1}-1}{\eta d^2 +1}\right) \Vert \mathbb {E} [B_0]\Vert \nonumber \\&+\,\sum _{i=1}^{T} \left( \eta ^i \frac{2\eta d^2(\eta d^2 +2)^{i-1}+1}{\eta d^2 +1}\right) \Vert \mathbb {E}[B_0]\Vert _{1\rightarrow 2 } \nonumber \\&+\,\left( 1+\eta ^2 d^2\sum _{i=1}^{T} (\eta ^2 d^2+2\eta )^{i-1}\right) \Vert \mathrm {diag}(\mathbb {E}[B_0])\Vert . \end{aligned}$$
(35)

We conclude after computing the sums and bounding from above \(\Vert \mathbb {E}[B_0]\Vert \) by \(\max _{j}(1-\varepsilon -s_j)\).

Lemma 7

For \(\eta <1\) and \(\varepsilon >0\), we have

$$\begin{aligned} \max _{s\in [0,1]} (1+2\eta \ s)^{T}(1-\varepsilon -s) \le 1+\frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)} \end{aligned}$$
(36)

Proof

Denote \(f(s)=(1+2\eta \ s)^{T}(1-\varepsilon -s)\). Differentiating f and setting to zero, we obtain

$$\begin{aligned}&2\eta T(1+2\eta \ s)^{T-1}(1-\varepsilon -s)-(1+2\eta \ s)^{T}=0\\&\Longleftrightarrow 2\eta T(1-\varepsilon -s)-(1+2\eta \ s) =0\\&\Longleftrightarrow \frac{T(1-\varepsilon )-1/2\eta }{T+1}=s \end{aligned}$$

Let \(s_c=\frac{T-\varepsilon -1/2\eta }{T+1}\) denote this critical point. Consider the two following cases:

  • if \(s_c \notin [0,1]\), then f has no critical point in the domain and therefore is maximised at either domain endpoint, i.e.

    $$\max _{s\in [0,1]} f(s)=\max \{ f(0)=1-\varepsilon ,f(1)=-\varepsilon (1+2\eta )^T\}\le 1$$
  • if \(s_c \in [0,1]\), then f is maximised at \(s_c\) and the value of f at \(s_c\) is

    $$\begin{aligned}&\Bigg ( 1+2\eta \frac{T(1-\varepsilon )-1/2\eta }{T+1}\Bigg )^{T} \Bigg (1-\varepsilon -\frac{T(1-\varepsilon )-1/2\eta }{T+1}\Bigg )\\&=\Bigg (1+\frac{2\eta T(1-\varepsilon )-1}{T+1}\Bigg )^{T} \Bigg (\frac{1-\varepsilon + 1/2\eta }{T+1}\Bigg )\\&\le (1+2\eta (1-\varepsilon ))^{T} \Bigg (\frac{1+1/2\eta }{T+1}\Bigg ) \le \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}. \end{aligned}$$

This analysis proves that the maximum value f can achieve is less than

\(\max \{1, \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}\}\le 1+ \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}\}\). Hence the result.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chrétien, S., Guyeux, C., Ho, ZW.O. (2019). Average Performance Analysis of the Stochastic Gradient Method for Online PCA. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2018. Lecture Notes in Computer Science(), vol 11331. Springer, Cham. https://doi.org/10.1007/978-3-030-13709-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-13709-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-13708-3

  • Online ISBN: 978-3-030-13709-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics