Average Performance Analysis of the Stochastic Gradient Method for Online PCA

Chrétien, Stéphane; Guyeux, Christophe; Ho, Zhen-Wai Olivier

doi:10.1007/978-3-030-13709-0_19

Stéphane Chrétien¹⁷,
Christophe Guyeux¹⁸ &
Zhen-Wai Olivier Ho¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11331))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

Abstract

This paper studies the complexity of the stochastic gradient algorithm for PCA when the data are observed in a streaming setting. We also propose an online approach for selecting the learning rate. Simulation experiments confirm the practical relevance of the plain stochastic gradient approach and that drastic improvements can be achieved by learning the learning rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allen-Zhu, Z., Li, Y.: LazySVD: even faster SVD decomposition yet without agonizing pain. In: Advances in Neural Information Processing Systems, pp. 974–982 (2016)
Google Scholar
Bandeira, A.S.: Ten lectures and forty-two open problems in the mathematics of data science (2015)
Google Scholar
Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? arXiv preprint arXiv:1511.03688 (2015)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet Google Scholar
Grone, R., Johnson, C.R., Sá, E.M., Wolkowicz, H.: Positive definite completions of partial Hermitian matrices. Linear Algebra Appl. 58, 109–124 (1984)
Article MathSciNet Google Scholar
Hazan, E., et al.: Introduction to online convex optimization. Found. Trends® Optim. 2(3–4), 157–325 (2016)
Article Google Scholar
Jin, C., Kakade, S.M., Musco, C., Netrapalli, P., Sidford, A.: Robust shift-and-invert preconditioning: faster and more sample efficient algorithms for eigenvector computation, arXiv preprint arXiv:1510.08896 (2015)
Laurent, M.: A tour d’horizon on positive semidefinite and euclidean distance matrix completion problems. Top. Semidefinite Inter.-Point Methods 18, 51–76 (1998)
MathSciNet MATH Google Scholar
Laurent, M.: Matrix completion problems. In: Floudas, C.A., Pardalos, P.M. (eds.) Encyclopedia of Optimization, pp. 1311–1319. Springer, Boston (2001). https://doi.org/10.1007/0-306-48332-7_271
Chapter Google Scholar
Shalev-Shwartz, S., et al.: Online learning and online convex optimization. Found. Trends®Mach. Learn. 4(2), 107–194 (2012)
Article Google Scholar
Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: ICML, pp. 144–152 (2015)
Google Scholar
Shamir, O.: Convergence of stochastic gradient descent for PCA. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML 2016, pp. 257–265, vol. 48. JMLR.org (2016)
Google Scholar
Sra, S., Nowozin, S., Wright, S.J.: Optimization for Machine Learning. MIT Press, Cambridge (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

National Physical Laboratory, Teddington, UK
Stéphane Chrétien
FEMTO-ST Institute, UMR 6174 CNRS, Besançon, France
Christophe Guyeux
LMB Université de Bourgogne Franche-Comté, 16 route de Gray, 25030, Besançon, France
Zhen-Wai Olivier Ho

Authors

Stéphane Chrétien
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Guyeux
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Wai Olivier Ho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stéphane Chrétien .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy and University of Reading, Reading, UK
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton
IBM, Tivoli Research Lab, Rome, Italy
Vincenzo Sciacca

A Technical lemmæ

Recall that

$$\begin{aligned} B_T&= \prod _{t=T}^1 (I+\eta A_t)^{\top } ((1-\varepsilon )I-A) \prod _{t=1}^{\top } (I+\eta A_t). \end{aligned}$$

(25)

Lemma 2

In the case of matrix completion, given a matrix X, we have

$$\begin{aligned} \mathbb {E}[A_t^{\top } XA_t]=d^2 \ \mathrm {diag}(A\ \mathrm {diag}(X)A). \end{aligned}$$

Proof

The resulting matrix writes

$$\begin{aligned} A_t^{\top } XA_t&=d^4 A_{ij}A_{ji}\mathbf {e}_{j_t}\mathbf {e}_{i_t}^{\top } X \mathbf {e}_{i_t} \mathbf {e}_{j_t}^{\top } \\&=d^4 A_{ij}A_{ji} X_{ii} \mathbf {e}_{j_t} \mathbf {e}_{j_t}^{\top } . \end{aligned}$$

Therefore the expected matrix writes

$$\begin{aligned} \mathbb {E}[A_t^{\top } X A_t]=d^2 \sum _{i,j}^d A_{ij}A_{ji} X_{ii} \mathbf {e}_j \mathbf {e}_j^{\top } \end{aligned}$$

Using the symmetry of A gives the result.

Now our next goal is to see how $\mathrm {diag}\left( A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A\right) $ evolves with the iterations. For this purpose, take the diagonal of (12), multiply from the left by $A^{\top } $ and from the right by A and take the diagonal of the resulting expression.

Lemma 3

We have that

$$\begin{aligned} \Vert \mathrm {diag} \left( \mathbb E [B_{T}] \right) \Vert&\le 2\eta \ \Vert \mathbb {E}[B_{T-1}]\Vert _{1\rightarrow 2 } + (1+\eta ^2d^2)\ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert \end{aligned}$$

(26)

Proof

Expanding the recurrence relationship (12) gives

$$\begin{aligned} \mathrm {diag}(\mathbb {E}[B_T])&=\mathrm {diag}(\mathbb {E}[B_{T-1}])+\eta \left( \mathrm {diag}\left( A^{\top } \mathbb E[B_{T-1}]+\mathbb E[B_{T-1}] A \right) \right) \nonumber \\&+\,\eta ^2 d^2 \mathrm {diag}\left( A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A\right) .\nonumber \end{aligned}$$

(27)

For any diagonal matrix $\varDelta $ and symmetric matrix A, we have

$$\begin{aligned} \Vert \mathrm {diag}( A^{\top } \varDelta A) \Vert \le \Vert A\Vert _{1\rightarrow 2 }^2 \Vert \varDelta \Vert . \end{aligned}$$

(28)

Therefore, by taking the operator norm on both sides of the equality, we have

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb {E}[B_T])\Vert&\le (1+\eta ^2 d^2 \Vert A\Vert _{1\rightarrow 2}^2)\Vert \mathrm {diag}(\mathbb {E}[B_{T-1}])\Vert + 2\eta \Vert \mathrm {diag}(A^{\top } \mathbb E[B_{T-1}])\Vert \end{aligned}$$

(29)

We conclude using $\Vert \mathrm {diag}(A^{\top } E[B_{T-1}])\Vert \le \Vert A\Vert _{1\rightarrow 2}\Vert \mathbb E[B_{T-1}]\Vert _{1\rightarrow 2}$ and $\Vert A\Vert _{1\rightarrow 2 }\le 1$.

We also have to understand how the $\ell _{1\rightarrow 2}$ norm evolves.

Lemma 4

We have

$$\begin{aligned} \Vert \mathbb E [B_{T}]\Vert _{1\rightarrow 2}&\le \eta \ \Vert \mathbb E[B_{T-1}]\Vert + (1+\eta ) \ \Vert \mathbb {E}[B_{T-1}]\Vert _{1\rightarrow 2 }+\eta ^2 d^2\ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$

(30)

Proof

Expanding the recurrence relationship gives

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert _{1\rightarrow 2}&=\Vert \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2}+ \eta \ \left( \Vert A^{\top } \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2} +\Vert \mathbb E [B_{T-1}]^{\top } A\Vert _{1\rightarrow 2}\right) \\&+\,\eta ^2 d^2 \Vert \mathrm {diag}(A^{\top } \mathrm {diag}(\mathbb E[B_{T-1}])A)\Vert _{1 \rightarrow 2}. \end{aligned}$$

For a diagonal matrix $\varDelta $, we have $\Vert \varDelta \Vert _{1\rightarrow 2}=\Vert \varDelta \Vert $. This leads to

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert _{1\rightarrow 2}&=\Vert \mathbb E [B_{T-1}]\Vert _{1\rightarrow 2}+ \eta \ \left( \Vert A\Vert \Vert E [B_{T-1}]\Vert _{1\rightarrow 2} +\Vert \mathbb E [B_{T-1}]\Vert \Vert A\Vert _{1\rightarrow 2}\right) \\&+\,\eta ^2d^2\Vert A\Vert _{1\rightarrow 2}^2 \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$

Finally, using $\Vert A\Vert _{1\rightarrow 2}\le 1$ concludes the proof.

We then have to understand how the operator norm of $\mathbb E [B_T]$ evolves

Lemma 5

We have

$$\begin{aligned} \Vert \mathbb E [B_T]\Vert \le (1+2\eta ) \Vert \mathbb E [B_{T-1}]\Vert + \eta ^2 d^2 \ \Vert \mathrm {diag}(\mathbb E [B_{T-1}])\Vert . \end{aligned}$$

(31)

Proof

Expanding the recurrence relationship (12) return

$$\begin{aligned}&\Vert \mathbb {E}[B_T]\Vert =\mathbb E [B_{T-1}]+\eta (\Vert A^{\top } \mathbb E [B_{T-1}]\Vert +\Vert \mathbb E [B_{T-1}] A\Vert )\\&\quad +\eta ^2 d^2 \Vert \mathrm {diag}(A^{\top } \mathrm {diag}(\mathbb E [B_{T-1}])A)\Vert . \end{aligned}$$

Then using similar inequalities as in the proof of the lemmas above, we have the result.

Lemma 6

Let $\Vert A\Vert =1$, then we have

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb {E}[B_T])\Vert&\le \alpha \max _j (1-\varepsilon -s_j)+\beta \Vert (1-\varepsilon )I-A\Vert _{1\rightarrow 2}+\gamma \max _j (1-\varepsilon -A_{jj}) \end{aligned}$$

(32)

where

$$\begin{aligned} \alpha&=2\frac{\eta }{\eta d^2+1}\left( \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2+2)}-\frac{1-\eta ^{T-2}}{1-\eta } \right) \\ \beta&=2\frac{\eta }{\eta d^2 +1}\left( \eta d^2 \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2+2)}+\frac{1-\eta ^{T-2}}{1-\eta } \right) \\ \gamma&=1+\eta ^2d^2 \frac{1-\eta ^{T-2}(\eta d^2+2)^{T-2}}{1-\eta (\eta d^2 +2)} \end{aligned}$$

Proof

Expanding the recurrence and using Eqs. (27), (30), and (31) yields the following system

$$\begin{aligned} \begin{bmatrix} \Vert \mathbb E[B_T]\Vert \\ \Vert \mathbb E[B_T]\Vert _{1\rightarrow 2}\\ \Vert \mathrm {diag}(\mathbb E[B_T])\Vert \end{bmatrix}\le \left( I+\eta \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}\right) \begin{bmatrix} \Vert \mathbb E[B_{T-1}]\Vert \\ \Vert \mathbb E[B_{T-1}]\Vert _{1\rightarrow 2}\\ \Vert \mathrm {diag}(\mathbb E[B_{T-1}])\Vert \end{bmatrix} \end{aligned}$$

(33)

To obtain the result, we expand the inequality by recurrence. Therefore, we are interested in computing the T-th power of the matrix in inequality (33). We have

$$\begin{aligned} \left( I+\eta \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}\right) ^{T}=I+\sum _{i=1}^{T} \eta ^i \begin{bmatrix} 2&0&\eta d^2\\ 1&1&\eta d^2\\ 0&2&\eta d^2 \end{bmatrix}^i. \end{aligned}$$

(34)

After computing the power matrices, it result that

$$\begin{aligned} \Vert \mathrm {diag}(\mathbb E[B_T])\Vert&\le \sum _{i=1}^{T} \left( \eta ^i \frac{2(\eta d^2 +2)^{i-1}-1}{\eta d^2 +1}\right) \Vert \mathbb {E} [B_0]\Vert \nonumber \\&+\,\sum _{i=1}^{T} \left( \eta ^i \frac{2\eta d^2(\eta d^2 +2)^{i-1}+1}{\eta d^2 +1}\right) \Vert \mathbb {E}[B_0]\Vert _{1\rightarrow 2 } \nonumber \\&+\,\left( 1+\eta ^2 d^2\sum _{i=1}^{T} (\eta ^2 d^2+2\eta )^{i-1}\right) \Vert \mathrm {diag}(\mathbb {E}[B_0])\Vert . \end{aligned}$$

(35)

We conclude after computing the sums and bounding from above $\Vert \mathbb {E}[B_0]\Vert $ by $\max _{j}(1-\varepsilon -s_j)$.

Lemma 7

For $\eta <1$ and $\varepsilon >0$, we have

$$\begin{aligned} \max _{s\in [0,1]} (1+2\eta \ s)^{T}(1-\varepsilon -s) \le 1+\frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)} \end{aligned}$$

(36)

Proof

Denote $f(s)=(1+2\eta \ s)^{T}(1-\varepsilon -s)$. Differentiating f and setting to zero, we obtain

$$\begin{aligned}&2\eta T(1+2\eta \ s)^{T-1}(1-\varepsilon -s)-(1+2\eta \ s)^{T}=0\\&\Longleftrightarrow 2\eta T(1-\varepsilon -s)-(1+2\eta \ s) =0\\&\Longleftrightarrow \frac{T(1-\varepsilon )-1/2\eta }{T+1}=s \end{aligned}$$

Let $s_c=\frac{T-\varepsilon -1/2\eta }{T+1}$ denote this critical point. Consider the two following cases:

if $s_c \notin [0,1]$, then f has no critical point in the domain and therefore is maximised at either domain endpoint, i.e.
$$\max _{s\in [0,1]} f(s)=\max \{ f(0)=1-\varepsilon ,f(1)=-\varepsilon (1+2\eta )^T\}\le 1$$
if $s_c \in [0,1]$, then f is maximised at $s_c$ and the value of f at $s_c$ is
$$\begin{aligned}&\Bigg ( 1+2\eta \frac{T(1-\varepsilon )-1/2\eta }{T+1}\Bigg )^{T} \Bigg (1-\varepsilon -\frac{T(1-\varepsilon )-1/2\eta }{T+1}\Bigg )\\&=\Bigg (1+\frac{2\eta T(1-\varepsilon )-1}{T+1}\Bigg )^{T} \Bigg (\frac{1-\varepsilon + 1/2\eta }{T+1}\Bigg )\\&\le (1+2\eta (1-\varepsilon ))^{T} \Bigg (\frac{1+1/2\eta }{T+1}\Bigg ) \le \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}. \end{aligned}$$

This analysis proves that the maximum value f can achieve is less than

$\max \{1, \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}\}\le 1+ \frac{(1+2\eta (1-\varepsilon ))^{T}}{\eta (T+1)}\}$. Hence the result.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chrétien, S., Guyeux, C., Ho, ZW.O. (2019). Average Performance Analysis of the Stochastic Gradient Method for Online PCA. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2018. Lecture Notes in Computer Science(), vol 11331. Springer, Cham. https://doi.org/10.1007/978-3-030-13709-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-13709-0_19
Published: 14 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13708-3
Online ISBN: 978-3-030-13709-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Average Performance Analysis of the Stochastic Gradient Method for Online PCA

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Technical lemmæ

A Technical lemmæ

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation