Skip to main content
Log in

Efficient, certifiably optimal clustering with applications to latent variable graphical models

  • Full Length Paper
  • Series B
  • Published:
Mathematical Programming Submit manuscript

Abstract

Motivated by the task of clustering either d variables or d points into K groups, we investigate efficient algorithms to solve the Peng–Wei (P–W) K-means semi-definite programming (SDP) relaxation. The P–W SDP has been shown in the literature to have good statistical properties in a variety of settings, but remains intractable to solve in practice. To this end we propose FORCE, a new algorithm to solve this SDP relaxation. Compared to off-the-shelf interior point solvers, our method reduces the computational complexity of solving the SDP from \({\widetilde{{\mathcal {O}}}}(d^7\log \epsilon ^{-1})\) to \({\widetilde{{\mathcal {O}}}}(d^{6}K^{-2}\epsilon ^{-1})\) arithmetic operations for an \(\epsilon \)-optimal solution. Our method combines a primal first-order method with a dual optimality certificate search, which when successful, allows for early termination of the primal method. We show for certain variable clustering problems that, with high probability, FORCE is guaranteed to find the optimal solution to the SDP relaxation and provide a certificate of exact optimality. As verified by our numerical experiments, this allows FORCE to solve the P–W SDP with dimensions in the hundreds in only tens of seconds. For a variation of the P–W SDP where K is not known a priori a slight modification of FORCE reduces the computational complexity of solving this problem as well: from \({\widetilde{{\mathcal {O}}}}(d^7\log \epsilon ^{-1})\) using a standard SDP solver to \({\widetilde{{\mathcal {O}}}}(d^{4}\epsilon ^{-1})\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. If an event occurs with probability q(d) for dimension d, it is said to occur with high probability if \(q(d)\ge 1 - C/d\) for all d sufficiently large.

  2. Note the switch to \(\delta \), with which we denote an additive error; \(\epsilon \), used to quantify the error of FORCE, more properly corresponds to a type of relative additive error (Sect. 3).

  3. The notation \({{\widetilde{{\mathcal {O}}}}}\) is used to suppress poly-log factors of d.

  4. In this context, \(\epsilon \) is a multiplicative error.

  5. We also did implement a MMW algorithm for P–W SDP, but found it did not converge in practice; we suspect this is due to the presence \(d^2\) equality constraints since at each iteration of MMW these are not satisfied, but we did not investigate this further.

  6. Actually Renegar [17] works in the setting \({\mathbf {F}}={\mathbf {I}}\); what we present here is a slightly modified version and later we use the results of the corresponding, adjusted theoretical analysis.

  7. The equalities are inexact because we make no assumptions on the mean of \({{\widehat{\varvec{\varGamma }}}}\), only its convergence rate.

  8. The authors have made the code available on-line at http://bpames.people.ua.edu/software.html.

  9. Similar results can be obtained for other graph structures, such as Band or Hub graphs.

References

  1. Bunea, F., Giraud, C., Royer, M., Verzelen, N.: PECOK: a convex optimization approach to variable clustering. arXiv:1606.05100 (2016)

  2. Dasgupta, S.: The hardness of k-means clustering. University of California, San Diego, Tech. rep. (2008)

  3. Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is NP-hard. Theor. Comput. Sci. 442, 13–21 (2012). https://doi.org/10.1016/j.tcs.2010.05.034

    Article  MathSciNet  MATH  Google Scholar 

  4. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  5. Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  6. Kumar, A., Kannan, R.: Clustering with spectral norm and the k-means algorithm. In: FOCS. arXiv:1004.1823v1 (2010)

  7. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA (2007)

  8. Peng, J., Wei, Y.: Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007). https://doi.org/10.1137/050641983

    Article  MathSciNet  MATH  Google Scholar 

  9. Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)

    MATH  Google Scholar 

  10. Awasthi, P., Bandeira, A.S.: Relax, no need to round: integrality of clustering formulations. In: ITCS, p. 27, https://doi.org/10.1145/2688073.2688116. arXiv:1408.4045 (2015)

  11. Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: Probably certifiably correct k-means clustering. Mathematical Programming pp 1–29. https://doi.org/10.1007/s10107-016-1097-0. arXiv:1509.07983 (2016)

  12. Bunea, F., Giraud, C., Luo, X., Royer, M., Verzelen, N.: Model assisted variable clustering: minimax-optimal recovery and algorithms. arXiv:1508.01939 (2018)

  13. Bunea, F., Ning, Y., Wegkamp, M.: Overlapping variable clustering with statistical guarantees. arXiv:1704.06977v1 (2017)

  14. Bandeira, A.S.: A note on probably certifiably correct algorithms. arXiv:1509.00824v1 (2015)

  15. Ames, B.P.W.: Guaranteed clustering and biclustering via semidefinite programming. Math. Program. Ser. A 147(1–2), 429–465 (2014). https://doi.org/10.1007/s10107-013-0729-x. arXiv:1202.3663

    Article  MathSciNet  MATH  Google Scholar 

  16. Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: On the tightness of an SDP relaxation of k-means. arXiv:1505.04778 (2015)

  17. Renegar, J.: Efficient first-order methods for linear programming and semidefinite programming. arXiv:1409.5832 (2014)

  18. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press. https://doi.org/10.1080/10556781003625177. arXiv:1111.6189v1 (2004)

  19. Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In: FOCS (2005)

  20. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011). https://doi.org/10.1561/2200000016. arXiv:1408.2927

    Article  MATH  Google Scholar 

  21. Awasthi, P., Sheffet, O.: Improved spectral-norm bounds for clustering. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 37–49. arXiv:1206.3204v2 (2012)

  22. Li, X., Chen, Y., Xu, J.: Convex relaxation methods for community detection. arXiv:1810.00315 (2018)

  23. Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE: Trans. Inf. Theory 62(1) (2016)

  24. Pirinen, A., Ames, B.: Clustering of sparse and approximately sparse graphs by semidefinite programming. arXiv:1603.05296 (2016)

  25. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer US. https://doi.org/10.1007/978-1-4419-8853-9. arXiv:1011.1669v3 (2004)

  26. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. Ser. A 103, 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5

    Article  MathSciNet  MATH  Google Scholar 

  27. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Oper. Res. 259(1991), 245–259 (2007)

    MathSciNet  MATH  Google Scholar 

  28. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015). https://doi.org/10.1561/2200000050. arXiv:1405.4980v2

    Article  MATH  Google Scholar 

  29. Ketchen, D., Shook, C.: The application of cluster analysis in strategic management research: an analysis and critique. Strategic Manag. J. 17(6), 441–458 (1996)

    Article  Google Scholar 

  30. Goutte, C., Toft, P., Rostrup, E., Nielsen, F.Å., Hansen, L.K.: On clustering fMRI time series. NeuroImage 9(3), 298–310 (1999). https://doi.org/10.1006/NIMG.1998.0391

    Article  Google Scholar 

  31. O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015). https://doi.org/10.1007/s10208-013-9150-3. arXiv:1204.3982

    Article  MathSciNet  MATH  Google Scholar 

  32. Andersen, E.D., Andersen, K.D.: The Mosek interior point optimizer for linear programming: an implementation of the homogeneous algorithm. In: High Performance Optimization, Springer, pp. 197–232. https://doi.org/10.1007/978-1-4757-3216-0_8 (2000)

  33. Sun, D., Toh, K.C., Yuan, Y., Zhao, X.Y.: SDPNAL+: A Matlab software for semidefinite programming with bound constraints (version 1.0). arXiv:1710.10604 (2017)

  34. Rudelson, M., Vershynin, R.: Hanson-Wright inequality and sub-Gaussian concentration. arXiv:1306.2872 (2013)

  35. Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. https://doi.org/10.1017/CBO9780511794308.006. arXiv:1011.3027 (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carson Eisenach.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proofs omitted in Sect. 3.2

First we have a lemma regarding the concentration of the noise terms \({\mathbf {E}}\) about their mean. Sometimes rather than state these concentration results in terms of d, we state them in terms of \(t \ge d\) to allow for more precise control of constants in our main theorems. We let \({\mathcal {E}}\) denote the event that \(||{{\widehat{\varvec{\varGamma }}}} - \varvec{\varGamma }^*||_{\infty } \le p_1||\varvec{\varGamma }^*||_{\max }\sqrt{\frac{\log d}{n}}\).

Lemma 9

Under the notation and assumptions from previous sections, if \(t\ge d\) then

$$\begin{aligned} \left| \sum _{j=1}^n {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}- {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right| \le c_0 ||\varvec{\varGamma }^*||_{\infty }\sqrt{|G^*_{i}|^2 n \log t}, \end{aligned}$$

with probability at least \(1-\frac{2}{t}\), where \(c_0 = c'(1+\sqrt{p_0})\) is a constant that depends only on \(p_0\) and the absolute constant \(c'\) from Proposition 11. Similarly with probability at least \(1-\frac{2}{t}\), for \(a \in G^*_{i}\),

$$\begin{aligned} \left| \sum _{j=1}^n {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j E^j_{a} - \gamma ^*_a \right| \le c_0 ||\varvec{\varGamma }^*||_{\infty }\sqrt{|G^*_{i}| n \log t} , \end{aligned}$$

Proof

To obtain the result, we observe that

$$\begin{aligned} \sum _{j=1}^n {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}- {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\end{aligned}$$

is a quadratic form of a \(n|G^*_{i}|\)-dimensional Gaussian random vector with independent entries. In particular, if we define \({\mathbf {M}}\) to be block diagonal with the ith \(n\times n\) diagonal block as \((\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}{\mathbf {1}}{\mathbf {1}}^T(\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}\), then we can apply Corollary 1 with matrix \({\mathbf {M}}\). Because \(||{\mathbf {M}}||_2 \le ||\varvec{\varGamma }^*||_{\infty }|G^*_{i}|\) and \(||{\mathbf {M}}||_F \le ||\varvec{\varGamma }^*||_{\infty }|G^*_{i}|\sqrt{n}\), applying the corollary gives

$$\begin{aligned} \left| \sum _{j=1}^n {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}- {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right| \le c' ||\varvec{\varGamma }^*||_{\infty }\left( \sqrt{|G^*_{i}|^2 n \log t} + |G^*_{i}|\log t \right) , \end{aligned}$$

with probability at least \(1-\frac{2}{t}\). Using the assumption \(\log d \le p_0 n\) gives the desired result. The proof of the second statement follows similarly, taking instead the diagonal blocks of \({\mathbf {M}}\) as \((\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}{\mathbf {1}}{\mathbf {e}}_a^T(\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}\), giving \(||{\mathbf {M}}||_2 \le ||\varvec{\varGamma }^*||_{\infty }\sqrt{|G^*_{i}|}\) and \(||{\mathbf {M}}||_F \le ||\varvec{\varGamma }^*||_{\infty }\sqrt{n|G^*_{i}|}\). \(\square \)

1.1 Proof of Lemma 6

Step 1 For notation, \(c_i\) will be used to denote absolute constants. The first step is to decompose \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\). Recall that under the G-Latent model, \({\mathbf {D}}= -{{\widehat{\varvec{\varSigma }}}} + {{\widehat{\varvec{\varGamma }}}}\). Substituting that into the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) gives

$$\begin{aligned} {\mathbf {Q}}_{i}^{\perp }(\varvec{X})&= \mathop {\underbrace{-\frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T{\widehat{\varvec{\varSigma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T + \frac{1}{|G^*_{i}|}\left( {\mathbf {1}}{\mathbf {1}}^T {\widehat{\varvec{\varSigma }}}_{G^*_{i},G^*_{i}} + {\widehat{\varvec{\varSigma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}{\mathbf {1}}^T\right) - {\widehat{\varvec{\varSigma }}}_{G^*_{i},G^*_{i}}}}\limits _{\mathrm{(i)}}\\&\quad + \mathop {\underbrace{\frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T - \frac{1}{|G^*_{i}|}\left( {\mathbf {1}}{\mathbf {1}}^T {\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}} + {\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}{\mathbf {1}}^T\right) + {\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}}}\limits _{\mathrm{(ii)}}. \end{aligned}$$

For (i), we recall that by the definition of the G-Latent model that

$$\begin{aligned} {\widehat{\varvec{\varSigma }}}_{G^*_{i},G^*_{i}} = \frac{1}{n}\sum _{j=1}^n\varvec{X}_{G^*_{i}}^j\varvec{X}_{G^*_{i}}^{jT} = \sum _{j=1}^n(Z_i^j + \varvec{E}_{G^*_{i}}^j)(Z_i^j + \varvec{E}_{G^*_{i}}^j)^T. \end{aligned}$$

Plugging this into (i) and simplifying gives us that

$$\begin{aligned} \text {(i)} = \frac{1}{n}\sum _{j=1}^n\left( -\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}}{|G^*_{i}|^2} {\mathbf {1}}{\mathbf {1}}^T + \frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j}{|G^*_{i}|}\left( {\mathbf {1}}\varvec{E}_{G^*_{i}}^{jT} + \varvec{E}_{G^*_{i}}^j{\mathbf {1}}^T\right) - \varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT} \right) . \end{aligned}$$

Now we see that, again, the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) has eight terms. We first show that each concentrates to its mean at the desired rate, and then use the triangle inequality to obtain the final result. Fortunately, we can subtract the mean for each of the 8 terms to the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) as the means for (i) are offset by the means for (ii). To give the new decomposition of \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) explicitly,

$$\begin{aligned} {\mathbf {Q}}_{i}^{\perp }(\varvec{X})&= -\mathop {\underbrace{\sum _{j=1}^n\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}}{n|G^*_{i}|^2} {\mathbf {1}}{\mathbf {1}}^T}}\limits _{\mathrm{(i).a}} + \mathop {\underbrace{\sum _{j=1}^n\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j}{n|G^*_{i}|}{\mathbf {1}}\varvec{E}_{G^*_{i}}^{jT}}}\limits _{\mathrm{(i).b}} + \mathop {\underbrace{\sum _{j=1}^n\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j}{n|G^*_{i}|}\varvec{E}_{G^*_{i}}^j{\mathbf {1}}^T}}\limits _{\mathrm{(i).c}} - \mathop {\underbrace{\frac{1}{n}\sum _{j=1}^n\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}}}\limits _{\mathrm{(i).d}}\nonumber \\&\quad + \mathop {\underbrace{\frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T}}\limits _{\mathrm{(ii).a}} - \mathop {\underbrace{\frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T {\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}}}\limits _{\mathrm{(ii).b}} + \mathop {\underbrace{\frac{1}{|G^*_{i}|}{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}{\mathbf {1}}^T}}\limits _{\mathrm{(ii).c}} + \mathop {\underbrace{{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}}}\limits _{\mathrm{(ii).d}}. \end{aligned}$$
(30)

Step 2 For the term (i).a, we can directly apply Lemma 9. Doing so, it follows immediately that with probability at least \(1-\frac{2}{t}\)

$$\begin{aligned} \left| \left| {\sum _{j=1}^n\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}}{n|G^*_{i}|^2} {\mathbf {1}}{\mathbf {1}}^T - \frac{1}{|G^*_{i}|^2}\left( {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T}_2\right| \right| \le c_0 ||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log t}{n}}. \end{aligned}$$

For the term (i).c (and so by symmetry (i).b), we observe that has the form \({\mathbf {u}}{\mathbf {v}}^T\) and that \(||{\mathbf {u}}{\mathbf {v}}^T||_2 = ||{\mathbf {u}}||_2||{\mathbf {v}}||_2\). Therefore, we can apply Lemma 9 and obtain that with probability at least \(1-2|G^*_{i}|/t^2\),

$$\begin{aligned} \bigg |\bigg |\sum _{j=1}^n\frac{{\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j}{n|G^*_{i}|}\varvec{E}_{G^*_{i}}^j{\mathbf {1}}^T - \frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T \varvec{\varGamma }^*_{G^*_{i},G^*_{i}}\bigg |\bigg |_2 \le c_0 ||\varvec{\varGamma }^*||_{\infty } \sqrt{\frac{2\log t}{n}}. \end{aligned}$$

Step 3 Now we control the term (i).d, the sample covariance matrix of the errors. We can directly apply Corollary 3 to obtain that with probability at least \(1-2/t\)

$$\begin{aligned}&\left| \left| {\frac{1}{n}\sum _{j=1}^n\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT} - \varvec{\varGamma }^*_{G^*_{i},G^*_{i}}}\right| \right| _2 \\&\quad \le ||\varvec{\varGamma }^*||_{\infty }\left( \frac{|G^*_{i}|}{n} + 2\frac{\sqrt{2|G^*_{i}|\log t}}{n} + 2\sqrt{\frac{|G^*_{i}|}{n}} + (2+\sqrt{p_0})\sqrt{\frac{2\log t}{n}}\right) \\&\quad \le ||\varvec{\varGamma }^*||_{\infty }\left( \frac{d}{n} + (2+2\sqrt{2p_0})\sqrt{\frac{d}{n}} + (2+\sqrt{p_0})\sqrt{\frac{2\log t}{n}}\right) . \end{aligned}$$

Step 4 For the terms in (ii), consider first (ii).a. We see that

$$\begin{aligned} \bigg |\bigg |\left( {\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T - \left( {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T\bigg |\bigg |_{\max } \le |G^*_{i}|||{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}} - \varvec{\varGamma }^*_{G^*_{i},G^*_{i}} ||_{\infty } \end{aligned}$$

Conditional on event \({\mathcal {E}}\),

$$\begin{aligned} \bigg |\bigg |\frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T{{\widehat{\varvec{\varGamma }}}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T - \frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T\bigg |\bigg |_{\max } \le \frac{p_1||\varvec{\varGamma }^*||_{\infty }}{|G^*_{i}|}\sqrt{\frac{\log d}{n}}. \end{aligned}$$

Because the matrices above are a multiple of \({\mathbf {1}}{\mathbf {1}}^T\), it follows that

$$\begin{aligned} \bigg |\bigg |\frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T{{\widehat{\varvec{\varGamma }}}}_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T - \frac{1}{|G^*_{i}|^2} \left( {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) {\mathbf {1}}{\mathbf {1}}^T\bigg |\bigg |_{2} \le p_1 ||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log d}{n}}. \end{aligned}$$

Next for (ii).b (and (ii).c by symmetry), we can see that

$$\begin{aligned} \bigg |\bigg |\frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T {{\widehat{\varvec{\varGamma }}}}_{G^*_{i},G^*_{i}} - \frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T \varvec{\varGamma }^*_{G^*_{i},G^*_{i}}\bigg |\bigg |_{2} = \frac{1}{|G^*_{i}|}\bigg |\bigg |{\mathbf {1}}{\mathbf {1}}^T \left( {{\widehat{\varvec{\varGamma }}}}_{G^*_{i},G^*_{i}} - \varvec{\varGamma }^*_{G^*_{i},G^*_{i}} \right) \bigg |\bigg |_{2}. \end{aligned}$$
(31)

Because \({\widehat{\varvec{\varGamma }}}\) and \(\varvec{\varGamma }^*\) are diagonal, we can use event \({\mathcal {E}}\) and the fact that for matrices of the form \({\mathbf {u}}{\mathbf {v}}^T\), \(||{\mathbf {u}}{\mathbf {v}}^T||_2 = ||{\mathbf {u}}||_2||{\mathbf {v}}||_2\), to obtain

$$\begin{aligned} \bigg |\bigg |\frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T {{\widehat{\varvec{\varGamma }}}}_{G^*_{i},G^*_{i}} - \frac{1}{|G^*_{i}|}{\mathbf {1}}{\mathbf {1}}^T \varvec{\varGamma }^*_{G^*_{i},G^*_{i}}\bigg |\bigg |_{2} \le p_1 |\varvec{\varGamma }^*|_{\infty }\sqrt{\frac{\log d}{n}} \end{aligned}$$

The same result is immediate for (ii).a by (4). Therefore by combining the above, applying the triangle inequality to (30), using that \({\mathcal {E}}\) occurs with probability at least \(1-p_2/d^2\), and choosing \(t=d^2\), we find that with probability at least \(1-\frac{c_2}{d^2}\)

$$\begin{aligned} ||{\mathbf {Q}}_{i}^{\perp }(\varvec{X})||_2 \le c_1||\varvec{\varGamma }^*||_{\infty }\left( \frac{d}{n} +\sqrt{\frac{d}{n}} + \sqrt{\frac{\log d}{n}} \right) , \end{aligned}$$

concluding the proof. \(\square \)

1.1.1 Proof of Lemma 7

Under the G-Latent model,

$$\begin{aligned} y'_{a,b}({\mathbf {X}},y_T) = -\mathop {\underbrace{{\widehat{\varSigma }}_{ a, b }}}\limits _{\mathrm{(i)}} + \mathop {\underbrace{y_a({\mathbf {X}},y_T)}}\limits _{\mathrm{(ii)}} + \mathop {\underbrace{y_b({\mathbf {X}},y_T)}}\limits _{\mathrm{(iii)}} \end{aligned}$$

Above, we saw that

$$\begin{aligned} y_a(\varvec{X},y_T) = \frac{1}{2|G^*_{i}|^2}{\mathbf {1}}^T{\mathbf {D}}_{G^*_{i},G^*_{i}}{\mathbf {1}}- \frac{1}{|G^*_{i}|}{\mathbf {D}}_{a,G^*_{i}}{\mathbf {1}}- \frac{1}{2|G^*_{i}|}y_T, \end{aligned}$$

and likewise for \(y_b\). Below we denote by \(\sigma _1 = \max _i C_{i,i}^*\) and \(\sigma _2 = \max \{\max _i C_{i,i}^*,||\varvec{\varGamma }^*||_{\infty }\}\). Following the same decomposition as in Lemma 6, we get that

$$\begin{aligned} y_a(\varvec{X},y_T)&= -\frac{1}{2|G^*_{i}|^2}{\mathbf {1}}^T{\widehat{\varvec{\varSigma }}}_{ G^*_{i},G^*_{i} }{\mathbf {1}}+ \frac{1}{2|G^*_{i}|^2}{\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}+ \frac{1}{|G^*_{i}|}{\widehat{\varvec{\varSigma }}}_{ a,G^*_{i} } {\mathbf {1}}- {\widehat{\varGamma }}_{a,a} - \frac{1}{2|G^*_{i}|}y_T \\&= \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^n \frac{1}{2}(Z_i^l)^2}}\limits _{\mathrm{(ii).a}} - \mathop {\underbrace{\frac{1}{2n|G^*_{i}|^2}\sum _{l=1}^n ({\mathbf {1}}^T\varvec{E}_{G^*_{i}}^l)^2}}\limits _{\mathrm{(ii).b}} + \mathop {\underbrace{\frac{1}{n|G^*_{i}|}\sum _{l=1}^n E_a^l {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^l}}\limits _{\mathrm{(ii).c}} + \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^n E^l_a Z_i^l}}\limits _{\mathrm{(ii).d}}\\&\quad + \mathop {\underbrace{\frac{1}{2|G^*_{i}|^2}{\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}}}\limits _{\mathrm{(ii).e}} - \mathop {\underbrace{\frac{1}{|G^*_{i}|}{\widehat{\varGamma }}_{a,a}}}\limits _{\mathrm{(ii).f}} - \frac{1}{2|G^*_{i}|}y_T. \end{aligned}$$

As in the proof of Lemma 6, the means of (ii).b and (ii).c offset the means of (ii).e and (ii).f. To control terms (ii).b and (ii).c, by Lemma 9 with probability at least \(1-1/t\),

$$\begin{aligned} \frac{1}{2n|G^*_{i}|^2}\sum _{j=1}^n\left( {\mathbf {1}}^T\varvec{E}_{G^*_{i}}^j\varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}- {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right) \le \frac{c_0 ||\varvec{\varGamma }^*||_{\infty }}{2} \sqrt{\frac{\log t}{n |G^*_{i}|^2 }}. \end{aligned}$$

Likewise, by Lemma 9,

$$\begin{aligned} \frac{1}{n|G^*_{i}|}\sum _{i=1}^n \left( E_a \varvec{E}_{G^*_{i}}^{jT}{\mathbf {1}}- \gamma _{a}^*\right) \ge -c_0 ||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log t}{n |G^*_{i}| }}, \end{aligned}$$

with probability at least \(1-1/t\). Conditional on event \({\mathcal {E}}\), (4) shows that

$$\begin{aligned} \frac{1}{2|G^*_{i}|^2}\left( {\mathbf {1}}^T{\widehat{\varvec{\varGamma }}}_{G^*_{i},G^*_{i}}{\mathbf {1}}- {\mathbf {1}}^T\varvec{\varGamma }^*_{G^*_{i},G^*_{i}}{\mathbf {1}}\right)&\ge -p_1 ||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log d}{n |G^*_{i}| }},\\ \frac{1}{|G^*_{i}|}\left( {\widehat{\varGamma }}_{a,a} - \varGamma ^*_{a,a} \right)&\le p_1||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log d}{n |G^*_{i}| }}. \end{aligned}$$

Lastly, if we denote by \(\sigma _1= \max _i C_{i,i}^*\), term (ii).d can be bounded by using Corollary 1, which gives that

$$\begin{aligned} \frac{1}{n}\sum _{l=1}^n E^l_a Z_i^l \ge - c_0||\varvec{\varGamma }^*||_{\infty }^{1/2}\sigma _1^{1/2} \sqrt{\frac{\log t}{n}}, \end{aligned}$$
(32)

with probability at least \(1-1/t\). The same results can be obtained for \(y_b\). For the terms in (i), we expand as before:

$$\begin{aligned} {\widehat{\varSigma }}_{ a,b, \, }= \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^n Z_i^lZ_j^l}}\limits _{\mathrm{(i).a}} + \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^nE_a^lZ_j^l}}\limits _{\mathrm{(i).b}} + \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^nE_b^lZ_i^l}}\limits _{\mathrm{(i).c}} + \mathop {\underbrace{\frac{1}{n}\sum _{l=1}^n E_a^l E_b^l}}\limits _{\mathrm{(i).d}}. \end{aligned}$$

Terms (i).b and (i).c can be bounded in the same way as (32). Term (i).d can be bounded by Corollary 1, giving that

$$\begin{aligned} \frac{1}{n}\sum _{l=1}^n E_a^l E_b^l \ge -c_0||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log t}{n}}, \end{aligned}$$

with probability at least \(1-1/t\). All that remains is to bound the terms (i).a, (ii).a and (iii).a. Fortunately, these correspond to the population quantity \(\varDelta {\mathbf {C}}_{ }^*\). Observing that this is just a quadratic form of 2n-dimensional Gaussian vector, we can applying Lemma 9. Doing so gives that

$$\begin{aligned} \frac{1}{2n}\left( \sum _{l=1}^n (Z_i^l)^2 + \sum _{l=1}^n (Z_j^l)^2 - 2\sum _{l=1}^n Z_i^lZ_j^l \right) \ge \frac{1}{2}\left( C_{i,i}^* + C_{j,j}^* - C_{i,j}^*\right) - 2c_0 \sigma _1\sqrt{\frac{\log t}{n}} \end{aligned}$$

with probability at least \(1-1/t\). Combining all the bounds for (i)-(iii), using that \({\mathcal {E}}\) occurs with probability at least \(1-p_2/d^3\), and selecting \(t=d^3\), we can see that, with probability at least \(1-c_1/d^3\)

$$\begin{aligned} y'_{a,b}&\ge \frac{1}{2}(C_{i,i}^* + C_{j,j}^* - 2C_{i,j}^*) - \frac{1}{2|G^*_{i}|}y_T - \frac{1}{2|G^*_{j}|}y_T - c_1||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log d}{n |G^*_{i}| }} - c_2\sigma \sqrt{\frac{\log d}{n}}\\&\ge \frac{1}{2}\varDelta ({\mathbf {C}}_{ }^*) - \frac{1}{2|G^*_{i}|}y_T - \frac{1}{2|G^*_{j}|}y_T - c_1||\varvec{\varGamma }^*||_{\infty }\sqrt{\frac{\log d}{n |G^*_{i}| }} - c_2\sigma \sqrt{\frac{\log d}{n}}. \end{aligned}$$

\(\square \)

Appendix B: Some technical lemmas

Lemma 10

Let \({\mathbf {M}}\) be a \(d\times d\) real, symmetric matrix of the form

$$\begin{aligned} {\mathbf {M}}= a{\mathbf {I}}+ b{\mathbf {1}}{\mathbf {1}}^T. \end{aligned}$$

where \(a,b \in {\mathbb {R}}\) then \({\mathbf {M}}\) has eigenvalues \(a+b\) with multiplicity 1 and a with multiplicity \(d-1\). If \(a,b > 0\), then \({\mathbf {M}}\) also has the property that

$$\begin{aligned} {\mathbf {M}}^{1/2} =\,&\sqrt{a}{\mathbf {I}}+ \frac{\sqrt{a+db} - \sqrt{a}}{d}{\mathbf {1}}{\mathbf {1}}^T,\\ {\mathbf {M}}^{-1} =\,&\frac{1}{a}{\mathbf {I}}- \frac{b}{a^2+abd}{\mathbf {1}}{\mathbf {1}}^T,\\ {\mathbf {M}}^{-1/2} =\,&\frac{1}{\sqrt{a}}{\mathbf {I}}- \frac{\sqrt{a+db} - \sqrt{a}}{d\sqrt{a^2 + dab}}{\mathbf {1}}{\mathbf {1}}^T. \end{aligned}$$

Proof

Using the Sherman-Morrison formula, a matrix of the form \({\mathbf {M}}= a{\mathbf {I}}+ b{\mathbf {1}}{\mathbf {1}}^T\), where \(a,b > 0\) has the inverse

$$\begin{aligned} {\mathbf {M}}^{-1} = \frac{1}{a}{\mathbf {I}}- \frac{b}{a^2+abd}{\mathbf {1}}{\mathbf {1}}^T. \end{aligned}$$

Because \({\mathbf {M}}\succ 0\), all eigenvalues are strictly positive and denote by \(\lambda _i\) and \(q_i\) the eigenvalues and corresponding eigenvectors. Without loss of generality, let \(q_i\) be orthonormal. Then we can write \({\mathbf {M}}= \sum _i \lambda _i {\mathbf {q}}_i{\mathbf {q}}_i^T\). By the form of \({\mathbf {M}}\), clearly \(\frac{1}{\sqrt{d}}{\mathbf {1}}\) is always an eigenvector of \({\mathbf {M}}\) with eigenvalue \(a+db\), so we can take \(q_1 = \frac{1}{\sqrt{d}}{\mathbf {1}}\) and \(\lambda _1 = 1\). The remaining \(q_i\) span \(({\mathbf {1}}{\mathbf {1}}^T)^{\perp }\) and have corresponding eigenvalues \(\lambda _i = a\). Therefore,

$$\begin{aligned} {\mathbf {M}}^{1/2} = \frac{\sqrt{a+db}}{\sqrt{d}}{\mathbf {1}}{\mathbf {1}}^T + \sum _{i=2}^d \sqrt{a}{\mathbf {q}}_i{\mathbf {q}}_i^T. \end{aligned}$$

Because this eigen-decomposition is unique, the above gives

$$\begin{aligned} {\mathbf {M}}^{1/2} = \sqrt{a}{\mathbf {I}}+ \frac{\sqrt{a+db} - \sqrt{a}}{d}{\mathbf {1}}{\mathbf {1}}^T. \end{aligned}$$

Using the expression for \({\mathbf {M}}^{-1}\) given above, it follows that

$$\begin{aligned} {\mathbf {M}}^{-1/2} = \frac{1}{\sqrt{a}}{\mathbf {I}}- \frac{\sqrt{a+db} - \sqrt{a}}{d\sqrt{a^2 + dab}}{\mathbf {1}}{\mathbf {1}}^T. \end{aligned}$$

\(\square \)

The following result for quadratic forms of standard multivariate Gaussian random variables can be found in many forms in the literature (for example, Rudelson and Vershynin [34]).

Lemma 11

(Hanson-Wright inequality for Gaussian random variables) Let \(\varvec{X}\sim N(0,{\mathbf {I}})\) be a d-dimensional random vector and let \({\mathbf {A}}\) be a \(d \times d\) matrix in \({\mathbb {R}}^{d \times d}\). Then

$$\begin{aligned} {\mathbb {P}}\left( |\varvec{X}^T {\mathbf {A}}\varvec{X}- {\mathbb {E}}\left[ \varvec{X}^T {\mathbf {A}}\varvec{X}\right] | \ge t \right) \le 2 \exp \left( -c \min \left\{ \frac{t^2}{||{\mathbf {A}}||_F^2},\frac{t}{||{\mathbf {A}}||_2} \right\} \right) , \end{aligned}$$

for some absolute constant c.

In particular, the following corollary is useful.

Corollary 1

Let \(\varvec{X}\sim N(0,{\mathbf {I}})\) be a d-dimensional random vector and let \({\mathbf {A}}\) be a \(d \times d\) matrix in \({\mathbb {R}}^{d \times d}\). Then

$$\begin{aligned} {\mathbb {P}}\left( |\varvec{X}^T {\mathbf {A}}\varvec{X}- {\mathbb {E}}\left[ \varvec{X}^T {\mathbf {A}}\varvec{X}\right] | \ge ||{\mathbf {A}}||_F\sqrt{t} + ||{\mathbf {A}}||_2 t \right) \le 2 \exp \left( -c t \right) , \end{aligned}$$

for some absolute constant c. Equivalently,

$$\begin{aligned} |\varvec{X}^T {\mathbf {A}}\varvec{X}- {\mathbb {E}}\left[ \varvec{X}^T {\mathbf {A}}\varvec{X}\right] | \le c'\left( ||{\mathbf {A}}||_F\sqrt{\log t} + ||{\mathbf {A}}||_2 \log t \right) \end{aligned}$$

with probability at least \(1-2/t\) for some absolute constant \(c'\).

Below we are concerned with the rate of concentration in the spectral norm of a sample covariance matrix to its mean: \(||{{\widehat{\varvec{\varSigma }}}} - \varvec{\varSigma }^*||_2\). If we write \({{\widehat{\varvec{\varSigma }}}} = \frac{1}{n}\varvec{X}^T\varvec{X}\), where \(\varvec{X}\) refers to the \(n\times d\) matrix in which the rows are the observations \(\varvec{X}_i\), we see how such a result is directly applicable to the problem at hand. We repeat the statement of Gordon’s Theorem given in Vershynin [35] below as Proposition 1. We use the notation from Vershynin [35] of \(s_{\min }\) and \(s_{\max }\) to denote the smallest and largest singular values, respectively.

Proposition 1

Let \(\varvec{X}\) be an \(n \times d\) matrix whose entries are independent standard normal random variables. Then

$$\begin{aligned} \sqrt{n} - \sqrt{d} \le {\mathbb {E}}[s_{\min }(\varvec{X})] \le {\mathbb {E}}[s_{\max }(\varvec{X})] \le \sqrt{n} + \sqrt{d} \end{aligned}$$

Using the result on sub-Gaussian concentration of a Lipschitz function of independent random variables, we immediately obtain the following corollary (also given in Vershynin [35]).

Corollary 2

Let \(\varvec{X}\) be an \(n \times d\) matrix whose entries are independent standard normal random variables, then for every \(t \ge 0\)

$$\begin{aligned} \sqrt{n} - \sqrt{d} -t \le s_{\min }(\varvec{X}) \le s_{\max }(\varvec{X}) \le \sqrt{n} + \sqrt{d} + t \end{aligned}$$

with probability at least \(1-2\exp (-t^2/2)\).

Proof

Observing that the functions \(s_{\min }\) and \(s_{\max }\) are 1-Lipschitz and using the sub-Gaussian tail bound, the result is immediate from the above. \(\square \)

Corollary 3

Let \(\varvec{X}_i\), for \(i=1,\dots ,n\), be a d-dimensional random vector sampled from \(N(0,\varvec{\varSigma })\). Denoting \({\widehat{\varvec{\varSigma }}}_{ }:= n^{-1}\sum _{i=1}^n\varvec{X}_i\varvec{X}_i^\top \), we have that

$$\begin{aligned} \lambda _{\min }\left( {\widehat{\varvec{\varSigma }}}_{ }-\varvec{\varSigma }\right)&\ge \lambda _{\min }(\varvec{\varSigma })\left( \frac{d}{n} + \frac{2t\sqrt{d}}{n} + \frac{t^2}{n} - \frac{2(\sqrt{d} + t)}{\sqrt{n}} \right) ,\\ \lambda _{\max }\left( {\widehat{\varvec{\varSigma }}}_{ }-\varvec{\varSigma }\right)&\le \lambda _{\max }(\varvec{\varSigma })\left( \frac{d}{n} + \frac{2t\sqrt{d}}{n} + \frac{t^2}{n} + \frac{2(\sqrt{d} + t)}{\sqrt{n}} \right) , \end{aligned}$$

with probability at least \(1-2\exp (-t^2/2)\).

Proof

This follows directly from Corollary 2. \(\square \)

Appendix C: Numerical results for unknown K

In this section we consider the exact same setup as in Sect. 6, but now we solve (22) using FORCE. For K unknown we do not have a readily available baseline comparison in high dimensions—prior work considers an ADMM based approach for (2), but not (22)—so we present only the results from FORCE and FORCE-P.

The main differences between the results displayed in Tables 3 and 4 and those in Sect. 6 are that (1) SDPNAL reached its iteration limit in some cases, leading to a higher average error than for K known, and (2) for the setting \((d,k,\rho ,\gamma ) = (500,100,0.3,3.0)\) the SNR was too low and FORCE did not converge. Otherwise, the results shown below, are very similar to those when K is known, and we refer the reader to the discussion in Sect. 6. The fact that we do not observe any significant difference in empirical performance between K fixed and K unknown—indeed K unknown appears to take longer in practice as intuition suggests—indicates that Theorem 1 either may not be tight or instances on which it achieves the worst-case bound are encountered with low probability.

Table 3 Benchmark results for low dimensional designs comparing FORCE and FORCE-P with MOSEK and SDPNAL+ with unknown K
Table 4 Benchmark results for high dimensional designs comparing FORCE and FORCE-P for K unknown

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Eisenach, C., Liu, H. Efficient, certifiably optimal clustering with applications to latent variable graphical models. Math. Program. 176, 137–173 (2019). https://doi.org/10.1007/s10107-019-01375-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01375-2

Mathematics Subject Classification

Navigation