Abstract
Motivated by the task of clustering either d variables or d points into K groups, we investigate efficient algorithms to solve the Peng–Wei (P–W) K-means semi-definite programming (SDP) relaxation. The P–W SDP has been shown in the literature to have good statistical properties in a variety of settings, but remains intractable to solve in practice. To this end we propose FORCE, a new algorithm to solve this SDP relaxation. Compared to off-the-shelf interior point solvers, our method reduces the computational complexity of solving the SDP from \({\widetilde{{\mathcal {O}}}}(d^7\log \epsilon ^{-1})\) to \({\widetilde{{\mathcal {O}}}}(d^{6}K^{-2}\epsilon ^{-1})\) arithmetic operations for an \(\epsilon \)-optimal solution. Our method combines a primal first-order method with a dual optimality certificate search, which when successful, allows for early termination of the primal method. We show for certain variable clustering problems that, with high probability, FORCE is guaranteed to find the optimal solution to the SDP relaxation and provide a certificate of exact optimality. As verified by our numerical experiments, this allows FORCE to solve the P–W SDP with dimensions in the hundreds in only tens of seconds. For a variation of the P–W SDP where K is not known a priori a slight modification of FORCE reduces the computational complexity of solving this problem as well: from \({\widetilde{{\mathcal {O}}}}(d^7\log \epsilon ^{-1})\) using a standard SDP solver to \({\widetilde{{\mathcal {O}}}}(d^{4}\epsilon ^{-1})\).
Similar content being viewed by others
Notes
If an event occurs with probability q(d) for dimension d, it is said to occur with high probability if \(q(d)\ge 1 - C/d\) for all d sufficiently large.
Note the switch to \(\delta \), with which we denote an additive error; \(\epsilon \), used to quantify the error of FORCE, more properly corresponds to a type of relative additive error (Sect. 3).
The notation \({{\widetilde{{\mathcal {O}}}}}\) is used to suppress poly-log factors of d.
In this context, \(\epsilon \) is a multiplicative error.
We also did implement a MMW algorithm for P–W SDP, but found it did not converge in practice; we suspect this is due to the presence \(d^2\) equality constraints since at each iteration of MMW these are not satisfied, but we did not investigate this further.
Actually Renegar [17] works in the setting \({\mathbf {F}}={\mathbf {I}}\); what we present here is a slightly modified version and later we use the results of the corresponding, adjusted theoretical analysis.
The equalities are inexact because we make no assumptions on the mean of \({{\widehat{\varvec{\varGamma }}}}\), only its convergence rate.
The authors have made the code available on-line at http://bpames.people.ua.edu/software.html.
Similar results can be obtained for other graph structures, such as Band or Hub graphs.
References
Bunea, F., Giraud, C., Royer, M., Verzelen, N.: PECOK: a convex optimization approach to variable clustering. arXiv:1606.05100 (2016)
Dasgupta, S.: The hardness of k-means clustering. University of California, San Diego, Tech. rep. (2008)
Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is NP-hard. Theor. Comput. Sci. 442, 13–21 (2012). https://doi.org/10.1016/j.tcs.2010.05.034
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)
Kumar, A., Kannan, R.: Clustering with spectral norm and the k-means algorithm. In: FOCS. arXiv:1004.1823v1 (2010)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA (2007)
Peng, J., Wei, Y.: Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007). https://doi.org/10.1137/050641983
Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)
Awasthi, P., Bandeira, A.S.: Relax, no need to round: integrality of clustering formulations. In: ITCS, p. 27, https://doi.org/10.1145/2688073.2688116. arXiv:1408.4045 (2015)
Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: Probably certifiably correct k-means clustering. Mathematical Programming pp 1–29. https://doi.org/10.1007/s10107-016-1097-0. arXiv:1509.07983 (2016)
Bunea, F., Giraud, C., Luo, X., Royer, M., Verzelen, N.: Model assisted variable clustering: minimax-optimal recovery and algorithms. arXiv:1508.01939 (2018)
Bunea, F., Ning, Y., Wegkamp, M.: Overlapping variable clustering with statistical guarantees. arXiv:1704.06977v1 (2017)
Bandeira, A.S.: A note on probably certifiably correct algorithms. arXiv:1509.00824v1 (2015)
Ames, B.P.W.: Guaranteed clustering and biclustering via semidefinite programming. Math. Program. Ser. A 147(1–2), 429–465 (2014). https://doi.org/10.1007/s10107-013-0729-x. arXiv:1202.3663
Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: On the tightness of an SDP relaxation of k-means. arXiv:1505.04778 (2015)
Renegar, J.: Efficient first-order methods for linear programming and semidefinite programming. arXiv:1409.5832 (2014)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press. https://doi.org/10.1080/10556781003625177. arXiv:1111.6189v1 (2004)
Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In: FOCS (2005)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011). https://doi.org/10.1561/2200000016. arXiv:1408.2927
Awasthi, P., Sheffet, O.: Improved spectral-norm bounds for clustering. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 37–49. arXiv:1206.3204v2 (2012)
Li, X., Chen, Y., Xu, J.: Convex relaxation methods for community detection. arXiv:1810.00315 (2018)
Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE: Trans. Inf. Theory 62(1) (2016)
Pirinen, A., Ames, B.: Clustering of sparse and approximately sparse graphs by semidefinite programming. arXiv:1603.05296 (2016)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer US. https://doi.org/10.1007/978-1-4419-8853-9. arXiv:1011.1669v3 (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. Ser. A 103, 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5
Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Oper. Res. 259(1991), 245–259 (2007)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015). https://doi.org/10.1561/2200000050. arXiv:1405.4980v2
Ketchen, D., Shook, C.: The application of cluster analysis in strategic management research: an analysis and critique. Strategic Manag. J. 17(6), 441–458 (1996)
Goutte, C., Toft, P., Rostrup, E., Nielsen, F.Å., Hansen, L.K.: On clustering fMRI time series. NeuroImage 9(3), 298–310 (1999). https://doi.org/10.1006/NIMG.1998.0391
O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015). https://doi.org/10.1007/s10208-013-9150-3. arXiv:1204.3982
Andersen, E.D., Andersen, K.D.: The Mosek interior point optimizer for linear programming: an implementation of the homogeneous algorithm. In: High Performance Optimization, Springer, pp. 197–232. https://doi.org/10.1007/978-1-4757-3216-0_8 (2000)
Sun, D., Toh, K.C., Yuan, Y., Zhao, X.Y.: SDPNAL+: A Matlab software for semidefinite programming with bound constraints (version 1.0). arXiv:1710.10604 (2017)
Rudelson, M., Vershynin, R.: Hanson-Wright inequality and sub-Gaussian concentration. arXiv:1306.2872 (2013)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. https://doi.org/10.1017/CBO9780511794308.006. arXiv:1011.3027 (2011)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proofs omitted in Sect. 3.2
First we have a lemma regarding the concentration of the noise terms \({\mathbf {E}}\) about their mean. Sometimes rather than state these concentration results in terms of d, we state them in terms of \(t \ge d\) to allow for more precise control of constants in our main theorems. We let \({\mathcal {E}}\) denote the event that \(||{{\widehat{\varvec{\varGamma }}}} - \varvec{\varGamma }^*||_{\infty } \le p_1||\varvec{\varGamma }^*||_{\max }\sqrt{\frac{\log d}{n}}\).
Lemma 9
Under the notation and assumptions from previous sections, if \(t\ge d\) then
with probability at least \(1-\frac{2}{t}\), where \(c_0 = c'(1+\sqrt{p_0})\) is a constant that depends only on \(p_0\) and the absolute constant \(c'\) from Proposition 11. Similarly with probability at least \(1-\frac{2}{t}\), for \(a \in G^*_{i}\),
Proof
To obtain the result, we observe that
is a quadratic form of a \(n|G^*_{i}|\)-dimensional Gaussian random vector with independent entries. In particular, if we define \({\mathbf {M}}\) to be block diagonal with the ith \(n\times n\) diagonal block as \((\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}{\mathbf {1}}{\mathbf {1}}^T(\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}\), then we can apply Corollary 1 with matrix \({\mathbf {M}}\). Because \(||{\mathbf {M}}||_2 \le ||\varvec{\varGamma }^*||_{\infty }|G^*_{i}|\) and \(||{\mathbf {M}}||_F \le ||\varvec{\varGamma }^*||_{\infty }|G^*_{i}|\sqrt{n}\), applying the corollary gives
with probability at least \(1-\frac{2}{t}\). Using the assumption \(\log d \le p_0 n\) gives the desired result. The proof of the second statement follows similarly, taking instead the diagonal blocks of \({\mathbf {M}}\) as \((\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}{\mathbf {1}}{\mathbf {e}}_a^T(\varvec{\varGamma }^*_{G^*_{i},G^*_{i}})^{1/2}\), giving \(||{\mathbf {M}}||_2 \le ||\varvec{\varGamma }^*||_{\infty }\sqrt{|G^*_{i}|}\) and \(||{\mathbf {M}}||_F \le ||\varvec{\varGamma }^*||_{\infty }\sqrt{n|G^*_{i}|}\). \(\square \)
1.1 Proof of Lemma 6
Step 1 For notation, \(c_i\) will be used to denote absolute constants. The first step is to decompose \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\). Recall that under the G-Latent model, \({\mathbf {D}}= -{{\widehat{\varvec{\varSigma }}}} + {{\widehat{\varvec{\varGamma }}}}\). Substituting that into the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) gives
For (i), we recall that by the definition of the G-Latent model that
Plugging this into (i) and simplifying gives us that
Now we see that, again, the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) has eight terms. We first show that each concentrates to its mean at the desired rate, and then use the triangle inequality to obtain the final result. Fortunately, we can subtract the mean for each of the 8 terms to the expression for \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) as the means for (i) are offset by the means for (ii). To give the new decomposition of \({\mathbf {Q}}_{i}^{\perp }(\varvec{X})\) explicitly,
Step 2 For the term (i).a, we can directly apply Lemma 9. Doing so, it follows immediately that with probability at least \(1-\frac{2}{t}\)
For the term (i).c (and so by symmetry (i).b), we observe that has the form \({\mathbf {u}}{\mathbf {v}}^T\) and that \(||{\mathbf {u}}{\mathbf {v}}^T||_2 = ||{\mathbf {u}}||_2||{\mathbf {v}}||_2\). Therefore, we can apply Lemma 9 and obtain that with probability at least \(1-2|G^*_{i}|/t^2\),
Step 3 Now we control the term (i).d, the sample covariance matrix of the errors. We can directly apply Corollary 3 to obtain that with probability at least \(1-2/t\)
Step 4 For the terms in (ii), consider first (ii).a. We see that
Conditional on event \({\mathcal {E}}\),
Because the matrices above are a multiple of \({\mathbf {1}}{\mathbf {1}}^T\), it follows that
Next for (ii).b (and (ii).c by symmetry), we can see that
Because \({\widehat{\varvec{\varGamma }}}\) and \(\varvec{\varGamma }^*\) are diagonal, we can use event \({\mathcal {E}}\) and the fact that for matrices of the form \({\mathbf {u}}{\mathbf {v}}^T\), \(||{\mathbf {u}}{\mathbf {v}}^T||_2 = ||{\mathbf {u}}||_2||{\mathbf {v}}||_2\), to obtain
The same result is immediate for (ii).a by (4). Therefore by combining the above, applying the triangle inequality to (30), using that \({\mathcal {E}}\) occurs with probability at least \(1-p_2/d^2\), and choosing \(t=d^2\), we find that with probability at least \(1-\frac{c_2}{d^2}\)
concluding the proof. \(\square \)
1.1.1 Proof of Lemma 7
Under the G-Latent model,
Above, we saw that
and likewise for \(y_b\). Below we denote by \(\sigma _1 = \max _i C_{i,i}^*\) and \(\sigma _2 = \max \{\max _i C_{i,i}^*,||\varvec{\varGamma }^*||_{\infty }\}\). Following the same decomposition as in Lemma 6, we get that
As in the proof of Lemma 6, the means of (ii).b and (ii).c offset the means of (ii).e and (ii).f. To control terms (ii).b and (ii).c, by Lemma 9 with probability at least \(1-1/t\),
Likewise, by Lemma 9,
with probability at least \(1-1/t\). Conditional on event \({\mathcal {E}}\), (4) shows that
Lastly, if we denote by \(\sigma _1= \max _i C_{i,i}^*\), term (ii).d can be bounded by using Corollary 1, which gives that
with probability at least \(1-1/t\). The same results can be obtained for \(y_b\). For the terms in (i), we expand as before:
Terms (i).b and (i).c can be bounded in the same way as (32). Term (i).d can be bounded by Corollary 1, giving that
with probability at least \(1-1/t\). All that remains is to bound the terms (i).a, (ii).a and (iii).a. Fortunately, these correspond to the population quantity \(\varDelta {\mathbf {C}}_{ }^*\). Observing that this is just a quadratic form of 2n-dimensional Gaussian vector, we can applying Lemma 9. Doing so gives that
with probability at least \(1-1/t\). Combining all the bounds for (i)-(iii), using that \({\mathcal {E}}\) occurs with probability at least \(1-p_2/d^3\), and selecting \(t=d^3\), we can see that, with probability at least \(1-c_1/d^3\)
\(\square \)
Appendix B: Some technical lemmas
Lemma 10
Let \({\mathbf {M}}\) be a \(d\times d\) real, symmetric matrix of the form
where \(a,b \in {\mathbb {R}}\) then \({\mathbf {M}}\) has eigenvalues \(a+b\) with multiplicity 1 and a with multiplicity \(d-1\). If \(a,b > 0\), then \({\mathbf {M}}\) also has the property that
Proof
Using the Sherman-Morrison formula, a matrix of the form \({\mathbf {M}}= a{\mathbf {I}}+ b{\mathbf {1}}{\mathbf {1}}^T\), where \(a,b > 0\) has the inverse
Because \({\mathbf {M}}\succ 0\), all eigenvalues are strictly positive and denote by \(\lambda _i\) and \(q_i\) the eigenvalues and corresponding eigenvectors. Without loss of generality, let \(q_i\) be orthonormal. Then we can write \({\mathbf {M}}= \sum _i \lambda _i {\mathbf {q}}_i{\mathbf {q}}_i^T\). By the form of \({\mathbf {M}}\), clearly \(\frac{1}{\sqrt{d}}{\mathbf {1}}\) is always an eigenvector of \({\mathbf {M}}\) with eigenvalue \(a+db\), so we can take \(q_1 = \frac{1}{\sqrt{d}}{\mathbf {1}}\) and \(\lambda _1 = 1\). The remaining \(q_i\) span \(({\mathbf {1}}{\mathbf {1}}^T)^{\perp }\) and have corresponding eigenvalues \(\lambda _i = a\). Therefore,
Because this eigen-decomposition is unique, the above gives
Using the expression for \({\mathbf {M}}^{-1}\) given above, it follows that
\(\square \)
The following result for quadratic forms of standard multivariate Gaussian random variables can be found in many forms in the literature (for example, Rudelson and Vershynin [34]).
Lemma 11
(Hanson-Wright inequality for Gaussian random variables) Let \(\varvec{X}\sim N(0,{\mathbf {I}})\) be a d-dimensional random vector and let \({\mathbf {A}}\) be a \(d \times d\) matrix in \({\mathbb {R}}^{d \times d}\). Then
for some absolute constant c.
In particular, the following corollary is useful.
Corollary 1
Let \(\varvec{X}\sim N(0,{\mathbf {I}})\) be a d-dimensional random vector and let \({\mathbf {A}}\) be a \(d \times d\) matrix in \({\mathbb {R}}^{d \times d}\). Then
for some absolute constant c. Equivalently,
with probability at least \(1-2/t\) for some absolute constant \(c'\).
Below we are concerned with the rate of concentration in the spectral norm of a sample covariance matrix to its mean: \(||{{\widehat{\varvec{\varSigma }}}} - \varvec{\varSigma }^*||_2\). If we write \({{\widehat{\varvec{\varSigma }}}} = \frac{1}{n}\varvec{X}^T\varvec{X}\), where \(\varvec{X}\) refers to the \(n\times d\) matrix in which the rows are the observations \(\varvec{X}_i\), we see how such a result is directly applicable to the problem at hand. We repeat the statement of Gordon’s Theorem given in Vershynin [35] below as Proposition 1. We use the notation from Vershynin [35] of \(s_{\min }\) and \(s_{\max }\) to denote the smallest and largest singular values, respectively.
Proposition 1
Let \(\varvec{X}\) be an \(n \times d\) matrix whose entries are independent standard normal random variables. Then
Using the result on sub-Gaussian concentration of a Lipschitz function of independent random variables, we immediately obtain the following corollary (also given in Vershynin [35]).
Corollary 2
Let \(\varvec{X}\) be an \(n \times d\) matrix whose entries are independent standard normal random variables, then for every \(t \ge 0\)
with probability at least \(1-2\exp (-t^2/2)\).
Proof
Observing that the functions \(s_{\min }\) and \(s_{\max }\) are 1-Lipschitz and using the sub-Gaussian tail bound, the result is immediate from the above. \(\square \)
Corollary 3
Let \(\varvec{X}_i\), for \(i=1,\dots ,n\), be a d-dimensional random vector sampled from \(N(0,\varvec{\varSigma })\). Denoting \({\widehat{\varvec{\varSigma }}}_{ }:= n^{-1}\sum _{i=1}^n\varvec{X}_i\varvec{X}_i^\top \), we have that
with probability at least \(1-2\exp (-t^2/2)\).
Proof
This follows directly from Corollary 2. \(\square \)
Appendix C: Numerical results for unknown K
In this section we consider the exact same setup as in Sect. 6, but now we solve (22) using FORCE. For K unknown we do not have a readily available baseline comparison in high dimensions—prior work considers an ADMM based approach for (2), but not (22)—so we present only the results from FORCE and FORCE-P.
The main differences between the results displayed in Tables 3 and 4 and those in Sect. 6 are that (1) SDPNAL reached its iteration limit in some cases, leading to a higher average error than for K known, and (2) for the setting \((d,k,\rho ,\gamma ) = (500,100,0.3,3.0)\) the SNR was too low and FORCE did not converge. Otherwise, the results shown below, are very similar to those when K is known, and we refer the reader to the discussion in Sect. 6. The fact that we do not observe any significant difference in empirical performance between K fixed and K unknown—indeed K unknown appears to take longer in practice as intuition suggests—indicates that Theorem 1 either may not be tight or instances on which it achieves the worst-case bound are encountered with low probability.
Rights and permissions
About this article
Cite this article
Eisenach, C., Liu, H. Efficient, certifiably optimal clustering with applications to latent variable graphical models. Math. Program. 176, 137–173 (2019). https://doi.org/10.1007/s10107-019-01375-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01375-2