A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Ben-David, Shai

doi:10.1007/s10994-006-0587-3

A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Published: 30 November 2006

Volume 66, pages 243–257, (2007)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Download PDF

Shai Ben-David¹

1073 Accesses
27 Citations
2 Altmetric
Explore all metrics

Abstract

We consider a framework of sample-based clustering. In this setting, the input to a clustering algorithm is a sample generated i.i.d by some unknown arbitrary distribution. Based on such a sample, the algorithm has to output a clustering of the full domain set, that is evaluated with respect to the underlying distribution. We provide general conditions on clustering problems that imply the existence of sampling based clustering algorithms that approximate the optimal clustering. We show that the K-median clustering, as well as K-means and the Vector Quantization problems, satisfy these conditions. Our results apply to the combinatorial optimization setting where, assuming that sampling uniformly over an input set can be done in constant time, we get a sampling-based algorithm for the K-median and K-means clustering problems that finds an almost optimal set of centers in time depending only on the confidence and accuracy parameters of the approximation, but independent of the input size. Furthermore, in the Euclidean input case, the dependence of the running time of our algorithm on the Euclidean dimension is only linear. Our main technical tool is a uniform convergence result for center based clustering that can be viewed as showing that the effective VC-dimension of k-center clustering equals k.

Article PDF

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Article 05 April 2024

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

References

Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge University Press.
Bartlett, P., Linder, T., & Lugosi, G. (1998). The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44, 1802–1813.
Article MATH MathSciNet Google Scholar
Ben-David, S. (2004). A framework for statistical cluatering with a constant time approximation algorithm for K-median clustering. In Proceedings of the 17th Annual Conference on Learning Theory, COLT’04, Springer.
Buhmann, J. (1998). Empirical risk approximation: An induction principle for unsupervised learning. Technical Report IAI-TR-98-3, Institut for Informatik III, Universitat Bonn.
Czumaj, A., & Sohler, C. (2004). Sublinear-time approximation for clustering via random samples. In Proceedings of the 31st International Colloquium on Automata, Language and Programming (ICALP’04), LNCS 3142:396–407.
Mettu, R. R., & Plaxton, C. G. (2004). Optimal time bounds for approximate clustering. Machine Learning, 56, 35–60.
Article MATH Google Scholar
Meyerson, A., O’Callaghan, L., & Plotkin, S. (2004). A k-median algorithm with running time independent of data size. Journal of Machine Learning, Special Issue on Theoretical Advances in Data Clustering (MLJ).
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of Symposium on Discrete Algorithms, SODA, (pp. 439–447).
Pollard, D. (1982). Quantization and the method of k-means. In IEEE Transactions on Information Theory, 28, 199–205.
Article MATH MathSciNet Google Scholar
Smola, A. J., Mika, S., & Scholkopf, B. (1998). Quantization functionals and regularized principal manifolds. Neuro COLT Technical Report Series NC2-TR-1998-028.
de la Vega, F., Karpinski, M., Kenyon, C., & Rabani, Y. (2003). Approximation schemes for clustering problems. In Proceedings of Symposium on the Theory of Computation, STOC’03.

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Shai Ben-David

Authors

Shai Ben-David
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shai Ben-David.

Additional information

Editor: Olivier Bousquet and Andre Elisseeff

A preliminary version of this work appeared in the proceedings of COLT’04 (Ben-David, 2004).

This work is supported in part by the Multidisciplinary University Research Initiative (MURI) under the Office of Naval Research Contract N00014-00-1-0564.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben-David, S. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering. Mach Learn 66, 243–257 (2007). https://doi.org/10.1007/s10994-006-0587-3

Download citation

Received: 18 July 2005
Revised: 15 June 2006
Accepted: 14 August 2006
Published: 30 November 2006
Issue Date: March 2007
DOI: https://doi.org/10.1007/s10994-006-0587-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Abstract

Article PDF

Similar content being viewed by others

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Clustering graph data: the roadmap to spectral techniques

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Abstract

Article PDF

Similar content being viewed by others

Sum-of-Squares Relaxations for Information Theory and Variational Inference

Clustering graph data: the roadmap to spectral techniques

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation