Skip to main content
Log in

Faster Algorithms for the Constrained k-means Problem

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r -gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O 1, ..., O k are an arbitrary partition of the dataset and the goal is to output k-centers c 1, ..., c k such that the objective function \({\sum }_{i = 1}^{k} {\sum }_{x \in O_{i}} ||x - c_{i}||^{2}\) is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter ε > 0, let denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + ε) approximation w.r.t. the objective function above. In this paper, we show an upper bound on by giving a randomized algorithm that outputs a list of \(2^{\tilde {O}(k/\varepsilon )}\) k-centers. We also give a closely matching lower bound of \(2^{\tilde {\Omega }(k/\sqrt {\varepsilon })}\). Moreover, our algorithm runs in time \(O \left (n d \cdot 2^{\tilde {O}(k/\varepsilon )} \right )\). This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(n d ⋅ (log n)k ⋅ 2poly(k/ε)) and output a list of size O((log n)k ⋅ 2poly(k/ε)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Ding and Xu [5] also gave a discussion on such partition algorithms for a number of clustering problems with side constraints.

  2. For any real numbers \(a_{1}, ..., a_{m}, ({\sum }_{r} a_{r})^{2}/m \leq {\sum }_{r} {a_{r}^{2}}\).

  3. Please see [9] for a discussion on such distance measures. This work shows how to extend such D 2-sampling based analysis to settings involving such distance measures.

References

  1. Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms 6, 59,1–59,26 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 250–257. ACM, New York (2002)

  3. Chen, K.: On k-median clustering in high dimensions. In: Proceedings of the Seventeenth annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, pp. 1177–1185. ACM, New York (2006)

  4. de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing, STOC ’03, pp. 50–58. ACM, New York (2003)

  5. Ding, H., Jinhui, X.: A unified framework for clustering constrained data without locality property. In: Proceedings of the Twenty-sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, pp. 1471–1490 (2015)

  6. Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-third Annual Symposium on Computational Geometry, SCG ’07, pp. 11–18. ACM, New York (2007)

  7. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04, pp. 291–300. ACM, New York (2004)

  8. Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG ’94, pp. 332–339. ACM, New York (1994)

  9. Jaiswal, R., Kumar, A., Sen, S.: A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica 70(1), 22–46 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of D 2-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  11. Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2), 5,1–5,32 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  12. Matoušek, J.: On approximate geometric k -clustering. Discret. Comput. Geom. 24(1), 61–84 (2000)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

Ragesh Jaiswal acknowledges the support of ISF-UGC India-Israel joint research grant 2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ragesh Jaiswal.

Additional information

This article is part of the Topical Collection on Theoretical Aspects of Computer Science

Õ notation hides a \( O({\mathrm{log}}{\frac {k}{\varepsilon}}) \) factor.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhattacharya, A., Jaiswal, R. & Kumar, A. Faster Algorithms for the Constrained k-means Problem. Theory Comput Syst 62, 93–115 (2018). https://doi.org/10.1007/s00224-017-9820-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-017-9820-7

Keywords

Navigation