Faster Coreset Construction for Projective Clustering via Low-Rank Approximation

Pratap, Rameshwar; Sen, Sandeep

doi:10.1007/978-3-319-94667-2_28

Rameshwar Pratap¹⁶ &
Sandeep Sen¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10979))

Included in the following conference series:

International Workshop on Combinatorial Algorithms

621 Accesses

Abstract

In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let \(A \in \mathbb {R}^{n\times d}\) be an input matrix. An earlier deterministic coreset construction of Feldman et. al. [10] relied on computing the SVD of A. The best known algorithms for SVD require \(\min \{nd^2, n^2d\}\) time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.

R. Pratap—This work done when author was affiliated with TCS Innovation Labs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here, \(\tilde{O}\) is the asymptotic notation that ignores logarithmic factors.
2.
Two passes are required as we first multiply A on the right with a Johnson-Lindenstrauss matrix \(\mathcal {S}\), and then we project the rows of A again onto the row span of \(\mathcal {S}A\).

References

Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)
Article MathSciNet Google Scholar
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. In: Welzl, E., (ed.) Current Trends in Combinatorial and Computational Geometry (2007)
Google Scholar
Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, 4–6 January 2009, pp. 968–977 (2009)
Chapter Google Scholar
Clarkson, K.L., Woodruff, D.P.: Low rank approximation and regression in input sparsity time. In: Symposium on Theory of Computing Conference, STOC 2013, Palo Alto, CA, USA, 1–4 June 2013, pp. 81–90 (2013)
Google Scholar
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, USA, pp. 226–231 (1996)
Google Scholar
Feldman, D., Fiat, A., Sharir, M.: Coresets forweighted facilities and their applications. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 315–324 (2006)
Google Scholar
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)
Google Scholar
Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)
Chapter Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013)
Chapter Google Scholar
Har-Peled, S.: No, coreset, no cry. In: Lodaya, K., Mahajan, M. (eds.) FSTTCS 2004: Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 3328, pp. 324–335. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30538-5_27
Chapter Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)
Google Scholar
Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: Proceedings of a DIMACS Workshop on External Memory Algorithms, New Brunswick, New Jersey, USA, 20–22 May 1998, pp. 107–118 (1998)
Google Scholar
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, Egypt, pp. 506–515 (2000)
Google Scholar
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
MATH Google Scholar
Phillips, J.M.: Coresets and sketches. CoRR, abs/1601.00617 (2016)
Google Scholar
Sarlós, T.: Improved approximation algorithms for large matrices via random projections. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 143–152 (2006)
Google Scholar
Varadarajan, K.R., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, 17–19 January 2012, pp. 1329–1342 (2012)
Chapter Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Wipro Technologies, Bangalore, India
Rameshwar Pratap
IIT Delhi, New Delhi, India
Sandeep Sen

Authors

Rameshwar Pratap
View author publications
You can also search for this author in PubMed Google Scholar
Sandeep Sen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rameshwar Pratap .

Editor information

Editors and Affiliations

Department of Informatics, King’s College London, London, United Kingdom
Costas Iliopoulos
Department of Computer Science, School of Computing, National University of Singapore, Singapore, Singapore
Hon Wai Leong
Department of Computer Science, School of Computing, National University of Singapore, Singapore, Singapore
Wing-Kin Sung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pratap, R., Sen, S. (2018). Faster Coreset Construction for Projective Clustering via Low-Rank Approximation. In: Iliopoulos, C., Leong, H., Sung, WK. (eds) Combinatorial Algorithms. IWOCA 2018. Lecture Notes in Computer Science(), vol 10979. Springer, Cham. https://doi.org/10.1007/978-3-319-94667-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-94667-2_28
Published: 04 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94666-5
Online ISBN: 978-3-319-94667-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics