Skip to main content

Faster Coreset Construction for Projective Clustering via Low-Rank Approximation

  • Conference paper
  • First Online:
Book cover Combinatorial Algorithms (IWOCA 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10979))

Included in the following conference series:

  • 621 Accesses

Abstract

In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let \(A \in \mathbb {R}^{n\times d}\) be an input matrix. An earlier deterministic coreset construction of Feldman et. al. [10] relied on computing the SVD of A. The best known algorithms for SVD require \(\min \{nd^2, n^2d\}\) time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.

R. Pratap—This work done when author was affiliated with TCS Innovation Labs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, \(\tilde{O}\) is the asymptotic notation that ignores logarithmic factors.

  2. 2.

    Two passes are required as we first multiply A on the right with a Johnson-Lindenstrauss matrix \(\mathcal {S}\), and then we project the rows of A again onto the row span of \(\mathcal {S}A\).

References

  1. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)

    Article  MathSciNet  Google Scholar 

  2. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. In: Welzl, E., (ed.) Current Trends in Combinatorial and Computational Geometry (2007)

    Google Scholar 

  3. Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, 4–6 January 2009, pp. 968–977 (2009)

    Chapter  Google Scholar 

  4. Clarkson, K.L., Woodruff, D.P.: Low rank approximation and regression in input sparsity time. In: Symposium on Theory of Computing Conference, STOC 2013, Palo Alto, CA, USA, 1–4 June 2013, pp. 81–90 (2013)

    Google Scholar 

  5. Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)

    Google Scholar 

  6. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, USA, pp. 226–231 (1996)

    Google Scholar 

  7. Feldman, D., Fiat, A., Sharir, M.: Coresets forweighted facilities and their applications. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 315–324 (2006)

    Google Scholar 

  8. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)

    Google Scholar 

  9. Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)

    Chapter  Google Scholar 

  10. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013)

    Chapter  Google Scholar 

  11. Har-Peled, S.: No, coreset, no cry. In: Lodaya, K., Mahajan, M. (eds.) FSTTCS 2004: Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 3328, pp. 324–335. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30538-5_27

    Chapter  Google Scholar 

  12. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)

    Google Scholar 

  13. Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: Proceedings of a DIMACS Workshop on External Memory Algorithms, New Brunswick, New Jersey, USA, 20–22 May 1998, pp. 107–118 (1998)

    Google Scholar 

  14. Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, Egypt, pp. 506–515 (2000)

    Google Scholar 

  15. Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  16. Phillips, J.M.: Coresets and sketches. CoRR, abs/1601.00617 (2016)

    Google Scholar 

  17. Sarlós, T.: Improved approximation algorithms for large matrices via random projections. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 143–152 (2006)

    Google Scholar 

  18. Varadarajan, K.R., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, 17–19 January 2012, pp. 1329–1342 (2012)

    Chapter  Google Scholar 

  19. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rameshwar Pratap .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pratap, R., Sen, S. (2018). Faster Coreset Construction for Projective Clustering via Low-Rank Approximation. In: Iliopoulos, C., Leong, H., Sung, WK. (eds) Combinatorial Algorithms. IWOCA 2018. Lecture Notes in Computer Science(), vol 10979. Springer, Cham. https://doi.org/10.1007/978-3-319-94667-2_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94667-2_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94666-5

  • Online ISBN: 978-3-319-94667-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics