Abstract
In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of k closest j-dimensional linear (affine) subspaces of a given set of n vectors in d dimensions. Let \(A \in \mathbb {R}^{n\times d}\) be an input matrix. An earlier deterministic coreset construction of Feldman et. al. [10] relied on computing the SVD of A. The best known algorithms for SVD require \(\min \{nd^2, n^2d\}\) time, which may not be feasible for large values of n and d. We present a coreset construction by projecting the matrix A on some orthonormal vectors that closely approximate the right singular vectors of A. As a consequence, when the values of k and j are small, we are able to achieve a faster algorithm, as compared to [10], while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.
R. Pratap—This work done when author was affiliated with TCS Innovation Labs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here, \(\tilde{O}\) is the asymptotic notation that ignores logarithmic factors.
- 2.
Two passes are required as we first multiply A on the right with a Johnson-Lindenstrauss matrix \(\mathcal {S}\), and then we project the rows of A again onto the row span of \(\mathcal {S}A\).
References
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. In: Welzl, E., (ed.) Current Trends in Combinatorial and Computational Geometry (2007)
Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, 4–6 January 2009, pp. 968–977 (2009)
Clarkson, K.L., Woodruff, D.P.: Low rank approximation and regression in input sparsity time. In: Symposium on Theory of Computing Conference, STOC 2013, Palo Alto, CA, USA, 1–4 June 2013, pp. 81–90 (2013)
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD 1996), Portland, Oregon, USA, pp. 226–231 (1996)
Feldman, D., Fiat, A., Sharir, M.: Coresets forweighted facilities and their applications. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 315–324 (2006)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)
Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013)
Har-Peled, S.: No, coreset, no cry. In: Lodaya, K., Mahajan, M. (eds.) FSTTCS 2004: Foundations of Software Technology and Theoretical Computer Science. Lecture Notes in Computer Science, vol. 3328, pp. 324–335. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30538-5_27
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)
Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams. In: Proceedings of a DIMACS Workshop on External Memory Algorithms, New Brunswick, New Jersey, USA, 20–22 May 1998, pp. 107–118 (1998)
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, 10–14 September 2000, Cairo, Egypt, pp. 506–515 (2000)
Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Phillips, J.M.: Coresets and sketches. CoRR, abs/1601.00617 (2016)
Sarlós, T.: Improved approximation algorithms for large matrices via random projections. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21–24 October 2006, Berkeley, California, USA, Proceedings, pp. 143–152 (2006)
Varadarajan, K.R., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, Kyoto, Japan, 17–19 January 2012, pp. 1329–1342 (2012)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Min. Knowl. Discov. 1(2), 141–182 (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Pratap, R., Sen, S. (2018). Faster Coreset Construction for Projective Clustering via Low-Rank Approximation. In: Iliopoulos, C., Leong, H., Sung, WK. (eds) Combinatorial Algorithms. IWOCA 2018. Lecture Notes in Computer Science(), vol 10979. Springer, Cham. https://doi.org/10.1007/978-3-319-94667-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-94667-2_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94666-5
Online ISBN: 978-3-319-94667-2
eBook Packages: Computer ScienceComputer Science (R0)