Advertisement

Core-Sets: Updated Survey

  • Dan Feldman
Chapter
Part of the Unsupervised and Semi-Supervised Learning book series (UNSESUL)

Abstract

In optimization or machine learning problems we are given a set of items, usually points in some metric space, and the goal is to minimize or maximize an objective function over some space of candidate solutions. For example, in clustering problems, the input is a set of points in some metric space, and a common goal is to compute a set of centers in some other space (points, lines) that will minimize the sum of distances to these points. In database queries, we may need to compute such a sum for a specific query set of k centers.

However, traditional algorithms cannot handle modern systems that require parallel real-time computations of infinite distributed streams from sensors such as GPS, audio or video that arrive to a cloud, or networks of weaker devices such as smartphones or robots.

Core-set is a “small data” summarization of the input “big data,” where every possible query has approximately the same answer on both data sets. Generic techniques enable efficient coreset maintainance of streaming, distributed, and dynamic data. Traditional algorithms can then be applied on these coresets to maintain the approximated optimal solutions.

The challenge is to design coresets with provable trade-off between their size and approximation error. This survey summarizes such constructions in a retrospective way that aims to unify and simplify the state of the art.

Keywords

Core-sets Data summarization Clustering Query Approximation error 

References

  1. 1.
    Agarwal, P.K., Har-Peled, S.: Maintaining the approximate extent measures of moving points. In: Proceedings of the 12th Soda, pp. 148–157 (2001)Google Scholar
  2. 2.
    Agarwal, P., Har-Peled, S., Varadarajan, K.: Approximating extent measures of points. J. Assoc. Comput. Mach. 51(4), 606–635 (2004)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Agarwal, P., Har-Peled, S., Varadarajan, K.: Geometric approximation via coresets. Combinatorial Comput. Geom. 52, 1–30 (2005)MathSciNetzbMATHGoogle Scholar
  4. 4.
    Agarwal, P.K., Jones, M., Murali, T.M., Procopiuc, C.M.: A Monte Carlo algorithm for fast projective clustering. In: Proceeding ACM-SIGMOD International Conference on Management of Data, pp. 418–427 (2002)Google Scholar
  5. 5.
    Agarwal, P.K., Mustafa, N.H.: k-means projective clustering. In: Proceeding 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 155–165 (2004)Google Scholar
  6. 6.
    Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. In: Proceeding 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 538–547 (2000)Google Scholar
  7. 7.
    Agarwal, P.K., Procopiuc, C.M.: Approximation algorithms for projective clustering. J. Algorithms 46(2), 115–139 (2003)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximation algorithms for k-line center. In: European Symposium on Algorithms, pp. 54–63 (2002)CrossRefGoogle Scholar
  9. 9.
    Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Proceedings of the 25th approx, pp. 15–28 (2009)CrossRefGoogle Scholar
  10. 10.
    Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)CrossRefGoogle Scholar
  11. 11.
    Assadi, S., Khanna, S.: Randomized composable coresets for matching and vertex cover. CoRR. abs/1705.08242 (2017). Retrieved from http://arxiv.org/abs/1705.08242
  12. 12.
    Bachem, O., Lucic, M., Krause, A.: Coresets for nonparametric estimation-the case of dp-means. In: International Conference on Machine Learning (ICML) (2015)Google Scholar
  13. 13.
    Bădoiu, M., Clarkson, K.L.: Smaller core-sets for balls. In: Proceedings of the 14th soda, pp. 801–802 (2003)Google Scholar
  14. 14.
    Bădoiu, M., Clarkson, K.L.: Optimal core-sets for balls. Comput. Geom. 40(1), 14–22 (2008)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Bambauer, J., Muralidhar, K., Sarathy, R.: Fool’s gold: an illustrated critique of differential privacy. Vanderbilt J. Entertain. Technol. Law 16, 701 (2013)Google Scholar
  16. 16.
    Barger, A., Feldman, D.: k-means for streaming and distributed big sparse data. In: Proceeding of the 2016 SIAM International Conference on Data Mining (sdm’16) (2016)Google Scholar
  17. 17.
    Batson, J., Spielman, D.A., Srivastava, N.: Twice-ramanujan sparsifiers. SIAM Rev. 56(2), 315–334 (2014)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Baykal, C., Liebenwein, L., Gilitschenski, I., et al.: Data-dependent coresets for compressing neural networks with applications to generalization bounds. arXiv preprint arXiv:1804.05345 (2018)Google Scholar
  19. 19.
    Bentley, J.L., Saxe, J.B.: Decomposable searching problems I. static-to-dynamic transformation. J. Algorithms 1(4), 301–358 (1980)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the vapnik-chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Boutsidis, C., Zouzias, A., Mahoney, M.W., Drineas, P.: Randomized dimensionality reduction for k-means clustering. IEEE Trans. Inf. Theory 61(2), 1045–1062 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Braverman, V., Feldman, D., Lang, H.: New frameworks for offline and streaming coreset constructions. arXiv preprint:1612.00889 (2016)Google Scholar
  23. 23.
    Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM J. Comput. 34(4), 803–824 (2005)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Chazelle, B., Edelsbrunner, H., Grigni, M., Guibas, L., Sharir, M., Welzl, E.: Improved bounds on weak &egr;-nets for convex sets. In: Proceedings of the Twenty-Fifth annual ACM Symposium on Theory of Computing (STOC), pp. 495–504. ACM, New York (1993)Google Scholar
  26. 26.
    Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Choi, S., Kim, T., Yu, W.: Performance evaluation of ransac family. J. Comput. Vis. 24(3), 271–300 (1997)CrossRefGoogle Scholar
  28. 28.
    Clarkson, K.L.: Coresets, sparse greedy approximation, and the frank-wolfe algorithm. Assoc. Comput. Mach. Trans. Algorithms (TALG) 6(4), 63 (2010)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Clarkson, K.L., Woodruff, D.P.: Numerical linear algebra in the streaming model. In: Proceedings of the 41st STOC, pp. 205–214 (2009)Google Scholar
  30. 30.
    Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, pp. 163–172 (2015)MathSciNetzbMATHGoogle Scholar
  31. 31.
    Cohen, M.B., Lee, Y.T., Musco, C., Musco, C., Peng, R., Sidford, A.: Uniform sampling for matrix approximation. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 181–190. ACM, New York (2015) . http://doi.acm.org/10.1145/2688073.2688113
  32. 32.
    Dasgupta, A., Drineas, P., Harb, B., Kumar, R., Mahoney, M.W.: Sampling algorithms and coresets for p-regression. In: Proceedings 19th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 932–941 (2008) . http://doi.acm.org/10.1145/1347082.1347184
  33. 33.
    Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 152–159 (2000)Google Scholar
  34. 34.
    Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1117–1126 (2006)Google Scholar
  35. 35.
    Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Sampling algorithms for l 2 regression and applications. In: Proceeding of SODA 06 Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136 (2006)Google Scholar
  36. 36.
    Edwards, M., Varadarajan, K.: No coreset, no cry: Ii. In: International Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 107–115 (2005)Google Scholar
  37. 37.
    Effros, M., Schulman, L.J.: Deterministic clustering with data nets. In: Electronic Colloquium on Computational Complexity (ECCC), Report no. 050 (2004)Google Scholar
  38. 38.
    Epstein, D., Feldman, D.: Quadcopter tracks quadcopter via real-time shape fitting. IEEE Robot. Autom. Lett. 3(1), 544–550 (2018)CrossRefGoogle Scholar
  39. 39.
    Feigin, M., Feldman, D., Sochen, N.: From high definition image to low space optimization. In: Proceeding 3rd International Conference on Scale Space and Variational Methods in Computer Vision (SSVM 2011) (2011)Google Scholar
  40. 40.
    Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceeding 34th Annual ACM Symposium on Theory of Computing (STOC) (2011). See http://arxiv.org/abs/1106.1379 for fuller version
  41. 41.
    Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceeding of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1343–1354 (2012)Google Scholar
  42. 42.
    Feldman, D., Tassa, T.: More constraints, smaller coresets: constrained matrix approximation of sparse big data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (kdd’15), pp. 249–258 (2015)Google Scholar
  43. 43.
    Feldman, D., Fiat, A., Sharir, M.: Coresets for weighted facilities and their applications. In: FOCS, pp. 315–324 (2006)Google Scholar
  44. 44.
    Feldman, D., Fiat, A., Segev, D., Sharir, M.: Bi-criteria linear-time approximations for generalized k-mean/median/center. In: Proceeding of 23rd ACM Symposium on Computational Geometry (SOCG), pp. 19–26 (2007)Google Scholar
  45. 45.
    Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Proceedings of the 23rd ACM Symposium on Computational Geometry (SoCG), pp. 11–18 (2007)Google Scholar
  46. 46.
    Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, pp. 361–370 (2009)Google Scholar
  47. 47.
    Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010)Google Scholar
  48. 48.
    Feldman, D., Krause, A., Faulkner, M.: Scalable training of mixture models via coresets. In: Proceeding 25th Conference on Neural Information Processing Systems (NIPS) (2011)Google Scholar
  49. 49.
    Feldman, D., Sugaya, A., Rus, D.: An effective coreset compression algorithm for large scale sensor networks. In: 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN), pp. 257–268 (2012)Google Scholar
  50. 50.
    Feldman, D., Sung, C., Rus, D.: The single pixel gps: learning big data signals from tiny coresets. In: Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pp. 23–32 (2012)Google Scholar
  51. 51.
    Feldman, D., Sugaya, A., Sung, C., Rus, D.: Idiary: from gps signals to a text-searchable diary. In: Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, p. 6 (2013)Google Scholar
  52. 52.
    Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013a)Google Scholar
  53. 53.
    Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453 (2013b)Google Scholar
  54. 54.
    Feldman, D., Volkov, M., Rus, D.: Dimensionality reduction of massive sparse datasets using coresets. In: Advances in Neural Information Processing Systems (NIPS) (2016)Google Scholar
  55. 55.
    Feldman, D., Ozer, S., Rus, D.: Coresets for vector summarization with applications to network graphs. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017. Sydney, NSW, Australia, 6–11 August 2017, pp. 1117–1125 (2017). http://proceedings.mlr.press/v70/feldman17a.html
  56. 56.
    Feldman, D., Xiang, C., Zhu, R., Rus, D.: Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In: 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3–16 (2017)Google Scholar
  57. 57.
    Foster, I.: Designing and Building Parallel Programs. Addison Wesley Publishing Company, Reading (1995)zbMATHGoogle Scholar
  58. 58.
    Funke, S., Laue, S.: Bounded-hop energy-efficient broadcast in low-dimensional metrics via coresets. In: Annual Symposium on Theoretical Aspects of Computer Science, pp. 272–283 (2007)Google Scholar
  59. 59.
    Har-Peled, S.: No coreset, no cry. In: Proceedings of the 24th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 324–335 (2004)Google Scholar
  60. 60.
    Har-Peled, S.: Coresets for discrete integration and clustering. In: 26th FSTTCS, pp. 33–44 (2006)CrossRefGoogle Scholar
  61. 61.
    Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the 25th SODA, pp. 126–134 (2005)Google Scholar
  62. 62.
    Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-median clustering and their applications. In: Proceedings of the 36th ACM Symposium on the Theory of Computing (STOC), pp. 291–300 (2004a)Google Scholar
  63. 63.
    Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceeding of the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 291–300 (2004b)Google Scholar
  64. 64.
    Har-Peled, S., Varadarajan, K.R.: Projective clustering in high dimensions using coresets. In: Proceeding 18th ACM Symposium on Computational Geometry (SOCG), pp. 312–318 (2002)Google Scholar
  65. 65.
    Haussler, D.: Decision theoretic generalizations of the pac learning model. In: Proceedings of the 1st International Workshop on Algorithmic Learning Theory (ALT), pp. 21–41 (1990)Google Scholar
  66. 66.
    Haussler, D., Welzl, E.: Epsilon-nets and simplex range queries. In: Annual ACM Symposium on Computational Geometry (SOCG) (1986)Google Scholar
  67. 67.
    Hellerstein, J.: Parallel programming in the age of big data. In: GIGAOM blog. Nov. 9, 2008 (2008)Google Scholar
  68. 68.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)MathSciNetCrossRefGoogle Scholar
  69. 69.
    IBM: What is Big Data? Bringing Big Data to the Enterprise (2012). www.ibm.com/software/data/bigdata/. Accessed 3rd Oct 2012
  70. 70.
    Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In: Symposium on Computational Geometry, pp. 332–339 (1994)Google Scholar
  71. 71.
    Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM Sigmod-Sigact-Sigart Symposium on Principles of Database Systems, pp. 100–108 (2014)Google Scholar
  72. 72.
    Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S.: Comparing distributions and shapes using the kernel distance. In: Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, pp. 47–56 (2011)Google Scholar
  73. 73.
    Langberg, M., Schulman, L.J.: Universal epsilon-approximators for integrals. In: Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 598–607 (2010)Google Scholar
  74. 74.
    Li, Y., Long, P.M., Srinivasan, A.: Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci. 62, 516–527 (2001)MathSciNetCrossRefGoogle Scholar
  75. 75.
    Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. In: European Symposium on Algorithms, pp. 313–324 (2009)Google Scholar
  76. 76.
    Mahajan, M., Nimbhorkar, P., Varadarajan, K.: The planar k-means problem is np-hard. In: International Workshop on Algorithms and Computation, pp. 274–285 (2009)CrossRefGoogle Scholar
  77. 77.
    Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)zbMATHGoogle Scholar
  78. 78.
    Matousek, J.: Approximations and optimal geometric divide-an-conquer. J. Comput. Syst. Sci. 50(2), 203–208 (1995)MathSciNetCrossRefGoogle Scholar
  79. 79.
    Matouaek, J.: New constructions of weak epsilon-nets. In: Proceedings of the Nineteenth Annual Symposium on Computational Geometry, pp. 129–135 (2003)Google Scholar
  80. 80.
    McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions, vol. 382. Wiley, New York (2007)zbMATHGoogle Scholar
  81. 81.
    Munteanu, A., Schwiegelshohn, C.: Coresets-methods and history: a theoreticians design pattern for approximation and streaming algorithms. KI-Künstl. Intell. 32(1), 37–53 (2018)CrossRefGoogle Scholar
  82. 82.
    Muthukrishnan, S., et al.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)MathSciNetCrossRefGoogle Scholar
  83. 83.
    Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: 47th Annual IEEE Symposium on Foundations of Computer Science, 2006. FOCS’06, pp. 165–176 (2006)Google Scholar
  84. 84.
    Paul, R., Feldman, D., Rus, D., Newman, P.: Visual precis generation using coresets. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1304–1311 (2014)Google Scholar
  85. 85.
    Phillips, J.M.: Coresets and Sketches, Near-Final Version of Chapter 49 in Handbook on Discrete and Computational Geometry, 3rd edn. CoRR (2016). abs/1601.00617. Retrieved from http://arxiv.org/abs/1601.00617
  86. 86.
    Rosman, G., Volkov, M., Feldman, D., Fisher III, J.W., Rus, D.: Coresets for k-segmentation of streaming data. In: Advances in Neural Information Processing Systems (NIPS), pp. 559–567 (2014)Google Scholar
  87. 87.
    Segaran, T., Hammerbacher, J.: Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media, Inc., Beijing (2009)Google Scholar
  88. 88.
    Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. Statistics 1050, 27 (2017)Google Scholar
  89. 89.
    Shyamalkumar, N., Varadarajan, K.: Efficient subspace approximation algorithms. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 532–540 (2007)Google Scholar
  90. 90.
    Tolochinsky, E., Feldman, D.: Coresets for Monotonic Functions With Applications to Deep Learning (2018). arXiv preprint:1802.07382Google Scholar
  91. 91.
    Tremblay, N., Barthelmé, S., Amblard, P.-O.: Determinantal Point Processes for Coresets (2018). arXiv preprint:1803.08700Google Scholar
  92. 92.
    Varadarajan, K., Xiao, X.: A near-linear algorithm for projective clustering integer points. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) (2012a)Google Scholar
  93. 93.
    Varadarajan, K., Xiao, X.: On the sensitivity of shape fitting problems. In: Proceedings of the 32nd Annual Conference on IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pp. 486–497 (2012b)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Dan Feldman
    • 1
  1. 1.Computer Science DepartmentUniversity of HaifaHaifaIsrael

Personalised recommendations