Skip to main content

A Comparative Study of the Use of Coresets for Clustering Large Datasets

  • Conference paper
  • First Online:
Book cover Future Data and Security Engineering (FDSE 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11814))

Included in the following conference series:

Abstract

Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. By using coresets, we can scale down a big data to a tiny one in order to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets. The two state-of-the-art algorithms have been proposed in 2018 are ProTraS by Ros & Guillaume and Lightweight Coreset by Bachem et al. In this paper, we briefly introduce these two algorithms and make a comparison between them to find out the benefits and drawbacks of each one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cs.joensuu.fi/sipu/datasets.

  2. 2.

    https://github.com/deric/clustering-benchmark.

References

  1. Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximating extent measures of points. J. ACM (JACM) 51(4), 606–635 (2004)

    Article  MathSciNet  Google Scholar 

  2. Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)

    MathSciNet  MATH  Google Scholar 

  3. Har-Peled, S., Kushal, A.: Smaller coresets for \(k\)-median and \(k\)-means clustering. In: Symposium on Computational Geometry (SoCG), pp. 126–134. ACM (2005)

    Google Scholar 

  4. Har-Peled, S., Mazumdar, S.: On coresets for \(k\)-means and \(k\)-median clustering. In: Symposium on Theory of Computing (STOC), pp. 291–300. ACM (2004)

    Google Scholar 

  5. Bachem, O., Lucic, M., Krause, A.: Scalable and distributed clustering via lightweight Coresets. In: International Conference on Knowledge Discovery and Data Mining (KDD) (2018)

    Google Scholar 

  6. Bachem, O., Lucic, M., Krause, A.: Practical Coreset constructions for machine learning. arXiv preprint (2017)

    Google Scholar 

  7. Phan, T.N., Dang, T.K.: A lightweight indexing approach for efficient batch similarity processing with MapReduce. SN Comput. Sci. 1(1) (2020)

    Google Scholar 

  8. Dang, T.K., Tran, K.T.K.: The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur. Commun. Netw. (2019)

    Google Scholar 

  9. Ros, F., Guillaume, S.: DENDIS: a new density-based sampling for clustering algorithm. In: Expert Systems with Applications, vol. 56, pp. 349–359 (2016)

    Article  Google Scholar 

  10. Ros, F., Guillaume, S.: DIDES: a fast and effective sampling for clustering algorithm. In: Knowledge and Information Systems, vol. 50, pp. 543–568 (2017)

    Article  Google Scholar 

  11. Ros, F., Guillaume, S.: ProTraS: a probabilistic traversing sampling algorithm. In: Expert Systems with Applications, vol. 105, pp. 65–76 (2018)

    Article  Google Scholar 

  12. Trang, L.H., Van Ngoan, P., Van Duc, N.: A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. In: Dang, T.K., Küng, J., Wagner, R., Thoai, N., Takizawa, M. (eds.) FDSE 2018. LNCS, vol. 11251, pp. 145–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03192-3_11

    Chapter  Google Scholar 

  13. Trang, L.H., Bangui, H., Ge, M., Buhnova, B.: Scaling big data applications in smart city with Coresets. In: Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pp. 357–363 (2019)

    Google Scholar 

  14. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  15. Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceeding of 10th Annual Symposium on Computational Geometry, pp. 332–339 (1994)

    Google Scholar 

  16. de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 50–58 (2003)

    Google Scholar 

  17. Matousek, J.: On approximate geometric \(k\)-clustering. Discrete Comput. Geom. 24, 61–84 (2000)

    Article  MathSciNet  Google Scholar 

  18. Arora, S.: Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. Assoc. Comput. Mach. 45(5), 753–782 (1998)

    Article  MathSciNet  Google Scholar 

  19. Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 30–39 (2003)

    Google Scholar 

  20. Arthur, D., Vassilvitskii, S.: \(k\)-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)

    Google Scholar 

Download references

Acknowledgment

This work is supported by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tran Khanh Dang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hoang, N.L., Dang, T.K., Trang, L.H. (2019). A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds) Future Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science(), vol 11814. Springer, Cham. https://doi.org/10.1007/978-3-030-35653-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35653-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35652-1

  • Online ISBN: 978-3-030-35653-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics