A Comparative Study of the Use of Coresets for Clustering Large Datasets

Hoang, Nguyen Le; Dang, Tran Khanh; Trang, Le Hong

doi:10.1007/978-3-030-35653-8_4

Nguyen Le Hoang¹²,
Tran Khanh Dang¹² &
Le Hong Trang¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11814))

Included in the following conference series:

International Conference on Future Data and Security Engineering

1403 Accesses
3 Citations

Abstract

Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. By using coresets, we can scale down a big data to a tiny one in order to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets. The two state-of-the-art algorithms have been proposed in 2018 are ProTraS by Ros & Guillaume and Lightweight Coreset by Bachem et al. In this paper, we briefly introduce these two algorithms and make a comparison between them to find out the benefits and drawbacks of each one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximating extent measures of points. J. ACM (JACM) 51(4), 606–635 (2004)
Article MathSciNet Google Scholar
Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
MathSciNet MATH Google Scholar
Har-Peled, S., Kushal, A.: Smaller coresets for \(k\)-median and \(k\)-means clustering. In: Symposium on Computational Geometry (SoCG), pp. 126–134. ACM (2005)
Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for \(k\)-means and \(k\)-median clustering. In: Symposium on Theory of Computing (STOC), pp. 291–300. ACM (2004)
Google Scholar
Bachem, O., Lucic, M., Krause, A.: Scalable and distributed clustering via lightweight Coresets. In: International Conference on Knowledge Discovery and Data Mining (KDD) (2018)
Google Scholar
Bachem, O., Lucic, M., Krause, A.: Practical Coreset constructions for machine learning. arXiv preprint (2017)
Google Scholar
Phan, T.N., Dang, T.K.: A lightweight indexing approach for efficient batch similarity processing with MapReduce. SN Comput. Sci. 1(1) (2020)
Google Scholar
Dang, T.K., Tran, K.T.K.: The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur. Commun. Netw. (2019)
Google Scholar
Ros, F., Guillaume, S.: DENDIS: a new density-based sampling for clustering algorithm. In: Expert Systems with Applications, vol. 56, pp. 349–359 (2016)
Article Google Scholar
Ros, F., Guillaume, S.: DIDES: a fast and effective sampling for clustering algorithm. In: Knowledge and Information Systems, vol. 50, pp. 543–568 (2017)
Article Google Scholar
Ros, F., Guillaume, S.: ProTraS: a probabilistic traversing sampling algorithm. In: Expert Systems with Applications, vol. 105, pp. 65–76 (2018)
Article Google Scholar
Trang, L.H., Van Ngoan, P., Van Duc, N.: A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. In: Dang, T.K., Küng, J., Wagner, R., Thoai, N., Takizawa, M. (eds.) FDSE 2018. LNCS, vol. 11251, pp. 145–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03192-3_11
Chapter Google Scholar
Trang, L.H., Bangui, H., Ge, M., Buhnova, B.: Scaling big data applications in smart city with Coresets. In: Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pp. 357–363 (2019)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 129–137 (1982)
Article MathSciNet Google Scholar
Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceeding of 10th Annual Symposium on Computational Geometry, pp. 332–339 (1994)
Google Scholar
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 50–58 (2003)
Google Scholar
Matousek, J.: On approximate geometric \(k\)-clustering. Discrete Comput. Geom. 24, 61–84 (2000)
Article MathSciNet Google Scholar
Arora, S.: Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. Assoc. Comput. Mach. 45(5), 753–782 (1998)
Article MathSciNet Google Scholar
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 30–39 (2003)
Google Scholar
Arthur, D., Vassilvitskii, S.: \(k\)-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Google Scholar

Download references

Acknowledgment

This work is supported by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019).

Author information

Authors and Affiliations

Faculty of Computer Sicence and Engineering, Ho Chi Minh City University of Technology, VNU-HCM, 268 Ly Thuong Kiet, District 10, Ho Chi Minh City, Vietnam
Nguyen Le Hoang, Tran Khanh Dang & Le Hong Trang

Authors

Nguyen Le Hoang
View author publications
You can also search for this author in PubMed Google Scholar
Tran Khanh Dang
View author publications
You can also search for this author in PubMed Google Scholar
Le Hong Trang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tran Khanh Dang .

Editor information

Editors and Affiliations

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam
Tran Khanh Dang
Johannes Kepler Universität Linz, Linz, Austria
Josef Küng
Hosei University, Tokyo, Japan
Makoto Takizawa
Telecommunications University, Nha Trang City, Vietnam
Son Ha Bui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hoang, N.L., Dang, T.K., Trang, L.H. (2019). A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds) Future Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science(), vol 11814. Springer, Cham. https://doi.org/10.1007/978-3-030-35653-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-35653-8_4
Published: 20 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35652-1
Online ISBN: 978-3-030-35653-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics