Abstract
Coresets can be described as a compact subset such that models trained on coresets will also provide a good fit with models trained on full data set. By using coresets, we can scale down a big data to a tiny one in order to reduce the computational cost of a machine learning problem. In recent years, data scientists have investigated various methods to create coresets. The two state-of-the-art algorithms have been proposed in 2018 are ProTraS by Ros & Guillaume and Lightweight Coreset by Bachem et al. In this paper, we briefly introduce these two algorithms and make a comparison between them to find out the benefits and drawbacks of each one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Approximating extent measures of points. J. ACM (JACM) 51(4), 606–635 (2004)
Agarwal, P.K., Procopiuc, C.M., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Comput. Geom. 52, 1–30 (2005)
Har-Peled, S., Kushal, A.: Smaller coresets for \(k\)-median and \(k\)-means clustering. In: Symposium on Computational Geometry (SoCG), pp. 126–134. ACM (2005)
Har-Peled, S., Mazumdar, S.: On coresets for \(k\)-means and \(k\)-median clustering. In: Symposium on Theory of Computing (STOC), pp. 291–300. ACM (2004)
Bachem, O., Lucic, M., Krause, A.: Scalable and distributed clustering via lightweight Coresets. In: International Conference on Knowledge Discovery and Data Mining (KDD) (2018)
Bachem, O., Lucic, M., Krause, A.: Practical Coreset constructions for machine learning. arXiv preprint (2017)
Phan, T.N., Dang, T.K.: A lightweight indexing approach for efficient batch similarity processing with MapReduce. SN Comput. Sci. 1(1) (2020)
Dang, T.K., Tran, K.T.K.: The meeting of acquaintances: a cost-efficient authentication scheme for light-weight objects with transient trust level and plurality approach. Secur. Commun. Netw. (2019)
Ros, F., Guillaume, S.: DENDIS: a new density-based sampling for clustering algorithm. In: Expert Systems with Applications, vol. 56, pp. 349–359 (2016)
Ros, F., Guillaume, S.: DIDES: a fast and effective sampling for clustering algorithm. In: Knowledge and Information Systems, vol. 50, pp. 543–568 (2017)
Ros, F., Guillaume, S.: ProTraS: a probabilistic traversing sampling algorithm. In: Expert Systems with Applications, vol. 105, pp. 65–76 (2018)
Trang, L.H., Van Ngoan, P., Van Duc, N.: A sample-based algorithm for visual assessment of cluster tendency (VAT) with large datasets. In: Dang, T.K., Küng, J., Wagner, R., Thoai, N., Takizawa, M. (eds.) FDSE 2018. LNCS, vol. 11251, pp. 145–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03192-3_11
Trang, L.H., Bangui, H., Ge, M., Buhnova, B.: Scaling big data applications in smart city with Coresets. In: Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pp. 357–363 (2019)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 129–137 (1982)
Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceeding of 10th Annual Symposium on Computational Geometry, pp. 332–339 (1994)
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 50–58 (2003)
Matousek, J.: On approximate geometric \(k\)-clustering. Discrete Comput. Geom. 24, 61–84 (2000)
Arora, S.: Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. Assoc. Comput. Mach. 45(5), 753–782 (1998)
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pp. 30–39 (2003)
Arthur, D., Vassilvitskii, S.: \(k\)-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Acknowledgment
This work is supported by a project with the Department of Science and Technology, Ho Chi Minh City, Vietnam (contract with HCMUT No. 42/2019/HD-QPTKHCN, dated 11/7/2019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hoang, N.L., Dang, T.K., Trang, L.H. (2019). A Comparative Study of the Use of Coresets for Clustering Large Datasets. In: Dang, T., Küng, J., Takizawa, M., Bui, S. (eds) Future Data and Security Engineering. FDSE 2019. Lecture Notes in Computer Science(), vol 11814. Springer, Cham. https://doi.org/10.1007/978-3-030-35653-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-35653-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35652-1
Online ISBN: 978-3-030-35653-8
eBook Packages: Computer ScienceComputer Science (R0)