Clustering on Streams

Venkatasubramanian, Suresh

doi:10.1007/978-1-4614-8265-9_68

Suresh Venkatasubramanian³

39 Accesses

Definition

An instance of a clustering problem (see clustering) consists of a collection of points in a distance space, a measure of the cost of a clustering, and a measure of the size of a clustering. The goal is to compute a partitioning of the points into clusters such that the cost of this clustering is minimized, while the size is kept under some predefined threshold. Less commonly, a threshold for the cost is specified, while the goal is to minimize the size of the clustering.

A data stream (see data streams) is a sequence of data presented to an algorithm one item at a time. A stream algorithm, upon reading an item, must perform some action based on this item and the contents of its working space, which is sublinear in the size of the data sequence. After this action is performed (which might include copying the item to its working space), the item is discarded.

Clustering on streams refers to the problem of clustering a data set presented as a data stream.

Historical Background

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Muthukrishnan S. Data streams: algorithms and applications. Found Trend Theor Comput Sci. 2005;1(2):117–236.
Article MathSciNet MATH Google Scholar
Dean J, Ghemaway S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation; 2004. p. 137–50.
Google Scholar
Borodin A, El-Yaniv R. Online computation and competitive analysis. New York: Cambridge University Press; 1998.
MATH Google Scholar
Charikar M, Chekuri C, Feder T, Motwani R. Incremental clustering and dynamic information retrieval. SIAM J Comput. 2004;33(6):1417–40.
Article MathSciNet MATH Google Scholar
Bradley PS, Fayyad UM, Reina C. Scaling clustering algorithms to large databases. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining; 1998. p. 9–15.
Google Scholar
Farnstrom F, Lewis J, Elkan C. Scalability for clustering algorithms revisited. SIGKDD Explor. 2000;2(1):51–7.
Article Google Scholar
Zhang T, Ramakrishnan R, Livny M. BIRCH: A new data clustering algorithm and its applications. Data Min Knowl Discov. 1997;1(2):141–82.
Article Google Scholar
Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L. Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng. 2003;15(3):515–28.
Article Google Scholar
Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science; 2000. p. 359.
Google Scholar
Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing; 2003. p. 30–9.
Google Scholar
Domingos P, Hulten G. Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 71–80.
Google Scholar
Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows: (extended abstract). In: Proceedings of the 13th Annual ACM - SIAM Symposium on Discrete Algorithms; 2002. p. 635–44.
Article MathSciNet MATH Google Scholar
Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2003. p. 234–43.
Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS. A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases; 2003. p. 81–92.
Chapter Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS. A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases; 2004.p. 852–63.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Utah, Salt Lake City, UT, USA
Suresh Venkatasubramanian

Authors

Suresh Venkatasubramanian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suresh Venkatasubramanian .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

AT&T Labs - Research, AT&T, Bedminster, NJ, USA
Divesh Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Venkatasubramanian, S. (2018). Clustering on Streams. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_68

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_68
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics