Clustering on Streams
An instance of a clustering problem (see clustering) consists of a collection of points in a distance space, a measure of the cost of a clustering, and a measure of the size of a clustering. The goal is to compute a partitioning of the points into clusters such that the cost of this clustering is minimized, while the size is kept under some predefined threshold. Less commonly, a threshold for the cost is specified, while the goal is to minimize the size of the clustering.
A data stream (see data streams) is a sequence of data presented to an algorithm one item at a time. A stream algorithm, upon reading an item, must perform some action based on this item and the contents of its working space, which is sublinear in the size of the data sequence. After this action is performed (which might include copying the item to its working space), the item is discarded.
Clustering on streams refers to the problem of clustering a data set presented as a data stream.
- 2.Dean J, Ghemaway S. MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation; 2004. p. 137–50.Google Scholar
- 5.Bradley PS, Fayyad UM, Reina C. Scaling clustering algorithms to large databases. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining; 1998. p. 9–15.Google Scholar
- 9.Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science; 2000. p. 359.Google Scholar
- 10.Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In: Proceedings of the 35th Annual ACM Symposium on Theory of Computing; 2003. p. 30–9.Google Scholar
- 11.Domingos P, Hulten G. Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. p. 71–80.Google Scholar
- 13.Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2003. p. 234–43.Google Scholar