Stream sampling is the process of collecting a representative sample of the elements of a data stream. The sample is usually much smaller than the entire stream, but can be designed to retain many important characteristics of the stream, and can be used to estimate many important aggregates on the stream. Unlike sampling from a stored data set, stream sampling must be performed online, when the data arrives. Any element that is not stored within the sample is lost forever, and cannot be retrieved. This article discusses various methods of sampling from a data stream and applications of these methods.
An early algorithm to maintain a random sample of a data stream is the reservoir sampling algorithm due to Vitter . More recent random sampling based algorithms have been inspired by the work of Alon et al. . Random sampling has for a long time been used to process data within stored databases - the reader is referred to  for a survey.
- 2.Babcock B, Datar M, Motwani R. Sampling from a moving window over streaming data. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms; 2002. p. 633–4.Google Scholar
- 3.Chakrabarti A, Cormode G, McGregor A. A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms; 2007. p. 328–35.Google Scholar
- 4.Cohen E, Strauss M. Maintaining time-decaying stream aggregates. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2003. p. 223–33.Google Scholar
- 5.Cormode G, Muthukrishnan S, Rozenbaum I. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proceedings of the 31st International Conference on Very Large Data Bases; 2005. p. 25–36.Google Scholar
- 6.Frahling G, Indyk P, Sohler C. Sampling in dynamic data streams and applications. In: Proceedings of the 21st Annual Acm Symposium on Computational Geometry; 2005. p. 142–49.Google Scholar
- 8.Gibbons P. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 541–50.Google Scholar
- 9.Gibbons P, Tirthapura S. Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallel Algorithms and Architectures; 2001. p. 281–91.Google Scholar
- 12.Manku GS, Rajagopalan S, Lindsay BG. Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1999. p. 251–62.Google Scholar