Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Stream Sampling

  • Bibudh Lahiri
  • Srikanta Tirthapura
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_372

Definition

Stream sampling is the process of collecting a representative sample of the elements of a data stream. The sample is usually much smaller than the entire stream, but can be designed to retain many important characteristics of the stream, and can be used to estimate many important aggregates on the stream. Unlike sampling from a stored data set, stream sampling must be performed online, when the data arrives. Any element that is not stored within the sample is lost forever, and cannot be retrieved. This article discusses various methods of sampling from a data stream and applications of these methods.

Historical Background

An early algorithm to maintain a random sample of a data stream is the reservoir sampling algorithm due to Vitter [15]. More recent random sampling based algorithms have been inspired by the work of Alon et al. [1]. Random sampling has for a long time been used to process data within stored databases - the reader is referred to [13] for a survey.

Foundations

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. J Comput Syst Sci. 1999;58(1):137–47.MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Babcock B, Datar M, Motwani R. Sampling from a moving window over streaming data. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms; 2002. p. 633–4.Google Scholar
  3. 3.
    Chakrabarti A, Cormode G, McGregor A. A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms; 2007. p. 328–35.Google Scholar
  4. 4.
    Cohen E, Strauss M. Maintaining time-decaying stream aggregates. In: Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2003. p. 223–33.Google Scholar
  5. 5.
    Cormode G, Muthukrishnan S, Rozenbaum I. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proceedings of the 31st International Conference on Very Large Data Bases; 2005. p. 25–36.Google Scholar
  6. 6.
    Frahling G, Indyk P, Sohler C. Sampling in dynamic data streams and applications. In: Proceedings of the 21st Annual Acm Symposium on Computational Geometry; 2005. p. 142–49.Google Scholar
  7. 7.
    Ganguly S. Counting distinct items over update streams. Theor Comput Sci. 2007;378(3):211–22.MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Gibbons P. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: Proceedings of the 27th International Conference on Very Large Data Bases; 2001. p. 541–50.Google Scholar
  9. 9.
    Gibbons P, Tirthapura S. Estimating simple functions on the union of data streams. In: Proceedings of the ACM Symposium on Parallel Algorithms and Architectures; 2001. p. 281–91.Google Scholar
  10. 10.
    Gibbons P, Tirthapura S. Distributed streams algorithms for sliding windows. Theor Comput Syst. 2004;37(3):457–78.MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Manku GS, Motwani R. Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002. p. 346–57.CrossRefGoogle Scholar
  12. 12.
    Manku GS, Rajagopalan S, Lindsay BG. Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1999. p. 251–62.Google Scholar
  13. 13.
    Olken F, Rotem D. Random sampling from databases – a survey. Stat Comput. 1995;5(1):43–57.CrossRefGoogle Scholar
  14. 14.
    Pavan A, Tirthapura S. Range-efficient counting of distinct elements in a massive data stream. SIAM J Comput. 2007;37(2):359–79.MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Vitter JS. Random sampling with a reservoir. ACM Trans Math Softw. 1985;11(1):37–57.MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Iowa State UniversityAmesUSA

Section editors and affiliations

  • Divesh Srivastava
    • 1
  1. 1.AT&T Labs - ResearchAT&TBedminsterUSA