Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Quantiles on Streams

  • Chiranjeeb Buragohain
  • Subhash Suri
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_290

Synonyms

Histogram; Median; Order statistics; Selection

Definition

Quantiles are order statistics of data: the φ-quantile (0 ≤ φ ≤ 1) of a set S is an element x such that φ|S| elements of S are less than or equal to x and the remaining (1 − φ)|S| are greater than x. This entry describes data stream (single-pass) algorithms for computing an approximation of such quantiles.

Historical Background

Since the earliest days of data processing, there has been a need to summarize data. Large volumes of raw, unstructured data easily overwhelm the human ability to comprehend or digest. Tools that help identify the major underlying trends or patterns in data have enormous value. Quantiles characterize distributions of real world data sets in ways that are less sensitive to outliers than simpler alternatives such as the mean and the variance. Consequently, quantiles are of interest to both database implementers and users: for instance, they are a fundamental tool for query optimization, splitting...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Arasu A, Manku GS. Approximate counts and quantiles over sliding windows. In: Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2004. p. 286–96.Google Scholar
  2. 2.
    Blum M, Floyd R, Pratt V, Rivest R, Tarjan RE. Time bounds for selection. J Comput Syst Sci. 1973;7(4):448–61.MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J Algorithms. 2005;55(1):58–75.MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Cormode G, Korn F, Muthukrishnan S, Srivastava D. Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: Proceedings of the 25th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2006. p. 263–72.Google Scholar
  5. 5.
    Cormode G, Korn F, Muthukrishnan S, Johnson T, Spatscheck O, Srivastava D. Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 35–46.Google Scholar
  6. 6.
    Cormode G, Muthukrishnan S, Zhuang W. What’s different: distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: Proceedings of the 22nd International Conference on Data Engineering; 2006. p. 57.Google Scholar
  7. 7.
    Cranor C, Johnson T, Spataschek O, Shkapenyuk V. Gigascope: a stream database for network applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2003. p. 647–51.Google Scholar
  8. 8.
    Flajolet P, Martin GN. Probabilistic counting algorithms for data base applications. J Comput Syst Sci. 1985;31(2):182–209.MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Greenwald JM, Khanna S. Power-conserving computation of order-statistics over sensor networks. In: Proceedings of the 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 2004. p. 275–85.Google Scholar
  10. 10.
    Greenwald JM, Khanna S. Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2001. p. 58–66.Google Scholar
  11. 11.
    Gupta A, Zane F. Counting inversions in streams. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms; 2003. p. 253–54.Google Scholar
  12. 12.
    Lin X, Lu H, Xu J, Yu JX. Continuously maintaining quantile summaries of the most recent N elements over a data stream. In: Proceedings of the 20th International Conference on Data Engineering; 2004.p. 362–74.Google Scholar
  13. 13.
    Manku GS, Rajagopalan S, Lindsay BG. Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1999. p. 251–62.Google Scholar
  14. 14.
    Manku GS, Rajagopalan S, Lindsay BG. Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1998. p. 426–35.Google Scholar
  15. 15.
    Munro JI, Paterson MS. Selection and sorting with limited storage. Theor Comput Sci. 1980;12(3):315–23.MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Paterson MS. Progrees in selection. In: Proceedings of the Scandinavian Workshop on Algorithm Theory; 1996. p. 368–79.Google Scholar
  17. 17.
    Pike R, Dorward S, Griesemer R, Quinlan S. Interpreting the data: parallel analysis with sawzall. Sci Program J. 2005;13(4):227–98.Google Scholar
  18. 18.
    Shrivastava N, Buragohain C, Agrawal D, Suri S. Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems; 2004. p. 239–49.Google Scholar
  19. 19.
    Vitter JS. Random sampling with a reservoir. ACM Trans Math Softw. 1985;11(1):37–57.MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Amazon.comSeattleUSA
  2. 2.University of California-Santa BarbaraSanta BarbaraUSA

Section editors and affiliations

  • Divesh Srivastava
    • 1
  1. 1.AT&T Labs - ResearchAT&TBedminsterUSA