Advertisement

An Improved Data Stream Summary: The Count-Min Sketch and Its Applications

  • Graham Cormode
  • S. Muthukrishnan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2976)

Abstract

We introduce a new sublinear space data structure—the Count-Min Sketch— for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor.

Keywords

Data Stream Range Query Point Query Space Bound Heavy Hitter 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alon, N., Gibbons, P., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In: Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems (PODS 1999), pp. 10–20 (1999)Google Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pp. 20–29 (1996); Journal version in Journal of Computer and System Sciences 58, 137–147 (1999)Google Scholar
  3. 3.
    Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of Symposium on Principles of Database Systems (PODS), pp. 1–16 (2002)Google Scholar
  4. 4.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: Tracking most frequent items dynamically. In: Proceedings of ACM Principles of Database Systems, pp. 296–306 (2003)Google Scholar
  6. 6.
    Cormode, G., Muthukrishnan, S.: What’s new: Finding significant differences in network data streams. In: Proceedings of IEEE Infocom (2004)Google Scholar
  7. 7.
    Estan, C., Varghese, G.: Data streaming in computer networks. In: Proceedings of Workshop on Management and Processing of Data Streams (2003), http://www.research.att.com/conf/mpds2003/schedule/estanV.ps
  8. 8.
    Flajolet, P., Martin, G.N.: Probabilistic counting. In: 24th Annual Symposium on Foundations of Computer Science, pp. 76–82 (1983); Journal version in Journal of Computer and System Sciences 31, 182–209 (1985)Google Scholar
  9. 9.
    Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams:You only get one look. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2002)Google Scholar
  10. 10.
    Gibbons, P., Matias, Y.: Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, A (1999)Google Scholar
  11. 11.
    Gilbert, A., Guha, S., Indyk, P., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Fast, smallspace algorithms for approximate histogram maintenance. In: Proceedings of the 34th ACM Symposium on Theory of Computing, pp. 389–398 (2002)Google Scholar
  12. 12.
    Gilbert, A., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: Onepass summaries for approximate aggregate queries. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 79–88 (2001); Journal version in IEEE Transactions on Knowledge and Data Engineering 15(3), 541–554 (2003)Google Scholar
  13. 13.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to summarize the universe: Dynamic maintenance of quantiles. In: Proceedings of 28th International Conference on Very Large Data Bases, pp. 454–465 (2002)Google Scholar
  14. 14.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. SIGMOD Record (ACM Special Interest Group on Management of Data) 30(2), 58–66 (2001)Google Scholar
  15. 15.
    Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)zbMATHGoogle Scholar
  16. 16.
    Muthukrishnan, S.: Data streams: Algorithms and applications. In: ACM-SIAM Symposium on Discrete Algorithms (2003), http://athos.rutgers.edu/~muthu/stream-1-1.ps
  17. 17.
    Woodruff, D.: Optimal space lower bounds for all frequency moments. In: ACM-SIAM Symposium on Discrete Algorithms (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Graham Cormode
    • 1
  • S. Muthukrishnan
    • 2
  1. 1.Center for Discrete Mathematics and Computer Science (DIMACS)Rutgers UniversityPiscataway
  2. 2.Division of Computer and Information SystemsRutgers University 

Personalised recommendations