Skip to main content

Aggregate Computation over Data Streams

  • Conference paper
Progress in WWW Research and Development (APWeb 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4976))

Included in the following conference series:

Abstract

Nowadays, we have witnessed the widely recognized phenomenon of high speed data streams. Various statistics computation over data streams is often required by many applications, including processing of relational type queries, data mining and high speed network management. In this paper, we provide survey for three important kinds of aggregate computations over data streams: frequency moment, frequency count and order statistic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aduri, P., Tirthapura, S.: Range efficient computation of f\(_{\mbox{0}}\) over massive data streams. In: ICDE, pp. 32–43 (2005)

    Google Scholar 

  2. Ahmad, Y., Berg, B., Çetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey, A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.B.: Distributed operation in the borealis stream processing engine. In: SIGMOD, pp. 882–884 (2005)

    Google Scholar 

  3. Ajtai, M., Jayram, T.S., Kumar, R., Sivakumar, D.: Approximate counting of inversions in a data stream. In: STOC, pp. 370–379 (2002)

    Google Scholar 

  4. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOCK, pp. 20–29 (1996)

    Google Scholar 

  5. Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: PODS, pp. 286–296 (2004)

    Google Scholar 

  6. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS (2002)

    Google Scholar 

  7. Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD, pp. 28–39 (2003)

    Google Scholar 

  8. Bandi, N., Agrawal, D., Abbadi, A.E.: Fast algorithms for heavy distinct hitters using associative memories. In: IEEE International Conference on Distributed Computing Systems(ICDCS), p. 6 (2007)

    Google Scholar 

  9. Bandi, N., Metwally, A., Agrawal, D., Abbadi, A.E.: Fast data stream algorithms using associative memories. In: SIGMOD, pp. 247–256 (2007)

    Google Scholar 

  10. Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Randomization and Approximation Techniques, 6th International Workshop, RANDOM, pp. 1–10 (2002)

    Google Scholar 

  11. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp. 623–632 (2002)

    Google Scholar 

  12. Bawa, M., Molina, H.G., Gionis, A., Motwani, R.: Estimating aggregates on a peer-to-peer network. Technical report, Stanford University (2003)

    Google Scholar 

  13. Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS, pp. 253–262 (2006)

    Google Scholar 

  14. Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Monitoring streams - a new class of data management applications. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 215–226. Springer, Heidelberg (2003)

    Google Scholar 

  15. Chang, Y.-C., Bergman, L.D., Castelli, V., Li, C.-S., Lo, M.-L., Smith, J.R.: The onion technique: Indexing for linear optimization queries. In: SIGMOD, pp. 391–402 (2000)

    Google Scholar 

  16. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  17. Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: Niagaracq: A scalable continuous query system for internet databases. In: SIGMOD, pp. 379–390 (2000)

    Google Scholar 

  18. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441–453 (1997)

    Article  MATH  Google Scholar 

  19. Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 241–252 (2003)

    Google Scholar 

  20. Considine, J., Li, F., Kollios, G., Byers, J.W.: Approximate aggregation techniques for sensor databases. In: ICDE, pp. 449–460 (2004)

    Google Scholar 

  21. Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: SODA, pp. 151–156 (2004)

    Google Scholar 

  22. Cormode, G., Garofalakis, M.N.: Sketching streams through the net: Distributed approximate query tracking. In: VLDB, pp. 13–24 (2005)

    Google Scholar 

  23. Cormode, G., Garofalakis, M.N., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: SIGMOD, pp. 25–36 (2005)

    Google Scholar 

  24. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: VLDB, pp. 464–475 (2003)

    Google Scholar 

  25. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In: SIGMOD, pp. 155–166 (2004)

    Google Scholar 

  26. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Effective computation of biased quantiles over data streams. In: ICDE, pp. 20–31 (2005)

    Google Scholar 

  27. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: PODS, pp. 263–272 (2006)

    Google Scholar 

  28. Cormode, G., Muthukrishnan, S.: What’s hot and what’s not: tracking most frequent items dynamically. In: PODS, pp. 296–306 (2003)

    Google Scholar 

  29. Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 29–38. Springer, Heidelberg (2004)

    Google Scholar 

  30. Cormode, G., Muthukrishnan, S.: Space efficient mining of multigraph streams. In: PODS, pp. 271–282 (2005)

    Google Scholar 

  31. Cormode, G., Muthukrishnan, S., Zhuang, W.: What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: ICDE, p. 57 (2006)

    Google Scholar 

  32. Cranor, C.D., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: SIGMOD, pp. 647–651 (2003)

    Google Scholar 

  33. Das, G., Gunoplulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB (2007)

    Google Scholar 

  34. Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries using views. In: VLDB, pp. 451–462 (2006)

    Google Scholar 

  35. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows (extended abstract). In: SODA, pp. 635–644 (2002)

    Google Scholar 

  36. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 348–360. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  37. Durand, M., Flajolet, P.: Loglog counting of large cardinalities (extended abstract). In: Di Battista, G., Zwick, U. (eds.) ESA 2003. LNCS, vol. 2832, pp. 605–617. Springer, Heidelberg (2003)

    Google Scholar 

  38. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW, pp. 613–622 (2001)

    Google Scholar 

  39. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communications(SIGCOMM) (2002)

    Google Scholar 

  40. Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270–313 (2003)

    Article  Google Scholar 

  41. Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: ACM SIGCOMM Conference on Internet Measurement, pp. 153–166 (2003)

    Google Scholar 

  42. Fagin, R.: Combining fuzzy information from multiple systems. In: PODS, pp. 216–226 (1996)

    Google Scholar 

  43. Fagin, R.: Fuzzy queries in multimedia database systems. In: PODS, pp. 1–10 (1998)

    Google Scholar 

  44. Fagin, R.: Combining fuzzy information from multiple systems. J. Comput. Syst. Sci. 58(1), 83–99 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  45. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001)

    Google Scholar 

  46. Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  47. Ganguly, S., Cormode, G.: On Estimating Frequency Moments of Data Streams. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds.) RANDOM 2007 and APPROX 2007. LNCS, vol. 4627, pp. 479–493. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  48. Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB, pp. 541–550 (2001)

    Google Scholar 

  49. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: SPAA, pp. 281–291 (2001)

    Google Scholar 

  50. Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)

    Google Scholar 

  51. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to summarize the universe: Dynamic maintenance of quantiles. In: VLDB, pp. 454–465 (2002)

    Google Scholar 

  52. Golab, L., DeHaan, D., Demaine, E.D., López-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: ACM SIGCOMM Conference on Internet Measurement, pp. 173–178 (2003)

    Google Scholar 

  53. Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: SIGMOD, pp. 611–622 (2005)

    Google Scholar 

  54. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD, pp. 58–66 (2001)

    Google Scholar 

  55. Greenwald, M., Khanna, S.: Power-conserving computation of order-statistics over sensor networks. In: PODS, pp. 275–285 (2004)

    Google Scholar 

  56. Guha, S., McGregor, A.: Approximate quantiles and the order of the stream. In: PODS, pp. 273–279 (2006)

    Google Scholar 

  57. Gupta, A., Zane, F.: Counting inversions in lists. In: SODA, pp. 253–254 (2003)

    Google Scholar 

  58. Hadjieleftheriou, M., Byers, J.W., Kollios, G.: Robust sketching and aggregation of distributed data streams. Technical report. Boston University (2005)

    Google Scholar 

  59. Hellerstein, J.M., Franklin, M.J., Chandrasekaran, S., Deshpande, A., Hildrum, K., Madden, S., Raman, V., Shah, M.A.: Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull. 23(2), 7–18 (2000)

    Google Scholar 

  60. Hershberger, J., Shrivastava, N., Suri, S., Tóth, C.D.: Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: PODS, pp. 338–347 (2005)

    Google Scholar 

  61. Hristidis, V., Koudas, N., Papakonstantinou, Y.: Prefer: A system for the efficient execution of multi-parametric ranked queries. In: SIGMOD, pp. 259–270 (2001)

    Google Scholar 

  62. Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: STOCK, pp. 202–208 (2005)

    Google Scholar 

  63. Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically maintaining frequent items over a data stream. In: CIKM, pp. 287–294 (2003)

    Google Scholar 

  64. Jin, W., Ester, M., Han, J.: Efficient processing of ranked queries with sweeping selection. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 527–535. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  65. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51–55 (2003)

    Article  Google Scholar 

  66. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD, pp. 289–300 (2006)

    Google Scholar 

  67. Korn, F., Muthukrishnan, S., Srivastava, D.: Reverse nearest neighbor aggregates over data streams. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 814–825. Springer, Heidelberg (2003)

    Google Scholar 

  68. Lee, L.K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: PODS, pp. 290–297 (2006)

    Google Scholar 

  69. Lin, X., Lu, H., Xu, J., Yu, J.X.: Continuously maintaining quantile summaries of the most recent n elements over a data stream. In: ICDE, pp. 362–374 (2004)

    Google Scholar 

  70. Lin, X., Xu, J., Zhang, Q., Lu, H., Yu, J.X., Zhou, X., Yuan, Y.: Approximate processing of massive continuous quantile queries over high-speed data streams. IEEE Trans. Knowl. Data Eng. 18(5), 683–698 (2006)

    Article  Google Scholar 

  71. Manganelli, S., Engle, R.: Value at risk models in finance. In: European Central Bank Working Paper Series No. 75 (2001)

    Google Scholar 

  72. Manjhi, A., Nath, S., Gibbons, P.B.: Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In: SIGMOD, pp. 287–298 (2005)

    Google Scholar 

  73. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE, pp. 767–778 (2005)

    Google Scholar 

  74. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 346–357. Springer, Heidelberg (2003)

    Google Scholar 

  75. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: SIGMOD, pp. 426–435 (1998)

    Google Scholar 

  76. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: SIGMOD, pp. 251–262 (1999)

    Google Scholar 

  77. Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2004)

    Google Scholar 

  78. Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143–152 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  79. Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over sliding windows. In: SIGMOD, pp. 635–646 (2006)

    Google Scholar 

  80. Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  81. Muthukrishnan, S.: Data streams: algorithms and applications. In: SODA, pp. 413–413 (2003)

    Google Scholar 

  82. Nath, S., Gibbons, P.B., Seshan, S., Anderson, Z.R.: Synopsis diffusion for robust aggregation in sensor networks. In: SenSys, pp. 250–262 (2004)

    Google Scholar 

  83. Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1), 41–82 (2005)

    Article  Google Scholar 

  84. Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB, pp. 448–459 (1996)

    Google Scholar 

  85. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: SenSys, pp. 239–249 (2004)

    Google Scholar 

  86. STREAM stream data manager, http://www-db.stanford.edu/stream/sqr

  87. Tao, Y., Hadjieleftheriou, M.: Processing ranked queries with the minimum space. In: Dix, J., Hegner, S.J. (eds.) FoIKS 2006. LNCS, vol. 3861, pp. 294–312. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  88. Tao, Y., Hristidis, V., Papadias, D., Papakonstantinou, Y.: Branch-and-bound processing of ranked queries. Inf. Syst. 32(3), 424–445 (2007)

    Article  Google Scholar 

  89. Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. IEEE Trans. Knowl. Data Eng (to appear, 2007)

    Google Scholar 

  90. Tsaparas, P., Palpanas, T., Kotidis, Y., Koudas, N., Srivastava, D.: Ranked join indices. In: ICDE, pp. 277–288 (2003)

    Google Scholar 

  91. Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: NDSS (2005)

    Google Scholar 

  92. Whang, K.-Y., Zanden, B.T.V., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. 15(2), 208–229 (1990)

    Article  Google Scholar 

  93. Xin, D., Chen, C., Han, J.: Towards robust indexing for ranked queries. In: VLDB, pp. 235–246 (2006)

    Google Scholar 

  94. Yao, Y., Gehrke, J.: The cougar approach to in-network query processing in sensor networks. SIGMOD Record 31(3), 9–18 (2002)

    Article  Google Scholar 

  95. Yi, K., Yu, H., Yang, J., Xia, G., Chen, Y.: Efficient maintenance of materialized top-k views. In: ICDE, pp. 189–200 (2003)

    Google Scholar 

  96. Zhang, Y., Lin, X., Xu, J., Korn, F., Wang, W.: Space-efficient relative error order sketch over data streams. In ICDE, page 51 (2006)

    Google Scholar 

  97. Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X., Yu, J.X.: Summarizing order statistics over data streams with duplicates. In: ICDE, pp. 1329–1333 (2007)

    Google Scholar 

  98. Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 358–369. Springer, Heidelberg (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Yanchun Zhang Ge Yu Elisa Bertino Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lin, X., Zhang, Y. (2008). Aggregate Computation over Data Streams. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78849-2_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78848-5

  • Online ISBN: 978-3-540-78849-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics