Skip to main content

Introduction to Stream Processing Algorithms

  • Reference work entry
  • First Online:

Definitions

Data streaming focuses on estimating functions over streams, which is an important task in data-intensive applications. It aims at approximating functions or statistical measures over (distributed) massive stream(s) in poly-logarithmic space over the size and/or the domain size (i.e., number of distinct items) of the stream(s).

Overview

Many different domains are concerned by the analysis of streams, including machine learning, data mining, databases, information retrieval, and network monitoring. In all these fields, it is necessary to quickly and precisely process a huge amount of data. This can also be applied to any other data issued from distributed applications such as social networks or sensor networks. Given these settings, the real-time analysis of large streams, relying on full-space algorithms, is often not feasible. Two main approaches exist to monitor massive data streams in real time with small amount of resources: sampling and summaries.

Computing information...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th ACM symposium on theory of computing, STOC

    Google Scholar 

  • Anceaume E, Busnel Y (2014) A distributed information divergence estimation over data streams. IEEE Trans Parallel Distrib Syst 25(2):478–487

    Article  Google Scholar 

  • Anceaume E, Busnel Y, Rivetti N, Sericola B (2015) Identifying global icebergs in distributed streams. In: Proceedings of the 34th international symposium on reliable distributed systems, SRDS

    Google Scholar 

  • Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: Proceedings of the 6th international workshop on randomization and approximation techniques, RANDOM

    Google Scholar 

  • Caneill M, El Rheddane A, Leroy V, De Palma N (2016) Locality-aware routing in stateful streaming applications. In: Proceedings of the 17th international middleware conference, Middleware’16

    Google Scholar 

  • Cardellini V, Casalicchio E, Colajanni M, Yu PS (2002) The state of the art in locally distributed web-server systems. ACM Comput Surv 34(2):263–311

    Article  Google Scholar 

  • Cardellini V, Grassi V, Lo Presti F, Nardelli M (2016) Optimal operator placement for distributed stream processing applications. In: Proceedings of the 10th ACM international conference on distributed and event-based systems, DEBS

    Google Scholar 

  • Carney D, Çetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proceedings of the 29th international conference on very large data bases, VLDB

    Google Scholar 

  • Chakrabarti A, Cormode G, McGregor A (2007) A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the 18th ACM-SIAM symposium on discrete algorithms, SODA

    Google Scholar 

  • Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of the 29th international colloquium on automata, languages and programming, ICALP

    Google Scholar 

  • Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proceedings of the 1st international workshop on algorithms and models for distributed event processing, AlMoDEP’11

    Google Scholar 

  • Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75

    Article  MathSciNet  MATH  Google Scholar 

  • Cormode G, Muthukrishnan S, Yi K (2011) Algorithms for distributed functional monitoring. ACM Trans Algorithms 7(2):21:1–21:20

    Article  MathSciNet  MATH  Google Scholar 

  • Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6):1794–1813

    Article  MathSciNet  MATH  Google Scholar 

  • Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209

    Article  MathSciNet  MATH  Google Scholar 

  • Ganguly S, Garafalakis M, Rastogi R, Sabnani K (2007) Streaming algorithms for robust, real-time detection of DDoS attacks. In: Proceedings of the 27th international conference on distributed computing systems, ICDCS

    Google Scholar 

  • Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. The VLDB J 23(4): 517–539

    Article  Google Scholar 

  • Gibbons PB, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th ACM symposium on parallel algorithms and architectures, SPAA

    Google Scholar 

  • Gibbons PB, Tirthapura S (2004) Distributed streams algorithms for sliding windows. Theory Comput Syst 37(3):457–478

    Article  MathSciNet  MATH  Google Scholar 

  • Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R (2014) A catalog of stream processing optimizations. ACM Comput Surv 46(4):41–34

    Article  Google Scholar 

  • Kane DM, Nelson J, Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: Proceedings of the 19th ACM symposium on principles of database systems, PODS

    Google Scholar 

  • Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of the 21st international conference on data engineering, ICDE

    Google Scholar 

  • Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases, VLDB

    Google Scholar 

  • Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceedings of the 10th international conference on database theory, ICDT

    Google Scholar 

  • Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2:143–152

    Article  MathSciNet  MATH  Google Scholar 

  • Muthukrishnan S (2005) Data streams: algorithms and applications. Now Publishers Inc., Hanover

    MATH  Google Scholar 

  • Nasir MAU, Morales GDF, Soriano DG, Kourtellis N, Serafini M (2015) The power of both choices: practical load balancing for distributed stream processing engines. In: Proceedings of the 31st IEEE international conference on data engineering, ICDE

    Google Scholar 

  • Rivetti N, Busnel Y, Mostefaoui A (2015a) Efficiently summarizing distributed data streams over sliding windows. In: Proceedings of the 14th IEEE international symposium on network computing and applications, NCA

    Google Scholar 

  • Rivetti N, Querzoni L, Anceaume E, Busnel Y, Sericola B (2015b) Efficient key grouping for near-optimal load balancing in stream processing systems. In: Proceedings of the 9th ACM international conference on distributed event-Based systems, DEBS

    Google Scholar 

  • Rivetti N, Anceaume E, Busnel Y, Querzoni L, Sericola B (2016a) Proactive online scheduling for shuffle grouping in distributed stream processing systems. In: Proceedings of the 17th ACM/IFIP/USENIX international middleware conference, Middleware

    Google Scholar 

  • Rivetti N, Busnel Y, Querzoni L (2016b) Load-aware shedding in stream processing systems. In: Proceedings of the 10th ACM international conference on distributed event-based systems, DEBS

    Google Scholar 

  • Vengerov D, Menck AC, Zait M, Chakkappen SP (2015) Join size estimation subject to filter conditions. Proc VLDB Endow 8(12):1530–1541

    Article  Google Scholar 

  • Yi K, Zhang Q (2013) Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65:206–223

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao Q, Lall A, Ogihara M, Xu J (2010) Global iceberg detection over distributed streams. In: Proceedings of the 26th IEEE international conference on data engineering, ICDE

    Google Scholar 

  • Zhao Q, Ogihara M, Wang H, Xu J (2006) Finding global icebergs over distributed data sets. In: Proceedings of the 25th ACM SIGACT- SIGMOD-SIGART symposium on principles of database systems, PODS

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicoló Rivetti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Rivetti, N. (2019). Introduction to Stream Processing Algorithms. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_192

Download citation

Publish with us

Policies and ethics