Definitions
Data streaming focuses on estimating functions over streams, which is an important task in data-intensive applications. It aims at approximating functions or statistical measures over (distributed) massive stream(s) in poly-logarithmic space over the size and/or the domain size (i.e., number of distinct items) of the stream(s).
Overview
Many different domains are concerned by the analysis of streams, including machine learning, data mining, databases, information retrieval, and network monitoring. In all these fields, it is necessary to quickly and precisely process a huge amount of data. This can also be applied to any other data issued from distributed applications such as social networks or sensor networks. Given these settings, the real-time analysis of large streams, relying on full-space algorithms, is often not feasible. Two main approaches exist to monitor massive data streams in real time with small amount of resources: sampling and summaries.
Computing information...
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th ACM symposium on theory of computing, STOC
Anceaume E, Busnel Y (2014) A distributed information divergence estimation over data streams. IEEE Trans Parallel Distrib Syst 25(2):478–487
Anceaume E, Busnel Y, Rivetti N, Sericola B (2015) Identifying global icebergs in distributed streams. In: Proceedings of the 34th international symposium on reliable distributed systems, SRDS
Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: Proceedings of the 6th international workshop on randomization and approximation techniques, RANDOM
Caneill M, El Rheddane A, Leroy V, De Palma N (2016) Locality-aware routing in stateful streaming applications. In: Proceedings of the 17th international middleware conference, Middleware’16
Cardellini V, Casalicchio E, Colajanni M, Yu PS (2002) The state of the art in locally distributed web-server systems. ACM Comput Surv 34(2):263–311
Cardellini V, Grassi V, Lo Presti F, Nardelli M (2016) Optimal operator placement for distributed stream processing applications. In: Proceedings of the 10th ACM international conference on distributed and event-based systems, DEBS
Carney D, Çetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proceedings of the 29th international conference on very large data bases, VLDB
Chakrabarti A, Cormode G, McGregor A (2007) A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the 18th ACM-SIAM symposium on discrete algorithms, SODA
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of the 29th international colloquium on automata, languages and programming, ICALP
Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proceedings of the 1st international workshop on algorithms and models for distributed event processing, AlMoDEP’11
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Cormode G, Muthukrishnan S, Yi K (2011) Algorithms for distributed functional monitoring. ACM Trans Algorithms 7(2):21:1–21:20
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6):1794–1813
Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209
Ganguly S, Garafalakis M, Rastogi R, Sabnani K (2007) Streaming algorithms for robust, real-time detection of DDoS attacks. In: Proceedings of the 27th international conference on distributed computing systems, ICDCS
Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. The VLDB J 23(4): 517–539
Gibbons PB, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th ACM symposium on parallel algorithms and architectures, SPAA
Gibbons PB, Tirthapura S (2004) Distributed streams algorithms for sliding windows. Theory Comput Syst 37(3):457–478
Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R (2014) A catalog of stream processing optimizations. ACM Comput Surv 46(4):41–34
Kane DM, Nelson J, Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: Proceedings of the 19th ACM symposium on principles of database systems, PODS
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of the 21st international conference on data engineering, ICDE
Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases, VLDB
Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceedings of the 10th international conference on database theory, ICDT
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2:143–152
Muthukrishnan S (2005) Data streams: algorithms and applications. Now Publishers Inc., Hanover
Nasir MAU, Morales GDF, Soriano DG, Kourtellis N, Serafini M (2015) The power of both choices: practical load balancing for distributed stream processing engines. In: Proceedings of the 31st IEEE international conference on data engineering, ICDE
Rivetti N, Busnel Y, Mostefaoui A (2015a) Efficiently summarizing distributed data streams over sliding windows. In: Proceedings of the 14th IEEE international symposium on network computing and applications, NCA
Rivetti N, Querzoni L, Anceaume E, Busnel Y, Sericola B (2015b) Efficient key grouping for near-optimal load balancing in stream processing systems. In: Proceedings of the 9th ACM international conference on distributed event-Based systems, DEBS
Rivetti N, Anceaume E, Busnel Y, Querzoni L, Sericola B (2016a) Proactive online scheduling for shuffle grouping in distributed stream processing systems. In: Proceedings of the 17th ACM/IFIP/USENIX international middleware conference, Middleware
Rivetti N, Busnel Y, Querzoni L (2016b) Load-aware shedding in stream processing systems. In: Proceedings of the 10th ACM international conference on distributed event-based systems, DEBS
Vengerov D, Menck AC, Zait M, Chakkappen SP (2015) Join size estimation subject to filter conditions. Proc VLDB Endow 8(12):1530–1541
Yi K, Zhang Q (2013) Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65:206–223
Zhao Q, Lall A, Ogihara M, Xu J (2010) Global iceberg detection over distributed streams. In: Proceedings of the 26th IEEE international conference on data engineering, ICDE
Zhao Q, Ogihara M, Wang H, Xu J (2006) Finding global icebergs over distributed data sets. In: Proceedings of the 25th ACM SIGACT- SIGMOD-SIGART symposium on principles of database systems, PODS
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this entry
Cite this entry
Rivetti, N. (2019). Introduction to Stream Processing Algorithms. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_192
Download citation
DOI: https://doi.org/10.1007/978-3-319-77525-8_192
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering