Definition
Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide area network (WAN). Failures of processing nodes or failures in the underlying communication network can cause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries.
Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Balazinska M. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology; 2006.
Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005. p. 13–24.
Brewer EA. Lessons from giant-scale services. IEEE Internet Comput. 2001;5(4):46–55.
Elnozahy ENM, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002;34(3):375–408.
Gray J. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers; 1985.
Gray J, Helland P, O’ Neil P, Shasha D. The dangers of replication and a solution. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1996. p. 173–82.
Hwang JH, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing. In: Proceedings of the 21st International Eonference on Data Engineering; 2005. p. 779–90.
Hwang JH, Xing Y, Çetintemel U, Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of the 23rd International Conference on Data Engineering; 2007. p. 176–85.
Kawell L, Beckhardt S, Halvorsen T, Ozzie R, Greif I. Replicated document management in a group communication system. In: Proceedings of the ACM Conference on Computer-Supported Cooperative Work; 1988.
Schiper A, Toueg S. From set membership to group membership: a separation of concerns. IEEE Trans Dependable Secure Comput. 2006;3(1):2–12.
Schneider FB. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv. 1990;22(4):299–319.
Schneider FB. What good are models and what models are good? In: Distributed systems. 2nd ed. ACM/Addison-Wesley Publishing; 1993, p. 17–26.
Shah MA. Flux: a mechanism for building robust, scalable dataflows. Ph.D. thesis, University of California, Berkeley; 2004.
Shah M, Hellerstein J, Brewer E. Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 827–38.
Terry DB, Theimer M, Petersen K, Demers AJ, Spreitzer M, Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In: Proceedings of the 15th ACM Symposium on Operating System Principles; 1995. p. 172–83.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Balazinska, M., Hwang, JH., Shah, M.A. (2018). Fault Tolerance and High Availability in Data Stream Management Systems. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_160
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_160
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering