Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Fault Tolerance and High Availability in Data Stream Management Systems

  • Magdalena BalazinskaEmail author
  • Jeong-Hyon Hwang
  • Mehul A. Shah
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_160


Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide area network (WAN). Failures of processing nodes or failures in the underlying communication network can cause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries.

Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Balazinska M. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology; 2006.Google Scholar
  2. 2.
    Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005. p. 13–24.Google Scholar
  3. 3.
    Brewer EA. Lessons from giant-scale services. IEEE Internet Comput. 2001;5(4):46–55.CrossRefGoogle Scholar
  4. 4.
    Elnozahy ENM, Alvisi L, Wang YM, Johnson DB. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv. 2002;34(3):375–408.CrossRefGoogle Scholar
  5. 5.
    Gray J. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers; 1985.Google Scholar
  6. 6.
    Gray J, Helland P, O’ Neil P, Shasha D. The dangers of replication and a solution. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 1996. p. 173–82.CrossRefGoogle Scholar
  7. 7.
    Hwang JH, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing. In: Proceedings of the 21st International Eonference on Data Engineering; 2005. p. 779–90.Google Scholar
  8. 8.
    Hwang JH, Xing Y, Çetintemel U, Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of the 23rd International Conference on Data Engineering; 2007. p. 176–85.Google Scholar
  9. 9.
    Kawell L, Beckhardt S, Halvorsen T, Ozzie R, Greif I. Replicated document management in a group communication system. In: Proceedings of the ACM Conference on Computer-Supported Cooperative Work; 1988.Google Scholar
  10. 10.
    Schiper A, Toueg S. From set membership to group membership: a separation of concerns. IEEE Trans Dependable Secure Comput. 2006;3(1):2–12.CrossRefGoogle Scholar
  11. 11.
    Schneider FB. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput Surv. 1990;22(4):299–319.CrossRefGoogle Scholar
  12. 12.
    Schneider FB. What good are models and what models are good? In: Distributed systems. 2nd ed. ACM/Addison-Wesley Publishing; 1993, p. 17–26.Google Scholar
  13. 13.
    Shah MA. Flux: a mechanism for building robust, scalable dataflows. Ph.D. thesis, University of California, Berkeley; 2004.Google Scholar
  14. 14.
    Shah M, Hellerstein J, Brewer E. Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 827–38.Google Scholar
  15. 15.
    Terry DB, Theimer M, Petersen K, Demers AJ, Spreitzer M, Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In: Proceedings of the 15th ACM Symposium on Operating System Principles; 1995. p. 172–83.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Magdalena Balazinska
    • 1
    Email author
  • Jeong-Hyon Hwang
    • 2
  • Mehul A. Shah
    • 3
  1. 1.University of WashingtonSeattleUSA
  2. 2.Department of Computer ScienceUniversity at Albany – State University of New YorkAlbanyUSA
  3. 3.Amazon Web Services (AWS)SeattleUSA

Section editors and affiliations

  • Ugur Cetintemel
    • 1
  1. 1.Department of Computer ScienceBrown UniversityProvidenceUSA