Fault Tolerance and High Availability in Data Stream Management Systems
Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators are spread across multiple processing nodes, i.e., independent processes typically running on different physical machines in a local-area network (LAN) or in a wide area network (WAN). Failures of processing nodes or failures in the underlying communication network can cause continuous queries (CQ) in a DSMS to stall or produce erroneous results. These failures can adversely affect critical client applications relying on these queries.
Traditionally, availability has been defined as the fraction of time that a system remains operational and properly services requests. In DSMSs, however, availability often also incorporates end-to-end latencies as applications need to quickly react to real-time events and thus can tolerate only small delays. A DSMS can handle failures using a...
- 1.Balazinska M. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology; 2006.Google Scholar
- 2.Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2005. p. 13–24.Google Scholar
- 5.Gray J. Why do computers stop and what can be done about it? Technical Report 85.7, Tandem Computers; 1985.Google Scholar
- 7.Hwang JH, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing. In: Proceedings of the 21st International Eonference on Data Engineering; 2005. p. 779–90.Google Scholar
- 8.Hwang JH, Xing Y, Çetintemel U, Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of the 23rd International Conference on Data Engineering; 2007. p. 176–85.Google Scholar
- 9.Kawell L, Beckhardt S, Halvorsen T, Ozzie R, Greif I. Replicated document management in a group communication system. In: Proceedings of the ACM Conference on Computer-Supported Cooperative Work; 1988.Google Scholar
- 12.Schneider FB. What good are models and what models are good? In: Distributed systems. 2nd ed. ACM/Addison-Wesley Publishing; 1993, p. 17–26.Google Scholar
- 13.Shah MA. Flux: a mechanism for building robust, scalable dataflows. Ph.D. thesis, University of California, Berkeley; 2004.Google Scholar
- 14.Shah M, Hellerstein J, Brewer E. Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of the ACM SIGMOD International Conference on Management of Data; 2004. p. 827–38.Google Scholar
- 15.Terry DB, Theimer M, Petersen K, Demers AJ, Spreitzer M, Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In: Proceedings of the 15th ACM Symposium on Operating System Principles; 1995. p. 172–83.Google Scholar