A Gossip-Style Failure Detection Service
Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures.
Unable to display preview. Download preview PDF.
- Agarwal, D. A., Moser, L. E., Melliar-Smith, P. M. & Budhia, R. K. (1995), A Reliable Ordered Delivery Protocol for Interconnected Local-Area Networks, in`Proc. of the International Conference on Network Protocols’, Tokyo, Japan, pp. 365–374.Google Scholar
- Amir, Y., Dolev, D., Kramer, S. & Malkhi, D. (1992), Transis: A communication subsystem for high availability, in`Proc. of the Twenty-Second Int. Symp. on Fault-Tolerant Computing’, IEEE, Boston, MA, pp. 76–84.Google Scholar
- Bailey, N. T. J. (1975), The Mathematical Theory of Infectious Diseases and its Applications (second edition), Hafner Press.Google Scholar
- Birman, K. P., Hayden, M., Ozkasap, O., Budiu, M. & Minsky, Y. (1998), Bimodal multicast, Technical Report 98–1665, Cornell University, Dept. of Computer Science.Google Scholar
- Chandra, T. D., Hadzilacos, V. & Toueg, S. (1992), The weakest failure detector for solving consensus, in `Proc. of the 11th Annual ACM Symposium on Principles of Distributed Computing’.Google Scholar
- Chandra, T. D., Hadzilacos, V., Toueg, S. & Charron-Bost, B. (1996), On the impossibility of group membership in asynchronous systems, in`Proc. of the 15th Annual ACM Symposium on Principles of Distributed Computing’, Philadelphia, PA.Google Scholar
- Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D. & Terry, D. (1987), Epidemic algorithms for replicated database maintenance, in`Proc. of the Sixth ACM Symp. on Principles of Distributed Computing’, ACM SIGOPS-SIGACT, Vancouver, British Columbia, pp. 1–12.Google Scholar
- Golding, R. & Taylor, K. (1992), Group membership in the epidemic style, Technical Report UCSC-CRL-92–13, UC Santa Cruz, Dept. of Computer Science.Google Scholar
- van Renesse, R., Birman, K. P., Hayden, M., Vaysburd, A. & Karr, D. (1998), ‘Building adaptive systems using Ensemble’, Software-Practice and Experience.Google Scholar
- van Renesse, R., Minsky, Y. & Hayden, M. (1998), A gossip-style failure detection service, Technical Report 98–1687, Cornell University, Dept. of Computer Science.Google Scholar
- Vogels, W. (1996), World wide failures, in`Proc. of the 7th ACM SIGOPS Workshop’, Connemara, Ireland.Google Scholar