A Gossip-Style Failure Detection Service

  • Robbert van Renesse
  • Yaron Minsky
  • Mark Hayden

Abstract

Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures.

Keywords

Guaran 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, D. A., Moser, L. E., Melliar-Smith, P. M. & Budhia, R. K. (1995), A Reliable Ordered Delivery Protocol for Interconnected Local-Area Networks, in`Proc. of the International Conference on Network Protocols’, Tokyo, Japan, pp. 365–374.Google Scholar
  2. Amir, Y., Dolev, D., Kramer, S. & Malkhi, D. (1992), Transis: A communication subsystem for high availability, in`Proc. of the Twenty-Second Int. Symp. on Fault-Tolerant Computing’, IEEE, Boston, MA, pp. 76–84.Google Scholar
  3. Bailey, N. T. J. (1975), The Mathematical Theory of Infectious Diseases and its Applications (second edition), Hafner Press.Google Scholar
  4. Baker, B. & Shostak, R. (1972), `Gossips and telephones’, Discrete Mathematics2 (3), 191–193.CrossRefMATHMathSciNetGoogle Scholar
  5. Birman, K. P., Hayden, M., Ozkasap, O., Budiu, M. & Minsky, Y. (1998), Bimodal multicast, Technical Report 98–1665, Cornell University, Dept. of Computer Science.Google Scholar
  6. Chandra, T. D., Hadzilacos, V. & Toueg, S. (1992), The weakest failure detector for solving consensus, in `Proc. of the 11th Annual ACM Symposium on Principles of Distributed Computing’.Google Scholar
  7. Chandra, T. D., Hadzilacos, V., Toueg, S. & Charron-Bost, B. (1996), On the impossibility of group membership in asynchronous systems, in`Proc. of the 15th Annual ACM Symposium on Principles of Distributed Computing’, Philadelphia, PA.Google Scholar
  8. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D. & Terry, D. (1987), Epidemic algorithms for replicated database maintenance, in`Proc. of the Sixth ACM Symp. on Principles of Distributed Computing’, ACM SIGOPS-SIGACT, Vancouver, British Columbia, pp. 1–12.Google Scholar
  9. Fischer, M. J., Lynch, N. A. & Patterson, M. S. (1985), `Impossibility of distributed consensus with one faulty process’, Journal of the ACM32 (2), 374–382.CrossRefMATHGoogle Scholar
  10. Golding, R. & Taylor, K. (1992), Group membership in the epidemic style, Technical Report UCSC-CRL-92–13, UC Santa Cruz, Dept. of Computer Science.Google Scholar
  11. Kozen, D. (1991), The Design and Analysis of Algorithms, Springer Verlag.MATHGoogle Scholar
  12. van Renesse, R., Birman, K. P., Hayden, M., Vaysburd, A. & Karr, D. (1998), ‘Building adaptive systems using Ensemble’, Software-Practice and Experience.Google Scholar
  13. van Renesse, R., Birman, K. P. & Maffeis, S. (1996), ‘Horus: A flexible group communication system’, Comm. of the ACM39 (4), 76–83.CrossRefGoogle Scholar
  14. van Renesse, R., Minsky, Y. & Hayden, M. (1998), A gossip-style failure detection service, Technical Report 98–1687, Cornell University, Dept. of Computer Science.Google Scholar
  15. Vogels, W. (1996), World wide failures, in`Proc. of the 7th ACM SIGOPS Workshop’, Connemara, Ireland.Google Scholar

Copyright information

© Springer-Verlag London Limited 1998

Authors and Affiliations

  • Robbert van Renesse
    • 1
  • Yaron Minsky
    • 1
  • Mark Hayden
    • 1
  1. 1.Dept. of Computer ScienceCornell UniversityIthacaUSA

Personalised recommendations