Skip to main content

Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2399))

Abstract

Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Ben-Or. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. In proc. 2nd annual ACM Symposium on Principles of Distributed Computing, pages 27–30, 1983.

    Google Scholar 

  2. K.P. Birman and R. van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, 1994.

    Google Scholar 

  3. N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.

    Google Scholar 

  4. T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Journal of ACM, 43(4):685–722, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  5. T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of ACM, 43(2):225–267, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  6. Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernadette Charron-Bost. On the impossibility of group membership. In Proc. of the 15th ACM Symposium on Principles of Distributed Computing, pages 322–330, Philadelphia, Pennsylvania, USA, May 1996.

    Google Scholar 

  7. J.M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.

    Google Scholar 

  8. B. Charron-Bost, X. Défago, and A. Schiper. Broadcasting Messages in Fault-Tolerant Distributed Systems: the benefit of handling input-triggered and output-triggered suspicions differently. TR IC/2002/020, EPFL, May 2002.

    Google Scholar 

  9. G.V. Chockler, I. Keidar, and R. Vitenberg. Group Communication Specifications: A Comprehensive Study. Computing Surveys, 4(33):1–43, December 2001.

    Google Scholar 

  10. F. Cristian and C. Fetzer. The timed asynchronous distributed system model. IEEE Transactions on Parallel & Distributed Systems, 10(6):642–657, June 1999.

    Google Scholar 

  11. D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchrony needed for distributed consensus. Journal of ACM, 34(1):77–97, January 1987.

    Google Scholar 

  12. X. Défago and A. Schiper. Specification of Replication Techniques, Semi-Passive Replication and Lazy Consensus. TR IC/2002/007, EPFL, February 2002.

    Google Scholar 

  13. X. Défago, A. Schiper, and N. Sergent. Semi-passive Replication. In 17th IEEE Symp. on Reliable Distributed Systems (SRDS-17), pages 43–58, West Lafayette, USA, October 1998.

    Google Scholar 

  14. C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of ACM, 35(2):288–323, April 1988.

    Google Scholar 

  15. M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.

    Google Scholar 

  16. V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University, May 1994.

    Google Scholar 

  17. E.Y. Lotem, I. Keidar, and D. Dolev. Dynamic Voting for Consistent Components. In Proc. 17th Annual ACM Symposium on Principles of Distributed Computing (PODC-97), 1997.

    Google Scholar 

  18. N.A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.

    Google Scholar 

  19. C. Malloth and A. Schiper. View Synchronous Communication in Large Scale Networks. In ESPRIT Basic Research BROADCAST, Third Year Report, Vol 4, July 1995.

    Google Scholar 

  20. N.F. Maxemchuk and D.H. Shur. An Internet multicast system for the stock market. ACM Trans. on Computer Systems, 19(3):384–412, August 2001.

    Google Scholar 

  21. L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, C.A. Lingley-Papadopoulis, and T.P. Archambaud. The Totem system. In IEEE 25th Int Symp on Fault-Tolerant Computing (FTCS-25), pages 61–66, 1995.

    Google Scholar 

  22. F. Pedone, A. Schiper, P. Urban, and D. Cavin. Solving Agreement Problems with Weak Ordering Oracles. TR IC/2002/010, EPFL, March 2002. Appears also as Technical Report HPL-2002-44, Hewlett-Packard Laboratories, March 2002.

    Google Scholar 

  23. M. Rabin. Randomized Byzantine Generals. In Proc. 24th Annual ACM Symposium on Foundations of Computer Science, pages 403–409, 1983.

    Google Scholar 

  24. A.M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.

    Google Scholar 

  25. A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, 10(3): 149–157, April 1997.

    Google Scholar 

  26. F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Schiper, A. (2002). Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs. In: Hermanns, H., Segala, R. (eds) Process Algebra and Probabilistic Methods: Performance Modeling and Verification. PAPM-PROBMIV 2002. Lecture Notes in Computer Science, vol 2399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45605-8_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-45605-8_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43913-4

  • Online ISBN: 978-3-540-45605-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics