Advertisement

Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs

  • André Schiper
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2399)

Abstract

Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles).

Keywords

Group Membership Correct Process Failure Detection Link Failure Process Exclusion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M. Ben-Or. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. In proc. 2nd annual ACM Symposium on Principles of Distributed Computing, pages 27–30, 1983.Google Scholar
  2. 2.
    K.P. Birman and R. van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, 1994.Google Scholar
  3. 3.
    N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.Google Scholar
  4. 4.
    T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Journal of ACM, 43(4):685–722, 1996.CrossRefMathSciNetzbMATHGoogle Scholar
  5. 5.
    T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of ACM, 43(2):225–267, 1996.CrossRefMathSciNetzbMATHGoogle Scholar
  6. 6.
    Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernadette Charron-Bost. On the impossibility of group membership. In Proc. of the 15th ACM Symposium on Principles of Distributed Computing, pages 322–330, Philadelphia, Pennsylvania, USA, May 1996.Google Scholar
  7. 7.
    J.M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.Google Scholar
  8. 8.
    B. Charron-Bost, X. Défago, and A. Schiper. Broadcasting Messages in Fault-Tolerant Distributed Systems: the benefit of handling input-triggered and output-triggered suspicions differently. TR IC/2002/020, EPFL, May 2002.Google Scholar
  9. 9.
    G.V. Chockler, I. Keidar, and R. Vitenberg. Group Communication Specifications: A Comprehensive Study. Computing Surveys, 4(33):1–43, December 2001.Google Scholar
  10. 10.
    F. Cristian and C. Fetzer. The timed asynchronous distributed system model. IEEE Transactions on Parallel & Distributed Systems, 10(6):642–657, June 1999.Google Scholar
  11. 11.
    D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchrony needed for distributed consensus. Journal of ACM, 34(1):77–97, January 1987.Google Scholar
  12. 12.
    X. Défago and A. Schiper. Specification of Replication Techniques, Semi-Passive Replication and Lazy Consensus. TR IC/2002/007, EPFL, February 2002.Google Scholar
  13. 13.
    X. Défago, A. Schiper, and N. Sergent. Semi-passive Replication. In 17th IEEE Symp. on Reliable Distributed Systems (SRDS-17), pages 43–58, West Lafayette, USA, October 1998.Google Scholar
  14. 14.
    C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of ACM, 35(2):288–323, April 1988.Google Scholar
  15. 15.
    M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.Google Scholar
  16. 16.
    V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University, May 1994.Google Scholar
  17. 17.
    E.Y. Lotem, I. Keidar, and D. Dolev. Dynamic Voting for Consistent Components. In Proc. 17th Annual ACM Symposium on Principles of Distributed Computing (PODC-97), 1997.Google Scholar
  18. 18.
    N.A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.Google Scholar
  19. 19.
    C. Malloth and A. Schiper. View Synchronous Communication in Large Scale Networks. In ESPRIT Basic Research BROADCAST, Third Year Report, Vol 4, July 1995.Google Scholar
  20. 20.
    N.F. Maxemchuk and D.H. Shur. An Internet multicast system for the stock market. ACM Trans. on Computer Systems, 19(3):384–412, August 2001.Google Scholar
  21. 21.
    L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, C.A. Lingley-Papadopoulis, and T.P. Archambaud. The Totem system. In IEEE 25th Int Symp on Fault-Tolerant Computing (FTCS-25), pages 61–66, 1995.Google Scholar
  22. 22.
    F. Pedone, A. Schiper, P. Urban, and D. Cavin. Solving Agreement Problems with Weak Ordering Oracles. TR IC/2002/010, EPFL, March 2002. Appears also as Technical Report HPL-2002-44, Hewlett-Packard Laboratories, March 2002.Google Scholar
  23. 23.
    M. Rabin. Randomized Byzantine Generals. In Proc. 24th Annual ACM Symposium on Foundations of Computer Science, pages 403–409, 1983.Google Scholar
  24. 24.
    A.M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.Google Scholar
  25. 25.
    A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, 10(3): 149–157, April 1997.Google Scholar
  26. 26.
    F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • André Schiper
    • 1
  1. 1.Ecole Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland

Personalised recommendations