Abstract
Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
M. Ben-Or. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. In proc. 2nd annual ACM Symposium on Principles of Distributed Computing, pages 27–30, 1983.
K.P. Birman and R. van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, 1994.
N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.
T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Journal of ACM, 43(4):685–722, 1996.
T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of ACM, 43(2):225–267, 1996.
Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernadette Charron-Bost. On the impossibility of group membership. In Proc. of the 15th ACM Symposium on Principles of Distributed Computing, pages 322–330, Philadelphia, Pennsylvania, USA, May 1996.
J.M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.
B. Charron-Bost, X. Défago, and A. Schiper. Broadcasting Messages in Fault-Tolerant Distributed Systems: the benefit of handling input-triggered and output-triggered suspicions differently. TR IC/2002/020, EPFL, May 2002.
G.V. Chockler, I. Keidar, and R. Vitenberg. Group Communication Specifications: A Comprehensive Study. Computing Surveys, 4(33):1–43, December 2001.
F. Cristian and C. Fetzer. The timed asynchronous distributed system model. IEEE Transactions on Parallel & Distributed Systems, 10(6):642–657, June 1999.
D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchrony needed for distributed consensus. Journal of ACM, 34(1):77–97, January 1987.
X. Défago and A. Schiper. Specification of Replication Techniques, Semi-Passive Replication and Lazy Consensus. TR IC/2002/007, EPFL, February 2002.
X. Défago, A. Schiper, and N. Sergent. Semi-passive Replication. In 17th IEEE Symp. on Reliable Distributed Systems (SRDS-17), pages 43–58, West Lafayette, USA, October 1998.
C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of ACM, 35(2):288–323, April 1988.
M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.
V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University, May 1994.
E.Y. Lotem, I. Keidar, and D. Dolev. Dynamic Voting for Consistent Components. In Proc. 17th Annual ACM Symposium on Principles of Distributed Computing (PODC-97), 1997.
N.A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
C. Malloth and A. Schiper. View Synchronous Communication in Large Scale Networks. In ESPRIT Basic Research BROADCAST, Third Year Report, Vol 4, July 1995.
N.F. Maxemchuk and D.H. Shur. An Internet multicast system for the stock market. ACM Trans. on Computer Systems, 19(3):384–412, August 2001.
L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, C.A. Lingley-Papadopoulis, and T.P. Archambaud. The Totem system. In IEEE 25th Int Symp on Fault-Tolerant Computing (FTCS-25), pages 61–66, 1995.
F. Pedone, A. Schiper, P. Urban, and D. Cavin. Solving Agreement Problems with Weak Ordering Oracles. TR IC/2002/010, EPFL, March 2002. Appears also as Technical Report HPL-2002-44, Hewlett-Packard Laboratories, March 2002.
M. Rabin. Randomized Byzantine Generals. In Proc. 24th Annual ACM Symposium on Foundations of Computer Science, pages 403–409, 1983.
A.M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.
A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, 10(3): 149–157, April 1997.
F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schiper, A. (2002). Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs. In: Hermanns, H., Segala, R. (eds) Process Algebra and Probabilistic Methods: Performance Modeling and Verification. PAPM-PROBMIV 2002. Lecture Notes in Computer Science, vol 2399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45605-8_1
Download citation
DOI: https://doi.org/10.1007/3-540-45605-8_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43913-4
Online ISBN: 978-3-540-45605-6
eBook Packages: Springer Book Archive