Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs

Schiper, André

doi:10.1007/3-540-45605-8_1

André Schiper⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2399))

Included in the following conference series:

Joint International Workshop von Process Algebra and Probabilistic Methods, Performance Modeling and Verification

369 Accesses
4 Citations

Abstract

Failure detection and group membership are two important components of fault-tolerant distributed systems. Understanding their role is essential when developing efficient solutions, not only in failure-free runs, but also in runs in which processes do crash. While group membership provides consistent information about the status of processes in the system, failure detectors provide inconsistent information. This paper discusses the trade-offs related to the use of these two components, and clarifies their roles using three examples. The first example shows a case where group membership may favourably be replaced by a failure detection mechanism. The second example illustrates a case where group membership is mandatory. Finally, the third example shows a case where neither group membership nor failure detectors are needed (they may be replaced by weak ordering oracles).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Ben-Or. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols. In proc. 2nd annual ACM Symposium on Principles of Distributed Computing, pages 27–30, 1983.
Google Scholar
K.P. Birman and R. van Renesse. Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press, 1994.
Google Scholar
N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.
Google Scholar
T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Journal of ACM, 43(4):685–722, 1996.
Article MathSciNet MATH Google Scholar
T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of ACM, 43(2):225–267, 1996.
Article MathSciNet MATH Google Scholar
Tushar Deepak Chandra, Vassos Hadzilacos, Sam Toueg, and Bernadette Charron-Bost. On the impossibility of group membership. In Proc. of the 15th ACM Symposium on Principles of Distributed Computing, pages 322–330, Philadelphia, Pennsylvania, USA, May 1996.
Google Scholar
J.M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.
Google Scholar
B. Charron-Bost, X. Défago, and A. Schiper. Broadcasting Messages in Fault-Tolerant Distributed Systems: the benefit of handling input-triggered and output-triggered suspicions differently. TR IC/2002/020, EPFL, May 2002.
Google Scholar
G.V. Chockler, I. Keidar, and R. Vitenberg. Group Communication Specifications: A Comprehensive Study. Computing Surveys, 4(33):1–43, December 2001.
Google Scholar
F. Cristian and C. Fetzer. The timed asynchronous distributed system model. IEEE Transactions on Parallel & Distributed Systems, 10(6):642–657, June 1999.
Google Scholar
D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchrony needed for distributed consensus. Journal of ACM, 34(1):77–97, January 1987.
Google Scholar
X. Défago and A. Schiper. Specification of Replication Techniques, Semi-Passive Replication and Lazy Consensus. TR IC/2002/007, EPFL, February 2002.
Google Scholar
X. Défago, A. Schiper, and N. Sergent. Semi-passive Replication. In 17th IEEE Symp. on Reliable Distributed Systems (SRDS-17), pages 43–58, West Lafayette, USA, October 1998.
Google Scholar
C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. Journal of ACM, 35(2):288–323, April 1988.
Google Scholar
M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.
Google Scholar
V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University, May 1994.
Google Scholar
E.Y. Lotem, I. Keidar, and D. Dolev. Dynamic Voting for Consistent Components. In Proc. 17th Annual ACM Symposium on Principles of Distributed Computing (PODC-97), 1997.
Google Scholar
N.A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
Google Scholar
C. Malloth and A. Schiper. View Synchronous Communication in Large Scale Networks. In ESPRIT Basic Research BROADCAST, Third Year Report, Vol 4, July 1995.
Google Scholar
N.F. Maxemchuk and D.H. Shur. An Internet multicast system for the stock market. ACM Trans. on Computer Systems, 19(3):384–412, August 2001.
Google Scholar
L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, C.A. Lingley-Papadopoulis, and T.P. Archambaud. The Totem system. In IEEE 25th Int Symp on Fault-Tolerant Computing (FTCS-25), pages 61–66, 1995.
Google Scholar
F. Pedone, A. Schiper, P. Urban, and D. Cavin. Solving Agreement Problems with Weak Ordering Oracles. TR IC/2002/010, EPFL, March 2002. Appears also as Technical Report HPL-2002-44, Hewlett-Packard Laboratories, March 2002.
Google Scholar
M. Rabin. Randomized Byzantine Generals. In Proc. 24th Annual ACM Symposium on Foundations of Computer Science, pages 403–409, 1983.
Google Scholar
A.M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.
Google Scholar
A. Schiper. Early consensus in an asynchronous system with a weak failure detector. Distributed Computing, 10(3): 149–157, April 1997.
Google Scholar
F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015, Lausanne, Switzerland
André Schiper

Authors

André Schiper
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Science Formal Methods and Tools Group, University of Twente, P.O. Box 217, 7500 AE, Enschede, The Netherlands
Holger Hermanns
Department of Computer Science, University of Verona, Strada Le Grazie 15, 37134, Verona, Italy
Roberto Segala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schiper, A. (2002). Failure Detection vs Group Membership in Fault-Tolerant Distributed Systems: Hidden Trade-Offs. In: Hermanns, H., Segala, R. (eds) Process Algebra and Probabilistic Methods: Performance Modeling and Verification. PAPM-PROBMIV 2002. Lecture Notes in Computer Science, vol 2399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45605-8_1

Download citation

DOI: https://doi.org/10.1007/3-540-45605-8_1
Published: 04 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43913-4
Online ISBN: 978-3-540-45605-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics