Abstract
Improving the dependability of computer systems is a critical and essential task. In this context, the paper surveys techniques that allow to achieve fault tolerance in distributed systems by replication. The main replication techniques are first explained. Then group communication is introduced as the communication infrastructure that allows the implementation of the different replication techniques. Finally the difficulty of implementing group communication is discussed, and the most important algorithms are presented.
The same paper will appear under the title Dependable Systems in Dependable Information and Communication Systems, to be published in the Springer LNCS series, 2006. Research supported by the Hasler Stiftung under grant number DICS-1825.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aguilera, M.K., Chen, W., Toueg, S.: Heartbeat: a Timeout-Free Failure Detector for Quiescent Reliable Communication. In: Mavronicolas, M. (ed.) WDAG 1997. LNCS, vol. 1320, pp. 126–140. Springer, Heidelberg (1997)
Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Thrifty Generic Broadcast. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, p. 268. Springer, Heidelberg (2000)
Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Distributed Database Systems. Addison-Wesley, Reading (1987)
Birman, K., Joseph, T.: Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems 5(1), 47–76 (1987)
Chandra, T.D., Hadzilacos, V., Toueg, S.: The Weakest Failure Detector for Solving Consensus. Journal of ACM 43(4), 685–722 (1996)
Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of ACM 43(2), 225–267 (1996)
Chockler, G.V., Keidar, I., Vitenberg, R.: Group Communication Specifications: A Comprehensive Study. ACM Computing Surveys 4(33), 1–43 (2001)
Défago, X., Schiper, A., Urban, P.: Totally Ordered Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Computing Surveys 4(36), 1–50 (2004)
Dolev, D., Dwork, C., Stockmeyer, L.: On the Minimal Synchrony Needed for Distributed Consensus. Journal of ACM 34(1), 77–97 (1987)
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. Journal of ACM 35(2), 288–323 (1988)
Ekwall, R., Schiper, A.: Replication: Understanding the Advantage of Atomic Broadcast over Quorum Systems. Journal of Universal Computer Science 11(5), 703–711 (2005)
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys 34(3), 375–408 (2002)
Fischer, M., Lynch, N., Paterson, M.: Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM 32, 374–382 (1985)
Guerraoui, R., Larrea, M., Schiper, A.: Reducing the Cost for Non-Blocking in Atomic Commitment. In: IEEE 16th Intl. Conf. Distributed Computing Systems, May 1996, pp. 692–697 (1996)
Hadzilacos, V., Toueg, S.: Fault-Tolerant Broadcasts and Related Problems. Technical Report 94-1425, Department of Computer Science, Cornell University (May 1994)
Herlihy, M., Wing, J.: Linearizability: a Correctness Condition for Concurrent Objects. ACM Trans. on Progr. Languages and Syst. 12(3), 463–492 (1990)
Hermant, J.-F., Le Lann, G.: Fast Asynchronous Uniform Consensus in Real-Time Distributed Systems. IEEE Transactions on Computers 51(8), 931–944 (2002)
Lamport, L.: Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM 21(7), 558–565 (1978)
Lamport, L.: How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Trans. on Computers C28(9), 690–691 (1979)
Lamport, L.: The Part-Time Parliament. TR 49, Digital SRC (September 1989)
Lamport, L.: The Part-Time Parliament. ACM Trans. on Computer Systems 16(2), 133–169 (1998)
Laprie, J.C. (ed.): Dependability: Basic Concepts and Terminology. Springer, Heidelberg (1992)
Lynch, N.A.: Distributed Algorithms. Morgan Kaufmann, San Francisco (1996)
Misra, J.: Axioms for Memory Access in Asynchronous Hardware Systems. ACM Trans. on Progr. Languages and Syst. 8(1), 142–153 (1986)
Pedone, F., Schiper, A.: Handling Message Semanticas with Generic Broadcast Protocols. Distributed Computing 15(2), 97–107 (2002)
Schiper, A.: Dynamic Group Communication. TR IC/2003/27, EPFL. To appear in ACM Distributed Computing (April 2003)
Schiper, A., Toueg, S.: From Set Membership to Group Membership: A Separation of Concerns. TR IC/2003/56, EPFL - IC (September 2003)
Schneider, F.B.: Implementing Fault Tolerant Services Using the State Machine Approach: A Tutorial. Computing Surveys 22(4) (December 1990)
Skeen, D.: Nonblocking Commit Protocols. In: ACM SIGMOD Intl. Conf. on Management of Data, pp. 133–142 (1981)
Urbán, P., Shnayderman, I., Schiper, A.: Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms. In: Proc. Int’l. Conf. on Dependable Systems and Networks, San Francisco, CA, USA, June 2003, pp. 645–654 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schiper, A. (2006). Group Communication: From Practice to Theory. In: Wiedermann, J., Tel, G., Pokorný, J., Bieliková, M., Štuller, J. (eds) SOFSEM 2006: Theory and Practice of Computer Science. SOFSEM 2006. Lecture Notes in Computer Science, vol 3831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11611257_10
Download citation
DOI: https://doi.org/10.1007/11611257_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31198-0
Online ISBN: 978-3-540-32217-7
eBook Packages: Computer ScienceComputer Science (R0)