Fault-tolerance by replication in distributed systems

Guerraoui, Rachid; Schiper, André

doi:10.1007/BFb0013477

Rachid Guerraoui¹ &
André Schiper¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1088))

Included in the following conference series:

International Conference on Reliable Software Technologies

457 Accesses
31 Citations

Abstract

The paper is a tutorial on fault-tolerance by replication in distributed systems. We start by defining linearizability as the correctness criterion for replicated services (or objects), and present the two main classes of replication techniques: primary-backup replication and active replication. We introduce group communication as the infrastructure providing the adequate multicast primitives to implement either primary-backup replication, or active replication. Finally, we discuss the implementation of the two most fundamental group multicast primitives: total order multicast and view synchronous multicast.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. El Abbadi and S. Toueg. Maintaining Availability in Partitioned Replicated Databases. ACM Trans. on Database Systems, 14(2):264–290, June 1989.
Google Scholar
M. Ahamad, P.W. Hutto, G. Neiger, J.E. Burns, and P. Kohli. Causal Memory: Definitions, Implementations and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, July 94.
Google Scholar
Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P. Ciarfella. Fast Message Ordering and Membership Using a Logical Token-Passing Ring. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 551–560, May 1993.
Google Scholar
E. Auceaume. Algorithmique de Fiabilisation de Systèmes Répartis. PhD thesis, Université de Paris-Sud, Centre d'Orsay, January 1993.
Google Scholar
K. Birman. The Process Group Approach to Reliable Distributed Computing. Comm. ACM, 36(12):37–53, December 1993.
Google Scholar
K. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems, 5(1):47–76, February 1987.
Google Scholar
K. Birman, A. Schiper, and P. Stephenson. Lightweight Causal and Atomic Group Multicast. ACM Trans. on Computer Systems, 9(3):272–314, August 1991.
Google Scholar
N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.
Google Scholar
T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Technical report, Department of Computer Science, Cornell University, May 1994. A preliminary version appeared in the Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, pages 147–158. ACM Press, August 1992.
Google Scholar
T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Technical Report 95-1535, Department of Computer Science, Cornell University, August 1995. A preliminary version appeared in the Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 325–340. ACM Press, August 1991.
Google Scholar
J. M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.
Google Scholar
D. Davcec and A. Burkhard. Consistency and Recovery Control for Replicated Files. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 87–96, 1985.
Google Scholar
M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.
Google Scholar
D.K. Gifford. Weighted Voting for Replicated Data. In Proceedings of the 7th Symposium on Operating Systems Principles, pages 150–159, December 1979.
Google Scholar
N. Goodmand, D. Skeen, A. Chan, U. Dayal, S. Fox, and D. Ries. A recovery algorithm for a distributed database system. In Proc. of the 2nd ACM SIGATC-SIGMOD Symposium on Principles of Database Systems, March 1983.
Google Scholar
A. S. Gopal. Fault-Tolerant Broadcast and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Cornell University, Ithaca, NY, March 1992.
Google Scholar
R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In 9th Intl. Workshop on Distributed Algorithms (WDAG-9), pages 87–100. Springer Verlag, LNCS 972, September 1995.
Google Scholar
R. Guerraoui and A. Schiper. Transaction model vs Virtual Synchrony Model: bridging the gap. In Theory and Practice in Distributed Systems, pages 121–132. Springer Verlag, LNCS 938, 1995.
Google Scholar
V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. In Sape Mullender, editor, Distributed Systems, pages 97–145. ACM Press, 1993.
Google Scholar
M. Herlihy. A Quorum-Consensus Replication Method for Abstract Data Types. ACM Trans. on Computer Systems, 4(1):32–53, February 1986.
Google Scholar
M. Herlihy and J. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Progr. Languages and Syst, 12(3):463–492, 1990.
Google Scholar
M. F. Kaashoek and A. S. Tanenbaum. Group Communication in the Amoeba Distributed Operating System. In IEEE 11th Intl. Conf. Distributed Computing Systems, pages 222–230, May 1991.
Google Scholar
M. F. Kaashoek, A. S. Tanenbaum, S. F. Hummel, and H. E. Bal. An Efficient Reliable Broadcast Protocol. Operating Systems Review, 23(4):5–19, October 1989.
Google Scholar
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers, C28(9):690–691, 1979.
Google Scholar
S. W. Luan and V. D. Gligor. A Fault-Tolerant Protocol for Atomic Broadcast. IEEE Trans. Parallel & Distributed Syst., 1(3):271–285, July 90.
Google Scholar
D. Malki, Y. Amir, D. Dolev, and S. Kramer. The Transis approach to high available cluster communication. Technical Report CS-94-14, Institute of Computer Science, The Hebrew University of Jerusalem, 1994.
Google Scholar
C. Malloth. Conception and Implementation of a Toolkit for Building Fault-Tolerant Distributed Applications in Large Scale Networks. PhD thesis, Federal Institute of Technology, Lausanne (EPFL), 1996. To appear.
Google Scholar
S. Mishra, L.L. Peterson, and R. D. Schlichting. Consul: a communication substrate for fault-tolerant distributed programs. Distributed Systems Engineering Journal, 1:87–103, 1993.
Google Scholar
L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal. Extended Virtual Synchrony. In IEEE 14th Intl. Conf. Distributed Computing Systems, pages 56–67, June 1994.
Google Scholar
D. Powell, editor. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, 1991.
Google Scholar
A. M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.
Google Scholar
A. Schiper and A. Sandoz. Uniform Reliable Multicast in a Virtually Synchronous Environment. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 561–568, May 1993.
Google Scholar
F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.
Google Scholar
R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephenson. The Horus System. In K. Birman and R. van Renesse, editors, Reliable Distributed Computing with the Isis Toolkit, pages 133–147. IEEE Computer Society Press, 1993.
Google Scholar
U. Wilhelm and A. Schiper. A Hierarchy of Totally Ordered Multicasts. In 14th IEEE Symp. on Reliable Distributed Systems (SRDS-14), pages 106–115, Bad Neuenahr, Germany, September 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Département d'Informatique, Ecole Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
Rachid Guerraoui & André Schiper

Authors

Rachid Guerraoui
View author publications
You can also search for this author in PubMed Google Scholar
André Schiper
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alfred Strohmeier

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guerraoui, R., Schiper, A. (1996). Fault-tolerance by replication in distributed systems. In: Strohmeier, A. (eds) Reliable Software Technologies — Ada-Europe '96. Ada-Europe 1996. Lecture Notes in Computer Science, vol 1088. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0013477

Download citation

DOI: https://doi.org/10.1007/BFb0013477
Published: 09 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61317-6
Online ISBN: 978-3-540-68457-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics