Abstract
The paper is a tutorial on fault-tolerance by replication in distributed systems. We start by defining linearizability as the correctness criterion for replicated services (or objects), and present the two main classes of replication techniques: primary-backup replication and active replication. We introduce group communication as the infrastructure providing the adequate multicast primitives to implement either primary-backup replication, or active replication. Finally, we discuss the implementation of the two most fundamental group multicast primitives: total order multicast and view synchronous multicast.
Preview
Unable to display preview. Download preview PDF.
References
A. El Abbadi and S. Toueg. Maintaining Availability in Partitioned Replicated Databases. ACM Trans. on Database Systems, 14(2):264–290, June 1989.
M. Ahamad, P.W. Hutto, G. Neiger, J.E. Burns, and P. Kohli. Causal Memory: Definitions, Implementations and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, July 94.
Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P. Ciarfella. Fast Message Ordering and Membership Using a Logical Token-Passing Ring. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 551–560, May 1993.
E. Auceaume. Algorithmique de Fiabilisation de Systèmes Répartis. PhD thesis, Université de Paris-Sud, Centre d'Orsay, January 1993.
K. Birman. The Process Group Approach to Reliable Distributed Computing. Comm. ACM, 36(12):37–53, December 1993.
K. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems, 5(1):47–76, February 1987.
K. Birman, A. Schiper, and P. Stephenson. Lightweight Causal and Atomic Group Multicast. ACM Trans. on Computer Systems, 9(3):272–314, August 1991.
N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.
T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Technical report, Department of Computer Science, Cornell University, May 1994. A preliminary version appeared in the Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, pages 147–158. ACM Press, August 1992.
T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Technical Report 95-1535, Department of Computer Science, Cornell University, August 1995. A preliminary version appeared in the Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 325–340. ACM Press, August 1991.
J. M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.
D. Davcec and A. Burkhard. Consistency and Recovery Control for Replicated Files. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 87–96, 1985.
M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.
D.K. Gifford. Weighted Voting for Replicated Data. In Proceedings of the 7th Symposium on Operating Systems Principles, pages 150–159, December 1979.
N. Goodmand, D. Skeen, A. Chan, U. Dayal, S. Fox, and D. Ries. A recovery algorithm for a distributed database system. In Proc. of the 2nd ACM SIGATC-SIGMOD Symposium on Principles of Database Systems, March 1983.
A. S. Gopal. Fault-Tolerant Broadcast and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Cornell University, Ithaca, NY, March 1992.
R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In 9th Intl. Workshop on Distributed Algorithms (WDAG-9), pages 87–100. Springer Verlag, LNCS 972, September 1995.
R. Guerraoui and A. Schiper. Transaction model vs Virtual Synchrony Model: bridging the gap. In Theory and Practice in Distributed Systems, pages 121–132. Springer Verlag, LNCS 938, 1995.
V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. In Sape Mullender, editor, Distributed Systems, pages 97–145. ACM Press, 1993.
M. Herlihy. A Quorum-Consensus Replication Method for Abstract Data Types. ACM Trans. on Computer Systems, 4(1):32–53, February 1986.
M. Herlihy and J. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Progr. Languages and Syst, 12(3):463–492, 1990.
M. F. Kaashoek and A. S. Tanenbaum. Group Communication in the Amoeba Distributed Operating System. In IEEE 11th Intl. Conf. Distributed Computing Systems, pages 222–230, May 1991.
M. F. Kaashoek, A. S. Tanenbaum, S. F. Hummel, and H. E. Bal. An Efficient Reliable Broadcast Protocol. Operating Systems Review, 23(4):5–19, October 1989.
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers, C28(9):690–691, 1979.
S. W. Luan and V. D. Gligor. A Fault-Tolerant Protocol for Atomic Broadcast. IEEE Trans. Parallel & Distributed Syst., 1(3):271–285, July 90.
D. Malki, Y. Amir, D. Dolev, and S. Kramer. The Transis approach to high available cluster communication. Technical Report CS-94-14, Institute of Computer Science, The Hebrew University of Jerusalem, 1994.
C. Malloth. Conception and Implementation of a Toolkit for Building Fault-Tolerant Distributed Applications in Large Scale Networks. PhD thesis, Federal Institute of Technology, Lausanne (EPFL), 1996. To appear.
S. Mishra, L.L. Peterson, and R. D. Schlichting. Consul: a communication substrate for fault-tolerant distributed programs. Distributed Systems Engineering Journal, 1:87–103, 1993.
L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal. Extended Virtual Synchrony. In IEEE 14th Intl. Conf. Distributed Computing Systems, pages 56–67, June 1994.
D. Powell, editor. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, 1991.
A. M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.
A. Schiper and A. Sandoz. Uniform Reliable Multicast in a Virtually Synchronous Environment. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 561–568, May 1993.
F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.
R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephenson. The Horus System. In K. Birman and R. van Renesse, editors, Reliable Distributed Computing with the Isis Toolkit, pages 133–147. IEEE Computer Society Press, 1993.
U. Wilhelm and A. Schiper. A Hierarchy of Totally Ordered Multicasts. In 14th IEEE Symp. on Reliable Distributed Systems (SRDS-14), pages 106–115, Bad Neuenahr, Germany, September 1995.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guerraoui, R., Schiper, A. (1996). Fault-tolerance by replication in distributed systems. In: Strohmeier, A. (eds) Reliable Software Technologies — Ada-Europe '96. Ada-Europe 1996. Lecture Notes in Computer Science, vol 1088. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0013477
Download citation
DOI: https://doi.org/10.1007/BFb0013477
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61317-6
Online ISBN: 978-3-540-68457-2
eBook Packages: Springer Book Archive