Abstract
This chapter is devoted to checkpointing in asynchronous message-passing systems. It first presents the notions of local and global checkpoints and a theorem stating a necessary and sufficient condition for a set of local checkpoints to belong to the same consistent global checkpoint.
Then, the chapter considers two consistency conditions, which can be associated with a distributed computation enriched with local checkpoints (the corresponding execution is called a communication and checkpoint pattern). The first consistency condition (called z-cycle-freedom) ensures that any local checkpoint, which has been taken by a process, belongs to a consistent global checkpoint. The second consistency condition (called rollback-dependency trackability) is stronger. It states that a consistent global checkpoint can be associated on the fly with each local checkpoint (i.e., without additional communication).
The chapter discusses these consistency conditions and presents algorithms that, once superimposed on a distributed execution, ensure that the corresponding consistency condition is satisfied. It also presents a message logging algorithm suited to uncoordinated checkpointing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers, in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New York, 1994), pp. 73–80
L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)
R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property. Inf. Process. Lett. 80(2), 105–111 (2001)
R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77
R. Baldoni, J.-M. Hélary, M. Raynal, Consistent records in asynchronous computations. Acta Inform. 35(6), 441–455 (1998)
R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal characterization and its protocol. Inf. Comput. 165(2), 144–173 (2001)
B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12
D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algorithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems (IEEE Press, New York, 1984), pp. 207–215
K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived distributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems (SRDS’91) (IEEE Press, New York, 1991), pp. 12–20
O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Parallel Distrib. Comput. 63(12), 1193–1218 (2003)
E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 134–141
I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc. 19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New York, 1999), pp. 55–62
I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency trackability property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2001), pp. 342–349
I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135
V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274 pages
A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of distributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 144–156
J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consistent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999)
J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002)
J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints. IEEE Trans. Softw. Eng. 2(2), 274–281 (1999)
D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11(3), 462–491 (1990)
R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)
A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems (Cambridge University Press, Cambridge, 2008), 736 pages
D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997)
D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107
A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing protocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS, vol. 1150 (Springer, Berlin, 1996), pp. 353–364
R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1), 68–74 (1997)
R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996)
B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2), 220–232 (1975)
D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE6(2), 183–194 (1980)
R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176
L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162
A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New York, 1989), pp. 223–238
J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst. 9(10), 963–971 (1998)
J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced checkpointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999)
Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput. 46(4), 456–468 (1997)
Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 546–554 (1995)
Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 147–154
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Raynal, M. (2013). Asynchronous Distributed Checkpointing. In: Distributed Algorithms for Message-Passing Systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38123-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-38123-2_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38122-5
Online ISBN: 978-3-642-38123-2
eBook Packages: Computer ScienceComputer Science (R0)