Asynchronous Distributed Checkpointing

Raynal, Michel

doi:10.1007/978-3-642-38123-2_8

Michel Raynal²

2940 Accesses
1 Citations

Abstract

This chapter is devoted to checkpointing in asynchronous message-passing systems. It first presents the notions of local and global checkpoints and a theorem stating a necessary and sufficient condition for a set of local checkpoints to belong to the same consistent global checkpoint.

Then, the chapter considers two consistency conditions, which can be associated with a distributed computation enriched with local checkpoints (the corresponding execution is called a communication and checkpoint pattern). The first consistency condition (called z-cycle-freedom) ensures that any local checkpoint, which has been taken by a process, belongs to a consistent global checkpoint. The second consistency condition (called rollback-dependency trackability) is stronger. It states that a consistent global checkpoint can be associated on the fly with each local checkpoint (i.e., without additional communication).

The chapter discusses these consistency conditions and presents algorithms that, once superimposed on a distributed execution, ensure that the corresponding consistency condition is satisfied. It also presents a message logging algorithm suited to uncoordinated checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers, in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New York, 1994), pp. 73–80
Chapter Google Scholar
L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)
Article Google Scholar
R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property. Inf. Process. Lett. 80(2), 105–111 (2001)
Article MATH Google Scholar
R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77
Chapter Google Scholar
R. Baldoni, J.-M. Hélary, M. Raynal, Consistent records in asynchronous computations. Acta Inform. 35(6), 441–455 (1998)
Article MathSciNet MATH Google Scholar
R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal characterization and its protocol. Inf. Comput. 165(2), 144–173 (2001)
Article MATH Google Scholar
B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12
Google Scholar
D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algorithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems (IEEE Press, New York, 1984), pp. 207–215
Google Scholar
K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived distributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems (SRDS’91) (IEEE Press, New York, 1991), pp. 12–20
Google Scholar
O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Parallel Distrib. Comput. 63(12), 1193–1218 (2003)
Article MATH Google Scholar
E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 134–141
Google Scholar
I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc. 19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New York, 1999), pp. 55–62
Google Scholar
I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency trackability property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2001), pp. 342–349
Chapter Google Scholar
I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135
Chapter Google Scholar
V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274 pages
Book Google Scholar
A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of distributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 144–156
Google Scholar
J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
Article Google Scholar
J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consistent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999)
Article Google Scholar
J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002)
Article MATH Google Scholar
J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints. IEEE Trans. Softw. Eng. 2(2), 274–281 (1999)
Article Google Scholar
D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11(3), 462–491 (1990)
Article MathSciNet MATH Google Scholar
R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)
Article MATH Google Scholar
A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems (Cambridge University Press, Cambridge, 2008), 736 pages
Book MATH Google Scholar
D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997)
Article Google Scholar
D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107
Chapter Google Scholar
A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing protocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS, vol. 1150 (Springer, Berlin, 1996), pp. 353–364
Google Scholar
R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
Article Google Scholar
N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1), 68–74 (1997)
Article Google Scholar
R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996)
Article Google Scholar
B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2), 220–232 (1975)
Article Google Scholar
D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE6(2), 183–194 (1980)
Article Google Scholar
R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176
Google Scholar
L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162
Google Scholar
A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New York, 1989), pp. 223–238
Google Scholar
J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst. 9(10), 963–971 (1998)
Article Google Scholar
J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced checkpointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999)
Article MathSciNet Google Scholar
Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput. 46(4), 456–468 (1997)
Article MathSciNet Google Scholar
Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 546–554 (1995)
Article Google Scholar
Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 147–154
Google Scholar

Download references

Author information

Authors and Affiliations

Institut Universitaire de France IRISA-ISTIC, Université de Rennes 1, Rennes Cedex, France
Michel Raynal

Authors

Michel Raynal
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Raynal, M. (2013). Asynchronous Distributed Checkpointing. In: Distributed Algorithms for Message-Passing Systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38123-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-38123-2_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38122-5
Online ISBN: 978-3-642-38123-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics