Skip to main content

Abstract

This chapter is devoted to checkpointing in asynchronous message-passing systems. It first presents the notions of local and global checkpoints and a theorem stating a necessary and sufficient condition for a set of local checkpoints to belong to the same consistent global checkpoint.

Then, the chapter considers two consistency conditions, which can be associated with a distributed computation enriched with local checkpoints (the corresponding execution is called a communication and checkpoint pattern). The first consistency condition (called z-cycle-freedom) ensures that any local checkpoint, which has been taken by a process, belongs to a consistent global checkpoint. The second consistency condition (called rollback-dependency trackability) is stronger. It states that a consistent global checkpoint can be associated on the fly with each local checkpoint (i.e., without additional communication).

The chapter discusses these consistency conditions and presents algorithms that, once superimposed on a distributed execution, ensure that the corresponding consistency condition is satisfied. It also presents a message logging algorithm suited to uncoordinated checkpointing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers, in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New York, 1994), pp. 73–80

    Chapter  Google Scholar 

  2. L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans. Softw. Eng. 24(2), 149–159 (1998)

    Article  Google Scholar 

  3. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property. Inf. Process. Lett. 80(2), 105–111 (2001)

    Article  MATH  Google Scholar 

  4. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpointing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77

    Chapter  Google Scholar 

  5. R. Baldoni, J.-M. Hélary, M. Raynal, Consistent records in asynchronous computations. Acta Inform. 35(6), 441–455 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  6. R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal characterization and its protocol. Inf. Comput. 165(2), 144–173 (2001)

    Article  MATH  Google Scholar 

  7. B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12

    Google Scholar 

  8. D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algorithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems (IEEE Press, New York, 1984), pp. 207–215

    Google Scholar 

  9. K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  10. F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived distributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems (SRDS’91) (IEEE Press, New York, 1991), pp. 12–20

    Google Scholar 

  11. O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Parallel Distrib. Comput. 63(12), 1193–1218 (2003)

    Article  MATH  Google Scholar 

  12. E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)

    Article  Google Scholar 

  13. J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 134–141

    Google Scholar 

  14. I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc. 19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New York, 1999), pp. 55–62

    Google Scholar 

  15. I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency trackability property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2001), pp. 342–349

    Chapter  Google Scholar 

  16. I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characterization of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135

    Chapter  Google Scholar 

  17. V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274 pages

    Book  Google Scholar 

  18. A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of distributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 144–156

    Google Scholar 

  19. J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)

    Article  Google Scholar 

  20. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consistent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999)

    Article  Google Scholar 

  21. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002)

    Article  MATH  Google Scholar 

  22. J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints. IEEE Trans. Softw. Eng. 2(2), 274–281 (1999)

    Article  Google Scholar 

  23. D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11(3), 462–491 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  24. R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)

    Article  MATH  Google Scholar 

  25. A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems (Cambridge University Press, Cambridge, 2008), 736 pages

    Book  MATH  Google Scholar 

  26. D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a distributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997)

    Article  Google Scholar 

  27. D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107

    Chapter  Google Scholar 

  28. A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing protocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS, vol. 1150 (Springer, Berlin, 1996), pp. 353–364

    Google Scholar 

  29. R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)

    Article  Google Scholar 

  30. N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1), 68–74 (1997)

    Article  Google Scholar 

  31. R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996)

    Article  Google Scholar 

  32. B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2), 220–232 (1975)

    Article  Google Scholar 

  33. D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE6(2), 183–194 (1980)

    Article  Google Scholar 

  34. R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176

    Google Scholar 

  35. L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162

    Google Scholar 

  36. A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New York, 1989), pp. 223–238

    Google Scholar 

  37. J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced checkpointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst. 9(10), 963–971 (1998)

    Article  Google Scholar 

  38. J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced checkpointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999)

    Article  MathSciNet  Google Scholar 

  39. Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints. IEEE Trans. Comput. 46(4), 456–468 (1997)

    Article  MathSciNet  Google Scholar 

  40. Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5), 546–554 (1995)

    Article  Google Scholar 

  41. Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 147–154

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Raynal, M. (2013). Asynchronous Distributed Checkpointing. In: Distributed Algorithms for Message-Passing Systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38123-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38123-2_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38122-5

  • Online ISBN: 978-3-642-38123-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics