, Volume 25, Issue 5, pp 489–510 | Cite as

A survey of checkpointing algorithms for parallel and distributed computers

  • S. Kalaiselvi
  • V. Rajaraman


Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.


Checkpointing algorithms parallel & distributed computing shared memory systems rollback recovery fault-tolerant systems 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ahmed R E, Frazier R C, Marinos P N 1990 Cache aided rollback error recovery (CARER) algorithms for shared memory multiprocessor systems.Proc. IEEE 20th Int. Symp. on Fault Tolerant Computing pp 82–88Google Scholar
  2. Alvisi L, Hoppe B, Marzullo K 1993 Nonblocking and orphan free message logging protocols.Proc. IEEE 23rd Int. Symp. Fault Tolerant Computing pp 145–154Google Scholar
  3. Archibald J, Baer J-L 1986 Cache coherence protocols: Evaluation using a multiprocessor simulation model.ACM Trans. Comput. Syst. 4: 273–298CrossRefGoogle Scholar
  4. Beck M, Plank J S, Kingsley G 1994 Compiler-assisted checkpointing. Technical Report, CS-94-269, University of Tennessee, Knoxville, TNGoogle Scholar
  5. Bhargava B, Lian S-R, Leu P-J 1990 Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms.Proc. IEEE 6th Int. Conf. on Data Eng. pp 182–189Google Scholar
  6. Borg A, Blau W, Gratsch W, Herrman H, Oberle W 1989 Fault tolerance under UNIX.ACM Trans. Comput. Syst. 7: 1–24CrossRefGoogle Scholar
  7. Bowen N S, Pradhan K 1992 Virtual checkpoints: Architecture and performance.IEEE Trans. Comput. 41: 516–525CrossRefGoogle Scholar
  8. Brown L, Wu J 1994 Dynamic snooping in a fault-tolerant distributed shared memory.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 218–226Google Scholar
  9. Chandy K M, Lamport L 1985 Distributed snapshots: Determining global states of distributed systems.ACM Trans. Comput. Syst. 3: 63–75CrossRefGoogle Scholar
  10. Chandy K M, Ramamoorthy C V 1972 Rollback and recovery strategies for computer programs.IEEE Trans. Comput. C-21: 546–556CrossRefMathSciNetGoogle Scholar
  11. Chen P M, Lee E K, Gibson G A, Katz R H, Patterson D A 1994 RAID-High performance, reliable secondary storage.ACM Comput. Surv. 26: 145–185CrossRefGoogle Scholar
  12. Chen S-K, Tsai W T, Thuraisingham M B 1989 Recovery point selection on a reverse binary tree task model.IEEE Trans. Software Eng. 15: 963–976CrossRefGoogle Scholar
  13. Cristian F, Jahanian F 1991 A timestamp-based checkpointing protocol for long-lived distributed computations.Proc. IEEE Conf. on Reliable Distributed Syst. pp 12–20Google Scholar
  14. Elnozahy E N, Zwaenepoel W 1992 Manetho: Transparent rollback recovery with low overhead, limited rollback and fast output commit.IEEE Trans. Comput. 41: 526–531CrossRefGoogle Scholar
  15. Elnozahy E N, Zwaenepoel W 1994 On the use and implementation of message logging.Proc. IEEE Int. Sytnp. on Fault Tolerant Computing pp 298–307Google Scholar
  16. Elnozahy E N, Johnson D B, Zwaenepoel W 1992 The performance of consistent checkpointing.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 39–47Google Scholar
  17. Fitzgerald R, Rashid R F 1986 The integration of virtual memory management and interprocess communication in accent.ACM Trans. Comput. Syst. 4: 147–177CrossRefGoogle Scholar
  18. IEEE 1992 Std. 1596–1992.IEEE Scalable Coherent Interface (SCI) (Piscataway, NJ: IEEE)Google Scholar
  19. Janssens B, Fuchs W K 1994 The performance of cache based error recovery in multiprocessors.IEEE Trans. Parallel Distributed Syst. 5: 1033–1043CrossRefGoogle Scholar
  20. Johnson D B, Zwaenepoel W 1987 Sender-based message logging.Proc. IEEE Int. Symp. Fault Tolerant Comput. pp. 14–19Google Scholar
  21. Juang T T-Y, Venkatesan S 1991 Crash recovery with little overhead.Proc. IEEE 11th Int. Conf. on Distributed Comput. Syst. pp 454–461Google Scholar
  22. Kalaiselvi S, Rajaraman V 2000 Task graph based checkpointing in parallel/distributed systems.J. Parallel Distributed Comput. (submitted)Google Scholar
  23. Kim J L, Park T 1993 An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel Distributed Syst. 4: 955–960CrossRefGoogle Scholar
  24. Koo R, Toueg S 1987 Checkpointing and rollback recovery for distributed systems.IEEE Trans. Software Eng. SE-13: 23–31CrossRefGoogle Scholar
  25. Lai T H, Yang T H 1987 On distributed snapshots.Inf. Process. Lett. 25: 153–158MATHCrossRefMathSciNetGoogle Scholar
  26. Lamport L 1978 Time clocks and the ordering of events in a distributed system.Commun. ACM 21: 558–565MATHCrossRefGoogle Scholar
  27. Leong H V, Agrawal D 1994 Using message semantics to reduce rollback in optimistic message logging recovery schemes.Proc. IEEE 14th Conf. on Distributed Computing Syst. pp 227–234Google Scholar
  28. Leu P-J, Bhargava B 1988 Concurrent robust checkpointing and recovery in distributed systems.Proc. Int. Conf. on Data Engineering pp 154–163Google Scholar
  29. Li C-C J, Fuchs W K 1990 CATCH-Compiler-assisted techniques for checkpointing.Proc. 1990 Int. Symp. on Fault Tolerant Computing pp 74–81Google Scholar
  30. Li K, Naughton J F, Plank S 1991 Checkpointing multicomputer applications.Proc. IEEE Conf. on Reliable Distributed Syst. pp 2–11Google Scholar
  31. Manivannan D, Singhal M 1996 A low-overhead recovery technique using quasi-synchronous checkpointing.Proc. IEEE Int. Conf. on Distributed Computing Syst. pp 100–107Google Scholar
  32. Mishra S K, Raghavan V V, Tzeng N-F 1991 Efficient algorithms for selection of recovery points in task models.IEEE Trans. Software Eng. 17: 731–734CrossRefGoogle Scholar
  33. Netzer R H B, Xu J 1995 Necessary and sufficient conditions for consistent global snapshots.IEEE Trans. Parallel Distributed Syst. 6: 165–169CrossRefGoogle Scholar
  34. Plank J S 1993Efficient checkpointing on MIMD architectures. Doctoral dissertation, Dept. of Computer Science, Princeton University, Princeton, NJGoogle Scholar
  35. Plank J S, Li K 1994a ickp: A consistent checkpointer for multicomputers.IEEE Parallel Distributed Technol. 2: 62–67CrossRefGoogle Scholar
  36. Plank J S, Li K 1994b Faster checkpointing withN + 1 parity.Proc. IEEE Int. Symp. on Fault Tolerant Computing pp 288–297Google Scholar
  37. Plank J S, Beck M, Kingsley G, Li K 1995 Libckpt: Transparent checkpointing under unix.USENIX Winter Technical Conference, New Orleans, LouisianaGoogle Scholar
  38. Ralston A, Reily E D 1993Encyclopedia of computer science 3rd edn (New York: IEEE Press)Google Scholar
  39. Randell B 1975 System structure for software fault tolerance.IEEE Trans. Software Eng. SE-1: 220–232Google Scholar
  40. Reuter A 1980 A fast transaction-oriented logging scheme for undo recovery.IEEE Trans. Software Eng. SE-6: 348–356CrossRefGoogle Scholar
  41. Siewiorek D P, Swarz S 1982The theory and practice of reliable system design (Cambridge, MA: Digital Press)Google Scholar
  42. Silva L, Silva J 1992 Global checkpointing for distributed programs.Proc. IEEE 11th Sytnp. on Reliable Distributed Syst. pp 155–162Google Scholar
  43. Strom R E, Yemini S 1985 Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3: 204–226CrossRefGoogle Scholar
  44. Tarn V-O, Hsu M 1990 Fast recovery in distributed shared virtual memory systems.Proc. IEEE 10th Int. Conf. on Distributed Computing Syst. pp 38–45Google Scholar
  45. Theimer M, Lantz K, Cheriton D R 1985 Preemptable remote execution facilities in the V-system.Proc. of the 10th ACM Symp. on Operating Syst. Principles pp 2–11Google Scholar
  46. Tong Z, Kain R Y, Tsai W T 1992 Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. Parallel Distributed Syst. 3: 246–251CrossRefGoogle Scholar
  47. Upadhyaya J S, Saluja K K 1986 A watchdog processor based general rollback technique with multiple retries.IEEE Trans. Software Eng. SE-12: 87–95Google Scholar
  48. Venkatesan S 1989 Message optimal incremental snapshots.Proc. IEEE 9th Int. Conf. Distributed Comput. Syst. pp 53–60Google Scholar
  49. Venkatesh K, Radhakrishnan T, Li H F 1987 Optimal checkpointing and local recording for domino-free rollback recovery.Inf. Process. Lett. 25: 295–303CrossRefGoogle Scholar
  50. Wang Y-M, Fuchs W K 1992 Optimistic message logging for independent checkpointing in message passing systems.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 147–154Google Scholar
  51. Wang Y-M, Chung P-Y, Lin I-J, Fuchs W K 1995 Checkpoint space reclamation for uncoordinated checkpointing in message passing systems.IEEE Trans. Parallel Distributed Syst. 6: 546–554CrossRefGoogle Scholar
  52. Wu K-L, Fuchs W K 1989 Recoverable distributed shared virtual memory: Memory coherence and storage structures.Proc. 1989 Int. Symp. on Fault Tolerant Computing pp 520–527Google Scholar
  53. Wu K-L, Fuchs W K, Patel J H 1989 Cache based error recovery for shared memory multiprocessor systems.Proc. Int. Conf. on Parallel Processing pp I159–I166Google Scholar
  54. Young C, Chiu G 1994 A crash recovery technique in distributed computing systems.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 235–242Google Scholar
  55. Zomaya A Y H 1996Parallel and distributed computing handbook (New York: McGraw-Hill)Google Scholar

Copyright information

© Indian Academy of Sciences 2000

Authors and Affiliations

  • S. Kalaiselvi
    • 1
  • V. Rajaraman
    • 2
  1. 1.Supercomputer Education and Research Centre (SERC)Indian Institute of ScienceBangaloreIndia
  2. 2.Jawaharlal Nehru Centre for Advanced Scientific ResearchIndian Institute of Science CampusBangaloreIndia

Personalised recommendations