Abstract
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Similar content being viewed by others
References
Ahmed R E, Frazier R C, Marinos P N 1990 Cache aided rollback error recovery (CARER) algorithms for shared memory multiprocessor systems.Proc. IEEE 20th Int. Symp. on Fault Tolerant Computing pp 82–88
Alvisi L, Hoppe B, Marzullo K 1993 Nonblocking and orphan free message logging protocols.Proc. IEEE 23rd Int. Symp. Fault Tolerant Computing pp 145–154
Archibald J, Baer J-L 1986 Cache coherence protocols: Evaluation using a multiprocessor simulation model.ACM Trans. Comput. Syst. 4: 273–298
Beck M, Plank J S, Kingsley G 1994 Compiler-assisted checkpointing. Technical Report, CS-94-269, University of Tennessee, Knoxville, TN
Bhargava B, Lian S-R, Leu P-J 1990 Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms.Proc. IEEE 6th Int. Conf. on Data Eng. pp 182–189
Borg A, Blau W, Gratsch W, Herrman H, Oberle W 1989 Fault tolerance under UNIX.ACM Trans. Comput. Syst. 7: 1–24
Bowen N S, Pradhan K 1992 Virtual checkpoints: Architecture and performance.IEEE Trans. Comput. 41: 516–525
Brown L, Wu J 1994 Dynamic snooping in a fault-tolerant distributed shared memory.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 218–226
Chandy K M, Lamport L 1985 Distributed snapshots: Determining global states of distributed systems.ACM Trans. Comput. Syst. 3: 63–75
Chandy K M, Ramamoorthy C V 1972 Rollback and recovery strategies for computer programs.IEEE Trans. Comput. C-21: 546–556
Chen P M, Lee E K, Gibson G A, Katz R H, Patterson D A 1994 RAID-High performance, reliable secondary storage.ACM Comput. Surv. 26: 145–185
Chen S-K, Tsai W T, Thuraisingham M B 1989 Recovery point selection on a reverse binary tree task model.IEEE Trans. Software Eng. 15: 963–976
Cristian F, Jahanian F 1991 A timestamp-based checkpointing protocol for long-lived distributed computations.Proc. IEEE Conf. on Reliable Distributed Syst. pp 12–20
Elnozahy E N, Zwaenepoel W 1992 Manetho: Transparent rollback recovery with low overhead, limited rollback and fast output commit.IEEE Trans. Comput. 41: 526–531
Elnozahy E N, Zwaenepoel W 1994 On the use and implementation of message logging.Proc. IEEE Int. Sytnp. on Fault Tolerant Computing pp 298–307
Elnozahy E N, Johnson D B, Zwaenepoel W 1992 The performance of consistent checkpointing.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 39–47
Fitzgerald R, Rashid R F 1986 The integration of virtual memory management and interprocess communication in accent.ACM Trans. Comput. Syst. 4: 147–177
IEEE 1992 Std. 1596–1992.IEEE Scalable Coherent Interface (SCI) (Piscataway, NJ: IEEE)
Janssens B, Fuchs W K 1994 The performance of cache based error recovery in multiprocessors.IEEE Trans. Parallel Distributed Syst. 5: 1033–1043
Johnson D B, Zwaenepoel W 1987 Sender-based message logging.Proc. IEEE Int. Symp. Fault Tolerant Comput. pp. 14–19
Juang T T-Y, Venkatesan S 1991 Crash recovery with little overhead.Proc. IEEE 11th Int. Conf. on Distributed Comput. Syst. pp 454–461
Kalaiselvi S, Rajaraman V 2000 Task graph based checkpointing in parallel/distributed systems.J. Parallel Distributed Comput. (submitted)
Kim J L, Park T 1993 An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel Distributed Syst. 4: 955–960
Koo R, Toueg S 1987 Checkpointing and rollback recovery for distributed systems.IEEE Trans. Software Eng. SE-13: 23–31
Lai T H, Yang T H 1987 On distributed snapshots.Inf. Process. Lett. 25: 153–158
Lamport L 1978 Time clocks and the ordering of events in a distributed system.Commun. ACM 21: 558–565
Leong H V, Agrawal D 1994 Using message semantics to reduce rollback in optimistic message logging recovery schemes.Proc. IEEE 14th Conf. on Distributed Computing Syst. pp 227–234
Leu P-J, Bhargava B 1988 Concurrent robust checkpointing and recovery in distributed systems.Proc. Int. Conf. on Data Engineering pp 154–163
Li C-C J, Fuchs W K 1990 CATCH-Compiler-assisted techniques for checkpointing.Proc. 1990 Int. Symp. on Fault Tolerant Computing pp 74–81
Li K, Naughton J F, Plank S 1991 Checkpointing multicomputer applications.Proc. IEEE Conf. on Reliable Distributed Syst. pp 2–11
Manivannan D, Singhal M 1996 A low-overhead recovery technique using quasi-synchronous checkpointing.Proc. IEEE Int. Conf. on Distributed Computing Syst. pp 100–107
Mishra S K, Raghavan V V, Tzeng N-F 1991 Efficient algorithms for selection of recovery points in task models.IEEE Trans. Software Eng. 17: 731–734
Netzer R H B, Xu J 1995 Necessary and sufficient conditions for consistent global snapshots.IEEE Trans. Parallel Distributed Syst. 6: 165–169
Plank J S 1993Efficient checkpointing on MIMD architectures. Doctoral dissertation, Dept. of Computer Science, Princeton University, Princeton, NJ
Plank J S, Li K 1994a ickp: A consistent checkpointer for multicomputers.IEEE Parallel Distributed Technol. 2: 62–67
Plank J S, Li K 1994b Faster checkpointing withN + 1 parity.Proc. IEEE Int. Symp. on Fault Tolerant Computing pp 288–297
Plank J S, Beck M, Kingsley G, Li K 1995 Libckpt: Transparent checkpointing under unix.USENIX Winter Technical Conference, New Orleans, Louisiana
Ralston A, Reily E D 1993Encyclopedia of computer science 3rd edn (New York: IEEE Press)
Randell B 1975 System structure for software fault tolerance.IEEE Trans. Software Eng. SE-1: 220–232
Reuter A 1980 A fast transaction-oriented logging scheme for undo recovery.IEEE Trans. Software Eng. SE-6: 348–356
Siewiorek D P, Swarz S 1982The theory and practice of reliable system design (Cambridge, MA: Digital Press)
Silva L, Silva J 1992 Global checkpointing for distributed programs.Proc. IEEE 11th Sytnp. on Reliable Distributed Syst. pp 155–162
Strom R E, Yemini S 1985 Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3: 204–226
Tarn V-O, Hsu M 1990 Fast recovery in distributed shared virtual memory systems.Proc. IEEE 10th Int. Conf. on Distributed Computing Syst. pp 38–45
Theimer M, Lantz K, Cheriton D R 1985 Preemptable remote execution facilities in the V-system.Proc. of the 10th ACM Symp. on Operating Syst. Principles pp 2–11
Tong Z, Kain R Y, Tsai W T 1992 Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. Parallel Distributed Syst. 3: 246–251
Upadhyaya J S, Saluja K K 1986 A watchdog processor based general rollback technique with multiple retries.IEEE Trans. Software Eng. SE-12: 87–95
Venkatesan S 1989 Message optimal incremental snapshots.Proc. IEEE 9th Int. Conf. Distributed Comput. Syst. pp 53–60
Venkatesh K, Radhakrishnan T, Li H F 1987 Optimal checkpointing and local recording for domino-free rollback recovery.Inf. Process. Lett. 25: 295–303
Wang Y-M, Fuchs W K 1992 Optimistic message logging for independent checkpointing in message passing systems.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 147–154
Wang Y-M, Chung P-Y, Lin I-J, Fuchs W K 1995 Checkpoint space reclamation for uncoordinated checkpointing in message passing systems.IEEE Trans. Parallel Distributed Syst. 6: 546–554
Wu K-L, Fuchs W K 1989 Recoverable distributed shared virtual memory: Memory coherence and storage structures.Proc. 1989 Int. Symp. on Fault Tolerant Computing pp 520–527
Wu K-L, Fuchs W K, Patel J H 1989 Cache based error recovery for shared memory multiprocessor systems.Proc. Int. Conf. on Parallel Processing pp I159–I166
Young C, Chiu G 1994 A crash recovery technique in distributed computing systems.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 235–242
Zomaya A Y H 1996Parallel and distributed computing handbook (New York: McGraw-Hill)
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kalaiselvi, S., Rajaraman, V. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25, 489–510 (2000). https://doi.org/10.1007/BF02703630
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02703630