Skip to main content
Log in

A survey of checkpointing algorithms for parallel and distributed computers

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ahmed R E, Frazier R C, Marinos P N 1990 Cache aided rollback error recovery (CARER) algorithms for shared memory multiprocessor systems.Proc. IEEE 20th Int. Symp. on Fault Tolerant Computing pp 82–88

  • Alvisi L, Hoppe B, Marzullo K 1993 Nonblocking and orphan free message logging protocols.Proc. IEEE 23rd Int. Symp. Fault Tolerant Computing pp 145–154

  • Archibald J, Baer J-L 1986 Cache coherence protocols: Evaluation using a multiprocessor simulation model.ACM Trans. Comput. Syst. 4: 273–298

    Article  Google Scholar 

  • Beck M, Plank J S, Kingsley G 1994 Compiler-assisted checkpointing. Technical Report, CS-94-269, University of Tennessee, Knoxville, TN

    Google Scholar 

  • Bhargava B, Lian S-R, Leu P-J 1990 Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms.Proc. IEEE 6th Int. Conf. on Data Eng. pp 182–189

  • Borg A, Blau W, Gratsch W, Herrman H, Oberle W 1989 Fault tolerance under UNIX.ACM Trans. Comput. Syst. 7: 1–24

    Article  Google Scholar 

  • Bowen N S, Pradhan K 1992 Virtual checkpoints: Architecture and performance.IEEE Trans. Comput. 41: 516–525

    Article  Google Scholar 

  • Brown L, Wu J 1994 Dynamic snooping in a fault-tolerant distributed shared memory.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 218–226

  • Chandy K M, Lamport L 1985 Distributed snapshots: Determining global states of distributed systems.ACM Trans. Comput. Syst. 3: 63–75

    Article  Google Scholar 

  • Chandy K M, Ramamoorthy C V 1972 Rollback and recovery strategies for computer programs.IEEE Trans. Comput. C-21: 546–556

    Article  MathSciNet  Google Scholar 

  • Chen P M, Lee E K, Gibson G A, Katz R H, Patterson D A 1994 RAID-High performance, reliable secondary storage.ACM Comput. Surv. 26: 145–185

    Article  Google Scholar 

  • Chen S-K, Tsai W T, Thuraisingham M B 1989 Recovery point selection on a reverse binary tree task model.IEEE Trans. Software Eng. 15: 963–976

    Article  Google Scholar 

  • Cristian F, Jahanian F 1991 A timestamp-based checkpointing protocol for long-lived distributed computations.Proc. IEEE Conf. on Reliable Distributed Syst. pp 12–20

  • Elnozahy E N, Zwaenepoel W 1992 Manetho: Transparent rollback recovery with low overhead, limited rollback and fast output commit.IEEE Trans. Comput. 41: 526–531

    Article  Google Scholar 

  • Elnozahy E N, Zwaenepoel W 1994 On the use and implementation of message logging.Proc. IEEE Int. Sytnp. on Fault Tolerant Computing pp 298–307

  • Elnozahy E N, Johnson D B, Zwaenepoel W 1992 The performance of consistent checkpointing.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 39–47

  • Fitzgerald R, Rashid R F 1986 The integration of virtual memory management and interprocess communication in accent.ACM Trans. Comput. Syst. 4: 147–177

    Article  Google Scholar 

  • IEEE 1992 Std. 1596–1992.IEEE Scalable Coherent Interface (SCI) (Piscataway, NJ: IEEE)

    Google Scholar 

  • Janssens B, Fuchs W K 1994 The performance of cache based error recovery in multiprocessors.IEEE Trans. Parallel Distributed Syst. 5: 1033–1043

    Article  Google Scholar 

  • Johnson D B, Zwaenepoel W 1987 Sender-based message logging.Proc. IEEE Int. Symp. Fault Tolerant Comput. pp. 14–19

  • Juang T T-Y, Venkatesan S 1991 Crash recovery with little overhead.Proc. IEEE 11th Int. Conf. on Distributed Comput. Syst. pp 454–461

  • Kalaiselvi S, Rajaraman V 2000 Task graph based checkpointing in parallel/distributed systems.J. Parallel Distributed Comput. (submitted)

  • Kim J L, Park T 1993 An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel Distributed Syst. 4: 955–960

    Article  Google Scholar 

  • Koo R, Toueg S 1987 Checkpointing and rollback recovery for distributed systems.IEEE Trans. Software Eng. SE-13: 23–31

    Article  Google Scholar 

  • Lai T H, Yang T H 1987 On distributed snapshots.Inf. Process. Lett. 25: 153–158

    Article  MATH  MathSciNet  Google Scholar 

  • Lamport L 1978 Time clocks and the ordering of events in a distributed system.Commun. ACM 21: 558–565

    Article  MATH  Google Scholar 

  • Leong H V, Agrawal D 1994 Using message semantics to reduce rollback in optimistic message logging recovery schemes.Proc. IEEE 14th Conf. on Distributed Computing Syst. pp 227–234

  • Leu P-J, Bhargava B 1988 Concurrent robust checkpointing and recovery in distributed systems.Proc. Int. Conf. on Data Engineering pp 154–163

  • Li C-C J, Fuchs W K 1990 CATCH-Compiler-assisted techniques for checkpointing.Proc. 1990 Int. Symp. on Fault Tolerant Computing pp 74–81

  • Li K, Naughton J F, Plank S 1991 Checkpointing multicomputer applications.Proc. IEEE Conf. on Reliable Distributed Syst. pp 2–11

  • Manivannan D, Singhal M 1996 A low-overhead recovery technique using quasi-synchronous checkpointing.Proc. IEEE Int. Conf. on Distributed Computing Syst. pp 100–107

  • Mishra S K, Raghavan V V, Tzeng N-F 1991 Efficient algorithms for selection of recovery points in task models.IEEE Trans. Software Eng. 17: 731–734

    Article  Google Scholar 

  • Netzer R H B, Xu J 1995 Necessary and sufficient conditions for consistent global snapshots.IEEE Trans. Parallel Distributed Syst. 6: 165–169

    Article  Google Scholar 

  • Plank J S 1993Efficient checkpointing on MIMD architectures. Doctoral dissertation, Dept. of Computer Science, Princeton University, Princeton, NJ

    Google Scholar 

  • Plank J S, Li K 1994a ickp: A consistent checkpointer for multicomputers.IEEE Parallel Distributed Technol. 2: 62–67

    Article  Google Scholar 

  • Plank J S, Li K 1994b Faster checkpointing withN + 1 parity.Proc. IEEE Int. Symp. on Fault Tolerant Computing pp 288–297

  • Plank J S, Beck M, Kingsley G, Li K 1995 Libckpt: Transparent checkpointing under unix.USENIX Winter Technical Conference, New Orleans, Louisiana

    Google Scholar 

  • Ralston A, Reily E D 1993Encyclopedia of computer science 3rd edn (New York: IEEE Press)

    Google Scholar 

  • Randell B 1975 System structure for software fault tolerance.IEEE Trans. Software Eng. SE-1: 220–232

    Google Scholar 

  • Reuter A 1980 A fast transaction-oriented logging scheme for undo recovery.IEEE Trans. Software Eng. SE-6: 348–356

    Article  Google Scholar 

  • Siewiorek D P, Swarz S 1982The theory and practice of reliable system design (Cambridge, MA: Digital Press)

    Google Scholar 

  • Silva L, Silva J 1992 Global checkpointing for distributed programs.Proc. IEEE 11th Sytnp. on Reliable Distributed Syst. pp 155–162

  • Strom R E, Yemini S 1985 Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3: 204–226

    Article  Google Scholar 

  • Tarn V-O, Hsu M 1990 Fast recovery in distributed shared virtual memory systems.Proc. IEEE 10th Int. Conf. on Distributed Computing Syst. pp 38–45

  • Theimer M, Lantz K, Cheriton D R 1985 Preemptable remote execution facilities in the V-system.Proc. of the 10th ACM Symp. on Operating Syst. Principles pp 2–11

  • Tong Z, Kain R Y, Tsai W T 1992 Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. Parallel Distributed Syst. 3: 246–251

    Article  Google Scholar 

  • Upadhyaya J S, Saluja K K 1986 A watchdog processor based general rollback technique with multiple retries.IEEE Trans. Software Eng. SE-12: 87–95

    Google Scholar 

  • Venkatesan S 1989 Message optimal incremental snapshots.Proc. IEEE 9th Int. Conf. Distributed Comput. Syst. pp 53–60

  • Venkatesh K, Radhakrishnan T, Li H F 1987 Optimal checkpointing and local recording for domino-free rollback recovery.Inf. Process. Lett. 25: 295–303

    Article  Google Scholar 

  • Wang Y-M, Fuchs W K 1992 Optimistic message logging for independent checkpointing in message passing systems.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 147–154

  • Wang Y-M, Chung P-Y, Lin I-J, Fuchs W K 1995 Checkpoint space reclamation for uncoordinated checkpointing in message passing systems.IEEE Trans. Parallel Distributed Syst. 6: 546–554

    Article  Google Scholar 

  • Wu K-L, Fuchs W K 1989 Recoverable distributed shared virtual memory: Memory coherence and storage structures.Proc. 1989 Int. Symp. on Fault Tolerant Computing pp 520–527

  • Wu K-L, Fuchs W K, Patel J H 1989 Cache based error recovery for shared memory multiprocessor systems.Proc. Int. Conf. on Parallel Processing pp I159–I166

  • Young C, Chiu G 1994 A crash recovery technique in distributed computing systems.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 235–242

  • Zomaya A Y H 1996Parallel and distributed computing handbook (New York: McGraw-Hill)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalaiselvi, S., Rajaraman, V. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25, 489–510 (2000). https://doi.org/10.1007/BF02703630

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02703630

Keywords

Navigation