A survey of checkpointing algorithms for parallel and distributed computers

Kalaiselvi, S.; Rajaraman, V.

doi:10.1007/BF02703630

A survey of checkpointing algorithms for parallel and distributed computers

Published: October 2000

Volume 25, pages 489–510, (2000)
Cite this article

Sadhana Aims and scope Submit manuscript

S. Kalaiselvi¹ &
V. Rajaraman²

207 Accesses
28 Citations
Explore all metrics

Abstract

Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time.Checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. Checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahmed R E, Frazier R C, Marinos P N 1990 Cache aided rollback error recovery (CARER) algorithms for shared memory multiprocessor systems.Proc. IEEE 20th Int. Symp. on Fault Tolerant Computing pp 82–88
Alvisi L, Hoppe B, Marzullo K 1993 Nonblocking and orphan free message logging protocols.Proc. IEEE 23rd Int. Symp. Fault Tolerant Computing pp 145–154
Archibald J, Baer J-L 1986 Cache coherence protocols: Evaluation using a multiprocessor simulation model.ACM Trans. Comput. Syst. 4: 273–298
Article Google Scholar
Beck M, Plank J S, Kingsley G 1994 Compiler-assisted checkpointing. Technical Report, CS-94-269, University of Tennessee, Knoxville, TN
Google Scholar
Bhargava B, Lian S-R, Leu P-J 1990 Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms.Proc. IEEE 6th Int. Conf. on Data Eng. pp 182–189
Borg A, Blau W, Gratsch W, Herrman H, Oberle W 1989 Fault tolerance under UNIX.ACM Trans. Comput. Syst. 7: 1–24
Article Google Scholar
Bowen N S, Pradhan K 1992 Virtual checkpoints: Architecture and performance.IEEE Trans. Comput. 41: 516–525
Article Google Scholar
Brown L, Wu J 1994 Dynamic snooping in a fault-tolerant distributed shared memory.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 218–226
Chandy K M, Lamport L 1985 Distributed snapshots: Determining global states of distributed systems.ACM Trans. Comput. Syst. 3: 63–75
Article Google Scholar
Chandy K M, Ramamoorthy C V 1972 Rollback and recovery strategies for computer programs.IEEE Trans. Comput. C-21: 546–556
Article MathSciNet Google Scholar
Chen P M, Lee E K, Gibson G A, Katz R H, Patterson D A 1994 RAID-High performance, reliable secondary storage.ACM Comput. Surv. 26: 145–185
Article Google Scholar
Chen S-K, Tsai W T, Thuraisingham M B 1989 Recovery point selection on a reverse binary tree task model.IEEE Trans. Software Eng. 15: 963–976
Article Google Scholar
Cristian F, Jahanian F 1991 A timestamp-based checkpointing protocol for long-lived distributed computations.Proc. IEEE Conf. on Reliable Distributed Syst. pp 12–20
Elnozahy E N, Zwaenepoel W 1992 Manetho: Transparent rollback recovery with low overhead, limited rollback and fast output commit.IEEE Trans. Comput. 41: 526–531
Article Google Scholar
Elnozahy E N, Zwaenepoel W 1994 On the use and implementation of message logging.Proc. IEEE Int. Sytnp. on Fault Tolerant Computing pp 298–307
Elnozahy E N, Johnson D B, Zwaenepoel W 1992 The performance of consistent checkpointing.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 39–47
Fitzgerald R, Rashid R F 1986 The integration of virtual memory management and interprocess communication in accent.ACM Trans. Comput. Syst. 4: 147–177
Article Google Scholar
IEEE 1992 Std. 1596–1992.IEEE Scalable Coherent Interface (SCI) (Piscataway, NJ: IEEE)
Google Scholar
Janssens B, Fuchs W K 1994 The performance of cache based error recovery in multiprocessors.IEEE Trans. Parallel Distributed Syst. 5: 1033–1043
Article Google Scholar
Johnson D B, Zwaenepoel W 1987 Sender-based message logging.Proc. IEEE Int. Symp. Fault Tolerant Comput. pp. 14–19
Juang T T-Y, Venkatesan S 1991 Crash recovery with little overhead.Proc. IEEE 11th Int. Conf. on Distributed Comput. Syst. pp 454–461
Kalaiselvi S, Rajaraman V 2000 Task graph based checkpointing in parallel/distributed systems.J. Parallel Distributed Comput. (submitted)
Kim J L, Park T 1993 An efficient protocol for checkpointing recovery in distributed systems.IEEE Trans. Parallel Distributed Syst. 4: 955–960
Article Google Scholar
Koo R, Toueg S 1987 Checkpointing and rollback recovery for distributed systems.IEEE Trans. Software Eng. SE-13: 23–31
Article Google Scholar
Lai T H, Yang T H 1987 On distributed snapshots.Inf. Process. Lett. 25: 153–158
Article MATH MathSciNet Google Scholar
Lamport L 1978 Time clocks and the ordering of events in a distributed system.Commun. ACM 21: 558–565
Article MATH Google Scholar
Leong H V, Agrawal D 1994 Using message semantics to reduce rollback in optimistic message logging recovery schemes.Proc. IEEE 14th Conf. on Distributed Computing Syst. pp 227–234
Leu P-J, Bhargava B 1988 Concurrent robust checkpointing and recovery in distributed systems.Proc. Int. Conf. on Data Engineering pp 154–163
Li C-C J, Fuchs W K 1990 CATCH-Compiler-assisted techniques for checkpointing.Proc. 1990 Int. Symp. on Fault Tolerant Computing pp 74–81
Li K, Naughton J F, Plank S 1991 Checkpointing multicomputer applications.Proc. IEEE Conf. on Reliable Distributed Syst. pp 2–11
Manivannan D, Singhal M 1996 A low-overhead recovery technique using quasi-synchronous checkpointing.Proc. IEEE Int. Conf. on Distributed Computing Syst. pp 100–107
Mishra S K, Raghavan V V, Tzeng N-F 1991 Efficient algorithms for selection of recovery points in task models.IEEE Trans. Software Eng. 17: 731–734
Article Google Scholar
Netzer R H B, Xu J 1995 Necessary and sufficient conditions for consistent global snapshots.IEEE Trans. Parallel Distributed Syst. 6: 165–169
Article Google Scholar
Plank J S 1993Efficient checkpointing on MIMD architectures. Doctoral dissertation, Dept. of Computer Science, Princeton University, Princeton, NJ
Google Scholar
Plank J S, Li K 1994a ickp: A consistent checkpointer for multicomputers.IEEE Parallel Distributed Technol. 2: 62–67
Article Google Scholar
Plank J S, Li K 1994b Faster checkpointing withN + 1 parity.Proc. IEEE Int. Symp. on Fault Tolerant Computing pp 288–297
Plank J S, Beck M, Kingsley G, Li K 1995 Libckpt: Transparent checkpointing under unix.USENIX Winter Technical Conference, New Orleans, Louisiana
Google Scholar
Ralston A, Reily E D 1993Encyclopedia of computer science 3rd edn (New York: IEEE Press)
Google Scholar
Randell B 1975 System structure for software fault tolerance.IEEE Trans. Software Eng. SE-1: 220–232
Google Scholar
Reuter A 1980 A fast transaction-oriented logging scheme for undo recovery.IEEE Trans. Software Eng. SE-6: 348–356
Article Google Scholar
Siewiorek D P, Swarz S 1982The theory and practice of reliable system design (Cambridge, MA: Digital Press)
Google Scholar
Silva L, Silva J 1992 Global checkpointing for distributed programs.Proc. IEEE 11th Sytnp. on Reliable Distributed Syst. pp 155–162
Strom R E, Yemini S 1985 Optimistic recovery in distributed systems.ACM Trans. Comput. Syst. 3: 204–226
Article Google Scholar
Tarn V-O, Hsu M 1990 Fast recovery in distributed shared virtual memory systems.Proc. IEEE 10th Int. Conf. on Distributed Computing Syst. pp 38–45
Theimer M, Lantz K, Cheriton D R 1985 Preemptable remote execution facilities in the V-system.Proc. of the 10th ACM Symp. on Operating Syst. Principles pp 2–11
Tong Z, Kain R Y, Tsai W T 1992 Rollback recovery in distributed systems using loosely synchronized clocks.IEEE Trans. Parallel Distributed Syst. 3: 246–251
Article Google Scholar
Upadhyaya J S, Saluja K K 1986 A watchdog processor based general rollback technique with multiple retries.IEEE Trans. Software Eng. SE-12: 87–95
Google Scholar
Venkatesan S 1989 Message optimal incremental snapshots.Proc. IEEE 9th Int. Conf. Distributed Comput. Syst. pp 53–60
Venkatesh K, Radhakrishnan T, Li H F 1987 Optimal checkpointing and local recording for domino-free rollback recovery.Inf. Process. Lett. 25: 295–303
Article Google Scholar
Wang Y-M, Fuchs W K 1992 Optimistic message logging for independent checkpointing in message passing systems.Proc. IEEE 11th Symp. on Reliable Distributed Syst. pp 147–154
Wang Y-M, Chung P-Y, Lin I-J, Fuchs W K 1995 Checkpoint space reclamation for uncoordinated checkpointing in message passing systems.IEEE Trans. Parallel Distributed Syst. 6: 546–554
Article Google Scholar
Wu K-L, Fuchs W K 1989 Recoverable distributed shared virtual memory: Memory coherence and storage structures.Proc. 1989 Int. Symp. on Fault Tolerant Computing pp 520–527
Wu K-L, Fuchs W K, Patel J H 1989 Cache based error recovery for shared memory multiprocessor systems.Proc. Int. Conf. on Parallel Processing pp I159–I166
Young C, Chiu G 1994 A crash recovery technique in distributed computing systems.Proc. IEEE 14th Int. Conf. on Distributed Computing Syst. pp 235–242
Zomaya A Y H 1996Parallel and distributed computing handbook (New York: McGraw-Hill)
Google Scholar

Download references

Author information

Authors and Affiliations

Supercomputer Education and Research Centre (SERC), Indian Institute of Science, 560012, Bangalore, India
S. Kalaiselvi
Jawaharlal Nehru Centre for Advanced Scientific Research, Indian Institute of Science Campus, 560012, Bangalore, India
V. Rajaraman

Authors

S. Kalaiselvi
View author publications
You can also search for this author in PubMed Google Scholar
V. Rajaraman
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kalaiselvi, S., Rajaraman, V. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana 25, 489–510 (2000). https://doi.org/10.1007/BF02703630

Download citation

Received: 27 August 1998
Revised: 08 June 2000
Issue Date: October 2000
DOI: https://doi.org/10.1007/BF02703630

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of checkpointing algorithms for parallel and distributed computers

Abstract

Access this article

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Analysis of parallel application checkpoint storage for system configuration

Checkpointing in Parallel State-Machine Replication

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of checkpointing algorithms for parallel and distributed computers

Abstract

Access this article

Similar content being viewed by others

Checkpointing Tools in a Supercomputer Center

Analysis of parallel application checkpoint storage for system configuration

Checkpointing in Parallel State-Machine Replication

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation