Implementing Forward Recovery Using Checkpoints in Distributed Systems
This paper describes the implementation of a forward recovery strategy in a Sun NFS environment. The implementation is based on the concept of lookahead execution with rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. In the experiment described, the recovery strategy has nearly error-free execution time and an average redundancy lower than TMR.
KeywordsExecution Time Executable File Concurrent Error Detection Validation Task Network File System
Unable to display preview. Download preview PDF.
- C. M. Krishna, G. S. Kang, and Y.-H. Lee, “Optimization criteria for checkpoint placement,” CACM, vol. 27, no. 6, pp. 1008–1012, Oct. 1984.Google Scholar
- S. Thanwastien, R. S. Pamula, and Y. L. Varol, “Evaluation of global rollback strategies for error recovery in concurrent processing systems,” Proc. 16th Int’I. Symp. on Fault-Tolerant Computing Systems, pp. 246-251, 1986.Google Scholar
- T. Anderson and P. Lee, Fault Tolerance: Principles and Practice. Englewood Cliffs, N.J.: Prentice-Hall, 1981.Google Scholar
- J. Long, W. K. Fuchs, and J. A. Abraham, “A forward recovery strategy using checkpointing in parallel systems,” Proc. Int’l. Conf. on Parallel Processing, vol. 1, pp. 272–275, 1990.Google Scholar
- K. Tsuruoka, A. Kaneko, and Y. Nishihara, “Dynamic recovery schemes for distributed processes,” IEEE 2nd Symp. on Reliability in Distributed Software and DataBase Systems, pp. 124-130, 1981.Google Scholar
- P. Agrawal, “Raft: A recursive algorithm for fault-tolerance,” Proc. Int’l. Conf. on Parallel Processing, pp. 814-821, 1985.Google Scholar
- P. Agrawal and R. Agrawal, “Software implementation of a recursive fault-tolerance algorithm on a network of computers,” Proc. of the 13th Annual Symposium on Computer Architecture, pp. 65-72, 1986.Google Scholar
- J. M. Smith, “Implementing remote fork() with checkpoint/restart,” Technical Committee on Operating Systems Newsletter, vol. 3, no. 1, pp. 15–19, Winter, 1989.Google Scholar
- C. C. Li and W. K. Fuchs, “Catch: Compiler-assisted techniques for checkpointing,” Proc. 20th Int’l. Symp. on Fault-Tolerant Computing Systems, pp. 74-81, 1990.Google Scholar