Implementing Forward Recovery Using Checkpoints in Distributed Systems

  • Junsheng Long
  • W. Kent Fuchs
  • Jacob A. Abraham
Part of the Dependable Computing and Fault-Tolerant Systems book series (DEPENDABLECOMP, volume 6)


This paper describes the implementation of a forward recovery strategy in a Sun NFS environment. The implementation is based on the concept of lookahead execution with rollback validation. It uses replicated tasks executing on different processors for forward recovery and checkpoint comparison for error detection. In the experiment described, the recovery strategy has nearly error-free execution time and an average redundancy lower than TMR.


Execution Time Executable File Concurrent Error Detection Validation Task Network File System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    A. Duda, “The effects of checkpointing on program execution time,” Information Processing Letters, vol. 16, pp. 221–229, 1983.MathSciNetMATHCrossRefGoogle Scholar
  2. [2]
    E. Gelenbe and D. Derochette, “Performance of rollback recovery systems under intermittent failures,” CACM, vol. 21, no. 6, pp. 493–499, 1978.MathSciNetMATHGoogle Scholar
  3. [3]
    C. M. Krishna, G. S. Kang, and Y.-H. Lee, “Optimization criteria for checkpoint placement,” CACM, vol. 27, no. 6, pp. 1008–1012, Oct. 1984.Google Scholar
  4. [4]
    J. W. Young, “A first order approximation to the optimal checkpoint interval,” CACM, vol. 17, no. 9, pp. 530–531, Sept. 1974.MATHGoogle Scholar
  5. [5]
    A. Tantawi and M. Ruschitzka, “Performance analysis of checkpointing strategies,” ACM Trans. on Computer Systems, vol. 2, no. 2, pp. 123–144, May 1984.CrossRefGoogle Scholar
  6. [6]
    S. Thanwastien, R. S. Pamula, and Y. L. Varol, “Evaluation of global rollback strategies for error recovery in concurrent processing systems,” Proc. 16th Int’I. Symp. on Fault-Tolerant Computing Systems, pp. 246-251, 1986.Google Scholar
  7. [7]
    Y.-H. Lee and G. S. Kang, “Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,” IEEE Trans. on Computers, vol. 33, no. 2, pp. 113–124, 1984.CrossRefGoogle Scholar
  8. [8]
    D. J. Taylor and C.-J. H. Seger, “Robust storage structures for crash recovery, ” IEEE Trans, on Computers, vol. 35, no. 4, pp. 288–295, 1986.CrossRefGoogle Scholar
  9. [9]
    C.-C. J. Li, P. P. Chen, and W. K. Fuchs, “Local concurrent error detection and correction in data structure using virtual backpointers,” IEEE Trans. on Computers, vol. 38, no 11, pp. 1481–1492, 1989.MathSciNetCrossRefGoogle Scholar
  10. [10]
    T. Anderson and P. Lee, Fault Tolerance: Principles and Practice. Englewood Cliffs, N.J.: Prentice-Hall, 1981.Google Scholar
  11. [11]
    J. Long, W. K. Fuchs, and J. A. Abraham, “A forward recovery strategy using checkpointing in parallel systems,” Proc. Int’l. Conf. on Parallel Processing, vol. 1, pp. 272–275, 1990.Google Scholar
  12. [12]
    K. Tsuruoka, A. Kaneko, and Y. Nishihara, “Dynamic recovery schemes for distributed processes,” IEEE 2nd Symp. on Reliability in Distributed Software and DataBase Systems, pp. 124-130, 1981.Google Scholar
  13. [13]
    P. Agrawal, “Raft: A recursive algorithm for fault-tolerance,” Proc. Int’l. Conf. on Parallel Processing, pp. 814-821, 1985.Google Scholar
  14. [14]
    P. Agrawal and R. Agrawal, “Software implementation of a recursive fault-tolerance algorithm on a network of computers,” Proc. of the 13th Annual Symposium on Computer Architecture, pp. 65-72, 1986.Google Scholar
  15. [15]
    J. M. Smith, “Implementing remote fork() with checkpoint/restart,” Technical Committee on Operating Systems Newsletter, vol. 3, no. 1, pp. 15–19, Winter, 1989.Google Scholar
  16. [16]
    C. C. Li and W. K. Fuchs, “Catch: Compiler-assisted techniques for checkpointing,” Proc. 20th Int’l. Symp. on Fault-Tolerant Computing Systems, pp. 74-81, 1990.Google Scholar

Copyright information

© Springer-Verlag/Wien 1992

Authors and Affiliations

  • Junsheng Long
    • 1
  • W. Kent Fuchs
    • 1
  • Jacob A. Abraham
    • 2
  1. 1.Center for Reliable and High-Performance Computing Coordinated Science LaboratoryUniversity of IllinoisUrbanaUSA
  2. 2.Computer Engineering Research Center Department of Electrical and Computer EngineeringUniversity of TexasAustinUSA

Personalised recommendations