Skip to main content
Log in

A distributed error recovery technique and its implementation and application on UNIX

  • Regular Papers
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This paper presents a checkpoint setting technique to eliminate domino effect in backward recovery in distributed systems, which is very efficient, powerful, widely applicable and easy to be implememted. Besides theoretical analysis, an implementation on UNIX system and a package for software fault-tolerance are introduced. Then the problems of checkpoint management and process termination are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. X. Castilloet al., Derivation and calibration of a transient error reliability model,IEEE Trans., C-31:7(1982).

    Google Scholar 

  2. S.R. McConnelet al., The measurement and analysis of transient errors in digital computer systems, Proc. FTCS-9, 1979.

  3. B. Randell, System structure for fault-tolerance,IEEE Trans., SE-1:2(1975).

    Google Scholar 

  4. K. Kant and A. Silberschatz, Error propagation and recovery in concurrent environments,The Computer Journal,28:5(1985).

    Google Scholar 

  5. B. Randellet al., Reliability issues in computing system design,Computing Surveys,10:2(1978).

    Google Scholar 

  6. K.G. Shin, Y.-H. Lee, Evaluation of error recovery blocks used for cooperating processes, IEEE Trans., SE-10:6(1984).

    Google Scholar 

  7. K. Kant, A global checkpointing model for error recovery, AFIPS Conf. Proc. Vol. 52, 1983.

  8. Y.-H. Lee and K.G. Shin, Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,IEEE Trans., C-33:2(1984).

    Google Scholar 

  9. G. Barigazzi and L. Strigini, Application transparent setting of recovery point, Proc, FTCS-13, 1983.

  10. A. Ciuffoletti, Error recovery in systems of communicating processes, Proc. 7th Int'l. Conf. on Software Eng., 1984.

  11. R.E. Strom and S. Yemini, Optimistic recovery—an asynchronous approach to fault-tolerance in distributed systems, Proc. FTCS-14, 1984.

  12. G.W. Wood, A decentralized recovery control protocol, Proc. FTCS-11, 1981.

  13. C.A.R. Hoare, Communicating Sequential Processes,C.ACM,21:8(1978).

    Google Scholar 

  14. R. Koo and S. Toueg, Checkpointing and rollback-recovery for distributed systems,IEEE Trans., SE-13:1 (1987).

    Google Scholar 

  15. Zhou Di, A recovery technique to fault-tolerance in distributed communicating process systems,Journal of Comupter Science and Technology,1:2(1986).

  16. B. Lampson, Atomic transations, distributed system-architecture and implementation, Lecture Notes in Computer Science105, 1981, 246–265.

    Google Scholar 

  17. A. Avizienis and J.P.J. Kelly, Fault-tolerance by design diversity: concepts and experiments,Computer,17:8 (1984).

    Google Scholar 

  18. P. Jalote and R.H. Campbell, Fault tolerance using communicating processes, Proc. FTCS-14, 1984.

  19. Zhou Di, Eliminating domino effect in backward error recovery in distributed systems, Proc. the 2nd Int'l Conf. on Com. & Appl. July, 1987. Beijing.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, D., Xu, X. A distributed error recovery technique and its implementation and application on UNIX. J. of Compt. Sci. & Technol. 5, 127–138 (1990). https://doi.org/10.1007/BF02943419

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02943419

Keywords

Navigation