Abstract
This paper presents a checkpoint setting technique to eliminate domino effect in backward recovery in distributed systems, which is very efficient, powerful, widely applicable and easy to be implememted. Besides theoretical analysis, an implementation on UNIX system and a package for software fault-tolerance are introduced. Then the problems of checkpoint management and process termination are discussed.
Similar content being viewed by others
References
X. Castilloet al., Derivation and calibration of a transient error reliability model,IEEE Trans., C-31:7(1982).
S.R. McConnelet al., The measurement and analysis of transient errors in digital computer systems, Proc. FTCS-9, 1979.
B. Randell, System structure for fault-tolerance,IEEE Trans., SE-1:2(1975).
K. Kant and A. Silberschatz, Error propagation and recovery in concurrent environments,The Computer Journal,28:5(1985).
B. Randellet al., Reliability issues in computing system design,Computing Surveys,10:2(1978).
K.G. Shin, Y.-H. Lee, Evaluation of error recovery blocks used for cooperating processes, IEEE Trans., SE-10:6(1984).
K. Kant, A global checkpointing model for error recovery, AFIPS Conf. Proc. Vol. 52, 1983.
Y.-H. Lee and K.G. Shin, Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,IEEE Trans., C-33:2(1984).
G. Barigazzi and L. Strigini, Application transparent setting of recovery point, Proc, FTCS-13, 1983.
A. Ciuffoletti, Error recovery in systems of communicating processes, Proc. 7th Int'l. Conf. on Software Eng., 1984.
R.E. Strom and S. Yemini, Optimistic recovery—an asynchronous approach to fault-tolerance in distributed systems, Proc. FTCS-14, 1984.
G.W. Wood, A decentralized recovery control protocol, Proc. FTCS-11, 1981.
C.A.R. Hoare, Communicating Sequential Processes,C.ACM,21:8(1978).
R. Koo and S. Toueg, Checkpointing and rollback-recovery for distributed systems,IEEE Trans., SE-13:1 (1987).
Zhou Di, A recovery technique to fault-tolerance in distributed communicating process systems,Journal of Comupter Science and Technology,1:2(1986).
B. Lampson, Atomic transations, distributed system-architecture and implementation, Lecture Notes in Computer Science105, 1981, 246–265.
A. Avizienis and J.P.J. Kelly, Fault-tolerance by design diversity: concepts and experiments,Computer,17:8 (1984).
P. Jalote and R.H. Campbell, Fault tolerance using communicating processes, Proc. FTCS-14, 1984.
Zhou Di, Eliminating domino effect in backward error recovery in distributed systems, Proc. the 2nd Int'l Conf. on Com. & Appl. July, 1987. Beijing.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Zhou, D., Xu, X. A distributed error recovery technique and its implementation and application on UNIX. J. of Compt. Sci. & Technol. 5, 127–138 (1990). https://doi.org/10.1007/BF02943419
Received:
Issue Date:
DOI: https://doi.org/10.1007/BF02943419