A distributed error recovery technique and its implementation and application on UNIX

Zhou, Di; Xu, Xiangwen

doi:10.1007/BF02943419

A distributed error recovery technique and its implementation and application on UNIX

Regular Papers
Published: April 1990

Volume 5, pages 127–138, (1990)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zhou Di¹^nAff2 &
Xu Xiangwen¹

18 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents a checkpoint setting technique to eliminate domino effect in backward recovery in distributed systems, which is very efficient, powerful, widely applicable and easy to be implememted. Besides theoretical analysis, an implementation on UNIX system and a package for software fault-tolerance are introduced. Then the problems of checkpoint management and process termination are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A brief introduction to distributed systems

Article Open access 16 August 2016

A Review of Distributed Ledger Technologies

Noisy intermediate-scale quantum computers

Article Open access 07 March 2023

References

X. Castilloet al., Derivation and calibration of a transient error reliability model,IEEE Trans., C-31:7(1982).
Google Scholar
S.R. McConnelet al., The measurement and analysis of transient errors in digital computer systems, Proc. FTCS-9, 1979.
B. Randell, System structure for fault-tolerance,IEEE Trans., SE-1:2(1975).
Google Scholar
K. Kant and A. Silberschatz, Error propagation and recovery in concurrent environments,The Computer Journal,28:5(1985).
Google Scholar
B. Randellet al., Reliability issues in computing system design,Computing Surveys,10:2(1978).
Google Scholar
K.G. Shin, Y.-H. Lee, Evaluation of error recovery blocks used for cooperating processes, IEEE Trans., SE-10:6(1984).
Google Scholar
K. Kant, A global checkpointing model for error recovery, AFIPS Conf. Proc. Vol. 52, 1983.
Y.-H. Lee and K.G. Shin, Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,IEEE Trans., C-33:2(1984).
Google Scholar
G. Barigazzi and L. Strigini, Application transparent setting of recovery point, Proc, FTCS-13, 1983.
A. Ciuffoletti, Error recovery in systems of communicating processes, Proc. 7th Int'l. Conf. on Software Eng., 1984.
R.E. Strom and S. Yemini, Optimistic recovery—an asynchronous approach to fault-tolerance in distributed systems, Proc. FTCS-14, 1984.
G.W. Wood, A decentralized recovery control protocol, Proc. FTCS-11, 1981.
C.A.R. Hoare, Communicating Sequential Processes,C.ACM,21:8(1978).
Google Scholar
R. Koo and S. Toueg, Checkpointing and rollback-recovery for distributed systems,IEEE Trans., SE-13:1 (1987).
Google Scholar
Zhou Di, A recovery technique to fault-tolerance in distributed communicating process systems,Journal of Comupter Science and Technology,1:2(1986).
B. Lampson, Atomic transations, distributed system-architecture and implementation, Lecture Notes in Computer Science105, 1981, 246–265.
Google Scholar
A. Avizienis and J.P.J. Kelly, Fault-tolerance by design diversity: concepts and experiments,Computer,17:8 (1984).
Google Scholar
P. Jalote and R.H. Campbell, Fault tolerance using communicating processes, Proc. FTCS-14, 1984.
Zhou Di, Eliminating domino effect in backward error recovery in distributed systems, Proc. the 2nd Int'l Conf. on Com. & Appl. July, 1987. Beijing.

Download references

Author information

Zhou Di
Present address: Tsinghua University, Beijing, China

Authors and Affiliations

Lehrstuhl für Prozessrecher, Techniche Universität München, Franz-Josephstr. 38/III, 8000, München 40, West Germany
Zhou Di & Xu Xiangwen

Authors

Zhou Di
View author publications
You can also search for this author in PubMed Google Scholar
Xu Xiangwen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, D., Xu, X. A distributed error recovery technique and its implementation and application on UNIX. J. of Compt. Sci. & Technol. 5, 127–138 (1990). https://doi.org/10.1007/BF02943419

Download citation

Received: 20 August 1988
Issue Date: April 1990
DOI: https://doi.org/10.1007/BF02943419

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distributed error recovery technique and its implementation and application on UNIX

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A Review of Distributed Ledger Technologies

Noisy intermediate-scale quantum computers

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A distributed error recovery technique and its implementation and application on UNIX

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

A Review of Distributed Ledger Technologies

Noisy intermediate-scale quantum computers

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation