State Restoration in Distributed Systems
This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in “backward error recovery”, i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed.
The formalisation is based on the use of what we term “Occurrence Graphs” to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restoration. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.
KeywordsError Recovery Virtual Link Distribute Computing System Active Place Ignorable Activity
Unable to display preview. Download preview PDF.
- [BJO72]L. A. Bjork, C. T. Davies, “The Semantics of the Preservation and Recovery of Integrity in a Data System”, TR 02.540, IBM, San José, Cal., 1972.Google Scholar
- [BOC76]G. V. Bochman, J. Gecsei, “A Unified Method for the Specification and Verification of Protocols”, Pub. # 247, Dept. d’Informatique, Univ. of Montreal, 1976.Google Scholar
- [DAV77]C. T. Davis, “Data Base Spheres of Control”, TR 02.782, IBM, San José, Cal., 1977.Google Scholar
- [HOL68]A. W. Holt, R. M. Shapiro, H. Saint, S. Marshall, “Information System Theory Project”, Appl. Data Research ADR 6606 (US Air Force, Rome Air Development Center RADC-TR-68-305), 1968.Google Scholar
- [MER77a]P. M. Merlin, B. Randell, “Consistent State Restoration in Distributed Systems”, TR 113, Computing Lab., Univ. of Newcastle-upon-Tyne, UK, 1977.Google Scholar
- [MER77b]P. M. Merlin, A. Segal, “A Failsafe Loop-Free Algorithm for Distributed Routing in Data Communication Networks”, Pub. 313, Dept. of Electr. Eng., Technion, Haifa, Israel, 1977.Google Scholar
- [PET76]C. A. Petri, “Nichtsequentielle Prozesse”, Rt. 76-6, GMD-ISF, Bonn, W. Germany, 1976.Google Scholar
- [PET77]C. A. Petri, “General Net Theory”, Proc. of the Joint IBM/Univ. of Newcastle-upon-Tyne Seminar on Computing System Design (B. Shaw, Ed.); Comp. Lab., Univ. of Newcastle-upon-Tyne, U. K., 1977, pp. 131–169.Google Scholar
- [RAN75]B. Randell, “System Structure for Software Fault Tolerance”, IEEE Trans. on Software Eng. SE-1, 2, pp: 220–232, 1975. [Also Chap. 1]Google Scholar
- [RAN77]B. Randell, P. A. Lee, P. C. Treleaven, “Reliable Computing Systems”, TR 102, Comp. Lab., Univ. of Newcastle-upon-Tyne, U.K., 1977.Google Scholar