State Restoration in Distributed Systems

  • P. M. Merlin
  • B. Randell
Part of the Texts and Monographs in Computer Science book series (MCS)

Abstract

This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in “backward error recovery”, i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed.

The formalisation is based on the use of what we term “Occurrence Graphs” to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restoration. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.

Keywords

Nite Decen 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BJO72]
    L. A. Bjork, C. T. Davies, “The Semantics of the Preservation and Recovery of Integrity in a Data System”, TR 02.540, IBM, San José, Cal., 1972.Google Scholar
  2. [BOC76]
    G. V. Bochman, J. Gecsei, “A Unified Method for the Specification and Verification of Protocols”, Pub. # 247, Dept. d’Informatique, Univ. of Montreal, 1976.Google Scholar
  3. [DAV77]
    C. T. Davis, “Data Base Spheres of Control”, TR 02.782, IBM, San José, Cal., 1977.Google Scholar
  4. [HOL68]
    A. W. Holt, R. M. Shapiro, H. Saint, S. Marshall, “Information System Theory Project”, Appl. Data Research ADR 6606 (US Air Force, Rome Air Development Center RADC-TR-68-305), 1968.Google Scholar
  5. [LOM77]
    D. B. Lomet, “Process Structuring, Synchronisation and Recovery using Atomic Actions”, Proc. ACM Conf. on Language Design for Reliable Software. Sigplan Notices 12, 3, 128–137, 1977. [Also Chap. 4]CrossRefGoogle Scholar
  6. [MER77a]
    P. M. Merlin, B. Randell, “Consistent State Restoration in Distributed Systems”, TR 113, Computing Lab., Univ. of Newcastle-upon-Tyne, UK, 1977.Google Scholar
  7. [MER77b]
    P. M. Merlin, A. Segal, “A Failsafe Loop-Free Algorithm for Distributed Routing in Data Communication Networks”, Pub. 313, Dept. of Electr. Eng., Technion, Haifa, Israel, 1977.Google Scholar
  8. [PET76]
    C. A. Petri, “Nichtsequentielle Prozesse”, Rt. 76-6, GMD-ISF, Bonn, W. Germany, 1976.Google Scholar
  9. [PET77]
    C. A. Petri, “General Net Theory”, Proc. of the Joint IBM/Univ. of Newcastle-upon-Tyne Seminar on Computing System Design (B. Shaw, Ed.); Comp. Lab., Univ. of Newcastle-upon-Tyne, U. K., 1977, pp. 131–169.Google Scholar
  10. [RAN75]
    B. Randell, “System Structure for Software Fault Tolerance”, IEEE Trans. on Software Eng. SE-1, 2, pp: 220–232, 1975. [Also Chap. 1]Google Scholar
  11. [RAN77]
    B. Randell, P. A. Lee, P. C. Treleaven, “Reliable Computing Systems”, TR 102, Comp. Lab., Univ. of Newcastle-upon-Tyne, U.K., 1977.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1985

Authors and Affiliations

  • P. M. Merlin
  • B. Randell

There are no affiliations available

Personalised recommendations