Advertisement

Application Transparent Fault Management in Fault Tolerant Mach

  • Mark Russinovich
  • Zary Segall
  • Dan Siewiorek
Part of the The Kluwer International Series in Engineering and Computer Science book series (SECS, volume 285)

Abstract

Fault detection and fault tolerance has become an increasingly important aspect of all computer system designs, from PC’s to high- end workstations and embedded critical systems. Since operating systems are common to all computers and it is at the operating system level where there is maximum system visibility and control, it is appropriate for the operating system to provide policies which detect, contain and tolerate faults. These policies form an operating system’s “fault management.” A mechanism to provide support for operating system fault management has been designed and implemented for a UNIX 43 BSD server running on the Mach 3.0 microkernel. The mechanism, called the sentry mechanism, consists of fault management control placed at all operating system entry and exit points. The suitability of the mechanism is determined through demonstration of its ability to support diverse, commonly accepted policies efficiently, where efficiency is measured in terms of implementation complexity and performance. Several sentry policies have been implemented including monitoring, assertions, checkpoint/checkpoint recovery and journaling journal replay. This paper presents the sentry mechanism, its implementation and the design and implementation of the mentioned policies.

Keywords

Fault Detection External Input Dependency Graph System Call Fault Management 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    M. Accetta, R. Baron, W. Bolosky, D. Golub and R. Rashid, “A New Kernel Foundation for UNIX Development”, USENIX 86, July 1986.Google Scholar
  2. [2]
    D. M. Andrews, “Software Fault Tolerance Through Executable Assertions”, 12th Asilomar Conference on Circuits and Systems and Computers, Pacific Grove, CA., pp, 6641–645, Nov. 1978.Google Scholar
  3. [3]
    D. M. Andrews, “Using Executable Assertions for Testing and Fault Tolerance”, FTCS-9, Madison, WI, June 20–22, pp. 102–105, 1979.Google Scholar
  4. [4]
    J. F. Bartlett, “A Nonstop Kernel”, Eigth Symposium on Operating Systems Principles, Asilomar, CA., pp 22–29, Dec. 1981.Google Scholar
  5. [5]
    B. Bhargava Shu-Renn Lian, “Independent Checkpointing and Concurrent Roll-back Recovery in Distributed Systems-an Optimistic Approach”, Seventh Symposium on Reliable Distributed Systems, Colombus, OH, pp. 3–12, Oct. 1988.Google Scholar
  6. [6]
    H. Custer, Inside Windows NT, Microsoft Press, Redmond, WA. 1993.Google Scholar
  7. [7]
    E. N. Elnozahy, W. Zwaenepoel, “Manetho: Transparent Roll Back-Recovery With Low Overhead, Limited Rollback, and Fast Output Commit”, IEEE Transactions on Computers, Vol. 41, No. 5., pp. 526–531, May 1992.CrossRefGoogle Scholar
  8. [8]
    T. M. Frazier, Y. Tamir, “Application-Transparent Error-Recovery Techniques for Multicomponent”, Fourth Conference on Hypercubes, Concurrent Computers and Applications, Monterey, CA., pp. 103–108, March 1989.Google Scholar
  9. [9]
    D. Golub, R. Dean, A. Forin and R. Rashid, “Unix as an Application Program”. USENIX Summer Conference, Anaheim, CA, June 11–15, 1990.Google Scholar
  10. [10]
    J. Gray and D. P. Siewiorek, “High-Availability Computer Systems”, IEEE Computer, September, 1991.Google Scholar
  11. [11]
    D. Jewitt, “Integrity S2: A Fault-Tolerant Unix Platform”, FTCS-21, Montreal, Canada, pp. 512–519, June, 1991.Google Scholar
  12. [12]
    T. T. Juang, S. Venkatesan, “Crash Recovery With Little Overhead”, 11th International Conference on Distributed Computing Systems, Arlington, TX, pp. 454–461, May, 1991.Google Scholar
  13. [13]
    T. T. Juang, S. Venkatesan, “Efficient Algorithms for Crash Recovery in Distributed Systems”, Tenth Conference on Foundations of Software Technology and Theoretical Computer Science, Bangalore, India, pp. 17–19, 1990.Google Scholar
  14. [14]
    R. Koo and S. Toueg, “Checkpointing and Rollback Recovery for Distributed Systems”, IEEE Transactions on Software Engineering, Vol. 13, Jan. 1987.Google Scholar
  15. [15]
    T. Lehr, Z. Segall, D. Vrsalovic, E. Caplan, A. Chung, and C. Fineman, “Visualizing Performance Debugging”, IEEE Computer, pp. 38–51, Oct. 1989.Google Scholar
  16. [16]
    A. Mahmood, D. J. Lu, and E. J. McCluskey, “Executable Assertions and Flight Software”, AIAA/IEEE 6th Digital Avionics Systems Conference, Baltimore, MD, pp. 346–351, Dec. 1984.Google Scholar
  17. [17]
    R. Rashid, R. Baron, A. Forin, D. Golub, M. Jones, D. Julin, D. Orr, and R. Sanzi, “Mach: A Foundation For Open Systems”, Proceedings of the Second Workshop on Workstation Operating Systems, Pacific Grove, CA, Sept. 27–29, 1989.Google Scholar
  18. [18]
    M. Russinovich, Z. Segall, “Open System Fault Management — Fault Tolerant Mach”, CMU Research Report, CMUCDS-92-8, 1992.Google Scholar
  19. [19]
    M. Russinovich, Z. Segall, and D. P. Siewiorek, “Application Transparent Fault Management in Fault Tolerant Mach”, FTCS-23, Toulouse, France, pp. 10–19, June 22–24, 1993.Google Scholar
  20. [20]
    D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation, Digital Press, Burlington, MA. 1992.Google Scholar
  21. [21]
    R. E. Strom, D. F. Bacon, S. A. Yemini, “Volatile Logging in n-Fault Tolerant Distributed Systems”, FTCS-18, Tokyo, Japan, pp.27–30, 1988.Google Scholar
  22. [22]
    R. E. Strom, D. F. Bacon, S. A. Yemini, “Towards Self Recovering Operating Systems”, International Conference on Reliable Systems, Los Angeles, CA, pp. 59–71, April 21–23, 1975.Google Scholar
  23. [23]
    Z. Tong, R. Y. Kain, W. T. Tsai, “A Low Overhead Checkpointing and Rollback Recovery Scheme For Distributed Systems”, Eigth Symposium on Reliable Distributed Systems, Seattle, WA, pp. 12–20, Oct. 1989.Google Scholar

Copyright information

© Kluwer Academic Publishers 1994

Authors and Affiliations

  • Mark Russinovich
  • Zary Segall
  • Dan Siewiorek

There are no affiliations available

Personalised recommendations