Design of Real-Time Fault-Tolerant Computing Stations

  • K. H. Kim
Conference paper
Part of the NATO ASI Series book series (NATO ASI F, volume 127)

Abstract

The steady increase observed during the past decade in distributed computer system (DCS) use in safety-critical real-time applications is expected to continue through the 1990’s. For example, DCS’s have been increasingly adopted in applications such as space navigation, air-traffic control, hospital automation, national defense, etc. [11, 16, 21, 43, 47]. To attain the desired level of reliability, such DCS’s must be designed to possess effective fault tolerance capabilities.

Keywords

Resi Nism Kelly Tempo 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ancona, M., Dodero, G., Gianuzzi, V., Clematis, A., and Fernandez, E.B., A System Architecture for Fault Tolerance in Concurrent Software, IEEE Computer, October 1990, 23–32.Google Scholar
  2. 2.
    Anderson, T. and Knight, J.C., A Framework for Software Fault Tolerance in Real-Time System, IEEE TSE, May 1983, 355–364.Google Scholar
  3. 3.
    Armstrong, L.T. and Lawrence, T.F., Adaptive Fault Tolerance, Proc. 1991 Systems Design Synthesis Technology Workshop, September 1991.Google Scholar
  4. 4.
    Avizienis, A., Gilley, G., Mathur G.C., Kennels F.P., Rohr, J.A. and Rubin, D.K., The STAR (Self Testing and Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design, IEEE Trans. on Computers, Vol. C-20, No. 11, November 1971, 1312–1321.Google Scholar
  5. 5.
    Avizienis, A., Fault tolerance and fault intolerance: Complementary approaches to reliable computing, Proc. 1975 Int’1. Conf. Rel. Software, Los Angeles, CA, April 1975, 458–464.Google Scholar
  6. 6.
    Avizienis, A., The N-Version Approach to Fault-Tolerant Software, IEEE Trans. on Software Engineering, Vol. Se-11, No. 12, December 1985, 1491–1501.CrossRefGoogle Scholar
  7. 7.
    Avizienis, A., Lyu, M.R., and Schutz, W., In Search of Effective Diversity: A Six-Language Study of Fault-Tolerant Flight Control Software,Proc. FTCS-18, 15–22.Google Scholar
  8. 8.
    Bagchi, A., and Hakimi, S. L., An Optimal Algorithm for Distributed System Level Diagnosis, Proc. IEEE Computer Society’s 21st Int’l. Symp. on Fault-Tolerant Computing, June 1991, 214–221.Google Scholar
  9. 9.
    Best, E., and Randell, B., A Formal Model of Atomicity in Asynchronous Systems, Acta Informatica 16, 1981, 93–124.CrossRefMATHGoogle Scholar
  10. 10.
    Bianchini, R. and Buskens, R., An Adaptive Distributed System-Level Diagnosis Algorithm and its implementation, Proc. IEEE Computer Society’s 21st Int’l. Symp. on Fault-Tolerant Computing, June 1991, 222–229.Google Scholar
  11. 11.
    Bhargava, B., editor., Concurrency and Reliability in Distributed Systems, Van Nostrand and Reinhold, 1987.Google Scholar
  12. 12.
    Carter, W.C., Hardware Fault Tolerance, Chapter 2 in Anderson, T., ed., Resilient Computing Systems, Vol. 1, Wiley-Interscience, 1985, 11–63.Google Scholar
  13. 13.
    Chandy, K.M., and Ramamoorthy, C.V., Rollback and recovery strategies for computer programs, IEEE Trans. on Computers, Vol. C-21, June 1972, 546–556.Google Scholar
  14. 14.
    Chu, W.W., Kim, K.H., and McDonald, W.C., Testbed-based Evaluation of Design Techniques for Fault-Tolerant Real-Time Distributed Computer Systems, Proceedings of the IEEE, Vol. 75, No. 5, Special Issue on Distributed Databases, May 1987, 649–667.Google Scholar
  15. 15.
    Cristian, F., Agreeing on Who is Present and Who is Absent in a Synchronous Distributed System, Proc. IEEE Computer Society’s 18th Int. Symp. of Fault-Tolerant Computing, Tokyo, Japan, June 1988, 206–211.Google Scholar
  16. 16.
    Davis, C.G. and Couch, R.L., Ballistic Missile Defense: A Supercomputer Challenge, IEEE Computer, November 1980, 37–46.Google Scholar
  17. 17.
    Ezhilchelvan, P. D., and Lemas, R., A Robust Group Membership Algorithm for Distributed Real-Time Systems, Proc. IEEE Computer Society’s Real-Time Systems Symp:, December 1990, 173–179.Google Scholar
  18. 18.
    Fraga, J.S., Rodrigues, V., and Silva, E.S., A Language Approach to Implementation of the Distributed Recovery Block Schemes, Proc. 13th. CBC Conf. on Computer Sciences, Gramado, Brazil, August 1991.Google Scholar
  19. 19.
    Garda-Molina, H., Elections in a distributed computing system, IEEE Trans. on Computers, January 1982, 48–59.Google Scholar
  20. 20.
    Gruensteidl, G., and Kopetz, H., A Reliable Multicast Protocol for Distributed Real- Time Systems, Proc. IEEE Computer Society’s Workshop on Real-Time Operating Systems, May 1991.Google Scholar
  21. 21.
    Hecht, H., Fault-Tolerant Software for Real-Time Applications, Computing Surverys, December 1976, 391–407.Google Scholar
  22. 22.
    Hecht, M., Agron, J., and Hochhauser, S., A Distributed Fault Tolerant Architecture for Nuclear Reactor Control and Safety Functions, Proc. IEEE Computer Society’s 1989 Real-Time Systems Symp., December 1989, 214–221.Google Scholar
  23. 23.
    Hecht, M., Agron, J., Hecht, H., and Kim, K.H., A Distributed Fault Tolerant Architecture for Nuclear Reactor and Other Critical Process Control Applications,Proc. IEEE Computer Society’s 21st Int’1 Symp. on Fault-Tolerant Computing, June 1991, Montreal, 462–469.Google Scholar
  24. 24.
    Hopkins, A.L., Smith, T.B., and Lala, J.H., FTMP–A Highly Reliable Fault-Tolerant Multiprocessor For Aircraft, Proceedings of The IEEE,Vo1. 66, No. 10, October 1978, 1221–1240.CrossRefGoogle Scholar
  25. 25.
    Horning, J.J., Lauer, H.C., Melliar-Smith, P.M., and Randell, B., A Program Structure for Error Detection and Recovery, Lecture Notes in Computer Science, Vol. 16, Springer-Verlag, New York, 1974, 171–187.Google Scholar
  26. 26.
    Ihara, H. and Mori, K., Autonomous Decentralized Computer Control Systems, Computer, Vol. 17, No. 8, August 1984, 57–66.CrossRefGoogle Scholar
  27. 27.
    Jensen, D. and Northcutt, J.D., Alpha: An Open Operating System for Mission-Critical Real-Time Distributed Systems - An Overview, Proc. 1989 Workshop on Operating Systems for Mission-Critical Computing, ACM Press, 1991.Google Scholar
  28. 28.
    Katsuki, D., et al., Pluribus–An Operational Fault-Tolerant Microprocessor, Proc. of the IEEE, October 1978, 1146–1159.Google Scholar
  29. 29.
    Kelly, J.P.J. et al., A Large Scale Second Generation Experiment in Multi-Version Software: Description and Early Results,Proc. FTcs-18, 9–14.Google Scholar
  30. 30.
    Kim, K.H., Error Detection, Reconfiguration and Recovery in Distributed Processing Systems, Proc. IEEE Computer Society’s 1st. Int’l. Conf. on Distributed Computing Systems, October 1979, 284–295.Google Scholar
  31. 31.
    Kim, K.H., Approaches to Mechanization of the Conversation Scheme Based on Monitor, IEEE Trans. on Software Eng., Vol. SE-8, No. 3, May 1982, 189–197.Google Scholar
  32. 32.
    Kim, K.H. and Welch, H.O., Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults in Real-Time Applications, IEEE Trans. on Computers, May 1989, 626–636.Google Scholar
  33. 33.
    Kim, K.H., Approaches to System-Level Fault Tolerance in Distributed Real-Time Computer Systems, Proc. 4th Intl Conf. on Fault-Tolerant Computing Systems, Baden-Baden, W. Germany, September 1989, 268–281 (Invited paper), Informatik-Fachberichte 214, Springer-Verlag 1989.Google Scholar
  34. 34.
    Kim, K.H., Guan, W.J., Damm, A., and Rohr, J.A., Approaches to Design of Temporary Blackout Handling Capabilities and an Evaluation with a RealTime Tightly Coupled Network Testbed,Proc. IEEE Computer Society’s 21st Int’l Symp. on Fault-Tolerant Computing, June 1991, Montreal, 470–477.Google Scholar
  35. 35.
    Kim, K.H. and Min, B.J., Approaches to Implementation of Multiple DRB Stations in Tightly Coupled Computer Networks and an Experimental Validation, Proc. IEEE Computer Society’s 15th Int’l Computer Software and Applications Conf. (COMPSAC ‘81), Tokyo, September 1991, 550–557.Google Scholar
  36. 36.
    Kim, K.H., Kopetz, H., Mori, K., Shokri, E.H., and Gruensteidl, G., An Efficient Approach to Decentralized Network Diagnosis and Reconfiguration in Real-Time LAN Systems: The PRHB/ED scheme, To appear in Proc. IEEE Computer Society’s 11th Symp. on Reliable Distributed Systems, October 1992, Houston, TX.Google Scholar
  37. 37.
    Kim, K.H., and Shokri, E.H., An Approach to Decentralized Maintenance of the Processor-Group Membership with Minimal Detection Latency Bounds in TDMA-Bus LAN Systems, Tech. Rept. uct-EcE-92–07, Dept. of Electrical & Computer Engineering, UCI, May, 1992.Google Scholar
  38. 38.
    Kopetz, H., Damm, A., Koza, C., Mulazzani, M., Wolfgang, S., Senft, C., and Zainlinger, R., Distributed Fault-Tolerant Real-Time Systems: The Mars Approach, IEEE Micro, February 1989, 25–39.Google Scholar
  39. 39.
    Kopetz, H., Grunsteidl, G, and Reisinger, J., Fault-Tolerant Membership Service in a Synchronous Distributed Real-Time System, Proc. IFIP WG 10.4 Int’l Working Conf. on Dependable Computing for Critical Applications, Santa Barbara, August 1989, 167–174.Google Scholar
  40. 40.
    Kopetz, H. and Kim, K.H., Temporal Uncertainties in Interactions among Real-Time Objects, Proc IEEE Computer Society’s 9th Symp. on Reliable Distributed Systems, Huntsville, AL, October 1990, 165–174.Google Scholar
  41. 41.
    Lamport, L., Shostak, R., and Pease, M., The Byzantine Generals problem, ACM Trans. Prog. Lang. Syst., Vol. 3, No. 4, July 1982, 382–401.CrossRefGoogle Scholar
  42. 42.
    Lee P.A., A Reconsideration of the Recovery Block Scheme, Computer Journal, Vol. 21, No. 4, November 1978, 306–310.CrossRefGoogle Scholar
  43. 43.
    McDonald, W.C. and Smith, R.W., A flexible distributed testbed for real time applications, Computer, Vol. 15, No. 10, October 1982, 25–39.CrossRefGoogle Scholar
  44. 44.
    Mori, K., et. al., Autonomous Decentralized Software Structure and Its Application, Proc. Fall Joint Computer Conference, Dallas, TX, November 1986, 1056–1063.Google Scholar
  45. 45.
    Nett, E., Supporting Fault Tolerant Computations in Distributed Systems, Habilitation Thesis, Univ. of Bonn, Germany, 1991.Google Scholar
  46. 46.
    Powell, D. et al., The Delta-4 Approach to Dependability in Open Distributed Computing Systems, Proc. IEEE Computer Society’s 18th Int’l. Symp. on Fault-Tolerant Computing, June 1988, 246–251.Google Scholar
  47. 47.
    Ramamoorthy, C.V. et al., Application of a Methodology for the Development and Validation of Reliable Process Control Software, IEEE Trans. on Software Engr., Vol. sE-7, No. 6, November 1981, 537–555.Google Scholar
  48. 48.
    Randell, B., System Structure for Software Fault Tolerance, IEEE Transactions on Software Engineering, June 1975, 220–232.Google Scholar
  49. 49.
    Rohr, J.A., STAREX Self-Repair Routines: Software Recovery in the JPLSTAR Computer, Digest of Papers FTCS-3, International Symposium on Fault-Tolerant Computing, Palo Alto, CA, June 1973, 11–16.Google Scholar
  50. 50.
    Stankovic, J.A., Misconceptions About Real-time computing: A Serious Problem for Next-Generation Systems, Computer, Vol. 21, No. 10, October 1988, 10–19.CrossRefGoogle Scholar
  51. 51.
    Strong, R., Problems in Maintaining Agreement, Proc. IEEE Computer Society’s 5th Symp. on Reliability in Distributed Software and Database Systems, Washington, DC, 1986, 20–27.Google Scholar
  52. 52.
    Taylor, D.J., Morgan, D.E., and Black, J.P., Redundancy in Data Structures: Improving Software Fault Tolerance, IEEE Trans. on Software Engineering, Vol. sE-6, No. 6, November 1980, 585–594.Google Scholar
  53. 53.
    Taylor D., and Wilson, G., Stratus, in Dependability of Resilient Computers, ed. T. Anderson, ssP Professional Books, Oxford, 1989, 222–236.Google Scholar
  54. 54.
    Tong, Z., Kain, R.Y., and Tsai, W.T., A Lower Overhead Checkpointing and Rollback Recovery Scheme for Distributed Systems, Proc. IEEE Computer Society’s 8th Symp. on Reliable Distributed Systems, October 1988, 12–20.Google Scholar
  55. 55.
    Toy, W.N., Fault-Tolerant Design of Local ESS Processors, Proceedings of the IEEE, Vol. 66, No. 10, October 1978, 1126–1145.CrossRefGoogle Scholar
  56. 56.
    Toy, W.N., Fault-Tolerant Computing, in Advances in Computers, Vol. 26, Academic Press, 1987, 201–279.Google Scholar
  57. 57.
    Wensley, J.H., et al., SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control, Proc. of the IEEE, October 1978, 1240–1255.Google Scholar
  58. 58.
    Wensley, J.H., An Operating System for a TMR Fault-Tolerant System, Digest of Papers FTCS-13, Thirteenth Annual International Symposium on Fault-Tolerant Computing, Milano, June 1983, 452–455.Google Scholar
  59. 59.
    Wilson, D., The STRATUS computer system, Chapter 12 in T. Anderson ed., Resilient Computing Systems Volume I, John Wiley do Sons, 1985, 45–67.Google Scholar
  60. 60.
    Yau, S.S. and Cheung, R.C., Design of Self-checking Software, Proc. Int’l Conf. on Reliable Software, 1975, 450–457.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  • K. H. Kim
    • 1
  1. 1.Department of Electrical & Computer EngineeringUniversity of CaliforniaIrvineUSA

Personalised recommendations