Skip to main content

An architecture for rapid distributed fault tolerance

  • Workshop on Embedded HPC Systems and Applications Devesh Bhatt, Honeywell Technology Center, USA Viktor Prasanna, Univ. of Southern California, USA
  • Conference paper
  • First Online:
Parallel and Distributed Processing (IPPS 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1388))

Included in the following conference series:

Abstract

Embedded high performance computing is being called upon to provide critical computing resources with increasing frequency. The ability to tolerate faults during operation, both maintaining operational capability and ensuring that correct results continue to be produced, is an important ingredient in mission-critical systems. An architecture for such a system is proposed, providing the ability to withstand faults with graceful degradation in performance and complete transparency to the applications programmer. The final system will be able to offer fault-tolerant computing transparently to MPI applications and draws heavily on existing, demonstrated successes.

This work was funded in part by NSF Grant No. EEC-8907070 Amendment 021 and by ONR Grant No. N00014-97-1-0116.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Samuel H. Russ, Brian Flachs, Jonathan Robinson, and Bjorn Heckel, “Hector: Automated Task Allocation for MPI”, Proceedings of the 10th International Parallel Processing Symposium, Honolulu, HI, April 1996.

    Google Scholar 

  2. Guerraoui, R., and Schiper, A., “Software-based Replication for Fault Tolerance”, Computer, Vol. 30, No. 4, April 1997, pp. 68–74.

    Article  Google Scholar 

  3. Jonathan Robinson, Samuel H. Russ, Brian Flachs, and Bjorn Heckel, “A Task Migration Implementation for the Message-Passing Interface”, Proceedings of the IEEE 5th High Performance Distributed Computing Conference (HPDC-5), Syracuse, NY, August 1996.

    Google Scholar 

  4. Dr. Samuel H. Russ, “Using Hector in an Architecture for Rapid Distributed Fault Tolerance”, MSU Technical Report No. MSSU-EIRS-ERC-97-17, December 1997.

    Google Scholar 

  5. Dr. Samuel H. Russ, Brad Meyers, Chun-Heong Tan, and Bjorn Heckel, “UserTransparent Run-time Performance Optimization”, 2nd International Workshop on Embedded High Performance Computing, associated with IPPS '97, Geneva, Switzerland, April 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

José Rolim

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Russ, S.H. (1998). An architecture for rapid distributed fault tolerance. In: Rolim, J. (eds) Parallel and Distributed Processing. IPPS 1998. Lecture Notes in Computer Science, vol 1388. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64359-1_757

Download citation

  • DOI: https://doi.org/10.1007/3-540-64359-1_757

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64359-3

  • Online ISBN: 978-3-540-69756-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics