Advertisement

An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

  • Luis Moura Silva
  • João Gabriel Silva
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1667)

Abstract

Coordinated checkpointing represents a very effective solution to assure the continuity of distributed and parallel applications in the occurrence of failures. In previous studies it has been proved that this approach achieved better results than independent checkpointing and message logging. However, we need to know more about the real overhead of coordinated checkpointing and get sustained insights about the best way to implement this technique of faulttolerance. This paper presents an experimental evaluation of coordinated checkpointing in a parallel machine. It describes some optimization techniques and presents some performance results.

Keywords

Parallel Machine Stable Storage Performance Overhead Host Machine Application Benchmark 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    E.N. Elnozahy, D.B. Johnson, Y.M. Wang. “A Survey of Rollback-Recovery Protocols in Message Passing Systems”, Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, October 1996Google Scholar
  2. 2.
    B. Randell. “System Structure for Software Fault-Tolerance”, IEEE Trans. on Software Engineering, Vol. SE-1 (2), pp. 226–232, June 1975Google Scholar
  3. 3.
    K.M. Chandy, L. Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM Transactions on Computer Systems, Vol. 3, No. 1, pp. 63–75, February 1985CrossRefGoogle Scholar
  4. 4.
    L.M. Silva, “Checkpointing Mechanisms for Scientific Parallel Applications”, PhD Thesis presented at the Univ. of Coimbra, Portugal, January 1997, ISBN 972-97189-0-3Google Scholar
  5. 5.
    L.M. Silva, J.G. Silva. “Global Checkpointing for Distributed Programs”, Proc. 11th Symposium on Reliable Distributed Programs, Houston USA, pp. 155–162, October 1992Google Scholar
  6. 6.
    E.N. Elnozahy, D.B. Johnson, W. Zwaenepoel. “The Performance of Consistent Checkpointing”, Proc. 11th Symposium on Reliable Distributed Systems, pp. 39–47, 1992Google Scholar
  7. 7.
    “Parix 1.2: Software Documentation”, Parsytec Computer GmbH, March 1993Google Scholar
  8. 8.
    G. Muller, M. Banatre, N. Peyrouze, B. Rochat. “Lessons from FTM: An Experiment in the Design and Implementation of a Low Cost Fault-Tolerant System”, IEEE Transactions on Reliability, pp. 332–340, June 1996Google Scholar
  9. 9.
    K. Li, J.F. Naughton, J.S. Plank. “Real-Time Concurrent Checkpoint for Parallel Programs”, Proc. 2nd ACM Sigplan Symposium in Principles and Practice of Parallel Programming, pp. 79–88, March 1990Google Scholar
  10. 10.
    J.S. Plank, K. Li. “ickp-A Consistent Checkpointer for Multicomputers”, IEEE Parallel and Distributed Technology, vol. 2 (2), pp. 62–67, Summer 1994CrossRefGoogle Scholar
  11. 11.
    B. Bieker, E. Maehle. “Overhead of Coordinated Checkpointing Protocols for Message Passing Parallel Systems”, Workshop on Fault-Tolerant Parallel and Distributed Systems, IPPS’99, San-Juan, Puerto-Rico, April 1999Google Scholar
  12. 12.
    N. Vaidya. “On Staggered Checkpointing”, Proc. 8th IEEE Symposium on Parallel and Distributed Processing, SPDS, October 1996Google Scholar
  13. 13.
    G. Cabillic, G. Muller, I. Puaut. “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems”, Proceedings 14th Symposium on Reliable Distributed Systems, SRDS-14, September 1995Google Scholar
  14. 14.
    N. Neves, K. Fuchs. “RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols”, Proc. 28th Int. Symposium on Fault-Tolerant Computing, FTCS-28, pp. 58–67, Munich, June 1998Google Scholar
  15. 15.
    L.M. Silva, J.G. Silva, S. Chapple, L. Clarke, “Portable Checkpointing and Recovery”, Proc. 4th Int. Symp. on High-Performance Distributed Computing, HPDC-4, Pentagon City, USA, pp.188–195, August 1995Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Luis Moura Silva
    • 1
  • João Gabriel Silva
    • 1
  1. 1.Departamento Engenharia InformáticaUniversidade de CoimbraCoimbraPortugal

Personalised recommendations