Skip to main content

Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

  • Conference paper
  • First Online:
Advances in Computing Science — ASIAN 2002 (ASIAN 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2550))

Included in the following conference series:

Abstract

This paper presents a new striped and staggered checkpointing (SSC) scheme for multicomputer clusters. We consider serverless clusters, where local disks attached to cluster nodes collectively form a distributed RAID (redundant array of inexpensive disks) with a single I/O space. The distributed RAID is used to save the checkpoint files periodically. Striping enables parallel I/O on distributed disks. Staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. Our SSC approach allows dynamical reconfiguration to minimize message-logging requirements among concurrent software processes. We demonstrate how to reduce the checkpointing overhead by striping and staggering dynamically. For communication-intensive programs, our SCC scheme can significantly reduce the checkpointing overhead. Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing schemes for fast rollback recovery from any single node (disk) failure in a cluster of computers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Cao and M. Singhal, “On coordinated Checkpointing in Distributed Systems”, IEEE Transactions on Parallel and Distributed Systems, Vol.9,No.12, pp.1213–1225, Dec.1998.

    Article  Google Scholar 

  2. G. Cao and N. Rishe, “A Nonblocking Consistent Checkpointing Algorithm for Distributed Systems”, Proceedings of the 8th International Conference on Parallel and Distributed Computing and Systems, Chicago, pp.302–307, October 1997.

    Google Scholar 

  3. K. M. Chandy and L. Lamport, “Distributed Snapshots: Determining Global States of Distributed Systems”, ACM Trans. Computer Systems, pp. 63–75, Feb. 1985.

    Google Scholar 

  4. Y. Deng and E. K. Park, “Checkpointing and Rollback-Recovery Algorithms in Distributed Systems”, Journal of Systems and Software, pp.59–71, Apr. 1994.

    Google Scholar 

  5. E. N. Elnozahy and W. Zwaenepoel, “Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit”, IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing, pp.526–531, May 1992.

    Google Scholar 

  6. E. N. Elnozahy, D. B. Johnson and Y. M. Wang, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems”, Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept. 1996.

    Google Scholar 

  7. E. N. Elnozahy and W. Zwaenepoel, “On the Use and Implementation of Message Logging”, Proceedings of the 24 th Int’l Sym. on Fault-Tolerant Computing, pp.298–307, June 1994.

    Google Scholar 

  8. G. A. Gibson and D. A. Patterson, “Designing Disk Arrays for High Data Reliability”, Journal of Parallel and Distributed Computing, Vol.17, Jan. 1993, pp 4–27.

    Article  Google Scholar 

  9. K. Hwang, H. Jin, E. Chow, C. L. Wang, and Z. Xu. “Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space”, IEEE Concurrency Magazine, March 1999, pp.60–69.

    Google Scholar 

  10. K. Hwang, H. Jin, and R. S. C. Ho, “Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing”, IEEE Transactions on Parallel and Distributed Systems, Vol.13,No.1, January 2002, pp.26–44.

    Article  Google Scholar 

  11. K. Hwang, H. Jin, R. Ho and W. Ro, “Reliable Cluster Computing with a New Checkpointing RAID-x Architecture”, Proceedings of 9-th Workshop on Heterogeneous Computing (HCW-2000), Cancun, Mexico, May 1, 2000, pp.171–184.

    Google Scholar 

  12. J. L. Kim and T. Park, “An efficient protocol for checkpointing recovery in distributed systems”, IEEE Transactions on Parallel and Distributed Systems, Aug. 1993, pp.955–960.

    Google Scholar 

  13. R. Koo and S. Toueg, “Checkpointing and Rollback-Recovery for Distributed Systems”, IEEE Trans. on Parallel and Distributed Systems, Vol.5.No.8, pp.955–960, Aug. 1993.

    Google Scholar 

  14. K. Li, J. Naughton and J. Plank, “Low-Latency Concurrent Checkpoint for Parallel Programs”, IEEE Trans. on Parallel and Dist. Computing, Vol.5,No.8, 1994, pp.874–879.

    Article  Google Scholar 

  15. M. Malhotra and K. Trivedi, “Reliability Analysis of Redundant Arrays of Inexpensive Disks”, Journal of Parallel and Distributed Computing, 1993.

    Google Scholar 

  16. R. H. Netzer and J. Xu, “Necessary and Sufficient Conditions for Consistent Global Snapshots”, IEEE Trans. on Parallel and Distributed System, pp.165–169, Feb. 1995.

    Google Scholar 

  17. J. S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing Under Unix”, Proceedings of Usenix Winter 1995 Technical Conference, pp.213–223, Jan. 1995.

    Google Scholar 

  18. J. S. Plank, K. Li, and M. A. Puening, “Diskless Checkpointing”, IEEE Transactions on Parallel and Distributed Systems, 1998, pp.972–986.

    Google Scholar 

  19. R. Prakash and M. Singhal, “Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems”, IEEE Trans. on Parallel and Distributed Systems, pp.1035–1048, Oct. 1996.

    Google Scholar 

  20. L. M. Silva and J. G. Silva, “Global Checkpointing for Distributed Programs”, Proceedings of 11 th Symposium Reliable Distributed Systems, pp.155–162, Oct. 1992.

    Google Scholar 

  21. N. H. Vaidya, “A Case for Two-Level Distributed Recovery Schemes”, Proceedings of the ACM In’l Conf. On Meas. and Modeling of Computer Systems (Sigmetrics’95), pp.64–73.

    Google Scholar 

  22. N. H. Vaidya, “Staggered Consistent Checkpointing”, IEEE Transactions on Parallel and Distributed Systems, 1999, Vol.10,No.7, pp.694–702.

    Article  MathSciNet  Google Scholar 

  23. B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Prentice Hall, New Jersey, 1999.

    Google Scholar 

  24. J. Xu and R. H. B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation”, Proc. of the 5 th IEEE Symposium on Parallel and Distributed Processing, Dec. 1993.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jin, H., Hwang, K. (2002). Distributed Checkpointing on Clusters with Dynamic Striping and Staggering. In: Jean-Marie, A. (eds) Advances in Computing Science — ASIAN 2002. ASIAN 2002. Lecture Notes in Computer Science, vol 2550. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36184-7_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-36184-7_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00195-9

  • Online ISBN: 978-3-540-36184-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics