Skip to main content

C 3: A System for Automating Application-Level Checkpointing of MPI Programs

  • Conference paper
Languages and Compilers for Parallel Computing (LCPC 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2958))

Abstract

Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs.

In ([1],[2]) We have presented a distributed checkpoint coordination protocol which handles MPI’s point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C 3 (Cornell Checkpoint (pre-) Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.

This work was supported by NSF grants ACI-9870687, EIA-9972853, ACI-0085969, ACI-0090217, ACI-0103723, and ACI-0121401.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated applicationlevel checkpointing of mpi programs. In: Principles and Practices of Parallel Programming, San Diego, CA (2003)

    Google Scholar 

  2. Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Collective operations in an application-level fault tolerant MPI system. In: International Conference on Supercomputing (ICS) 2003, San Francisco, CA (2003)

    Google Scholar 

  3. Elnozahy, M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollbackrecovery protocols in message passing systems. Technical Report CMU-CS-96- 181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA (1996)

    Google Scholar 

  4. Chandy, M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computing Systems 3, 63–75 (1985)

    Article  Google Scholar 

  5. Graham, R., Choi, S.E., Daniel, D., Desai, N., Minnich, R., Rasmussen, C., Risinger, D., Sukalski, M.: A network-failure-tolerant message-passing system for tera-scale clusters. In: Proceedings of the International Conference on Supercomputing (2002)

    Google Scholar 

  6. Gupta, I., Chandra, T., Goldszmidt, G.: On scalable and efficient distributed failure detectors. In: Proc. 20th Annual ACM Symp. on Principles of Distributed Computing, pp. 170–179 (2001)

    Google Scholar 

  7. Litzkow, M., Tannenbaum, T., Livny, J.B., Checkpoint, M.: migration of UNIX processes in the Condor distributed processing system. Technical Report 1346, University of Wisconsin-Madison (1997)

    Google Scholar 

  8. Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under UNIX. Technical Report UT-CS-94-242, Dept. of Computer Science, University of Tennessee (1994)

    Google Scholar 

  9. Ramkumar, B., Strumpen, V.: Portable checkpointing for heterogenous architectures. In: Symposium on Fault-Tolerant Computing, pp. 58–67 (1997)

    Google Scholar 

  10. Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)

    Google Scholar 

  11. Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers 41 (1992)

    Google Scholar 

  12. Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead faulttolerance. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Madison, Wisconsin, June 15 - 18 (1999)

    Google Scholar 

  13. Beck, M., Plank, J.S., Kingsley, G.: Compiler-assisted checkpointing. Technical Report UT-CS-94-269, Dept. of Computer Science, University of Tennessee (1994)

    Google Scholar 

  14. OpenMP: Overview of the OpenMP standard (2003) Online at, http://www.openmp.org/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P. (2004). C 3: A System for Automating Application-Level Checkpointing of MPI Programs. In: Rauchwerger, L. (eds) Languages and Compilers for Parallel Computing. LCPC 2003. Lecture Notes in Computer Science, vol 2958. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24644-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24644-2_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21199-0

  • Online ISBN: 978-3-540-24644-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics