Advertisement

New User-Guided and ckpt-Based Checkpointing Libraries for Parallel MPI Applications,

  • Paweł Czarnul
  • Marcin Frączak
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)

Abstract

We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointing. The other version is a technically advanced parallel implementation of checkpointing based on the user-level ckpt library. It uses wrappers for MPI calls in the user program which enables to run a shadow MPI application just for communication purposes. Communication between original processes and the shadow MPI code is done via shared memory segments to which communication buffers are mapped. We present checkpoint/restart times for the two approaches and subversions proposed by us compared to an available LAMMPI/BLCR checkpointing solution for MPI applications. The performance of all the versions and I/O optimizations are discussed for a 4-node, 16-processor cluster with NFS and specifically for single SMP nodes with a local file system.

Keywords

Shared Memory Master Process Communication Buffer Checkpoint Data Checkpointing Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Silva, L., Silva, G.: System-level versus user-defined checkpointing. In: Proceedings. Seventeenth IEEE Symposium on Reliable Distributed Systems, pp. 68–74 (1998)Google Scholar
  2. 2.
    Czarnul, P.: Programming, Tuning and Automatic Parallelization of Irregular Divide-and-Conquer Applications in DAMPVM/DAC. International Journal of High Performance Computing Applications 17, 77–93 (2003)CrossRefGoogle Scholar
  3. 3.
    CUMULVS (Collaborative User Migration, User Library for Visualization and Steering) Distributed Computing Group, Computer Science and Mathematics Division, Oak Ridge National Laboratory, http://www.csm.ornl.gov/cs/cumulvs.html
  4. 4.
    Zandy, V.C. (ckpt library), http://www.cs.wisc.edu/~zandy/ckpt/
  5. 5.
    Condor Team, Attention: Professor Miron Livny, Dept of Computer Sciences, 1210 W. Dayton St., Madison, WI 53706-1685, (608) 262-0856 or miron@cs.wisc.edu Condor Team, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI: (The Condor Project, CondorâĂŹs Checkpoint Mechanism)Google Scholar
  6. 6.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing Under UNIX. In: Conference Proceedings USENIX Winter 1995 Technical Conference (1995)Google Scholar
  7. 7.
    Romanov, S., Malashonok, D.Y., Iskra, K., Gubala, T.: The Dynamite checkpointer 2.0. Faculty of Science, Informatics Institute (2003), http://www.science.uva.nl/research/scs/Software/ckpt/#hector
  8. 8.
    Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of mpi programs. In: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming, San Diego, California, USA, pp. 84–94 (2003)Google Scholar
  9. 9.
    Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The lam/mpi checkpoint/restart framework: System-initiated checkpointing. In: Los Alamos Computer Science Institute (LACSI) Symposium (2003)Google Scholar
  10. 10.
    Duell, J., Hargrove, P., Roman, E.: The Design and Implementation of Berkeley Lab’s Linux Checkpoint/Restart. In: Future Technologies Group white paper (2003)Google Scholar
  11. 11.
    Cappello, F., Leader, P., et al.: Mpich-v: Mpi implementation for volatile resources, http://www.lri.fr/~bouteill/MPICH-V
  12. 12.
    Czarnul, P., Grzeda, K.: Parallel Simulations of Electrophysiological Phenomena in Myocardium on Large 32 and 64-bit Linux Clusters. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 234–241. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Message Passing Interface Forum: MPI-2: Extensions to the Message-Passing Interface, University of Tennessee, Knoxville, Tennessee (1997)Google Scholar
  14. 14.
    Sankaran, S., Squyres, J., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing, Los Alamos Computer Science Institute (LACSI) Symposium (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Paweł Czarnul
    • 1
  • Marcin Frączak
    • 1
  1. 1.Faculty of Electronics, Telecommunications and InformaticsGdansk University of TechnologyPoland

Personalised recommendations