Skip to main content

Checkpoint/Restart-Enabled Parallel Debugging

  • Conference paper
Recent Advances in the Message Passing Interface (EuroMPI 2010)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6305))

Included in the following conference series:

Abstract

Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Message Passing Interface Forum: MPI: A Message Passing Interface. In: Proc. of Supercomputing 1993, pp. 878–883 (1993)

    Google Scholar 

  2. Cownie, J., Gropp, W.: A standard interface for debugger access to message queue information in MPI. In: Margalef, T., Dongarra, J., Luque, E. (eds.) PVM/MPI 1999. LNCS, vol. 1697, pp. 51–58. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  3. Gottbrath, C.L., Barrett, B., Gropp, B., Lusk, E., Squyres, J.: An interface to support the identification of dynamic MPI 2 processes for scalable parallel debugging. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 115–122. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)

    Article  Google Scholar 

  5. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems 3, 63–75 (1985)

    Article  Google Scholar 

  6. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2007)

    Google Scholar 

  7. Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and implementation of multiple fault-tolerant MPI over Myrinet (M3). In: Proceedings of the ACM/IEEE Supercomputing Conference (2005)

    Google Scholar 

  8. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over InfiniBand. In: International Conference on Parallel Processing, pp. 471–478 (2006)

    Google Scholar 

  9. Bouteiller, A., et al.: MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications 20, 319–333 (2006)

    Article  Google Scholar 

  10. Duell, J., Hargrove, P., Roman, E.: The design and implementation of Berkeley Lab’s Linux Checkpoint/Restart. Technical Report LBNL-54941, Lawrence Berkeley National Laboratory (2002)

    Google Scholar 

  11. Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in Open MPI. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, pp. 49–58 (2009)

    Google Scholar 

  12. Curtis, B.: Fifteen years of psychology in software engineering: Individual differences and cognitive science. In: Proceedings of the International Conference on Software Engineering, pp. 97–106 (1984)

    Google Scholar 

  13. Feldman, S.I., Brown, C.B.: IGOR: A system for program debugging via reversible execution. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 112–123 (1988)

    Google Scholar 

  14. Wittie, L.: The Bugnet distributed debugging system. In: Proceedings of the 2nd workshop on Making Distributed Systems Work, pp. 1–3 (1986)

    Google Scholar 

  15. Bouteiller, A., Bosilca, G., Dongarra, J.: Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 297–306 (2007)

    Google Scholar 

  16. Ronsse, M., Bosschere, K.D., de Kergommeaux, J.C.: Execution replay and debugging. In: Proceedings of the Fourth International Workshop on Automated Debugging, Munich, Germany (2000)

    Google Scholar 

  17. King, S.T., Dunlap, G.W., Chen, P.M.: Debugging operating systems with time-traveling virtual machines. In: Proceedings of the USENIX Annual Technical Conference (2005)

    Google Scholar 

  18. Pan, D.Z., Linton, M.A.: Supporting reverse execution for parallel programs. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 124–129 (1988)

    Google Scholar 

  19. Agrawal, H., DeMillo, R.A., Spafford, E.H.: An execution-backtracking approach to debugging. IEEE Software 8(3), 21–26 (1991)

    Article  Google Scholar 

  20. Undo Ltd.: UndoDB - Reversible debugging for Linux (2009)

    Google Scholar 

  21. TotalView Technologies: ReplayEngine (2009)

    Google Scholar 

  22. Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. SIGARCH Computer Architecture News 30, 123–134 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hursey, J. et al. (2010). Checkpoint/Restart-Enabled Parallel Debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15646-5_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15645-8

  • Online ISBN: 978-3-642-15646-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics