Abstract
Debugging is often the most time consuming part of software development. HPC applications prolong the debugging process by adding more processes interacting in dynamic ways for longer periods of time. Checkpoint/restart-enabled parallel debugging returns the developer to an intermediate state closer to the bug. This focuses the debugging process, saving developers considerable amounts of time, but requires parallel debuggers cooperating with MPI implementations and checkpointers. This paper presents a design specification for such a cooperative relationship. Additionally, this paper discusses the application of this design to the GDB and DDT debuggers, Open MPI, and BLCR projects.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Message Passing Interface Forum: MPI: A Message Passing Interface. In: Proc. of Supercomputing 1993, pp. 878–883 (1993)
Cownie, J., Gropp, W.: A standard interface for debugger access to message queue information in MPI. In: Margalef, T., Dongarra, J., Luque, E. (eds.) PVM/MPI 1999. LNCS, vol. 1697, pp. 51–58. Springer, Heidelberg (1999)
Gottbrath, C.L., Barrett, B., Gropp, B., Lusk, E., Squyres, J.: An interface to support the identification of dynamic MPI 2 processes for scalable parallel debugging. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 115–122. Springer, Heidelberg (2006)
Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)
Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems 3, 63–75 (1985)
Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2007)
Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and implementation of multiple fault-tolerant MPI over Myrinet (M3). In: Proceedings of the ACM/IEEE Supercomputing Conference (2005)
Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for MPI programs over InfiniBand. In: International Conference on Parallel Processing, pp. 471–478 (2006)
Bouteiller, A., et al.: MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications 20, 319–333 (2006)
Duell, J., Hargrove, P., Roman, E.: The design and implementation of Berkeley Lab’s Linux Checkpoint/Restart. Technical Report LBNL-54941, Lawrence Berkeley National Laboratory (2002)
Hursey, J., Mattox, T.I., Lumsdaine, A.: Interconnect agnostic checkpoint/restart in Open MPI. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, pp. 49–58 (2009)
Curtis, B.: Fifteen years of psychology in software engineering: Individual differences and cognitive science. In: Proceedings of the International Conference on Software Engineering, pp. 97–106 (1984)
Feldman, S.I., Brown, C.B.: IGOR: A system for program debugging via reversible execution. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 112–123 (1988)
Wittie, L.: The Bugnet distributed debugging system. In: Proceedings of the 2nd workshop on Making Distributed Systems Work, pp. 1–3 (1986)
Bouteiller, A., Bosilca, G., Dongarra, J.: Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 297–306 (2007)
Ronsse, M., Bosschere, K.D., de Kergommeaux, J.C.: Execution replay and debugging. In: Proceedings of the Fourth International Workshop on Automated Debugging, Munich, Germany (2000)
King, S.T., Dunlap, G.W., Chen, P.M.: Debugging operating systems with time-traveling virtual machines. In: Proceedings of the USENIX Annual Technical Conference (2005)
Pan, D.Z., Linton, M.A.: Supporting reverse execution for parallel programs. In: Proceedings of the ACM SIGPLAN/SIGOPS workshop on Parallel and Distributed Debugging, pp. 124–129 (1988)
Agrawal, H., DeMillo, R.A., Spafford, E.H.: An execution-backtracking approach to debugging. IEEE Software 8(3), 21–26 (1991)
Undo Ltd.: UndoDB - Reversible debugging for Linux (2009)
TotalView Technologies: ReplayEngine (2009)
Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. SIGARCH Computer Architecture News 30, 123–134 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hursey, J. et al. (2010). Checkpoint/Restart-Enabled Parallel Debugging. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds) Recent Advances in the Message Passing Interface. EuroMPI 2010. Lecture Notes in Computer Science, vol 6305. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15646-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-642-15646-5_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15645-8
Online ISBN: 978-3-642-15646-5
eBook Packages: Computer ScienceComputer Science (R0)