Skip to main content

Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4757))

Abstract

While high performance computing was eagerly adopted by users as a vehicle for satisfying a growing demand on computational power, some areas are still poorly explored. The MPI paradigm is considered as being the keystone for the large development of the HPC infrastructure over the last decade. However, even today the users have to face the lack of tools able to help increase the stability of the software stack and/or of the applications. In this paper we present and evaluate a tool designed to allow developers to further investigate the execution of parallel applications by enabling them to dynamically move back and forth in the execution timeline of a parallel application. Based on an unobtrusive message logging mechanism, deterministic replay is enforced, leading to a simpler and more efficient way to debug parallel software.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gottbrath, C.: Eliminating parallel application memory bugs with totalview. In: SC 2006 Proceedings of the 2006 ACM/IEEE conference on Supercomputing p. 210. ACM Press, New York (2006)

    Chapter  Google Scholar 

  2. Rudgyard, M.: Novel techniques for debugging and optimizing parallel applications. In: SC 2006, p. 281. ACM Press, New York (2006)

    Chapter  Google Scholar 

  3. Vetter, J.S., de Supinski, B.R.: Dynamic software testing of mpi applications with umpire. In: SC 2000: Proceedings of the 2000 ACM/IEEE conference on Supercomputing, p. 51. IEEE Computer Society, Washington, DC, USA (2000)

    Google Scholar 

  4. Wolf, F., Mohr, B., Dongarra, J., Moore, S.: Efficient pattern search in large traces through successive refinement. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 47–54. Springer, Berlin (2004)

    Google Scholar 

  5. Noeth, M., Mueller, F., Schulz, M., de Supinski, B.: Scalable compression and replay of communication traces in massively parallel environments. In: 21th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), ACM Press, New York (to appear, 2007)

    Google Scholar 

  6. Geels, D., Altekar, G., Shenker, S., Stoica, I.: Replay debugging for distributed applications. In: Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, USENIX, pp. 289–300 (2006)

    Google Scholar 

  7. Bouteiler, A., Herault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI, vol. 20, pp. 319–333. SAGE Publications, Thousand Oaks (2006)

    Google Scholar 

  8. Clemencon, C., Fritscher, J., Meehan, M.J., Ruhl, R.: An implementation of race detection and deterministic replay with mpi. In: Haridi, S., Ali, K., Magnusson, P. (eds.) EURO-PAR 1995: Parallel Processing. LNCS, vol. 966, pp. 155–166. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  9. Kranzlmuller, D., Schaubschlager, C., Volkert, J.: An integrated record&replay mechanism for nondeterministic message passing programs. In: Proceedings of the 8th EuroPVM/MPI Users’ Group Meeting, pp. 192–200. Springer, London, UK (2001)

    Google Scholar 

  10. de Kergommeaux, J.C., Ronsse, M., de Bosschere, K.: MPL*: Efficient record/replay of nondeterministic features of message passing libraries. In: Margalef, T., Dongarra, J.J., Luque, E. (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface. LNCS, vol. 1697, pp. 141–148. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  11. Maryama, M., Tsumara, T., Nakashima, H.: Parallel program debugging based on data replay. In: 17th IASTED International Conference on Parallel and Distributed Computing Systems, pp. 151–156. ACTA Press (November 2005)

    Google Scholar 

  12. Duell, J., Hargrove, P., Roman, E.: The design and implementation of berkeley lab’s linux checkpoint/restart. Technical Report LBNL-54941, Berkeley Lab (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Franck Cappello Thomas Herault Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bouteiller, A., Bosilca, G., Dongarra, J. (2007). Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging. In: Cappello, F., Herault, T., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2007. Lecture Notes in Computer Science, vol 4757. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75416-9_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75416-9_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75415-2

  • Online ISBN: 978-3-540-75416-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics