Skip to main content

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

  • Conference paper
  • First Online:

Abstract

The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or non-portable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size. We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different architectures and use cases. The design uses P nMPI to instantiate a tool from a set of individual modules. We present an overview of our design, along with first performance results for a proof of concept implementation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard. http://www.mpi-forum.org/docs/mpi-10.ps (1995)

  2. Message Passing Interface Forum: MPI-2: Extensions to the Message-Passing Interface. http://www.mpi-forum.org/docs/mpi-20.ps (1997)

  3. Krammer, B., Bidmon, K., Müller, M.S., Resch, M.M.: MARMOT: An MPI Analysis and Checking Tool. In Joubert, G.R., Nagel, W.E., Peters, F.J., Walter, W.V., eds.: PARCO. Volume 13 of Advances in Parallel Computing., Elsevier (2003) 493–500

    Google Scholar 

  4. Vetter, J.S., de Supinski, B.R.: Dynamic Software Testing of MPI Applications with Umpire. Supercomputing, ACM/IEEE 2000 Conference (04-10 Nov. 2000) 51–51

    Google Scholar 

  5. Schulz, M., de Supinski, B.R.: PNMPI Tools: A Whole Lot Greater Than the Sum of Their Parts. In: Supercomputing 2007 (SC’07). (2007)

    Google Scholar 

  6. Hilbrich, T., de Supinski, B.R., Schulz, M., Müller, M.S.: A Graph Based Approach for MPI Deadlock Detection. In: ICS ’09: Proceedings of the 23rd international conference on Supercomputing, New York, NY, USA, ACM (2009) 296–305

    Google Scholar 

  7. Luecke, G.R., Zou, Y., Coyle, J., Hoekstra, J., Kraeva, M.: Deadlock Detection in MPI Programs. Concurrency and Computation: Practice and Experience 14(11) (2002) 911–932

    Article  MATH  Google Scholar 

  8. Vakkalanka, S.S., Sharma, S., Gopalakrishnan, G., Kirby, R.M.: ISP: A Tool for Model Checking MPI Programs. In: PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, New York, NY, USA, ACM (2008) 285–286

    Google Scholar 

  9. Roth, P.C., Arnold, D.C., Miller, B.P.: MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools. In: SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, IEEE Computer Society (2003) 21

    Google Scholar 

  10. Brunst, H., Kranzlmüller, D., Nagel, W.E.: Tools for Scalable Parallel Program Analysis - Vampir NG and DeWiz. The International Series in Engineering and Computer Science, Distributed and Parallel Systems 777 (2005) 92–102

    Google Scholar 

  11. Wolf, F., Wylie, B., Abraham, E., Becker, D., Frings, W., Fuerlinger, K., Geimer, M., Hermanns, M., Mohr, B., Moore, S., Szebenyi, Z.: Usage of the SCALASCA Toolset for Scalable Performance Analysis of Large-Scale Parallel Applications. In: Proceedings of the 2nd HLRS Parallel Tools Workshop, Stuttgart, Germany (July 2008)

    Google Scholar 

  12. Edwards, D.J., Minsky, M.L.: Recent Improvements in DDT. Technical report, Alinea, Cambridge, MA, USA (1963)

    Google Scholar 

  13. Totalview Technologies: Totalview - Parallel and Thread Debugger. http://www.totalviewtech.com/products/totalview.html (July 2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tobias Hilbrich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hilbrich, T., Schulz, M., de Supinski, B.R., Müller, M.S. (2010). MUST: A Scalable Approach to Runtime Error Detection in MPI Programs. In: Müller, M., Resch, M., Schulz, A., Nagel, W. (eds) Tools for High Performance Computing 2009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11261-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11261-4_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11260-7

  • Online ISBN: 978-3-642-11261-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics