MUST: A Scalable Approach to Runtime Error Detection in MPI Programs
- Cite this paper as:
- Hilbrich T., Schulz M., de Supinski B.R., Müller M.S. (2010) MUST: A Scalable Approach to Runtime Error Detection in MPI Programs. In: Müller M., Resch M., Schulz A., Nagel W. (eds) Tools for High Performance Computing 2009. Springer, Berlin, Heidelberg
The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or non-portable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size. We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different architectures and use cases. The design uses PnMPI to instantiate a tool from a set of individual modules. We present an overview of our design, along with first performance results for a proof of concept implementation.
Unable to display preview. Download preview PDF.