TAUoverSupermon: Low-Overhead Online Parallel Performance Monitoring

Nataraj, Aroon; Sottile, Matthew; Morris, Alan; Malony, Allen D.; Shende, Sameer

doi:10.1007/978-3-540-74466-5_11

Aroon Nataraj¹,
Matthew Sottile²,
Alan Morris¹,
Allen D. Malony¹ &
…
Sameer Shende¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4641))

Included in the following conference series:

European Conference on Parallel Processing

794 Accesses
11 Citations

Abstract

Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very lowoverhead application monitoring as well as other benefits unavailable from using a transport such as NFS.

Download to read the full chapter text

Chapter PDF

The PerSyst Monitoring Tool

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Keywords

References

Shende, S., Malony, A.D.: The TAU parallel performance system. The International Journal of High Performance Computing Applications 20(2), 287–331 (2006)
Article Google Scholar
Sottile, M., Minnich, R.: Supermon: A high-speed cluster monitoring system. In: CLUSTER 2002: International Conference on Cluster Computing (2002)
Google Scholar
Bailey, D.H., et al.: The nas parallel benchmarks. The International Journal of Supercomputer Applications 5(3), 63–73 (1991)
Article Google Scholar
Nataraj, A., Malony, A., Shende, S., Morris, A.: Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project. In: CLUSTER 2006. International Conference on Cluster Computing, IEEE Computer Society Press, Los Alamitos (2006)
Google Scholar
de St. Germain, J.D., Parker, S.G., McCorquodale, J., Johnson, C.R.: Uintah: A massively parallel problem solving environment. In: HPDC 2000: International Symposium on High Performance Distributed Computing, pp. 33–42 (2000)
Google Scholar
Gu, W., et al.: Falcon: On-line monitoring and steering of large-scale parallel programs. In: 5th Symposium of the Frontiers of Massively Parallel Computing, McLean, VA, pp. 422–429 (1995)
Google Scholar
Ribler, R., Simitci, H., Reed, D.: The Autopilot performance-directed adaptive control system. Future Generation Computer Systems 18(1), 175–187 (2001)
Article MATH Google Scholar
Tapus, C., Chung, I.H., Hollingworth, J.: Active harmony: Towards automated performance tuning. In: SC 2002: ACM/IEEE conference on Supercomputing (2002)
Google Scholar
Eisenhauer, G., Schwan, K.: An object-based infrastructure for program monitoring and steering. In: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT 1998), pp. 10–20 (1998)
Google Scholar
Miller, B., Callaghan, M., Cargille, J., Hollingsworth, J., Irvin, R., Karavanic, K., Kunchithapadam, K., Newhall, T.: The paradyn parallel performance measurement tool. Computer 28(11), 37–46 (1995)
Article Google Scholar
Roth, P., Arnold, D., Miller, B.: Mrnet: A software-based multicast/reduction network for scalable tools. In: SC 2003: ACM/IEEE conference on Supercomputing (2003)
Google Scholar
Roth, P., Miller, B.: On-line automated performance diagnosis on thousands of processes. In: 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 69–80. ACM Press, New York (2006)
Google Scholar
Huck, K.A., Malony, A.D., Shende, S., Morris, A.: TAUg: Runtime Global Performance Data Access Using MPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 313–321. Springer, Heidelberg (2006)
Google Scholar
Ludwig, T., Wismüller, R., Sunderam, V., Bode, A.: Omis – on-line monitoring interface specification (version 2.0). LRR-TUM Research Report Series 9 (1998)
Google Scholar
Wismuller, R., Trinitis, J., Ludwig, T.: Ocm – a monitoring system for interoperable tools. In: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT 1998), pp. 1–9 (1998)
Google Scholar
Gerndt, M., Fürlinger, K., Kereku, E.: Periscope: Advanced techniques for performance analysis. In: Parallel Computing: Current & Future Issues of High-End Computing, In the International Conference ParCo 2005, 13-16 September 2005, pp. 15–26. Department of Computer Architecture, University of Malaga, Spain (2005)
Google Scholar
Mendes, C., Reed, D.: Monitoring large systems via statistical sampling. International Journal of High Performance Computing Applications 18(2), 267–277 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Oregon, Eugene, OR, USA
Aroon Nataraj, Alan Morris, Allen D. Malony & Sameer Shende
Los Alamos National Laboratory, Los Alamos, NM, USA
Matthew Sottile

Authors

Aroon Nataraj
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Sottile
View author publications
You can also search for this author in PubMed Google Scholar
Alan Morris
View author publications
You can also search for this author in PubMed Google Scholar
Allen D. Malony
View author publications
You can also search for this author in PubMed Google Scholar
Sameer Shende
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Anne-Marie Kermarrec Luc Bougé Thierry Priol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nataraj, A., Sottile, M., Morris, A., Malony, A.D., Shende, S. (2007). TAUoverSupermon: Low-Overhead Online Parallel Performance Monitoring . In: Kermarrec, AM., Bougé, L., Priol, T. (eds) Euro-Par 2007 Parallel Processing. Euro-Par 2007. Lecture Notes in Computer Science, vol 4641. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74466-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-74466-5_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74465-8
Online ISBN: 978-3-540-74466-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TAUoverSupermon: Low-Overhead Online Parallel Performance Monitoring

Abstract

Chapter PDF

Similar content being viewed by others

The PerSyst Monitoring Tool

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

TAUoverSupermon: Low-Overhead Online Parallel Performance Monitoring

Abstract

Chapter PDF

Similar content being viewed by others

The PerSyst Monitoring Tool

MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis

Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation