TAUoverSupermon: Low-Overhead Online Parallel Performance Monitoring

  • Aroon Nataraj
  • Matthew Sottile
  • Alan Morris
  • Allen D. Malony
  • Sameer Shende
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4641)


Online application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. We adapt and combine two existing, mature systems - TAU and Supermon - to address this problem. TAU performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach leads to very lowoverhead application monitoring as well as other benefits unavailable from using a transport such as NFS.


Online performance measurement cluster monitoring 


  1. 1.
    Shende, S., Malony, A.D.: The TAU parallel performance system. The International Journal of High Performance Computing Applications 20(2), 287–331 (2006)CrossRefGoogle Scholar
  2. 2.
    Sottile, M., Minnich, R.: Supermon: A high-speed cluster monitoring system. In: CLUSTER 2002: International Conference on Cluster Computing (2002)Google Scholar
  3. 3.
    Bailey, D.H., et al.: The nas parallel benchmarks. The International Journal of Supercomputer Applications 5(3), 63–73 (1991)CrossRefGoogle Scholar
  4. 4.
    Nataraj, A., Malony, A., Shende, S., Morris, A.: Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project. In: CLUSTER 2006. International Conference on Cluster Computing, IEEE Computer Society Press, Los Alamitos (2006)Google Scholar
  5. 5.
    de St. Germain, J.D., Parker, S.G., McCorquodale, J., Johnson, C.R.: Uintah: A massively parallel problem solving environment. In: HPDC 2000: International Symposium on High Performance Distributed Computing, pp. 33–42 (2000)Google Scholar
  6. 6.
    Gu, W., et al.: Falcon: On-line monitoring and steering of large-scale parallel programs. In: 5th Symposium of the Frontiers of Massively Parallel Computing, McLean, VA, pp. 422–429 (1995)Google Scholar
  7. 7.
    Ribler, R., Simitci, H., Reed, D.: The Autopilot performance-directed adaptive control system. Future Generation Computer Systems 18(1), 175–187 (2001)zbMATHCrossRefGoogle Scholar
  8. 8.
    Tapus, C., Chung, I.H., Hollingworth, J.: Active harmony: Towards automated performance tuning. In: SC 2002: ACM/IEEE conference on Supercomputing (2002)Google Scholar
  9. 9.
    Eisenhauer, G., Schwan, K.: An object-based infrastructure for program monitoring and steering. In: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT 1998), pp. 10–20 (1998)Google Scholar
  10. 10.
    Miller, B., Callaghan, M., Cargille, J., Hollingsworth, J., Irvin, R., Karavanic, K., Kunchithapadam, K., Newhall, T.: The paradyn parallel performance measurement tool. Computer 28(11), 37–46 (1995)CrossRefGoogle Scholar
  11. 11.
    Roth, P., Arnold, D., Miller, B.: Mrnet: A software-based multicast/reduction network for scalable tools. In: SC 2003: ACM/IEEE conference on Supercomputing (2003)Google Scholar
  12. 12.
    Roth, P., Miller, B.: On-line automated performance diagnosis on thousands of processes. In: 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 69–80. ACM Press, New York (2006)Google Scholar
  13. 13.
    Huck, K.A., Malony, A.D., Shende, S., Morris, A.: TAUg: Runtime Global Performance Data Access Using MPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 313–321. Springer, Heidelberg (2006)Google Scholar
  14. 14.
    Ludwig, T., Wismüller, R., Sunderam, V., Bode, A.: Omis – on-line monitoring interface specification (version 2.0). LRR-TUM Research Report Series 9 (1998)Google Scholar
  15. 15.
    Wismuller, R., Trinitis, J., Ludwig, T.: Ocm – a monitoring system for interoperable tools. In: 2nd SIGMETRICS Symposium on Parallel and Distributed Tools (SPDT 1998), pp. 1–9 (1998)Google Scholar
  16. 16.
    Gerndt, M., Fürlinger, K., Kereku, E.: Periscope: Advanced techniques for performance analysis. In: Parallel Computing: Current & Future Issues of High-End Computing, In the International Conference ParCo 2005, 13-16 September 2005, pp. 15–26. Department of Computer Architecture, University of Malaga, Spain (2005)Google Scholar
  17. 17.
    Mendes, C., Reed, D.: Monitoring large systems via statistical sampling. International Journal of High Performance Computing Applications 18(2), 267–277 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Aroon Nataraj
    • 1
  • Matthew Sottile
    • 2
  • Alan Morris
    • 1
  • Allen D. Malony
    • 1
  • Sameer Shende
    • 1
  1. 1.Department of Computer and Information Science, University of Oregon, Eugene, ORUSA
  2. 2.Los Alamos National Laboratory, Los Alamos, NMUSA

Personalised recommendations