A New Scalable Monitoring Tool Using Performance Properties of HPC Systems
We present a monitoring and analysis tool prototype for system wide monitoring of High Performance Computers. The tool uses formal specification of properties which are based on hardware counters. These evaluate the performance at different granularities, namely at core, application and partition graininess. The information obtained is aimed at detecting single node performance as well as parallel execution performance. The goal is to identify performance bottlenecks in running applications as well as the general system behaviour. The scalability in our prototype for highly parallel machines is achieved through a distributed software architecture. We use an analysis agent at each partition. These agents communicate to a high level agent using a communication protocol based on TCP/IP. The high level agent has as a main task the synchronisation of the rest of the agents. Moreover, the analysis agents have the capability to use OpenMP within each partition to parallelise their monitoring tasks. Our approach used to tackle the storing of large amounts of information is achieved by data reduction. Only the properties that detect a bottleneck are stored, thus we don’t compromise the quality of the needed monitoring information.
This work is funded by BMBF under the ISAR project, grant 01IH08005.
- 1.Gerndt, M., Fuerlinger, K.: Automatic performance analysis with periscope. Journal: Concurrency and Computation: Practice and Experience.Wiley InterScience. John Wiley & Sons, Ltd. (2009)Google Scholar
- 2.Gerndt, M., Fuerlinger, K., Kereku, E.: Periscope: Advanced techniques for performance analysis, parallel computing: Current & future issues of high-end computing. In: International Conference ParCo 2005, vol. 33 (2006). NIC Series ISBN 3-00-017352-8Google Scholar
- 3.Gerndt, M., Kereku, E.: Search strategies for automatic performance analysis tools. In: Euro-Par 2007, vol. LNCS 4641, pp. 129–138 (2007)Google Scholar
- 4.Gerndt, M., Strohhaecker, S.: Distribution of analysis agents in periscope on altix 4700. In: Proceedings of ParCo (2007)Google Scholar
- 5.HP: pfmon tool. www.hpl.hp.com/research/linux/perfmon/pfmon.php4
- 6.Intel: Introduction to Microarchitectural Optimization for Itanium 2 Processors (2002). URL http://cache-www.intel.com/cd/00/00/21/93/219348_software_optimization.pdf
- 7.Nataraj, A., Sottile, M., Morris, A., Malony, A., Shende, S.: Tauoversupermon: Low-overhead online parallel performance monitoring. In: Proceedings Euro-Par 2007, vol. LNCS 4641, pp. 85–96 (2007)Google Scholar
- 8.Schmidt, D.C.: The adaptive communication environment: Object-oriented network programming components for developing client/server applications. In: Proceedings of the 12th Annual Sun Users Group Conference, pp. 214–225 (1994)Google Scholar
- 9.Schmidt, D.C., Huston, D., Buschmann, F.: C++ Network Programming Vol. 1: Mastering Complexity with ACE and Patterns. Pearson Education (2002)Google Scholar
- 10.Sottile, M.J., Minnich, R.G.: Supermon: A high-speed cluster monitoring system. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER ’02, pp. 39–. IEEE Computer Society, Washington, DC, USA (2002). URL http://portal.acm.org/citation.cfm?id=792762.793324