Using Sampling to Understand Parallel Program Performance

Tallent, Nathan R.; Mellor-Crummey, John

doi:10.1007/978-3-642-31476-6_2

Using Sampling to Understand Parallel Program Performance

Nathan R. Tallent⁵ &
John Mellor-Crummey⁶

Conference paper
First Online: 01 January 2012

783 Accesses

Abstract

Developing scalable parallel applications for extreme-scale systems is challenging. The challenge of developing scalable parallel applications is only partially addressed by existing languages, compilers, and autotuners. As a result, manual performance tuning is often necessary to obtain high application performance. Rice University’s HPCToolkit is a suite of performance tools that supports innovative techniques for pinpointing and quantifying performance bottlenecks in fully optimized parallel programs with a measurement overhead of only a few percent. Many of these techniques were designed to leverage sampling for performance measurement, attribution, analysis, and presentation. This paper surveys some of HPCToolkit’s most interesting techniques and argues that sampling-based performance analysis is surprisingly versatile and effective.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)
Google Scholar
Adhianto, L., Mellor-Crummey, J., Tallent, N.R.: Effectively presenting call path profiles of application performance. In: International Conference on Parallel Processing Workshops, pp. 179–188. IEEE Computer Society, Los Alamitos (2010)
Google Scholar
Arnold, M., Ryder, B.G.: A framework for reducing the cost of instrumented code. In: Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 168–179. ACM, New York (2001)
Google Scholar
Chung, I.H., Walkup, R.E., Wen, H.F., Yu, H.: MPI performance analysis tools on Blue Gene/L. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 123. ACM, New York (2006)
Google Scholar
Coarfa, C., Mellor-Crummey, J., Froyd, N., Dotsenko, Y.: Scalability analysis of SPMD codes using expectations. In: Proceedings of the 21st International Conference on Supercomputing, pp. 13–22. ACM, New York (2007)
Google Scholar
De Rose, L., Homer, B., Johnson, D., Kaufmann, S., Poxon, H.: Cray performance analysis tools. In: Tools for High Performance Computing, pp. 191–199. Springer, Berlin (2008)
Google Scholar
Free Software Foundation: Glibc. http://www.gnu.org/s/libc/ (2012)
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 212–223. ACM, New York (1998)
Google Scholar
Froyd, N., Mellor-Crummey, J., Fowler, R.: Low-overhead call path profiling of unmodified, optimized code. In: Proceedings of the 19th International Conference on Supercomputing, pp. 81–90. ACM, New York (2005)
Google Scholar
Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22(6), 702–719 (2010)
Google Scholar
Hollingsworth, J.K., Miller, B.P., Cargille, J.: Dynamic program instrumentation for scalable performance tools. In: Proceedings of the 1994 Scalable High Performance Computing Conference, pp. 841–850. IEEE Computer Society, Los Alamitos, CA, USA (1994)
Google Scholar
Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir performance analysis tool-set. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 139–155. Springer, Berlin (2008)
Google Scholar
Liu, X., Mellor-Crummey, J.: Pinpointing data locality problems using data-centric analysis. In: Proceedings of the 2011 IEEE/ACM International Symposium on Code Generation and Optimization, Chamonix, France, pp. 171–180. IEEE Computer Society, Los Alamitos (2011)
Google Scholar
Malony, A.D., Shende, S., Morris, A., Wolf, F.: Compensation of measurement overhead in parallel performance profiling. Int. J. High Perform. Comput. Appl. 21(2), 174–194 (2007)
Google Scholar
Mellor-Crummey, J., Fowler, R., Marin, G., Tallent, N.: HPCView: a tool for top-down analysis of node performance. J. Supercomput. 23(1), 81–104 (2002)
Google Scholar
Miller, B.P., Callaghan, M.D., Cargille, J.M., Hollingsworth, J.K., Irvin, R.B., Karavanic, K.L., Kunchithapadam, K., Newhall, T.: The Paradyn parallel performance measurement tool. Computer 28(11), 37–46 (1995)
Google Scholar
Mosberger-Tang, D.: libunwind. http://www.nongnu.org/libunwind (2012)
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, p. 55. IEEE Computer Society, Washington, DC (2003)
Google Scholar
Rice University: HPCToolkit performance tools. http://hpctoolkit.org (2012)
Schulz, M., Galarowicz, J., Maghrak, D., Hachfeld, W., Montoya, D., Cranford, S.: Open | SpeedShop: an open source infrastructure for parallel performance analysis. Sci. Program. 16(2–3), 105–121 (2008)
Google Scholar
Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)
Google Scholar
Tallent, N.R., Mellor-Crummey, J.: Effective performance measurement and analysis of multithreaded applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229–240. ACM, New York (2009)
Google Scholar
Tallent, N.R., Mellor-Crummey, J., Fagan, M.W.: Binary analysis for measurement and attribution of program performance. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 441–452. ACM, New York (2009)
Google Scholar
Tallent, N.R., Mellor-Crummey, J.M.: Identifying performance bottlenecks in work-stealing computations. Computer 42(12), 44–50 (2009)
Google Scholar
Tallent, N., Mellor-Crummey, J., Adhianto, L., Fagan, M., Krentel, M.: HPCToolkit: performance tools for scientific computing. J. Phys. Conf. Ser. 125, 012088 (5pp) (2008)
Google Scholar
Tallent, N.R., Mellor-Crummey, J.M., Adhianto, L., Fagan, M.W., Krentel, M.: Diagnosing performance bottlenecks in emerging petascale applications. In: Proceedings of the 2009 ACM/IEEE Conference on Supercomputing, pp. 1–11. ACM, New York (2009)
Google Scholar
Tallent, N.R., Adhianto, L., Mellor-Crummey, J.M.: Scalable identification of load imbalance in parallel executions using call path profiles. In: Proceedings of the 2010 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Tallent, N.R., Mellor-Crummey, J.M., Porterfield, A.: Analyzing lock contention in multithreaded applications. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 269–280. ACM, New York (2010)
Google Scholar
Tallent, N.R., Mellor-Crummey, J.M., Franco, M., Landrum, R., Adhianto, L.: Scalable fine-grained call path tracing. In: Proceedings of the 25th International Conference on Supercomputing, pp. 63–74. ACM, New York (2011)
Google Scholar
Traub, O., Schechter, S., Smith, M.D.: Ephemeral instrumentation for lightweight program profiling. Tech. rep., Harvard University (1999)
Google Scholar
Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Frings, W., Fürlinger, K., Geimer, M., Hermanns, M.A., Mohr, B., Moore, S., Pfeifer, M., Szebenyi, Z.: Usage of the Scalasca toolset for scalable performance analysis of large-scale parallel applications. In: Tools for High Performance Computing, pp. 157–167. Springer, Berlin (2008)
Google Scholar

Download references

Acknowledgements

HPCToolkit would not be what it is without the efforts of Mark Krentel, Laksono Adhianto, and Mike Fagan. Xu Liu developed our data-centric analysis.

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, 99352, USA
Nathan R. Tallent
Department of Computer Science, Rice University, Houston, TX, 77005, USA
John Mellor-Crummey

Authors

Nathan R. Tallent
View author publications
You can also search for this author in PubMed Google Scholar
John Mellor-Crummey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathan R. Tallent .

Editor information

Editors and Affiliations

, Zentrum für Informationsdienste, Technische Universität Dresden, Dresden, 01062, Germany
Holger Brunst
, Zentrum für Informationsdienste, Technische Universität Dresden, Dresden, 01062, Germany
Matthias S. Müller
, Zentrum für Informationsdienste, Technische Universität Dresden, Dresden, 01062, Germany
Wolfgang E. Nagel
, Höchstleistungsrechenzentrum, Universität Stuttgart, Nobelstraße 19, Stuttgart, 70569, Germany
Michael M. Resch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tallent, N.R., Mellor-Crummey, J. (2012). Using Sampling to Understand Parallel Program Performance. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds) Tools for High Performance Computing 2011. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31476-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-31476-6_2
Published: 02 August 2012
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31475-9
Online ISBN: 978-3-642-31476-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics