Skip to main content

Counter Inspection Toolkit: Making Sense Out of Hardware Performance Events

  • Conference paper
  • First Online:

Abstract

Hardware counters play an essential role in understanding the behavior of performance-critical applications, and inform any effort to identify opportunities for performance optimization. However, because modern hardware is becoming increasingly complex, the number of counters that are offered by the vendors increases and, in some cases, so does their complexity. In this paper we present a toolkit that aims to assist application developers invested in performance analysis by automatically categorizing and disambiguating performance counters. We present and discuss the set of microbenchmarks and analyses that we developed as part of our toolkit. We explain why they work and discuss the non-obvious reasons why some of our early benchmarks and analyses did not work in an effort to share with the rest of the community the wisdom we acquired from negative results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The actual count is not zero, but rather a small number due to noise caused by code not shown in the figures, such as the calls to PAPI_start() and PAPI_stop(). However, in our experiments this number did not grow when varying the variable size, so for large iteration counts the fraction of mispredicted branches approaches zero.

  2. 2.

    Other, more sophisticated goodness functions, such as Pearson’s \(\chi ^2\) test [7], could be used to assist in the analysis of the measurements, but in our experiments we found that the simple formula in Eq. 1 is sufficient.

References

  1. Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14(3), 189–204 (2000)

    Article  Google Scholar 

  2. Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2 (2017)

    Google Scholar 

  3. Pearson, K.: Notes on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 58, 240–242 (1895)

    Google Scholar 

  4. Danalis, A., Luszczek, P., Marin, G., Vetter, J.S., Dongarra, J.: Blackjackbench: portable hardware characterization with automated results analysis. Comput. J. 57(7), 1002 (2014)

    Article  Google Scholar 

  5. McVoy, L., Staelin, C.: lmbench: Portable tools for performance analysis. In: Proceedings of the Annual Technical Conference on USENIX 1996 Annual Technical Conference ATEC’96, pp. 23–23. USENIX Association, Berkeley, CA, USA, 24–26 Jan 1996

    Google Scholar 

  6. Mucci, P.J., London, K.: The CacheBench Report. Technical report, Computer Science Department, University of Tennessee, Knoxville, TN (1998)

    Google Scholar 

  7. Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. 5(50), 157–175 (1900)

    Google Scholar 

  8. Molnar, I.: perf: Linux profiling with performance counters (2009). https://perf.wiki.kernel.org/

  9. Wolf III, J.H..: Programming Methods for the Pentium III Processor’s Streaming SIMD Extensions Using the VTune™ Performance Enhancement Environment. Intel Corporation (1999)

    Google Scholar 

  10. Intel Performance Tuning Utility. http://software.intel.com/en-us/articles/intel-performance-tuning-utility/

  11. Drongowski, P.J.: An introduction to analysis and optimization with AMD Code Analyst™ Performance Analyzer. Advanced Micro Devices, Inc. (2008)

    Google Scholar 

  12. Treibig, J., Hager, G., Wellein, G.: LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of the First International Workshop on Parallel Software Tools and Tool Infrastructures, September 2010

    Google Scholar 

  13. Dongarra, J., Moore, S., Mucci, P., Seymour, K., You, H.: Accurate cache and TLB characterization using hardware counters. In: Marian Bubak, G., van Albada, D., Sloot, P.M.A., Dongarra, J. (eds.) International Conference on Computational Science, volume 3036 of Lecture Notes in Computer Science, pp. III:432–439. Krakow Poland, June 2004. Springer, Heidelberg. ISBN 3-540-22114-X

    Chapter  Google Scholar 

  14. Duchateau, A.X., Sidelnik, A., Garzarán, M.J., Padua, D.A.: P-ray: a suite of micro-benchmarks for multi-core architectures. In: Proceeding of the 21st International Workshop on Languages and Compilers for Parallel Computing (LCPC’08)

    Google Scholar 

  15. Gonzalez-Dominguez, J., Taboada, G.L., Fraguela, B.B., Martin, M.J., Tourio, J.: Servet: a benchmark suite for autotuning on multicore clusters. In: IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–10. IEEE Computer Society, Atlanta, GA, 19–23 Apr 2010. https://doi.org/10.1109/IPDPS.2010.5470358

  16. Molka, D., Hackenberg, D., Schone, R., Muller, M.S.: Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques PACT ’09, pp. 261–270, Raleigh, North Carolina, September 12–16. IEEE Computer Society, DC, USA, Washington (2009)

    Google Scholar 

  17. Staelin, C., McVoy, L.: mhz: Anatomy of a micro-benchmark. In: USENIX 1998 Annual Technical Conference, pp. 155–166. USENIX Association, New Orleans, Louisiana, 15–18 Jan 1998

    Google Scholar 

  18. Yotov, K., Jackson, S., Steele, T., Pingali, K., Stodghill, P.: Automatic measurement of instruction cache capacity. In: Proceedings of the 18th Workshop on Languages and Compilers for Parallel Computing (LCPC), pp. 230–243. Springer, Hawthorne, New York, 20–22 Oct 2005

    Google Scholar 

  19. Yotov, K., Pingali, K., Stodghill, P.: Automatic measurement of memory hierarchy parameters. SIGMETRICS Perform. Eval. Rev. 33(1), 181–192 (2005)

    Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1450429.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony Danalis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Danalis, A., Jagode, H., Hanumantharayappa, Ragate, S., Dongarra, J. (2019). Counter Inspection Toolkit: Making Sense Out of Hardware Performance Events. In: Niethammer, C., Resch, M., Nagel, W., Brunst, H., Mix, H. (eds) Tools for High Performance Computing 2017. PTHPC 2017. Springer, Cham. https://doi.org/10.1007/978-3-030-11987-4_2

Download citation

Publish with us

Policies and ethics