Skip to main content

Lightweight Instrumentation and Analysis Using OpenSHMEM Performance Counters

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11283))

Abstract

Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, are popular methods of parallel programming; however, performance monitoring and analysis tools for these models have remained elusive. In this work, we propose a performance counter extension to the OpenSHMEM interfaces to expose internal communication state as lightweight performance data to tools. We implement our interface in the open source Sandia OpenSHMEM library and demonstrate its mapping to libfabric primitives. Next, we design a simple collector tool to record the behavior of OpenSHMEM processes at execution time. We analyze the Integer Sort (ISx) benchmark and use the resulting data to investigate several common performance issues—including communication schedule, poor overlap, and load imbalance—and visualize the impact of optimizations to correct these issues. Through this study, our tool uncovered a performance bug in this popular benchmark. Finally, by using our tool to guide the application of several pipelining optimizations, we were able to improve the ISx key exchange performance by more than 30%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Other names and brands may be claimed as the property of others.

  2. 2.

    Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

    Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.

    Software and workloads used in performance tests may have been optimized for performance only on Intel\(^{\textregistered }\) microprocessors. Performance tests, such as SYSmark\(^{\star }\) and MobileMark\(^{\star }\), are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

    For more information go to http://www.intel.com/benchmarks.

References

  1. Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper. 22(6), 685–701 (2010). Http://hpctoolkit.Org

    Google Scholar 

  2. Barrett, B.W., Brigthwell, R., Hemmert, K.S., Pedretti, K., Wheeler, K., Underwood, K.D.: Enhanced support for openSHMEM communication in portals. In: IEEE 19th Annual Symposium on High Performance Interconnects. HotI, August 2011

    Google Scholar 

  3. Brandt, J., Froese, E., Gentile, A., Kaplan, L., Allan, B., Walsh, E.: Network performance counter monitoring and analysis on the Cray XC platform. In: Proceedings of Cray Users Group (2016)

    Google Scholar 

  4. Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. SC 2000, IEEE Computer Society, Washington, DC, USA (2000)

    Google Scholar 

  5. Cong, G., Wen, H., Murata, H., Negishi, Y.: Tool-assisted optimization of shared-memory accesses in UPC applications. In: IEEE International Conference on High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, (HPCC-ICESS), pp. 104–111, June 2012

    Google Scholar 

  6. DeRose, L., Homer, B., Johnson, D., Kaufmann, S., Poxon, H.: Cray performance analysis tools. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 191–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68564-7_12

    Chapter  Google Scholar 

  7. Eschweiler, D., Wagner, M., Geimer, M., Knpfer, A., Nagel, W., Wolf, F.: Open trace format 2: The next generation of scalable trace formats and support libraries. In: Applications, Tools and Techniques on the Road to Exascale Computing. vol. 22, pp. 481–490, January 2012

    Google Scholar 

  8. Grun, P., et al.: A brief introduction to the openfabrics interfaces - a new network API for maximizing high performance application efficiency. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 34–39, August 2015

    Google Scholar 

  9. Hanebutte, U., Hemstad, J.: ISx: A scalable integer sort for co-design in the exascale era. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), pp. 102–104, September 2015

    Google Scholar 

  10. Hanebutte, U., Hemstad, J.: ISx: a scalable integer sort for co-design in the exascale era. In: 9th International Conference on Partitioned Global Address Space Programming Models. pp. 102–104, September 2015

    Google Scholar 

  11. Hermanns, M.-A., Geimer, M., Mohr, B., Wolf, F.: Scalable detection of MPI-2 remote memory access inefficiency patterns. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 31–41. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03770-2_10

    Chapter  Google Scholar 

  12. Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for periscope, scalasca, TAU, and vampir. Tools for High Performance Computing, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7

    Chapter  Google Scholar 

  13. Linford, J., Simon, T.A., Shende, S., Malony, A.D.: Profiling non-numeric OpenSHMEM applications with the TAU performance system. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 105–119. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_8

    Chapter  Google Scholar 

  14. Linford, J.C., Khuvis, S., Shende, S., Malony, A., Imam, N., Venkata, M.G.: Performance analysis of openSHMEM applications with TAU commander. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 161–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_11

    Chapter  Google Scholar 

  15. Mohr, B., Kühnal, A., Hermanns, M., Wolf, F.: Performance analysis of one-sided communication mechanisms. In: Joubert, G.R., Nagel, W.E., Peters, F.J., Plata, O.G., Tirado, P., Zapata, E.L. (eds.) Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005. John von Neumann Institute for Computing Series, 13–16 September 2005, Department of Computer Architecture, University of Malaga, Spain, vol. 33, pp. 885–892. Central Institute for Applied Mathematics, Jülich (2005)

    Google Scholar 

  16. MPI Forum: MPI: A message-passing interface standard version 3.1. Technical report, University of Tennessee, Knoxville, June 2015

    Google Scholar 

  17. Oeste, S., Knüpfer, A., Ilsche, T.: Towards parallel performance analysis tools for the openSHMEM standard. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 90–104. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_7

    Chapter  Google Scholar 

  18. OpenSHMEM application programming interface, version 1.3., February 2016. http://www.openshmem.org

  19. OpenSHMEM application programming interface, version 1.4., December 2017. http://www.openshmem.org

  20. Pedretti, K., Vaughan, C.T., Barrett, R.F., Devine, K.D., Hemmert, K.S.: Using the Cray Gemini performance counters. In: Proceedings of the Cray Users Group (2013)

    Google Scholar 

  21. Portals 4.0. http://www.cs.sandia.gov/Portals/portals4.html

  22. Performance Scaled Messaging 2 (PSM2) Programmer’s Guide, October 2017. https://intel.ly/2y2uvjb

  23. Sandia OpenSHMEM (2018). https://github.com/Sandia-OpenSHMEM/SOS

  24. Seager, K., Choi, S.-E., Dinan, J., Pritchard, H., Sur, S.: Design and implementation of openSHMEM using OFI on the aries interconnect. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 97–113. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_7

    Chapter  Google Scholar 

  25. Su, H.H., Billingsley, M., George, A.D.: Parallel performance wizard: a performance system for the analysis of partitioned global-address-space applications. Int. J. High Perform. Comput. Appl. 24(4), 485–510 (2010)

    Article  Google Scholar 

  26. Su, H.-H., Bonachea, D., Leko, A., Sherburne, H., Billingsley, M., George, A.D.: GASP! a standardized performance analysis tool interface for global address space programming models. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 450–459. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75755-9_54

    Chapter  Google Scholar 

  27. Tallent, N.R., Vishnu, A., Van Dam, H., Daily, J., Kerbyson, D.J., Hoisie, A.: Diagnosing the causes and severity of one-sided message contention. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 130–139. ACM, New York, NY, USA (2015)

    Google Scholar 

  28. UPC Consortium: UPC language and library specifications, v1.3. Technical Report LBNL-6623E, Lawrence Berkeley National Lab, November 2013

    Google Scholar 

  29. Van der Wijngaart, R.F., et al.: Comparing runtime systems with exascale ambitions using the parallel research Kernels. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 321–339. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_17

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Wasi-ur- Rahman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rahman, M.Wu., Ozog, D., Dinan, J. (2019). Lightweight Instrumentation and Analysis Using OpenSHMEM Performance Counters. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity. OpenSHMEM 2018. Lecture Notes in Computer Science(), vol 11283. Springer, Cham. https://doi.org/10.1007/978-3-030-04918-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04918-8_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04917-1

  • Online ISBN: 978-3-030-04918-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics