Profiling and Debugging Support for the Kokkos Programming Model

  • Simon D. HammondEmail author
  • Christian R. Trott
  • Daniel Ibanez
  • Daniel Sunderland
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11203)


Supercomputing hardware is undergoing a period of significant change. In order to cope with the rapid pace of hardware and, in many cases, programming model innovation, we have developed the Kokkos Programming Model – a C++-based abstraction that permits performance portability across diverse architectures. Our experience has shown that the abstractions developed can significantly frustrate debugging and profiling activities because they break expected code proximity and layout assumptions. In this paper we present the Kokkos Profiling interface, a lightweight, suite of hooks to which debugging and profiling tools can attach to gain deep insights into the execution and data structure behaviors of parallel programs written to the Kokkos interface.


Profiling Performance-portable Programming models 



Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.


  1. 1.
    Bartlett, R.A.: Teuchos C++ memory management classes, idioms, and related topics, the complete reference: a comprehensive strategy for safe and efficient memory management in C++ for high performance computing. Technical report, SAND2010-2234, Sandia National Laboratories (2010)Google Scholar
  2. 2.
    Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems Jade Edition, pp. 359–371. Elsevier (2011)Google Scholar
  3. 3.
    Cownie, J., DelSignore, J., de Supinski, B.R., Warren, K.: DMPL: an OpenMP DLL debugging interface. In: Voss, M.J. (ed.) WOMPAT 2003. LNCS, vol. 2716, pp. 137–146. Springer, Heidelberg (2003). Scholar
  4. 4.
    Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)CrossRefGoogle Scholar
  5. 5.
    Edwards, H.C., Sunderland, D., Porter, V., Amsler, C., Mish, S.: Manycore performance-portability: kokkos multidimensional array library. Sci. Program. 20(2), 89–114 (2012)Google Scholar
  6. 6.
    Edwards, H.C., Trott, C.R.: Kokkos: enabling performance portability across manycore architectures. In: Extreme Scaling Workshop (XSW), pp. 18–24. IEEE (2013)Google Scholar
  7. 7.
    Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014)CrossRefGoogle Scholar
  8. 8.
    Eichenberger, A., et al.: OMPT and OMPD: OpenMP tools application programming interfaces for performance analysis and debugging. In: International Workshop on OpenMP (IWOMP 2013) (2013)Google Scholar
  9. 9.
    Eichenberger, A.E., et al.: OMPT: an OpenMP tools application programming interface for performance analysis. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 171–185. Springer, Heidelberg (2013). Scholar
  10. 10.
    Foley, D., Danskin, J.: Ultra-performance pascal GPU and NVLink interconnect. IEEE Micro 37(2), 7–17 (2017)CrossRefGoogle Scholar
  11. 11.
    Hammarlund, P., et al.: Haswell: the fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014)CrossRefGoogle Scholar
  12. 12.
    Heroux, M.A., et al.: An overview of the trilinos project. ACM Trans. Math. Softw. (TOMS) 31(3), 397–423 (2005)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Jain, T., Agrawal, T.: The haswell microarchitecture - 4th generation processor. Int. J. Comput. Sci. Inf. Technol. 4(3), 477–480 (2013)Google Scholar
  14. 14.
    Killian, W., Scogland, T., Kunen, A., Cavazos, J.: The design and implementation of OpenMP 4.5 and OpenACC backends for the RAJA C++ performance portability layer. In: Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2017. LNCS, vol. 10732, pp. 63–82. Springer, Cham (2018). Scholar
  15. 15.
    Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2012). Scholar
  16. 16.
    Messina, P.: The U.S. D.O.E. Exascale Computing Project – Goals and Challenges, February 2017Google Scholar
  17. 17.
    Nalamalpu, A., et al.: Broadwell: a family of IA 14nm processors. In: 2015 Symposium on VLSI Circuits (VLSI Circuits), pp. C314–C315. IEEE (2015)Google Scholar
  18. 18.
    Pheatt, C.: Intel threading building blocks. J. Comput. Sci. Coll. 23(4), 298–298 (2008)Google Scholar
  19. 19.
    Sadasivam, S.K., Thompto, B.W., Kalla, R., Starke, W.J.: IBM Power9 processor architecture. IEEE Micro 37(2), 40–51 (2017). Scholar
  20. 20.
    Schulz, M., Galarowicz, J., Maghrak, D., Hachfeld, W., Montoya, D., Cranford, S.: Open|SpeedShop: an open source infrastructure for parallel performance analysis. Sci. Programm. 16(2–3), 105–121 (2008)Google Scholar
  21. 21.
    Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)CrossRefGoogle Scholar
  22. 22.
    Sodani, A.: Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24. IEEE (2015)Google Scholar
  23. 23.
    Sodani, A., et al.: Knights landing: second-generation Intel Xeon Phi product. IEEE Micro 36(2), 34–46 (2016)CrossRefGoogle Scholar
  24. 24.
    Tallent, N., Mellor-Crummey, J., Adhianto, L., Fagan, M., Krentel, M.: HPCToolkit: performance tools for scientific computing. In: Journal of Physics: Conference Series, vol. 125, p. 012088. IOP Publishing (2008)Google Scholar
  25. 25.
    Thompto, B.: POWER9: processor for the cognitive era. In: 2016 IEEE Hot Chips 28 Symposium (HCS), pp. 1–19. IEEE (2016)Google Scholar
  26. 26.
    Zenker, E., et al.: Alpaka-an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 631–640. IEEE (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Simon D. Hammond
    • 1
    Email author
  • Christian R. Trott
    • 1
  • Daniel Ibanez
    • 1
  • Daniel Sunderland
    • 1
  1. 1.Center for Computing ResearchSandia National LaboratoriesAlbuquerqueUSA

Personalised recommendations