Path-Synchronous Performance Monitoring in HPC Interconnection Networks with Source-Code Attribution

  • Adarsh YogaEmail author
  • Milind Chabbi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10724)


Performance anomalies involving interconnection networks have largely remained a “black box” for developers relying on traditional CPU profilers. Network-side profilers collect aggregate statistics and lack source-code attribution. We have incorporated an effective protocol extension in the Gen-Z communication protocol for tagging network packets in an interconnection network; additionally, we have backed the protocol extension with hardware and software enhancements that allow tracking the flow of a network transaction through every hop in the interconnection network and associate it back to the application source code. The result is a first-of-its-kind hardware-assisted telemetry of disparate, autonomous interconnection networking components with application source code association that offers better developer insights. Our scheme works on a sampling basis to ensure low runtime overhead and generates modest volumes of data. Simulation of our methods in the open-source Structural Simulation Toolkit (SST/Macro) shows its effectiveness—deep insights into the underlying network details to the developer at minimal overheads.



This work was supported (in part) by the US Department of Energy (DOE) under Cooperative Agreement DE-SC0012199, the Blackcomb 2 Project.

Supplementary material


  1. 1.
    Oliker, L., Canning, A., Carter, J., Shalf, J., Ethier, S.: Scientific application performance on leading scalar and vector supercomputing platforms. Int. J. High Perform. Comput. Appl. 22(1), 5–20 (2006)CrossRefGoogle Scholar
  2. 2.
    Dongarra, J., Heroux, M.A.: Toward a new metric for ranking high performance computing systems. Sandia report, SAND2013-4744 312, p. 150 (2013)Google Scholar
  3. 3.
    Egawa, R., Komatsu, K., Momose, S., Isobe, Y., Musa, A., Takizawa, H., Kobayashi, H.: Potential of a modern vector supercomputer for practical applications: performance evaluation of SX-ACE. J. Supercomput., March 2017Google Scholar
  4. 4.
  5. 5.
    Adhianto, L., Banerjee, S., Fagan, M., Krentel, M., Marin, G., Mellor-Crummey, J., Tallent, N.R.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)Google Scholar
  6. 6.
    Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The Scalasca performance toolset architecture. Concurr. Comput. Pract. Exp. 22(6), 702–719 (2010)Google Scholar
  7. 7.
    Shende, S.S., Malony, A.D.: The Tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)CrossRefGoogle Scholar
  8. 8.
  9. 9.
    Intel Inc.: Intel Trace Analyzer and Collector, October 2017.
  10. 10.
    Allinea Inc.: Allinea MAP - C/C++ profiler and Fortran profiler for high performance Linux code, October 2017.
  11. 11.
    Liu, X., Mellor-Crummey, J.: A data-centric profiler for parallel programs. In: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, vol. 28 (2013)Google Scholar
  12. 12.
    Rane, A., Browne, J.: Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA. IEEE Computer Society (2012)Google Scholar
  13. 13.
    Böhme, D., Geimer, M., Arnold, L., Voigtlaender, F., Wolf, F.: Identifying the root causes of wait states in large-scale parallel applications. ACM Trans. Parallel Comput. 3(2), 11:1–11:24 (2016)CrossRefGoogle Scholar
  14. 14.
    Isaacs, K.E., Gamblin, T., Bhatele, A., Schulz, M., Hamann, B., Bremer, P.T.: Ordering traces logically to identify lateness in message passing programs. IEEE Trans. Parallel Distrib. Syst. 27(3), 829–840 (2016)CrossRefGoogle Scholar
  15. 15.
    Weber, M., Brendel, R., Hilbrich, T., Mohror, K., Schulz, M., Brunst, H.: Structural clustering: a new approach to support performance analysis at scale. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 484–493, May 2016Google Scholar
  16. 16.
    Isaacs, K.E., Giménez, A., Jusufi, I., Gamblin, T., Bhatele, A., Schulz, M., Hamann, B., Bremer, P.T.: State of the art of performance visualization. In: Borgo, R., Maciejewski, R., Viola, I. (eds.) EuroVis - STARs. The Eurographics Association (2014)Google Scholar
  17. 17.
    Valiev, M., Bylaska, E., Govind, N., Kowalski, K., Straatsma, T., Dam, H.V., Wang, D., Nieplocha, J., Apra, E., Windus, T., de Jong, W.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)CrossRefzbMATHGoogle Scholar
  18. 18.
    Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA 2008, Washington, DC, USA, pp. 77–88. IEEE Computer Society (2008)Google Scholar
  19. 19.
    Alverson, B., Kaplan, L., Roweth, D.: Cray XC Series Network.
  20. 20.
    National Energy Research Scientific Computing Center: Edison.
  21. 21.
    Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)CrossRefGoogle Scholar
  22. 22.
    Gen-Z Consortium: Gen-Z: Draft Core Specification, July 2017.
  23. 23.
  24. 24.
    Zaki, O., Lusk, E., Gropp, W., Swider, D.: Toward scalable performance visualization with jumpshot. High Perf. Comput. Appl. 13(2), 277–288 (1999)CrossRefGoogle Scholar
  25. 25.
    Karrels, E., Lusk, E.: Performance analysis of MPI programs. In: Dongarra, J., Tourancheau, B. (eds.) Proceedings of the Workshop on Environments and Tools For Parallel Scientific Computing, pp. 195–200. SIAM Publications (1994)Google Scholar
  26. 26.
    Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The vampir performance analysis tool-set. Tools High Perf. Comput. 139–155 (2008)Google Scholar
  27. 27.
    McDonald, N.: SuperSim: a flexible event-driven cycle-accurate network simulator.
  28. 28.
    Carothers, C.: ROSS: Rensselaer’s Optimistic Simulation System.
  29. 29.
    Carothers, C.D., Bauer, D., Pearce, S.: ROSS: a high-performance, low memory, modular time warp system. In: Proceedings of the Fourteenth Workshop on Parallel and Distributed Simulation, PADS 2000, Washington, DC, USA, pp. 53–60. IEEE Computer Society (2000)Google Scholar
  30. 30.
    Liu, N., Carothers, C., Cope, J., Carns, P., Ross, R.: Model and simulation of exascale communication networks. J. Simul. 6(4), 227–236 (2012)CrossRefGoogle Scholar
  31. 31.
    Jain, N., Bhatele, A., White, S., Gamblin, T., Kale, L.V.: Evaluating hpc networks via simulation of parallel workloads. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 154–165, November 2016Google Scholar
  32. 32.
    Rodrigues, A.F., Hemmert, K.S., Barrett, B.W., Kersey, C., Oldfield, R., Weston, M., Risen, R., Cook, J., Rosenfeld, P., CooperBalls, E., Jacob, B.: The structural simulation toolkit. SIGMETRICS Perform. Eval. Rev. 38(4), 37–42 (2011)CrossRefGoogle Scholar
  33. 33.
    So-In, C.: A survey of network traffic monitoring and analysis tools.
  34. 34.
  35. 35.
    sFlow organization: sFlow.
  36. 36.
    Hewlett Packard Labs: Network Performance Monitoring (NWPM) Tool.
  37. 37.
    Plotly Technologies Inc.: Collaborative data science (2015).

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Hewlett Packard LabsPalo AltoUSA
  2. 2.Rutgers UniversityPiscatawayUSA
  3. 3.Scalable Machines ResearchCupertinoUSA

Personalised recommendations