NUMAPROF, A NUMA Memory Profiler

  • Sébastien ValatEmail author
  • Othman BouiziEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


The number of cores in HPC systems and servers increased a lot for the last few years. In order to also increase the available memory bandwidth and capacity, most systems became NUMA (Non-Uniform Memory Access) meaning each processor has its own memory and can share it. Although the access to the remote memory is transparent for the developer, it comes with a lower bandwidth and a higher latency. It might heavily impact the performance of the application if it happens too often. Handling this memory locality in multi-threaded applications is a challenging task. In order to help the developer, we developed NUMAPROF, a memory profiling tool pinpointing the local and remote memory accesses onto the source code with the same approach as MALT, a memory allocation profiling tool. The paper offers a full review of the capacity of NUMAPROF on mainstream HPC workloads. In addition to the dedicated interface, the tool also provides hints about unpinned memory accesses (unpinned thread or unpinned page) which can help the developer find portion of codes not safely handling the NUMA binding. The tool also provides dedicated metrics to track access to MCDRAM of the Intel Xeon Phi codenamed Knight’s Landing. To operate, the tool instruments the application by using Pin, a parallel binary instrumentation framework from Intel. NUMAPROF also has the particularity of using the OS memory mapping without relying on hardware counters or OS simulation. It permits understanding what really happened on the system without requiring dedicated hardware support.


NUMA Memory Profiler Instrumentation Pin Access Remote MCDRAM KNL 


  1. 1.
    Huge pages and preferred policy kernel bug.
  2. 2.
  3. 3.
  4. 4.
    Beniamine, D., Diener, M., Huard, G., Navaux, P.O.A.: TABARNAC: Tools for Analyzing Behavior of Applications Running on NUMA Architecture. Research Report 8774, Inria Grenoble Rhône-Alpes, Université de Grenoble, October 2015.
  5. 5.
    Clet-Ortega, J., Carribault, P., Pérache, M.: Evaluation of OpenMP task scheduling algorithms for large NUMA architectures. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 596–607. Springer, Cham (2014). Scholar
  6. 6.
    Drongowski, P.J.: Instruction-based sampling: A new performance analysis technique for amd family 10h processors (2007).
  7. 7.
    Lachaize, R., Lepers, B., Quema, V.: MemProf: A memory profiler for NUMA multicore systems. In: Presented as Part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), pp. 53–64. USENIX, Boston, MA (2012)Google Scholar
  8. 8.
    Liu, X., Mellor-Crummey, J.: A tool to analyze the performance of multithreaded programs on NUMA architectures. SIGPLAN Not. 49(8), 259–272 (2014)CrossRefGoogle Scholar
  9. 9.
    McCurdy, C., Vetter, J.: Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In: IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pp. 87–96 (2010)Google Scholar
  10. 10.
    Pérache, M., Jourdren, H., Namyst, R.: MPC: a unified parallel runtime for clusters of NUMA machines. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 78–88. Springer, Heidelberg (2008). Scholar
  11. 11.
    Prestor, U.: Evaluating the memory performance of a ccNUMA system.
  12. 12.
    Roy, A., Hand, S., Harris, T.: Hybrid binary rewriting for memory access instrumentation. SIGPLAN Not. 46(7), 227–238 (2011)CrossRefGoogle Scholar
  13. 13.
    Seward, J., Nethercote, N.: Using valgrind to detect undefined value errors with bit-precision. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC 2005, pp. 2–2, USENIX Association, Berkeley, CA, USA (2005)Google Scholar
  14. 14.
    Tao, J., Schulz, M., Karl, W.: A simulation tool for evaluating shared memory systems. In: 36th Annual Simulation Symposium, 2003, pp. 335–342, March 2003Google Scholar
  15. 15.
    Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in openmp programs. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? MAW 2008, pp. 377–384, ACM, New York, NY, USA (2008)Google Scholar
  16. 16.
    Valat, S., Charif-Rubial, A.S., Jalby, W.: Malt: A malloc tracker. In: Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems, SEPS 2017, pp. 1–10. ACM, New York, NY, USA (2017)Google Scholar
  17. 17.
    Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995)CrossRefGoogle Scholar
  18. 18.
    Yang, R., Antony, J., Rendell, A., Robson, D., Strazdins, P.: Profiling directed NUMA optimization on Linux systems: a case study of the GAUSSIAN computational chemistry code. In: IEEE International Parallel Distributed Processing Symposium, pp. 1046–1057, May 2011Google Scholar
  19. 19.
    Yao, J.: Numatop: A tool for memory access locality characterization and analysis.
  20. 20.
    Zhao, Q., Rabbah, R., Amarasinghe, S., Rudolph, L., Wong, W.F.: Ubiquitous memory introspection. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2007, pp. 299–311. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.CERNMeyrinSwitzerland
  2. 2.INTELMeudonFrance

Personalised recommendations