Evaluating OpenMP Affinity on the POWER8 Architecture
As we move toward pre-Exascale systems, two of the DOE leadership class systems will consist of very powerful OpenPOWER compute nodes which will be more complex to program. These systems will have massive amounts of parallelism; where threads may be running on POWER9 cores as well as on accelerators. Advances in memory interconnects, such as NVLINK, will provide a unified shared memory address spaces for different types of memories HBM, DRAM, etc. In preparation for such system, we need to improve our understanding on how OpenMP supports the concept of affinity as well as memory placement on POWER8 systems. Data locality and affinity are key program optimizations to exploit the compute and memory capabilities to achieve good performance by minimizing data motion across NUMA domains and access the cache efficiently. This paper is the first step to evaluate the current features of OpenMP 4.0 on the POWER8 processors, and on how to measure its effects on a system with two POWER8 sockets. We experiment with the different affinity settings provided by OpenMP 4.0 to quantify the costs of having good data locality vs not, and measure their effects via hardware counters. We also find out which affinity settings benefits more from data locality. Based on this study we describe the current state of art, the challenges we faced in quantifying effects of affinity, and ideas on how OpenMP 5.0 should be improved to address affinity in the context of NUMA domains and accelerators.
KeywordsCache Line NUMA Domain POWER8 Processor Memory Subsystem Memory Placement
This material is based upon work supported by the U.S. Department of Energy, Office of Science under the Advanced Scientific Computing Research (ASCR) program. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
- 2.Caldeira, A.B., Haug, V., Kahle, M.E., Maciel, C.D., Sanchez, M.: IBM Power Systems S812L and S822L Technical Overview and Introduction (2014)Google Scholar
- 3.Su, C., Li, D., Nikolopoulos, D., Grove, M., Cameron, K.W., de Supinski, B.R.: Critical path-based thread placement for NUMA systems. In: Proceedings of the Second International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2011, pp. 19–20. ACM, New York (2011)Google Scholar
- 4.Diener, M., Madruga, F., Rodrigues, E., Alves, M., Schneider, J., Navaux, P., Heiss, H.U.: Evaluating thread placement based on memory access patterns for multi-core processors. In: 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 491–496 (2010)Google Scholar
- 5.Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data and thread affinity in OpenMP programs. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? MAW 2008, pp. 377–384. ACM, New York (2008)Google Scholar
- 6.Goglin, B., Furmento, N.: Enabling high-performance memory migration for multithreaded applications on Linux. In: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS 2009, pp. 1–9. IEEE Computer Society, Washington, DC (2009)Google Scholar
- 7.Bull, J.M., Johnson, C.: Data distribution, migration and replication on a ccNUMA architecture. In: Proceedings of the Fourth European Workshop on OpenMP (2002)Google Scholar
- 8.Nordén, M., Löf, H., Rantakokko, J., Holmgren, S.: Geographical locality and dynamic data migration for OpenMP implementations of adaptive PDE solvers. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP 2005 and IWOMP 2006. LNCS, vol. 4315, pp. 382–393. Springer, Heidelberg (2008)CrossRefGoogle Scholar
- 10.Marathe, J., Mueller, F.: Hardware profile-guided automatic page placement for ccnuma systems. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 90–99. ACM, New York (2006)Google Scholar
- 11.Broquedis, F., Furmento, N., Goglin, B., Namyst, R., Wacrenier, P.-A.: Dynamic task and data placement over NUMA architectures: an OpenMP runtime perspective. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 79–92. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 12.Wittmann, M., Hager, G.: Optimizing ccNUMA locality for task-parallel execution under openmp and TBB on multicore-based systems. CoRR abs/1101.0093 (2011)Google Scholar