Data Partitioning Strategies for Stencil Computations on NUMA Systems

  • Frank Feinbube
  • Max Plauth
  • Marius Knaust
  • Andreas Polze
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)


Many scientific problems rely on the efficient execution of stencil computations, which are usually memory-bound. In this paper, stencils on two-dimensional data are executed on NUMA architectures. Each node of a NUMA system processes a distinct partition of the input data independent from other nodes. However, processors may need access to the memory of other nodes at the edges of the partitions. This paper demonstrates two techniques based on machine learning for identifying partitioning strategies that reduce the occurrence of remote memory access. One approach is generally applicable and is based on an uninformed search. The second approach caps the search space by employing geometric decomposition. The partitioning strategies obtained with these techniques are analyzed theoretically. Finally, an evaluation on a real NUMA machine is conducted, which demonstrates that the expected reduction of the remote memory accesses can be achieved.


NUMA Stencil computation Data partitioning 


Acknowledgement and Disclaimer

This paper has received funding from the European Union’s Horizon 2020 research and innovation programme 2014–2018 under grant agreement No. 644866. This paper reflects only the authors’ views and the European Commission is not responsible for any use that may be made of the information it contains.


  1. 1.
    Abraham, S.G., Hudak, D.E.: Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Trans. Parallel Distrib. Syst. 2(3), 318–328 (1991)CrossRefGoogle Scholar
  2. 2.
    Datta, K.: Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. Ph.D. thesis, University of California, Berkeley (2009)Google Scholar
  3. 3.
    DeFlumere, A.: Optimal partitioning for parallel matrix computation on a small number of abstract heterogeneous processors. Ph.D. thesis, University College Dublin (2014)Google Scholar
  4. 4.
    Dursun, H., Nomura, K.I., Wang, W., Kunaseth, M., Peng, L., Seymour, R., Kalia, R.K., Nakano, A., Vashishta, P.: In-core optimization of high-order stencil computations. In: PDPTA, pp. 533–538 (2009)Google Scholar
  5. 5.
    Hagen, W., Plauth, M., Eberhardt, F., Polze, A.: PGASUS: a framework for C++ application development on NUMA architectures. In: 2016 Fourth International Symposium on Computing and Networking (CANDAR), pp. 368–374. IEEE, Hiroshima, November 2016Google Scholar
  6. 6.
    Henretty, T., Veras, R., Franchetti, F., Pouchet, L.N., Ramanujam, J., Sadayappan, P.: A stencil compiler for short-vector SIMD architectures. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, pp. 13–24. ACM (2013)Google Scholar
  7. 7.
    Hewlett-Packard Development Company: Red Hat Linux NUMA Support for HP ProLiant Servers. Technical report. (2013). Accessed 1 Feb 2017Google Scholar
  8. 8.
    Jacobi, C.G.J.: Über ein leichtes Verfahren die in der Theorie der Säcularstörungen vorkommenden Gleichungen numerisch aufzulösen. Journal für die reine und angewandte Mathematik 30, 51–94 (1846)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Kirkpatrick, S., Vecchi, M.P., et al.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Knaust, M.: Partitioning 2D Data for Stencil Computations on NUMA Systems. Master’s thesis, Hasso Plattner Institute, University of Potsdam (2016)Google Scholar
  11. 11.
    Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13. IEEE Computer Society (2010)Google Scholar
  12. 12.
    Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 77–91. Springer, Heidelberg (2011). CrossRefGoogle Scholar
  13. 13.
    Plauth, M., Hagen, W., Feinbube, F., Eberhardt, F., Feinbube, L., Polze, A.: Parallel implementation strategies for hierarchical non-uniform memory access systems by example of the scale-invariant feature transform algorithm. In: IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1351–1359. IEEE, Chicago, May 2016Google Scholar
  14. 14.
    Reed, D.A., Adams, L.M., Patrick, M.L.: Stencils and problem partitionings: their influence on the performance of multiple processor systems. IEEE Trans. Comput. 100(7), 845–858 (1987)CrossRefGoogle Scholar
  15. 15.
    Roth, G., Mellor-crummey, J., Kennedy, K., Brickner, R.G.: Compiling stencils in high performance Fortran. In: Supercomputing 1997: Proceedings of the 1997 ACM/IEEE conference on Supercomputing, pp. 1–20. ACM Press (1997)Google Scholar
  16. 16.
    Shaheen, M., Strzodka, R.: NUMA aware iterative stencil computations on many-core systems. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), pp. 461–473. IEEE (2012)Google Scholar
  17. 17.
    Silicon Graphics International Corp: SGI UV 300H for SAP HANA (2015)Google Scholar
  18. 18.
    Strzodka, R., Shaheen, M., Pajak, D., Seidel, H.P.: Cache oblivious parallelograms in iterative stencil computations. In: Proceedings of the 24th ACM International Conference on Supercomputing, pp. 49–59. ACM (2010)Google Scholar
  19. 19.
    Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: 33rd Annual IEEE International Computer Software and Applications Conference, COMPSAC 2009, vol. 1, pp. 579-586. IEEE (2009)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Frank Feinbube
    • 1
  • Max Plauth
    • 1
  • Marius Knaust
    • 1
  • Andreas Polze
    • 1
  1. 1.Operating Systems and Middleware Group, Hasso Plattner Institute for Software Systems EngineeringUniversity of PotsdamPotsdamGermany

Personalised recommendations