Abstract
Kernels with low arithmetic intensity with memory footprint exceeding cache sizes are typically categorised as memory bandwidth bound. Kernels of this class are typically limited by hardware memory bandwidth. In this work we contribute a simple memory access pattern, derived from a widely-used upwinded stencil-style benchmark, which presents significant challenges for cache-based architectures. The problem appears to grow worse as CPU core counts increase, and the pattern in its initial form shows no benefit from the new high-bandwidth memory now appearing on the Intel Xeon Phi (Knights Landing) family of processors. We describe the memory access scenarios which appear to be causing lower than expected cache performance, before presenting optimisations to mitigate the problem. These optimisations result in useful effective memory bandwidth and runtime improvements by up to 4X on cache based architectures. Results are presented on the Intel Xeon (Broadwell) and Xeon Phi (Knights Landing) processors.
This is a preview of subscription content, log in via an institution.
References
Deakin, T., McIntosh-Smith, S., Gaudin, W.: Many-core acceleration of a discrete ordinates transport mini-app at extreme scale. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 429–448. Springer, Cham (2016). doi:10.1007/978-3-319-41321-1_22
Deakin, T., McIntosh-Smith, S., Martineau, M., Gaudin, W.: An improved parallelism scheme for deterministic discrete ordinates transport. Int. J. High Perform. Comput. Appl. http://hpc.sagepub.com/cgi/doi/10.1177/1094342016668978
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489–507. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_34
Intel: Programming with Intel Streaming SIMD Extensions, Intel 64 and IA-32 Architectures Software Developer’s Manual, chap. 10, vol. 1. Intel Corporation, December 2016
Jeffers, J., Reinders, J., Sodani, A.: Trinity workloads. In: Intel Xeon Phi Processor High Performance Programming, chap. 25, pp. 549–579. Morgan Kaufmann, Boston (2016). http://www.sciencedirect.com/science/article/pii/B9780128091944000259
Jeffers, J., Reinders, J., Sodani, A.: Quantum chromodynamics. In: Intel Xeon Phi Processor High Performance Programming, pp. 581–598. Elsevier (2016). http://linkinghub.elsevier.com/retrieve/pii/B9780128091944000260
Lamport, L.: The parallel execution of DO loops. CACM - Commun. ACM 17(2), 83–93 (1974)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009)
Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Tech. report, LA-UR-13-21070, Los Alamos National Laboratory (2013)
Acknowledgement
We would like to thank John Pennycook and Andrew Mallinson of Intel Corporation for their assistance with this work. The mega-stream code is made available from the UK Mini-App Consortium on GitHub at https://github.com/UK-MAC/mega-stream. The University of Bristol is an Intel Parallel Computing Center, and the authors would like to thank Intel Corporation for the provision of the Intel Xeon Phi (Knights Landing) Processor. The authors would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, “Swan”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Deakin, T., Gaudin, W., McIntosh-Smith, S. (2017). On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-67630-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)