On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures

Deakin, Tom; Gaudin, Wayne; McIntosh-Smith, Simon

doi:10.1007/978-3-319-67630-2_26

On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures

Tom Deakin¹⁷,
Wayne Gaudin¹⁸ &
Simon McIntosh-Smith¹⁷

Conference paper
First Online: 20 October 2017

1805 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Abstract

Kernels with low arithmetic intensity with memory footprint exceeding cache sizes are typically categorised as memory bandwidth bound. Kernels of this class are typically limited by hardware memory bandwidth. In this work we contribute a simple memory access pattern, derived from a widely-used upwinded stencil-style benchmark, which presents significant challenges for cache-based architectures. The problem appears to grow worse as CPU core counts increase, and the pattern in its initial form shows no benefit from the new high-bandwidth memory now appearing on the Intel Xeon Phi (Knights Landing) family of processors. We describe the memory access scenarios which appear to be causing lower than expected cache performance, before presenting optimisations to mitigate the problem. These optimisations result in useful effective memory bandwidth and runtime improvements by up to 4X on cache based architectures. Results are presented on the Intel Xeon (Broadwell) and Xeon Phi (Knights Landing) processors.

This is a preview of subscription content, log in via an institution.

References

Deakin, T., McIntosh-Smith, S., Gaudin, W.: Many-core acceleration of a discrete ordinates transport mini-app at extreme scale. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 429–448. Springer, Cham (2016). doi:10.1007/978-3-319-41321-1_22
Google Scholar
Deakin, T., McIntosh-Smith, S., Martineau, M., Gaudin, W.: An improved parallelism scheme for deterministic discrete ordinates transport. Int. J. High Perform. Comput. Appl. http://hpc.sagepub.com/cgi/doi/10.1177/1094342016668978
Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489–507. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_34
Chapter Google Scholar
Intel: Programming with Intel Streaming SIMD Extensions, Intel 64 and IA-32 Architectures Software Developer’s Manual, chap. 10, vol. 1. Intel Corporation, December 2016
Google Scholar
Jeffers, J., Reinders, J., Sodani, A.: Trinity workloads. In: Intel Xeon Phi Processor High Performance Programming, chap. 25, pp. 549–579. Morgan Kaufmann, Boston (2016). http://www.sciencedirect.com/science/article/pii/B9780128091944000259
Jeffers, J., Reinders, J., Sodani, A.: Quantum chromodynamics. In: Intel Xeon Phi Processor High Performance Programming, pp. 581–598. Elsevier (2016). http://linkinghub.elsevier.com/retrieve/pii/B9780128091944000260
Lamport, L.: The parallel execution of DO loops. CACM - Commun. ACM 17(2), 83–93 (1974)
Article MathSciNet MATH Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009)
Article Google Scholar
Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Tech. report, LA-UR-13-21070, Los Alamos National Laboratory (2013)
Google Scholar

Download references

Acknowledgement

We would like to thank John Pennycook and Andrew Mallinson of Intel Corporation for their assistance with this work. The mega-stream code is made available from the UK Mini-App Consortium on GitHub at https://github.com/UK-MAC/mega-stream. The University of Bristol is an Intel Parallel Computing Center, and the authors would like to thank Intel Corporation for the provision of the Intel Xeon Phi (Knights Landing) Processor. The authors would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, “Swan”.

Author information

Authors and Affiliations

Department of Computer Science, University of Bristol, Bristol, UK
Tom Deakin & Simon McIntosh-Smith
UK Atomic Weapons Establishment, Aldermaston, UK
Wayne Gaudin

Authors

Tom Deakin
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Gaudin
View author publications
You can also search for this author in PubMed Google Scholar
Simon McIntosh-Smith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tom Deakin .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Hamburg, Germany
Julian M. Kunkel
TITECH, Tokyo, Japan
Rio Yokota
Department of Computer Science, University of Delaware, Newark, Delaware, USA
Michela Taufer
Lawrence Berkeley National Laboratory, Berkeley, California, USA
John Shalf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deakin, T., Gaudin, W., McIntosh-Smith, S. (2017). On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-67630-2_26
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67629-6
Online ISBN: 978-3-319-67630-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics