Skip to main content

On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10524))

Abstract

Kernels with low arithmetic intensity with memory footprint exceeding cache sizes are typically categorised as memory bandwidth bound. Kernels of this class are typically limited by hardware memory bandwidth. In this work we contribute a simple memory access pattern, derived from a widely-used upwinded stencil-style benchmark, which presents significant challenges for cache-based architectures. The problem appears to grow worse as CPU core counts increase, and the pattern in its initial form shows no benefit from the new high-bandwidth memory now appearing on the Intel Xeon Phi (Knights Landing) family of processors. We describe the memory access scenarios which appear to be causing lower than expected cache performance, before presenting optimisations to mitigate the problem. These optimisations result in useful effective memory bandwidth and runtime improvements by up to 4X on cache based architectures. Results are presented on the Intel Xeon (Broadwell) and Xeon Phi (Knights Landing) processors.

This is a preview of subscription content, log in via an institution.

References

  1. Deakin, T., McIntosh-Smith, S., Gaudin, W.: Many-core acceleration of a discrete ordinates transport mini-app at extreme scale. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 429–448. Springer, Cham (2016). doi:10.1007/978-3-319-41321-1_22

    Google Scholar 

  2. Deakin, T., McIntosh-Smith, S., Martineau, M., Gaudin, W.: An improved parallelism scheme for deterministic discrete ordinates transport. Int. J. High Perform. Comput. Appl. http://hpc.sagepub.com/cgi/doi/10.1177/1094342016668978

  3. Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489–507. Springer, Cham (2016). doi:10.1007/978-3-319-46079-6_34

    Chapter  Google Scholar 

  4. Intel: Programming with Intel Streaming SIMD Extensions, Intel 64 and IA-32 Architectures Software Developer’s Manual, chap. 10, vol. 1. Intel Corporation, December 2016

    Google Scholar 

  5. Jeffers, J., Reinders, J., Sodani, A.: Trinity workloads. In: Intel Xeon Phi Processor High Performance Programming, chap. 25, pp. 549–579. Morgan Kaufmann, Boston (2016). http://www.sciencedirect.com/science/article/pii/B9780128091944000259

  6. Jeffers, J., Reinders, J., Sodani, A.: Quantum chromodynamics. In: Intel Xeon Phi Processor High Performance Programming, pp. 581–598. Elsevier (2016). http://linkinghub.elsevier.com/retrieve/pii/B9780128091944000260

  7. Lamport, L.: The parallel execution of DO loops. CACM - Commun. ACM 17(2), 83–93 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  8. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995

    Google Scholar 

  9. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009)

    Article  Google Scholar 

  10. Zerr, R.J., Baker, R.S.: SNAP: SN (discrete ordinates) application proxy - proxy description. Tech. report, LA-UR-13-21070, Los Alamos National Laboratory (2013)

    Google Scholar 

Download references

Acknowledgement

We would like to thank John Pennycook and Andrew Mallinson of Intel Corporation for their assistance with this work. The mega-stream code is made available from the UK Mini-App Consortium on GitHub at https://github.com/UK-MAC/mega-stream. The University of Bristol is an Intel Parallel Computing Center, and the authors would like to thank Intel Corporation for the provision of the Intel Xeon Phi (Knights Landing) Processor. The authors would like to thank Cray Inc. for providing access to the Cray XC40 supercomputer, “Swan”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom Deakin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Deakin, T., Gaudin, W., McIntosh-Smith, S. (2017). On the Mitigation of Cache Hostile Memory Access Patterns on Many-Core CPU Architectures. In: Kunkel, J., Yokota, R., Taufer, M., Shalf, J. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10524. Springer, Cham. https://doi.org/10.1007/978-3-319-67630-2_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67630-2_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67629-6

  • Online ISBN: 978-3-319-67630-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics