An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors

Hofmann, Johannes; Hager, Georg; Wellein, Gerhard; Fey, Dietmar

doi:10.1007/978-3-319-58667-0_16

Johannes Hofmann¹⁹,
Georg Hager²⁰,
Gerhard Wellein²⁰ &
…
Dietmar Fey¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10266))

Included in the following conference series:

International Conference on High Performance Computing

2447 Accesses
14 Citations

Abstract

This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell) with a focus on performance with floating point workloads. Starting at the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved significantly by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may benefit from a suitable choice of cache snoop mode settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/.
2.
http://tiny.cc/LIKWID.
3.
http://www.hpc.rrze.fau.de/systeme/meggie-cluster.shtml.
4.
The latencies of some instructions (e.g., FP division) depend on their operands. When working with “trivial” denominators, such as whole numbers, latency can be significantly lower than when operating on non-trivial floating-point numbers.
5.
CLs are mapped to L3 segments based on their addresses according to a hashing function. Thus, each CA knows which CA in other NUMA domains is responsible for a certain CL.
6.
Investigations using the HITME_* performance counter events indicate this cache is exclusively used in DIR mode.

References

Barker, K., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: A performance evaluation of the Nehalem quad-core processor for scientific computing. Parallel Proces. Lett. 18(4), 453–469 (2008). http://dx.doi.org/10.1142/S012962640800351X
Article MathSciNet Google Scholar
Gasc, T., Vuyst, F.D., Peybernes, M., Poncet, R., Motte, R.: Building a more efficient Lagrange-remap scheme thanks to performance modeling. In: Papadrakakis, M., et al. (ed.) Proceedings of the ECCOMAS Congress 2016, the VII European Congress on Computational Methods in Applied Sciences and Engineering, Crete Island, Greece, 5–10 June 2016. https://www.eccomas2016.org/proceedings/pdf/12210.pdf
Hackenberg, D., Oldenburg, R., Molka, D., Schöne, R.: Introducing FIRESTARTER: a processor stress test utility. In: 2013 International Green Computing Conference Proceedings. pp. 1–9, June 2013
Google Scholar
Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy efficiency feature survey of the Intel Haswell processor. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 896–904, May 2015
Google Scholar
Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multicore chips via simple machine models. Concurr. Computat.: Pract. Exper. (2013). doi:10.1002/cpe.3180
Hockney, R.W., Curington, I.J.: \(f_{1/2}\): a parameter to characterize memory and communication bottlenecks. Parallel Comput. 10(3), 277–286 (1989)
Article Google Scholar
Hofmann, J., Fey, D.: An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors. In: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing, E2SC 2016, pp. 31–38. IEEE Press, Piscataway (2016). https://doi.org/10.1109/E2SC.2016.16
Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of Intel’s Haswell microarchitecture using the ECM model and microbenchmarks. In: Hannig, F., Cardoso, J.M.P., Pionteck, T., Fey, D., Schröder-Preikschat, W., Teich, J. (eds.) ARCS 2016. LNCS, vol. 9637, pp. 210–222. Springer, Cham (2016). doi:10.1007/978-3-319-30695-7_16
Chapter Google Scholar
Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors. Concurr. Comput.: Pract. Exp. (2016). http://dx.doi.org/10.1002/cpe.3921
Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2014, pp. 57–64. ACM, New York (2014). http://doi.acm.org/10.1145/2568058.2568068
Intel Corporation: Intel Xeon Processor E5-1600, E5-2400, and E5-2600 v3 Product Families - volume 2 of 2, Registers. http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf
Intel Corporation: Intel Xeon Processor E5 v3 Product Family. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newsl. 19, 19–25 (1995)
Google Scholar
Microway Inc.: Detailed specifications of the Intel Xeon E5-2600 v4 Broadwell-EP processors
Google Scholar
Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the Intel Haswell-EP architecture. In: Proceedings of the 44th International Conference on Parallel Processing (ICPP 2015). IEEE (2015)
Google Scholar
Kottapalli, S., Geetha, V., Neefs, H.G., Choi, Y.: Patent US20130007376 A1: Opportunistic Snoop Broadcast (OSB) in directory enabled home snoopy systems. http://www.google.com/patents/US20130007376
Schöne, R., Treibig, J., Dolz, M.F., Guillen, C., Navarrete, C., Knobloch, M., Rountree, B.: Tools and methods for measuring and tuning the energy efficiency of HPC systems. Sci. Program. 22(4), 273–283 (2014). http://dx.doi.org/10.3233/SPR-140393
Google Scholar
Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751240
Treibig, J., Hager, G., Hofmann, H.G., Hornegger, J., Wellein, G.: Pushing the limits for medical image reconstruction on recent standard multicore processors. Int. J. High Perform. Comput. Appl. 27(2), 162–177 (2013). http://dx.doi.org/10.1177/1094342012442424
Article Google Scholar
Treibig, J., Hager, G., Wellein, G.: likwid-bench: an extensible microbenchmarking platform for x86 multicore compute nodes. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing, pp. 27–36. Springer, Heidelberg (2011)
Google Scholar
Wilde, T., Auweter, A., Shoukourian, H., Bode, A.: Taking advantage of node power variation in homogenous HPC systems to save energy. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 376–393. Springer, Cham (2015). doi:10.1007/978-3-319-20119-1_27
Chapter Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). http://doi.acm.org/10.1145/1498765.1498785
Article Google Scholar
Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations. Concurr. Comput.: Pract. Exp. 28(7), 2295–2315 (2016). http://dx.doi.org/10.1002/cpe.3489
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture, University of Erlangen-Nuremberg, 91058, Erlangen, Germany
Johannes Hofmann & Dietmar Fey
Erlangen Regional Computing Center (RRZE), 91058, Erlangen, Germany
Georg Hager & Gerhard Wellein

Authors

Johannes Hofmann
View author publications
You can also search for this author in PubMed Google Scholar
Georg Hager
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Wellein
View author publications
You can also search for this author in PubMed Google Scholar
Dietmar Fey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johannes Hofmann .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Julian M. Kunkel
Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
Argonne National Laboratory, Argonne, IL, USA
Pavan Balaji
KAUST, Thuwal, Saudi Arabia
David Keyes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hofmann, J., Hager, G., Wellein, G., Fey, D. (2017). An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10266. Springer, Cham. https://doi.org/10.1007/978-3-319-58667-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-58667-0_16
Published: 12 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58666-3
Online ISBN: 978-3-319-58667-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics