Skip to main content

An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10266))

Included in the following conference series:

Abstract

This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell) with a focus on performance with floating point workloads. Starting at the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved significantly by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may benefit from a suitable choice of cache snoop mode settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/.

  2. 2.

    http://tiny.cc/LIKWID.

  3. 3.

    http://www.hpc.rrze.fau.de/systeme/meggie-cluster.shtml.

  4. 4.

    The latencies of some instructions (e.g., FP division) depend on their operands. When working with “trivial” denominators, such as whole numbers, latency can be significantly lower than when operating on non-trivial floating-point numbers.

  5. 5.

    CLs are mapped to L3 segments based on their addresses according to a hashing function. Thus, each CA knows which CA in other NUMA domains is responsible for a certain CL.

  6. 6.

    Investigations using the HITME_* performance counter events indicate this cache is exclusively used in DIR mode.

References

  1. Barker, K., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho, J.C.: A performance evaluation of the Nehalem quad-core processor for scientific computing. Parallel Proces. Lett. 18(4), 453–469 (2008). http://dx.doi.org/10.1142/S012962640800351X

    Article  MathSciNet  Google Scholar 

  2. Gasc, T., Vuyst, F.D., Peybernes, M., Poncet, R., Motte, R.: Building a more efficient Lagrange-remap scheme thanks to performance modeling. In: Papadrakakis, M., et al. (ed.) Proceedings of the ECCOMAS Congress 2016, the VII European Congress on Computational Methods in Applied Sciences and Engineering, Crete Island, Greece, 5–10 June 2016. https://www.eccomas2016.org/proceedings/pdf/12210.pdf

  3. Hackenberg, D., Oldenburg, R., Molka, D., Schöne, R.: Introducing FIRESTARTER: a processor stress test utility. In: 2013 International Green Computing Conference Proceedings. pp. 1–9, June 2013

    Google Scholar 

  4. Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy efficiency feature survey of the Intel Haswell processor. In: 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pp. 896–904, May 2015

    Google Scholar 

  5. Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power properties of modern multicore chips via simple machine models. Concurr. Computat.: Pract. Exper. (2013). doi:10.1002/cpe.3180

  6. Hockney, R.W., Curington, I.J.: \(f_{1/2}\): a parameter to characterize memory and communication bottlenecks. Parallel Comput. 10(3), 277–286 (1989)

    Article  Google Scholar 

  7. Hofmann, J., Fey, D.: An ECM-based energy-efficiency optimization approach for bandwidth-limited streaming kernels on recent Intel Xeon processors. In: Proceedings of the 4th International Workshop on Energy Efficient Supercomputing, E2SC 2016, pp. 31–38. IEEE Press, Piscataway (2016). https://doi.org/10.1109/E2SC.2016.16

  8. Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of Intel’s Haswell microarchitecture using the ECM model and microbenchmarks. In: Hannig, F., Cardoso, J.M.P., Pionteck, T., Fey, D., Schröder-Preikschat, W., Teich, J. (eds.) ARCS 2016. LNCS, vol. 9637, pp. 210–222. Springer, Cham (2016). doi:10.1007/978-3-319-30695-7_16

    Chapter  Google Scholar 

  9. Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors. Concurr. Comput.: Pract. Exp. (2016). http://dx.doi.org/10.1002/cpe.3921

  10. Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2014, pp. 57–64. ACM, New York (2014). http://doi.acm.org/10.1145/2568058.2568068

  11. Intel Corporation: Intel Xeon Processor E5-1600, E5-2400, and E5-2600 v3 Product Families - volume 2 of 2, Registers. http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

  12. Intel Corporation: Intel Xeon Processor E5 v3 Product Family. http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v3-spec-update.pdf

  13. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Archit. (TCCA) Newsl. 19, 19–25 (1995)

    Google Scholar 

  14. Microway Inc.: Detailed specifications of the Intel Xeon E5-2600 v4 Broadwell-EP processors

    Google Scholar 

  15. Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the Intel Haswell-EP architecture. In: Proceedings of the 44th International Conference on Parallel Processing (ICPP 2015). IEEE (2015)

    Google Scholar 

  16. Kottapalli, S., Geetha, V., Neefs, H.G., Choi, Y.: Patent US20130007376 A1: Opportunistic Snoop Broadcast (OSB) in directory enabled home snoopy systems. http://www.google.com/patents/US20130007376

  17. Schöne, R., Treibig, J., Dolz, M.F., Guillen, C., Navarrete, C., Knobloch, M., Rountree, B.: Tools and methods for measuring and tuning the energy efficiency of HPC systems. Sci. Program. 22(4), 273–283 (2014). http://dx.doi.org/10.3233/SPR-140393

    Google Scholar 

  18. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015. ACM, New York (2015). http://doi.acm.org/10.1145/2751205.2751240

  19. Treibig, J., Hager, G., Hofmann, H.G., Hornegger, J., Wellein, G.: Pushing the limits for medical image reconstruction on recent standard multicore processors. Int. J. High Perform. Comput. Appl. 27(2), 162–177 (2013). http://dx.doi.org/10.1177/1094342012442424

    Article  Google Scholar 

  20. Treibig, J., Hager, G., Wellein, G.: likwid-bench: an extensible microbenchmarking platform for x86 multicore compute nodes. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing, pp. 27–36. Springer, Heidelberg (2011)

    Google Scholar 

  21. Wilde, T., Auweter, A., Shoukourian, H., Bode, A.: Taking advantage of node power variation in homogenous HPC systems to save energy. In: Kunkel, J.M., Ludwig, T. (eds.) ISC High Performance 2015. LNCS, vol. 9137, pp. 376–393. Springer, Cham (2015). doi:10.1007/978-3-319-20119-1_27

    Chapter  Google Scholar 

  22. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). http://doi.acm.org/10.1145/1498765.1498785

    Article  Google Scholar 

  23. Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations. Concurr. Comput.: Pract. Exp. 28(7), 2295–2315 (2016). http://dx.doi.org/10.1002/cpe.3489

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Hofmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hofmann, J., Hager, G., Wellein, G., Fey, D. (2017). An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10266. Springer, Cham. https://doi.org/10.1007/978-3-319-58667-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-58667-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58666-3

  • Online ISBN: 978-3-319-58667-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics