Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

Abstract

Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. 1.

    Albright RK (2012) Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power scheduling. Oregon State University, Corvallis

    Google Scholar 

  2. 2.

    AMD (2000) 3DNow! technology manual. Motorola, Chicago

  3. 3.

    Neon. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon. Accessed 04 Mar 2019

  4. 4.

    Asanovic̀ K (1998) Vector microprocessors. Ph.D. thesis

  5. 5.

    Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757

    Article  Google Scholar 

  6. 6.

    Binkert N, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1

    Article  Google Scholar 

  7. 7.

    Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co. Inc., Boston

    Google Scholar 

  8. 8.

    Casas M, Moreto M, Alvarez L, Castillo E, Chasapis D, Hayes T (2015) Runtime-aware architectures. In: European Conference on Parallel Processing, pp 16–27

  9. 9.

    Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97:1077–1100

    MathSciNet  Article  Google Scholar 

  10. 10.

    Cebrián JM, Natvig L, Meyer JC (2014) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors. Computing 96(12):1179–1193

    Article  Google Scholar 

  11. 11.

    Chapman B (2007) The multicore programming challenge. In: Advanced Parallel Processing Technologies; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), vol 4847. Springer-Verlag, New York, p 3

  12. 12.

    ITRS (2011) International technology roadmap for semiconductors

  13. 13.

    CRAY (1984) The CRAY X-MP series of computer systems

  14. 14.

    Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits 9(5):256–268

    Article  Google Scholar 

  15. 15.

    Espasa R (1997) Advanced vector architectures. Ph.D. thesis, Universitat Politècnica de Catalunya

  16. 16.

    Espasa R, Valero M, Smith JE (1998) Vector architectures: past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432

  17. 17.

    Fuller S (1998) Motorola AltiVec technology. Motorola, Chicago

    Google Scholar 

  18. 18.

    Haley A (1956) DEUCE: a high-speed general-purpose computer. Proc IEEE Part B Radio Electron Eng 103(2S):165–173

    Article  Google Scholar 

  19. 19.

    Hennessy JL, Patterson DA (2017) Computer architecture: a quantitative approach, 6th edn. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  20. 20.

    Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P (2004) Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 International Symposium on Low Power Electronics and Design—ISLPED ’04, ACM Press, New York, p 32

  21. 21.

    Inoue H (2016) How SIMD width affects energy efficiency: a case study on sorting. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), IEEE, pp 1–3

  22. 22.

    Inter Corporation (2012) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architecture

  23. 23.

    Intel Corporation (2015) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set reference

  24. 24.

    Hockney RW, Jesshope RC (1988) Parallel computers two: architecture, programming and algorithms, 2nd edn. IOP Publishing Ltd., Bristol

    Google Scholar 

  25. 25.

    Jimborean A, Koukos K, Spiliopoulos V, Black-Schaffer D, Kaxiras S (2014) Fix the code. Don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Annual IEEE/ACM International Symposium

  26. 26.

    Kaxiras S, Martonosi M (2008) Computer architecture techniques for power-efficiency. Synth Lect Comput Archit 3(1):1–207

    Article  Google Scholar 

  27. 27.

    Koukos K, Black-Schaffer D, Spiliopoulos V, Kaxiras S (2013) Towards more efficient execution: a decoupled access-execute approach. In: International Conference on Supercomputing (ICS)

  28. 28.

    Lee Y, Avizienis R, Bishara A, Xia R, Lockhart D, Batten C (2011) Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International Symposium on Computer Architecture (ISCA), pp 129–140

  29. 29.

    Lemuet C, Sampson J, Francois J, Jouppi N (2006) The potential energy efficiency of vector acceleration. In: ACM/IEEE SC 2006 conference (SC’06), IEEE, p 1

  30. 30.

    Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 469–480

  31. 31.

    Li H, Bhunia S, Chen Y, Vijaykumar TN, Roy K (2003) Deterministic clock gating for microprocessor power reduction. In: International Symposium on High-Performance Computer Architecture (HPCA)

  32. 32.

    Majzoub S (2010) Voltage island design in multi-core SIMD processors. In: 2010 5th international design and test workshop, IEEE, pp 18–23

  33. 33.

    Mudge T (2001) Power: a first-class architectural design constraint. Computer 34(4):52–58

    Article  Google Scholar 

  34. 34.

    NEC (2017) Vector supercomputer SX series: SX-aurora TSUBASA. https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html. Accessed 04 Mar 2019

  35. 35.

    Russell RM (1971) The CRAY-1 computer system. In: Proceedings of Communication, ACM Computer Proceedings of WJCC Communication, ACM. McCarthy J, Time sharing computer systems Pt. I, AFIPS Press NJ 36(12):657–675

  36. 36.

    Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72

    Article  Google Scholar 

  37. 37.

    Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451

  38. 38.

    Sodani A (2015) Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: IEEE Hot Chips 27 Symposium (HCS)

  39. 39.

    Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39

    Article  Google Scholar 

  40. 40.

    The Green 500 (2018). https://www.top500.org/green500/. Accessed 4 Mar 2019

  41. 41.

    Usami K, Goto Y, Matsunaga K, Koyama S, Ikebuchi D, Amano H, Nakamura H (2011) On-chip detection methodology for break-even time of power gated function units. In: IEEE/ACM International Symposium on Low Power Electronics and Design, IEEE, pp 241–246

  42. 42.

    Villa L, Espasa R, Valero M, Effective usage of vector registers in advanced vector architectures. In: Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, pp 250–260

  43. 43.

    Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228

  44. 44.

    Wilkinson JH (1954) The Pilot ACE. In: Automatic Digital Computation. Her Majesty's Stationery Office, London, pp 5–14. Reprinted in [99, pp 193–199] and [1248, pp 219–228]

  45. 45.

    Wu Q, Martonosi M, Clark D, Reddi V, Connors D, Wu Y, Lee J, Brooks D, A dynamic compilation framework for controlling microprocessor energy and performance. In: 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05), IEEE, pp 271–282

  46. 46.

    Wulf WA, McKee SA (1995) Hitting the memory wall. ACM SIGARCH Comput Archit News 23(1):20–24

    Article  Google Scholar 

  47. 47.

    Xi S, Jacobson H, Bose P, Wei GY, Brooks D (2015) Quantifying sources of error in McPAT and potential impacts on architectural studies. In: International Symposium on High Performance Computer Architecture (HPCA), pp 577–589

  48. 48.

    Yang X, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779

    MathSciNet  Article  Google Scholar 

  49. 49.

    Yoshida T (2016) Introduction of fujitsu’s hpc processor for the post-k computer. In: Hot Chips 28 Symposium (HCS) (Hot Chips' 16)

Download references

Acknowledgements

Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Adrian Barredo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Barredo, A., Cebrian, J.M., Valero, M. et al. Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies. J Supercomput 76, 1960–1979 (2020). https://doi.org/10.1007/s11227-019-02841-6

Download citation

Keywords

  • Vector
  • Efficiency
  • DVFS
  • Power wall