Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

Barredo, Adrian; Cebrian, Juan M.; Valero, Mateo; Casas, Marc; Moreto, Miquel

doi:10.1007/s11227-019-02841-6

Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

Published: 04 April 2019

Volume 76, pages 1960–1979, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Adrian Barredo ORCID: orcid.org/0000-0001-9435-3234¹,
Juan M. Cebrian¹,
Mateo Valero¹,
Marc Casas¹ &
…
Miquel Moreto¹

416 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Article 07 March 2022

Scalability analysis of AVX-512 extensions

Article 23 April 2019

Automated Compiler Optimization of Multiple Vector Loads/Stores

Article 09 January 2017

References

Albright RK (2012) Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power scheduling. Oregon State University, Corvallis
Google Scholar
AMD (2000) 3DNow! technology manual. Motorola, Chicago
Neon. https://developer.arm.com/architectures/instruction-sets/simd-isas/neon. Accessed 04 Mar 2019
Asanovic̀ K (1998) Vector microprocessors. Ph.D. thesis
Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757
Article Google Scholar
Binkert N, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1
Article Google Scholar
Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co. Inc., Boston
Google Scholar
Casas M, Moreto M, Alvarez L, Castillo E, Chasapis D, Hayes T (2015) Runtime-aware architectures. In: European Conference on Parallel Processing, pp 16–27
Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97:1077–1100
Article MathSciNet Google Scholar
Cebrián JM, Natvig L, Meyer JC (2014) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors. Computing 96(12):1179–1193
Article Google Scholar
Chapman B (2007) The multicore programming challenge. In: Advanced Parallel Processing Technologies; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), vol 4847. Springer-Verlag, New York, p 3
ITRS (2011) International technology roadmap for semiconductors
CRAY (1984) The CRAY X-MP series of computer systems
Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits 9(5):256–268
Article Google Scholar
Espasa R (1997) Advanced vector architectures. Ph.D. thesis, Universitat Politècnica de Catalunya
Espasa R, Valero M, Smith JE (1998) Vector architectures: past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432
Fuller S (1998) Motorola AltiVec technology. Motorola, Chicago
Google Scholar
Haley A (1956) DEUCE: a high-speed general-purpose computer. Proc IEEE Part B Radio Electron Eng 103(2S):165–173
Article Google Scholar
Hennessy JL, Patterson DA (2017) Computer architecture: a quantitative approach, 6th edn. Morgan Kaufmann Publishers Inc., San Francisco
MATH Google Scholar
Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P (2004) Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 International Symposium on Low Power Electronics and Design—ISLPED ’04, ACM Press, New York, p 32
Inoue H (2016) How SIMD width affects energy efficiency: a case study on sorting. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), IEEE, pp 1–3
Inter Corporation (2012) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architecture
Intel Corporation (2015) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set reference
Hockney RW, Jesshope RC (1988) Parallel computers two: architecture, programming and algorithms, 2nd edn. IOP Publishing Ltd., Bristol
MATH Google Scholar
Jimborean A, Koukos K, Spiliopoulos V, Black-Schaffer D, Kaxiras S (2014) Fix the code. Don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Annual IEEE/ACM International Symposium
Kaxiras S, Martonosi M (2008) Computer architecture techniques for power-efficiency. Synth Lect Comput Archit 3(1):1–207
Article Google Scholar
Koukos K, Black-Schaffer D, Spiliopoulos V, Kaxiras S (2013) Towards more efficient execution: a decoupled access-execute approach. In: International Conference on Supercomputing (ICS)
Lee Y, Avizienis R, Bishara A, Xia R, Lockhart D, Batten C (2011) Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International Symposium on Computer Architecture (ISCA), pp 129–140
Lemuet C, Sampson J, Francois J, Jouppi N (2006) The potential energy efficiency of vector acceleration. In: ACM/IEEE SC 2006 conference (SC’06), IEEE, p 1
Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 469–480
Li H, Bhunia S, Chen Y, Vijaykumar TN, Roy K (2003) Deterministic clock gating for microprocessor power reduction. In: International Symposium on High-Performance Computer Architecture (HPCA)
Majzoub S (2010) Voltage island design in multi-core SIMD processors. In: 2010 5th international design and test workshop, IEEE, pp 18–23
Mudge T (2001) Power: a first-class architectural design constraint. Computer 34(4):52–58
Article Google Scholar
NEC (2017) Vector supercomputer SX series: SX-aurora TSUBASA. https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html. Accessed 04 Mar 2019
Russell RM (1971) The CRAY-1 computer system. In: Proceedings of Communication, ACM Computer Proceedings of WJCC Communication, ACM. McCarthy J, Time sharing computer systems Pt. I, AFIPS Press NJ 36(12):657–675
Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72
Article Google Scholar
Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451
Sodani A (2015) Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: IEEE Hot Chips 27 Symposium (HCS)
Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39
Article Google Scholar
The Green 500 (2018). https://www.top500.org/green500/. Accessed 4 Mar 2019
Usami K, Goto Y, Matsunaga K, Koyama S, Ikebuchi D, Amano H, Nakamura H (2011) On-chip detection methodology for break-even time of power gated function units. In: IEEE/ACM International Symposium on Low Power Electronics and Design, IEEE, pp 241–246
Villa L, Espasa R, Valero M, Effective usage of vector registers in advanced vector architectures. In: Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, pp 250–260
Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228
Wilkinson JH (1954) The Pilot ACE. In: Automatic Digital Computation. Her Majesty's Stationery Office, London, pp 5–14. Reprinted in [99, pp 193–199] and [1248, pp 219–228]
Wu Q, Martonosi M, Clark D, Reddi V, Connors D, Wu Y, Lee J, Brooks D, A dynamic compilation framework for controlling microprocessor energy and performance. In: 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05), IEEE, pp 271–282
Wulf WA, McKee SA (1995) Hitting the memory wall. ACM SIGARCH Comput Archit News 23(1):20–24
Article Google Scholar
Xi S, Jacobson H, Bose P, Wei GY, Brooks D (2015) Quantifying sources of error in McPAT and potential impacts on architectural studies. In: International Symposium on High Performance Computer Architecture (HPCA), pp 577–589
Yang X, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779
Article MathSciNet Google Scholar
Yoshida T (2016) Introduction of fujitsu’s hpc processor for the post-k computer. In: Hot Chips 28 Symposium (HCS) (Hot Chips' 16)

Download references

Acknowledgements

Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).

Author information

Authors and Affiliations

Centro Nacional de Supercomputacion, Barcelona, Spain
Adrian Barredo, Juan M. Cebrian, Mateo Valero, Marc Casas & Miquel Moreto

Authors

Adrian Barredo
View author publications
You can also search for this author in PubMed Google Scholar
Juan M. Cebrian
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar
Marc Casas
View author publications
You can also search for this author in PubMed Google Scholar
Miquel Moreto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrian Barredo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barredo, A., Cebrian, J.M., Valero, M. et al. Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies. J Supercomput 76, 1960–1979 (2020). https://doi.org/10.1007/s11227-019-02841-6

Download citation

Published: 04 April 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11227-019-02841-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

Abstract

Access this article

Similar content being viewed by others

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Scalability analysis of AVX-512 extensions

Automated Compiler Optimization of Multiple Vector Loads/Stores

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

Abstract

Access this article

Similar content being viewed by others

Vectorizing divergent control flow with active-lane consolidation on long-vector architectures

Scalability analysis of AVX-512 extensions

Automated Compiler Optimization of Multiple Vector Loads/Stores

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation