Advertisement

Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

  • Adrian BarredoEmail author
  • Juan M. Cebrian
  • Mateo Valero
  • Marc Casas
  • Miquel Moreto
Article
  • 29 Downloads

Abstract

Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.

Keywords

Vector Efficiency DVFS Power wall 

Notes

Acknowledgements

Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).

References

  1. 1.
    Albright RK (2012) Optimizing performance/watt of embedded SIMD multiprocessors through a priori application guided power scheduling. Oregon State University, CorvallisGoogle Scholar
  2. 2.
    AMD (2000) 3DNow! technology manual. Motorola, ChicagoGoogle Scholar
  3. 3.
  4. 4.
    Asanovic̀ K (1998) Vector microprocessors. Ph.D. thesisGoogle Scholar
  5. 5.
    Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA (1968) The ILLIAC IV computer. IEEE Trans Comput C–17(8):746–757CrossRefzbMATHGoogle Scholar
  6. 6.
    Binkert N, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1CrossRefGoogle Scholar
  7. 7.
    Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Longman Publishing Co. Inc., BostonGoogle Scholar
  8. 8.
    Casas M, Moreto M, Alvarez L, Castillo E, Chasapis D, Hayes T (2015) Runtime-aware architectures. In: European Conference on Parallel Processing, pp 16–27Google Scholar
  9. 9.
    Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97:1077–1100MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Cebrián JM, Natvig L, Meyer JC (2014) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors. Computing 96(12):1179–1193CrossRefGoogle Scholar
  11. 11.
    Chapman B (2007) The multicore programming challenge. In: Advanced Parallel Processing Technologies; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), vol 4847. Springer-Verlag, New York, p 3Google Scholar
  12. 12.
    ITRS (2011) International technology roadmap for semiconductorsGoogle Scholar
  13. 13.
    CRAY (1984) The CRAY X-MP series of computer systemsGoogle Scholar
  14. 14.
    Dennard R, Gaensslen F, Rideout V, Bassous E, LeBlanc A (1974) Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid State Circuits 9(5):256–268CrossRefGoogle Scholar
  15. 15.
    Espasa R (1997) Advanced vector architectures. Ph.D. thesis, Universitat Politècnica de CatalunyaGoogle Scholar
  16. 16.
    Espasa R, Valero M, Smith JE (1998) Vector architectures: past, present and future. In: Proceeding ICS ’98 Proceedings of the 12th International Conference on Supercomputing, pp 425–432Google Scholar
  17. 17.
    Fuller S (1998) Motorola AltiVec technology. Motorola, ChicagoGoogle Scholar
  18. 18.
    Haley A (1956) DEUCE: a high-speed general-purpose computer. Proc IEEE Part B Radio Electron Eng 103(2S):165–173CrossRefGoogle Scholar
  19. 19.
    Hennessy JL, Patterson DA (2017) Computer architecture: a quantitative approach, 6th edn. Morgan Kaufmann Publishers Inc., San FranciscozbMATHGoogle Scholar
  20. 20.
    Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P (2004) Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 International Symposium on Low Power Electronics and Design—ISLPED ’04, ACM Press, New York, p 32Google Scholar
  21. 21.
    Inoue H (2016) How SIMD width affects energy efficiency: a case study on sorting. In: 2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX), IEEE, pp 1–3Google Scholar
  22. 22.
    Inter Corporation (2012) Intel 64 and IA-32 architectures software developer’s manual volume 1: basic architectureGoogle Scholar
  23. 23.
    Intel Corporation (2015) Intel 64 and IA-32 architectures software developer’s manual volume 2A: instruction set referenceGoogle Scholar
  24. 24.
    Hockney RW, Jesshope RC (1988) Parallel computers two: architecture, programming and algorithms, 2nd edn. IOP Publishing Ltd., BristolzbMATHGoogle Scholar
  25. 25.
    Jimborean A, Koukos K, Spiliopoulos V, Black-Schaffer D, Kaxiras S (2014) Fix the code. Don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Annual IEEE/ACM International SymposiumGoogle Scholar
  26. 26.
    Kaxiras S, Martonosi M (2008) Computer architecture techniques for power-efficiency. Synth Lect Comput Archit 3(1):1–207CrossRefGoogle Scholar
  27. 27.
    Koukos K, Black-Schaffer D, Spiliopoulos V, Kaxiras S (2013) Towards more efficient execution: a decoupled access-execute approach. In: International Conference on Supercomputing (ICS)Google Scholar
  28. 28.
    Lee Y, Avizienis R, Bishara A, Xia R, Lockhart D, Batten C (2011) Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International Symposium on Computer Architecture (ISCA), pp 129–140Google Scholar
  29. 29.
    Lemuet C, Sampson J, Francois J, Jouppi N (2006) The potential energy efficiency of vector acceleration. In: ACM/IEEE SC 2006 conference (SC’06), IEEE, p 1Google Scholar
  30. 30.
    Li S, Ahn JH, Strong RD, Brockman JB, Tullsen DM, Jouppi NP (2009) McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In: Proceedings of the 42nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 469–480Google Scholar
  31. 31.
    Li H, Bhunia S, Chen Y, Vijaykumar TN, Roy K (2003) Deterministic clock gating for microprocessor power reduction. In: International Symposium on High-Performance Computer Architecture (HPCA)Google Scholar
  32. 32.
    Majzoub S (2010) Voltage island design in multi-core SIMD processors. In: 2010 5th international design and test workshop, IEEE, pp 18–23Google Scholar
  33. 33.
    Mudge T (2001) Power: a first-class architectural design constraint. Computer 34(4):52–58CrossRefGoogle Scholar
  34. 34.
    NEC (2017) Vector supercomputer SX series: SX-aurora TSUBASA. https://www.nec.com/en/global/solutions/hpc/sx/vector_engine.html. Accessed 04 Mar 2019
  35. 35.
    Russell RM (1971) The CRAY-1 computer system. In: Proceedings of Communication, ACM Computer Proceedings of WJCC Communication, ACM. McCarthy J, Time sharing computer systems Pt. I, AFIPS Press NJ 36(12):657–675Google Scholar
  36. 36.
    Russell RM (1978) The CRAY-1 computer system. Commun ACM 21(1):63–72CrossRefGoogle Scholar
  37. 37.
    Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pp 440–451Google Scholar
  38. 38.
    Sodani A (2015) Knights landing (KNL): 2nd generation Intel Xeon Phi processor. In: IEEE Hot Chips 27 Symposium (HCS)Google Scholar
  39. 39.
    Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, Horsnell M, Magklis G, Martinez A, Premillieu N, Reid A, Rico A, Walker P (2017) The ARM scalable vector extension. IEEE Micro 37(2):26–39CrossRefGoogle Scholar
  40. 40.
    The Green 500 (2018). https://www.top500.org/green500/. Accessed 4 Mar 2019
  41. 41.
    Usami K, Goto Y, Matsunaga K, Koyama S, Ikebuchi D, Amano H, Nakamura H (2011) On-chip detection methodology for break-even time of power gated function units. In: IEEE/ACM International Symposium on Low Power Electronics and Design, IEEE, pp 241–246Google Scholar
  42. 42.
    Villa L, Espasa R, Valero M, Effective usage of vector registers in advanced vector architectures. In: Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, pp 250–260Google Scholar
  43. 43.
    Watson WJ (1972) The TI ASC: a highly modular and flexible super computer architecture. In: Proceedings of the December 5–7, 1972, Fall Joint Computer Conference, Part I (AFIPS), pp 221–228Google Scholar
  44. 44.
    Wilkinson JH (1954) The Pilot ACE. In: Automatic Digital Computation. Her Majesty's Stationery Office, London, pp 5–14. Reprinted in [99, pp 193–199] and [1248, pp 219–228]Google Scholar
  45. 45.
    Wu Q, Martonosi M, Clark D, Reddi V, Connors D, Wu Y, Lee J, Brooks D, A dynamic compilation framework for controlling microprocessor energy and performance. In: 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05), IEEE, pp 271–282Google Scholar
  46. 46.
    Wulf WA, McKee SA (1995) Hitting the memory wall. ACM SIGARCH Comput Archit News 23(1):20–24CrossRefGoogle Scholar
  47. 47.
    Xi S, Jacobson H, Bose P, Wei GY, Brooks D (2015) Quantifying sources of error in McPAT and potential impacts on architectural studies. In: International Symposium on High Performance Computer Architecture (HPCA), pp 577–589Google Scholar
  48. 48.
    Yang X, Wang Z, Xue J, Zhou Y (2012) The reliability wall for exascale supercomputing. IEEE Trans Comput 61(6):767–779MathSciNetCrossRefzbMATHGoogle Scholar
  49. 49.
    Yoshida T (2016) Introduction of fujitsu’s hpc processor for the post-k computer. In: Hot Chips 28 Symposium (HCS) (Hot Chips' 16)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Centro Nacional de SupercomputacionBarcelonaSpain

Personalised recommendations