Performance and Power Evaluation of Clustered VLIW Processors with Wide Functional Units

  • Miquel Pericàs
  • Eduard Ayguadé
  • Javier Zalamea
  • Josep Llosa
  • Mateo Valero
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3133)


Architectural resources and program recurrences are themain limitations to the amount of Instruction-Level Parallelism (ILP) exploitable from loops. To increase the number of operations per second, current designs use high degrees of resource replication for memory ports and functional units. But the high costs in terms of power and cycle time of this technique limit the degree of replication.

Clustering is a technique aimed at decentralizing the design of future wide issue cores and enable them to meet the technology constraints in terms of cycle time, area and power. Another way to reduce the complexity of recent cores is using wide functional units. This technique only requires minor modifications to the underlying hardware, but also imposes a penalty on the exploitable parallelism.

In this paper we evaluate a broad range of VLIW configurations that make use of these two techniques. From this study we conclude that applying both techniques yields configurations with very good power-performance efficiency.


Execution Time Functional Unit Clock Cycle Power Evaluation Processor Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Berry, M., Chen, D., Koss, P., Kuck, D.: The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers, Technical Report 827, CSRD, Univ. of Illinois at Urbana-Champaign (November 1988)Google Scholar
  2. 2.
    Brooks, D., Tiwari, V., Martsoni, M.: Wattch: A Framework for Architectural- Level Power Analysis and Optimizations, Int’l Symp. on Computer Architecture, ISCA 2000 (2000)Google Scholar
  3. 3.
    Faraboschi, P., Brown, G., Desoli, G., Homewood, F.: Lx: A technology platform for customizable VLIW embedded processing. In: Proc. 27th Annual Intl. Symp. on Computer Architecture, pp. 203-213 (June 2000)Google Scholar
  4. 4.
    Friedman, J., Greenfield, Z.: The tigersharc DSP architecture, IEEE Micro, 66-76 (January-February 2000)Google Scholar
  5. 5.
    Glaskowsky, P.N.: MAP1000 unfolds at Equator. Microprocessor Report. 12(16) (December 1998)Google Scholar
  6. 6.
    Hrishikesh, M.S., Jouppi, N.P., Farkas, K.I., Burger, D., Keckler, S.W., Shivakumar, P.: The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays. In: Proc. of the 29 th Symp. on Comp. Arch (ISCA 2002) (May 2002)Google Scholar
  7. 7.
    Llosa, J., Valero, M., Ayguadé, E., González, A.: Hypernode reduction modulo scheduling. In: Proc. of the 28 th Annual Int. Symp. on Microarchitecture (MICRO- 28), November 1995, pp. 350–360 (1995)Google Scholar
  8. 8.
    Lòpez, D., Llosa, J., Valero, M., Ayguadé, E.: Cost–Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures. IEEE Trans. on Comp. 50(10), 1033–1051 (2001)CrossRefGoogle Scholar
  9. 9.
    Lòpez, D., Llosa, J., Valero, M., Ayguadé, E.: Cost-Conscious Strategies to Increase Performance of Numerical Programs on Aggressive VLIW Architectures. IEEE. Trans. on Comp. 50(10), 1033–1051 (2001)CrossRefGoogle Scholar
  10. 10.
    Watanabe, T.: The NEC SX-3 Supercomputer System. In: Proc. ComCon 1991, pp. 303–308 (1991)Google Scholar
  11. 11.
    White, S.W., Dhawan, S.: POWER2: Next Generation of the RISC System/6000 Family. IBM J. Research and Development 38(5), 493–502 (1994)CrossRefGoogle Scholar
  12. 12.
    Wilton, S.J.E., Jouppi, N.P.: CACTI: An enhanced Cache Access and Cycle Time Model. IEEE. J. Solid-State Circuits 31(5), 677–688 (1996)CrossRefGoogle Scholar
  13. 13.
    Zalamea, J., Llosa, J., Ayguadé, E., Valero, M.: MIRS: Modulo Scheduling with integrated register spilling. In: Proc. of 14th Annual Workshop on Languages and Compilers for Parallel Computing (LCPC 2001) (August 2001)Google Scholar
  14. 14.
    Zalamea, J., Llosa, J., Ayguadé, E.: andM. Valero. Modulo Scheduling with integrated register spilling for Clustered VLIW Architectures. In: Proc. 34th annual Int. Symp. on Microarch (December 2001)Google Scholar
  15. 15.
    AltiVec Vectorizes PowerPC Microprocessor Report  12(6) (May 1998)Google Scholar
  16. 16.
    INTEL, Pentium III Processor: Developer’s Manual, Intel Technology Report (1999), available at
  17. 17.
    T.I.Inc.: TMS320C62x/67x CPU and Instruction Set Reference Guide (1998) Google Scholar
  18. 18.
    Rixner, S., Dally, W.J., Khailany, B., Mattson, P., Kapasi, U.J., Owens, J.D.: Register organization for media processing, High-Performance Computer Architecture. In: HPCA-6. Proceedings. Sixth International Symposium on (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Miquel Pericàs
    • 1
  • Eduard Ayguadé
    • 1
  • Javier Zalamea
    • 1
  • Josep Llosa
    • 1
  • Mateo Valero
    • 1
  1. 1.Departament d’Arquitectura de ComputadorsUniversitat Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations