Advertisement

Tiled Multicore Processors

  • Michael B. Taylor
  • Walter Lee
  • Jason E. Miller
  • David Wentzlaff
  • Ian Bratt
  • Ben Greenwald
  • Henry Hoffmann
  • Paul R. Johnson
  • Jason S. Kim
  • James Psota
  • Arvind Saraf
  • Nathan Shnidman
  • Volker Strumpen
  • Matthew I. Frank
  • Saman Amarasinghe
  • Anant Agarwal
Chapter
Part of the Integrated Circuits and Systems book series (ICIR)

Abstract

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources – including logic, wires, and pins – in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x–9x better for higher levels of ILP, and 10x–100x better when highly parallel applications are coded in a stream language or optimized by hand.

Keywords

Data Cache Multicore Processor Annual International Symposium Stream Algorithm Wire Delay 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgments

We thank our StreamIt collaborators, specifically M. Gordon, J. Lin, and B. Thies for the StreamIt backend and the corresponding section of this chapter. We are grateful to our collaborators from ISI East including C. Chen, S. Crago, M. French, L. Wang and J. Suh for developing the Raw motherboard, firmware components, and several applications. T. Konstantakopoulos, L. Jakab, F. Ghodrat, M. Seneski, A. Saraswat, R. Barua, A. Ma, J. Babb, M. Stephenson, S. Larsen, V. Sarkar, and several others too numerous to list also contributed to the success of Raw. The Raw chip was fabricated in cooperation with IBM. Raw is funded by DARPA, NSF, ITRI, and the Oxygen Alliance.

References

  1. 1.
    A. Agarwal and M. Levy. Going multicore presents challenges and opportunities. Embedded Systems Design, 20(4), April 2007.Google Scholar
  2. 2.
    V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 248–259, 2000.Google Scholar
  3. 3.
    E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. In Supercomputing ’90: Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, pages 2–11, 1990.Google Scholar
  4. 4.
    M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzilicioglu, and J. A. Webb. The Warp Computer: Architecture, Implementation and Performance. IEEE Transactions on Computers, 36(12):1523–1538, December 1987.Google Scholar
  5. 5.
    J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 134–143, 1997.Google Scholar
  6. 6.
    M. Baron. Low-key Intel 80-core Intro: The tip of the iceberg. Microprocessor Report, April 2007.Google Scholar
  7. 7.
    M. Baron. Tilera’s cores communicate better. Microprocessor Report, November 2007.Google Scholar
  8. 8.
    R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 4–15, 1999.Google Scholar
  9. 9.
    M. Bohr. Interconnect Scaling – The Real Limiter to High Performance ULSI. In 1995 IEDM, pages 241–244, 1995.Google Scholar
  10. 10.
    P. Bose, D. H. Albonesi, and D. Marculescu. Power and complexity aware design. IEEE Micro: Guest Editor’s Introduction for Special Issue on Power and Complexity Aware Design, 23(5):8–11, Sept/Oct 2003.Google Scholar
  11. 11.
    S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, and R. Laufer. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 28–39, 1999.Google Scholar
  12. 12.
    M. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 75–86, October 2006.Google Scholar
  13. 13.
    M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A Stream Compiler for Communication-Exposed Architectures. In ASPLOS-X: Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291–303, 2002.Google Scholar
  14. 14.
    T. Gross and D. R. O’Halloron. iWarp, Anatomy of a Parallel Computing System. The MIT Press, Cambridge, MA, 1998.Google Scholar
  15. 15.
    J. R. Hauser and J. Wawrzynek. Garp: A MIPS Processor with Reconfigurable Coprocessor. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 12–21, 1997.Google Scholar
  16. 16.
    R. Ho, K. W. Mai, and M. A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504, April 2001.CrossRefGoogle Scholar
  17. 17.
    H. Hoffmann, V. Strumpen, A. Agarwal, and H. Hoffmann. Stream Algorithms and Architecture. Technical Memo MIT-LCS-TM-636, MIT Laboratory for Computer Science, 2003.Google Scholar
  18. 18.
    H. P. Hofstee. Power efficient processor architecture and the Cell processor. In HPCA ’05: Proceedings of the 11th International Symposium on High Performance Computer Architecture, pages 258–262, 2005.Google Scholar
  19. 19.
    U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany. The Imagine Stream Processor. In ICCD ’02: Proceedings of the 2002 IEEE International Conference on Computer Design, pages 282–288, 2002.Google Scholar
  20. 20.
    A. KleinOsowski and D. Lilja. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. Computer Architecture Letters, 1, June 2002.Google Scholar
  21. 21.
    P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005.CrossRefGoogle Scholar
  22. 22.
    C. Kozyrakis and D. Patterson. A New Direction for Computer Architecture Research. IEEE Computer, 30(9):24–32, September 1997.Google Scholar
  23. 23.
    R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread Architecture. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004.Google Scholar
  24. 24.
    J. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, Massachusetts Institute of Technology, 1998.Google Scholar
  25. 25.
    W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. In ASPLOS-VIII: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46–54, 1998.Google Scholar
  26. 26.
    W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent Scheduling. In MICRO-35: Proceedings of the 35th Annual International Symposium on Microarchitecture, pages 111–122, 2002.Google Scholar
  27. 27.
    K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart Memories: A Modular Reconfigurable Architecture. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 161–171, 2000.Google Scholar
  28. 28.
    D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer, 30(9):37–39, September 1997.Google Scholar
  29. 29.
    J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance. Computers. http://www.cs.virginia.edu/stream.
  30. 30.
    J. E. Miller. Software Instruction Caching. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, June 2007. http://hdl.handle.net/1721.1/40317.
  31. 31.
    C. A. Moritz, D. Yeung, and A. Agarwal. SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures. IEEE Transactions on Parallel and Distributed Systems, pages 730–742, July 2001.Google Scholar
  32. 32.
    S. Naffziger, G. Hammond, S. Naffziger, and G. Hammond. The Implementation of the Next-Generation 64b Itanium Microprocessor. In Proceedings of the IEEE International Solid-State Circuits Conference, pages 344–345, 472, 2002.Google Scholar
  33. 33.
    R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A Design Space Evaluation of Grid Processor Architectures. In MICRO-34: Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 40–51, 2001.Google Scholar
  34. 34.
    M. Narayanan and K. A. Yelick. Generating Permutation Instructions from a High-Level Description. TR UCB-CS-03-1287, UC Berkeley, 2003.Google Scholar
  35. 35.
    S. Palacharla. Complexity-Effective Superscalar Processors. PhD thesis, University of Wisconsin–Madison, 1998.Google Scholar
  36. 36.
    J. Sanchez and A. Gonzalez. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. In MICRO-33: Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 124–133, December 2000.Google Scholar
  37. 37.
    K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor. In MICRO-39: Proceedings of the 39th Annual International Symposium on Microarchitecture, pages 480–491, Dec 2006.Google Scholar
  38. 38.
    D. Shoemaker, F. Honore, C. Metcalf, and S. Ward. NuMesh: An Architecture Optimized for Scheduled Communication. Journal of Supercomputing, 10(3):285–302, 1996.MATHCrossRefGoogle Scholar
  39. 39.
    G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In ISCA ’95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 414–425, 1995.Google Scholar
  40. 40.
    J. Suh, E.-G. Kim, S. P. Crago, L. Srinivasan, and M. C. French. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels. In ISCA ’03: Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 410–419, June 2003.Google Scholar
  41. 41.
    M. B. Taylor. Deionizer: A Tool for Capturing and Embedding I/O Calls. Technical Report MIT-CSAIL-TR-2004-037, MIT CSAIL/Laboratory for Computer Science, 2004. http://cag.csail.mit.edu/∼mtaylor/deionizer.html.
  42. 42.
    M. B. Taylor. Tiled Processors. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2007.Google Scholar
  43. 43.
    M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, J.-W. Lee, P. Johnson, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, pages 25–35, Mar 2002.Google Scholar
  44. 44.
    M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures. In HPCA ’03: Proceedings of the 9th International Symposium on High Performance Computer Architecture, pages 341–353, 2003.Google Scholar
  45. 45.
    M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks. IEEE Transactions on Parallel and Distributed Systems (Special Issue on On-chip Networks), Feb 2005.Google Scholar
  46. 46.
    M. B. Taylor, W. Lee, J. E. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 2–13, June 2004.Google Scholar
  47. 47.
    W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In 2002 Compiler Construction, pages 179–196, 2002.Google Scholar
  48. 48.
    E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it All to Software: Raw Machines. IEEE Computer, 30(9):86–93, Sep 1997.Google Scholar
  49. 49.
    D. Wentzlaff. Architectural Implications of Bit-level Computation in Communication Applications. Master’s thesis, Massachusetts Institute of Technology, 2002.Google Scholar
  50. 50.
    D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, and A. Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5):15–31, Sept–Oct 2007.CrossRefGoogle Scholar
  51. 51.
    R. Whaley, A. Petitet, J. J. Dongarra, and Whaley. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing, 27(1–2):3–35, 2001.MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag US 2009

Authors and Affiliations

  • Michael B. Taylor
    • 1
  • Walter Lee
    • 2
  • Jason E. Miller
    • 3
  • David Wentzlaff
    • 3
  • Ian Bratt
    • 2
  • Ben Greenwald
    • 4
  • Henry Hoffmann
    • 3
  • Paul R. Johnson
    • 3
  • Jason S. Kim
    • 3
  • James Psota
    • 3
  • Arvind Saraf
    • 5
  • Nathan Shnidman
    • 6
  • Volker Strumpen
    • 7
  • Matthew I. Frank
    • 8
  • Saman Amarasinghe
    • 3
  • Anant Agarwal
    • 3
  1. 1.University of CaliforniaSan DiegoUSA
  2. 2.Tilera CorporationWestboroughUSA
  3. 3.MIT CSAILCambridgeUSA
  4. 4.VeracodeBurlingtonUSA
  5. 5.Swasth FoundationBangaloreIndia
  6. 6.The MITRE CorporationBedfordUSA
  7. 7.IBM Austin Research LaboratoryAustinUSA
  8. 8.University of Illinois at Urbana - ChampaignUrbanaUSA

Personalised recommendations