Tiled Multicore Processors

Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

doi:10.1007/978-1-4419-0263-4_1

Michael B. Taylor⁴,
Walter Lee⁵,
Jason E. Miller⁶,
David Wentzlaff⁶,
Ian Bratt⁵,
Ben Greenwald⁷,
Henry Hoffmann⁶,
Paul R. Johnson⁶,
Jason S. Kim⁶,
James Psota⁶,
Arvind Saraf⁸,
Nathan Shnidman⁹,
Volker Strumpen¹⁰,
Matthew I. Frank¹¹,
Saman Amarasinghe⁶ &
…
Anant Agarwal⁶

Part of the book series: Integrated Circuits and Systems ((ICIR))

1901 Accesses

Abstract

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources – including logic, wires, and pins – in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x–9x better for higher levels of ILP, and 10x–100x better when highly parallel applications are coded in a stream language or optimized by hand.

Based on “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams”, by M.B. Taylor, W. Lee, J.E. Miller, et al. which appeared in The 31st Annual International Symposium on Computer Architecture (ISCA). © 2004 IEEE. [46]

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Agarwal and M. Levy. Going multicore presents challenges and opportunities. Embedded Systems Design, 20(4), April 2007.
Google Scholar
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 248–259, 2000.
Google Scholar
E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. In Supercomputing ’90: Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, pages 2–11, 1990.
Google Scholar
M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzilicioglu, and J. A. Webb. The Warp Computer: Architecture, Implementation and Performance. IEEE Transactions on Computers, 36(12):1523–1538, December 1987.
Google Scholar
J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 134–143, 1997.
Google Scholar
M. Baron. Low-key Intel 80-core Intro: The tip of the iceberg. Microprocessor Report, April 2007.
Google Scholar
M. Baron. Tilera’s cores communicate better. Microprocessor Report, November 2007.
Google Scholar
R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 4–15, 1999.
Google Scholar
M. Bohr. Interconnect Scaling – The Real Limiter to High Performance ULSI. In 1995 IEDM, pages 241–244, 1995.
Google Scholar
P. Bose, D. H. Albonesi, and D. Marculescu. Power and complexity aware design. IEEE Micro: Guest Editor’s Introduction for Special Issue on Power and Complexity Aware Design, 23(5):8–11, Sept/Oct 2003.
Google Scholar
S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, and R. Laufer. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 28–39, 1999.
Google Scholar
M. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 75–86, October 2006.
Google Scholar
M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A Stream Compiler for Communication-Exposed Architectures. In ASPLOS-X: Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291–303, 2002.
Google Scholar
T. Gross and D. R. O’Halloron. iWarp, Anatomy of a Parallel Computing System. The MIT Press, Cambridge, MA, 1998.
Google Scholar
J. R. Hauser and J. Wawrzynek. Garp: A MIPS Processor with Reconfigurable Coprocessor. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 12–21, 1997.
Google Scholar
R. Ho, K. W. Mai, and M. A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504, April 2001.
Article Google Scholar
H. Hoffmann, V. Strumpen, A. Agarwal, and H. Hoffmann. Stream Algorithms and Architecture. Technical Memo MIT-LCS-TM-636, MIT Laboratory for Computer Science, 2003.
Google Scholar
H. P. Hofstee. Power efficient processor architecture and the Cell processor. In HPCA ’05: Proceedings of the 11th International Symposium on High Performance Computer Architecture, pages 258–262, 2005.
Google Scholar
U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany. The Imagine Stream Processor. In ICCD ’02: Proceedings of the 2002 IEEE International Conference on Computer Design, pages 282–288, 2002.
Google Scholar
A. KleinOsowski and D. Lilja. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. Computer Architecture Letters, 1, June 2002.
Google Scholar
P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005.
Article Google Scholar
C. Kozyrakis and D. Patterson. A New Direction for Computer Architecture Research. IEEE Computer, 30(9):24–32, September 1997.
Google Scholar
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread Architecture. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004.
Google Scholar
J. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, Massachusetts Institute of Technology, 1998.
Google Scholar
W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. In ASPLOS-VIII: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46–54, 1998.
Google Scholar
W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent Scheduling. In MICRO-35: Proceedings of the 35th Annual International Symposium on Microarchitecture, pages 111–122, 2002.
Google Scholar
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart Memories: A Modular Reconfigurable Architecture. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 161–171, 2000.
Google Scholar
D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer, 30(9):37–39, September 1997.
Google Scholar
J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance. Computers. http://www.cs.virginia.edu/stream.
J. E. Miller. Software Instruction Caching. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, June 2007. http://hdl.handle.net/1721.1/40317.
C. A. Moritz, D. Yeung, and A. Agarwal. SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures. IEEE Transactions on Parallel and Distributed Systems, pages 730–742, July 2001.
Google Scholar
S. Naffziger, G. Hammond, S. Naffziger, and G. Hammond. The Implementation of the Next-Generation 64b Itanium Microprocessor. In Proceedings of the IEEE International Solid-State Circuits Conference, pages 344–345, 472, 2002.
Google Scholar
R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A Design Space Evaluation of Grid Processor Architectures. In MICRO-34: Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 40–51, 2001.
Google Scholar
M. Narayanan and K. A. Yelick. Generating Permutation Instructions from a High-Level Description. TR UCB-CS-03-1287, UC Berkeley, 2003.
Google Scholar
S. Palacharla. Complexity-Effective Superscalar Processors. PhD thesis, University of Wisconsin–Madison, 1998.
Google Scholar
J. Sanchez and A. Gonzalez. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. In MICRO-33: Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 124–133, December 2000.
Google Scholar
K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor. In MICRO-39: Proceedings of the 39th Annual International Symposium on Microarchitecture, pages 480–491, Dec 2006.
Google Scholar
D. Shoemaker, F. Honore, C. Metcalf, and S. Ward. NuMesh: An Architecture Optimized for Scheduled Communication. Journal of Supercomputing, 10(3):285–302, 1996.
Article MATH Google Scholar
G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In ISCA ’95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 414–425, 1995.
Google Scholar
J. Suh, E.-G. Kim, S. P. Crago, L. Srinivasan, and M. C. French. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels. In ISCA ’03: Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 410–419, June 2003.
Google Scholar
M. B. Taylor. Deionizer: A Tool for Capturing and Embedding I/O Calls. Technical Report MIT-CSAIL-TR-2004-037, MIT CSAIL/Laboratory for Computer Science, 2004. http://cag.csail.mit.edu/∼mtaylor/deionizer.html.
M. B. Taylor. Tiled Processors. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2007.
Google Scholar
M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, J.-W. Lee, P. Johnson, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, pages 25–35, Mar 2002.
Google Scholar
M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures. In HPCA ’03: Proceedings of the 9th International Symposium on High Performance Computer Architecture, pages 341–353, 2003.
Google Scholar
M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks. IEEE Transactions on Parallel and Distributed Systems (Special Issue on On-chip Networks), Feb 2005.
Google Scholar
M. B. Taylor, W. Lee, J. E. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 2–13, June 2004.
Google Scholar
W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In 2002 Compiler Construction, pages 179–196, 2002.
Google Scholar
E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it All to Software: Raw Machines. IEEE Computer, 30(9):86–93, Sep 1997.
Google Scholar
D. Wentzlaff. Architectural Implications of Bit-level Computation in Communication Applications. Master’s thesis, Massachusetts Institute of Technology, 2002.
Google Scholar
D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, and A. Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5):15–31, Sept–Oct 2007.
Article Google Scholar
R. Whaley, A. Petitet, J. J. Dongarra, and Whaley. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing, 27(1–2):3–35, 2001.
Article MATH Google Scholar

Download references

Acknowledgments

We thank our StreamIt collaborators, specifically M. Gordon, J. Lin, and B. Thies for the StreamIt backend and the corresponding section of this chapter. We are grateful to our collaborators from ISI East including C. Chen, S. Crago, M. French, L. Wang and J. Suh for developing the Raw motherboard, firmware components, and several applications. T. Konstantakopoulos, L. Jakab, F. Ghodrat, M. Seneski, A. Saraswat, R. Barua, A. Ma, J. Babb, M. Stephenson, S. Larsen, V. Sarkar, and several others too numerous to list also contributed to the success of Raw. The Raw chip was fabricated in cooperation with IBM. Raw is funded by DARPA, NSF, ITRI, and the Oxygen Alliance.

Author information

Authors and Affiliations

University of California, San Diego, CA, USA
Michael B. Taylor
Tilera Corporation, Westborough, MA, USA
Walter Lee & Ian Bratt
MIT CSAIL, Cambridge, MA, USA
Jason E. Miller, David Wentzlaff, Henry Hoffmann, Paul R. Johnson, Jason S. Kim, James Psota, Saman Amarasinghe & Anant Agarwal
Veracode, Burlington, MA, USA
Ben Greenwald
Swasth Foundation, Bangalore, India
Arvind Saraf
The MITRE Corporation, Bedford, MA, USA
Nathan Shnidman
IBM Austin Research Laboratory, Austin, TX, USA
Volker Strumpen
University of Illinois at Urbana - Champaign, Urbana, IL, USA
Matthew I. Frank

Authors

Michael B. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Walter Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jason E. Miller
View author publications
You can also search for this author in PubMed Google Scholar
David Wentzlaff
View author publications
You can also search for this author in PubMed Google Scholar
Ian Bratt
View author publications
You can also search for this author in PubMed Google Scholar
Ben Greenwald
View author publications
You can also search for this author in PubMed Google Scholar
Henry Hoffmann
View author publications
You can also search for this author in PubMed Google Scholar
Paul R. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Jason S. Kim
View author publications
You can also search for this author in PubMed Google Scholar
James Psota
View author publications
You can also search for this author in PubMed Google Scholar
Arvind Saraf
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Shnidman
View author publications
You can also search for this author in PubMed Google Scholar
Volker Strumpen
View author publications
You can also search for this author in PubMed Google Scholar
Matthew I. Frank
View author publications
You can also search for this author in PubMed Google Scholar
Saman Amarasinghe
View author publications
You can also search for this author in PubMed Google Scholar
Anant Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Natural Sciences, University of Texas, Austin, University Station 1, Austin, 78712-0233, U.S.A.
Stephen W. Keckler
Dept. Electrical Engineering, Stanford University, Stanford, 94305-9510, U.S.A.
Kunle Olukotun
IBM Software Group, Burnet Rd. 11501, Austin, 78758, U.S.A.
H. Peter Hofstee

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Taylor, M.B. et al. (2009). Tiled Multicore Processors. In: Keckler, S., Olukotun, K., Hofstee, H. (eds) Multicore Processors and Systems. Integrated Circuits and Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0263-4_1

Download citation

DOI: https://doi.org/10.1007/978-1-4419-0263-4_1
Published: 03 August 2009
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-0262-7
Online ISBN: 978-1-4419-0263-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics