Skip to main content

Tiled Multicore Processors

  • Chapter
  • First Online:
Multicore Processors and Systems

Abstract

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources – including logic, wires, and pins – in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x–9x better for higher levels of ILP, and 10x–100x better when highly parallel applications are coded in a stream language or optimized by hand.

Based on “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams”, by M.B. Taylor, W. Lee, J.E. Miller, et al. which appeared in The 31st Annual International Symposium on Computer Architecture (ISCA). © 2004 IEEE. [46]

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Agarwal and M. Levy. Going multicore presents challenges and opportunities. Embedded Systems Design, 20(4), April 2007.

    Google Scholar 

  2. V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 248–259, 2000.

    Google Scholar 

  3. E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A Portable Linear Algebra Library for High-Performance Computers. In Supercomputing ’90: Proceedings of the 1990 ACM/IEEE Conference on Supercomputing, pages 2–11, 1990.

    Google Scholar 

  4. M. Annaratone, E. Arnould, T. Gross, H. T. Kung, M. Lam, O. Menzilicioglu, and J. A. Webb. The Warp Computer: Architecture, Implementation and Performance. IEEE Transactions on Computers, 36(12):1523–1538, December 1987.

    Google Scholar 

  5. J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The RAW Benchmark Suite: Computation Structures for General Purpose Computing. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 134–143, 1997.

    Google Scholar 

  6. M. Baron. Low-key Intel 80-core Intro: The tip of the iceberg. Microprocessor Report, April 2007.

    Google Scholar 

  7. M. Baron. Tilera’s cores communicate better. Microprocessor Report, November 2007.

    Google Scholar 

  8. R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal. Maps: A Compiler-Managed Memory System for Raw Machines. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 4–15, 1999.

    Google Scholar 

  9. M. Bohr. Interconnect Scaling – The Real Limiter to High Performance ULSI. In 1995 IEDM, pages 241–244, 1995.

    Google Scholar 

  10. P. Bose, D. H. Albonesi, and D. Marculescu. Power and complexity aware design. IEEE Micro: Guest Editor’s Introduction for Special Issue on Power and Complexity Aware Design, 23(5):8–11, Sept/Oct 2003.

    Google Scholar 

  11. S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, and R. Laufer. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In ISCA ’99: Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 28–39, 1999.

    Google Scholar 

  12. M. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 75–86, October 2006.

    Google Scholar 

  13. M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A Stream Compiler for Communication-Exposed Architectures. In ASPLOS-X: Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291–303, 2002.

    Google Scholar 

  14. T. Gross and D. R. O’Halloron. iWarp, Anatomy of a Parallel Computing System. The MIT Press, Cambridge, MA, 1998.

    Google Scholar 

  15. J. R. Hauser and J. Wawrzynek. Garp: A MIPS Processor with Reconfigurable Coprocessor. In Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), pages 12–21, 1997.

    Google Scholar 

  16. R. Ho, K. W. Mai, and M. A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504, April 2001.

    Article  Google Scholar 

  17. H. Hoffmann, V. Strumpen, A. Agarwal, and H. Hoffmann. Stream Algorithms and Architecture. Technical Memo MIT-LCS-TM-636, MIT Laboratory for Computer Science, 2003.

    Google Scholar 

  18. H. P. Hofstee. Power efficient processor architecture and the Cell processor. In HPCA ’05: Proceedings of the 11th International Symposium on High Performance Computer Architecture, pages 258–262, 2005.

    Google Scholar 

  19. U. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany. The Imagine Stream Processor. In ICCD ’02: Proceedings of the 2002 IEEE International Conference on Computer Design, pages 282–288, 2002.

    Google Scholar 

  20. A. KleinOsowski and D. Lilja. MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research. Computer Architecture Letters, 1, June 2002.

    Google Scholar 

  21. P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro, 25(2):21–29, 2005.

    Article  Google Scholar 

  22. C. Kozyrakis and D. Patterson. A New Direction for Computer Architecture Research. IEEE Computer, 30(9):24–32, September 1997.

    Google Scholar 

  23. R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread Architecture. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004.

    Google Scholar 

  24. J. Kubiatowicz. Integrated Shared-Memory and Message-Passing Communication in the Alewife Multiprocessor. PhD thesis, Massachusetts Institute of Technology, 1998.

    Google Scholar 

  25. W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. In ASPLOS-VIII: Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 46–54, 1998.

    Google Scholar 

  26. W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent Scheduling. In MICRO-35: Proceedings of the 35th Annual International Symposium on Microarchitecture, pages 111–122, 2002.

    Google Scholar 

  27. K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart Memories: A Modular Reconfigurable Architecture. In ISCA ’00: Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 161–171, 2000.

    Google Scholar 

  28. D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer, 30(9):37–39, September 1997.

    Google Scholar 

  29. J. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance. Computers. http://www.cs.virginia.edu/stream.

  30. J. E. Miller. Software Instruction Caching. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, June 2007. http://hdl.handle.net/1721.1/40317.

  31. C. A. Moritz, D. Yeung, and A. Agarwal. SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures. IEEE Transactions on Parallel and Distributed Systems, pages 730–742, July 2001.

    Google Scholar 

  32. S. Naffziger, G. Hammond, S. Naffziger, and G. Hammond. The Implementation of the Next-Generation 64b Itanium Microprocessor. In Proceedings of the IEEE International Solid-State Circuits Conference, pages 344–345, 472, 2002.

    Google Scholar 

  33. R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A Design Space Evaluation of Grid Processor Architectures. In MICRO-34: Proceedings of the 34th Annual International Symposium on Microarchitecture, pages 40–51, 2001.

    Google Scholar 

  34. M. Narayanan and K. A. Yelick. Generating Permutation Instructions from a High-Level Description. TR UCB-CS-03-1287, UC Berkeley, 2003.

    Google Scholar 

  35. S. Palacharla. Complexity-Effective Superscalar Processors. PhD thesis, University of Wisconsin–Madison, 1998.

    Google Scholar 

  36. J. Sanchez and A. Gonzalez. Modulo Scheduling for a Fully-Distributed Clustered VLIW Architecture. In MICRO-33: Proceedings of the 33rd Annual International Symposium on Microarchitecture, pages 124–133, December 2000.

    Google Scholar 

  37. K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger. Distributed microarchitectural protocols in the TRIPS prototype processor. In MICRO-39: Proceedings of the 39th Annual International Symposium on Microarchitecture, pages 480–491, Dec 2006.

    Google Scholar 

  38. D. Shoemaker, F. Honore, C. Metcalf, and S. Ward. NuMesh: An Architecture Optimized for Scheduled Communication. Journal of Supercomputing, 10(3):285–302, 1996.

    Article  MATH  Google Scholar 

  39. G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In ISCA ’95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 414–425, 1995.

    Google Scholar 

  40. J. Suh, E.-G. Kim, S. P. Crago, L. Srinivasan, and M. C. French. A Performance Analysis of PIM, Stream Processing, and Tiled Processing on Memory-Intensive Signal Processing Kernels. In ISCA ’03: Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 410–419, June 2003.

    Google Scholar 

  41. M. B. Taylor. Deionizer: A Tool for Capturing and Embedding I/O Calls. Technical Report MIT-CSAIL-TR-2004-037, MIT CSAIL/Laboratory for Computer Science, 2004. http://cag.csail.mit.edu/∼mtaylor/deionizer.html.

  42. M. B. Taylor. Tiled Processors. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, Feb 2007.

    Google Scholar 

  43. M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, J.-W. Lee, P. Johnson, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, pages 25–35, Mar 2002.

    Google Scholar 

  44. M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures. In HPCA ’03: Proceedings of the 9th International Symposium on High Performance Computer Architecture, pages 341–353, 2003.

    Google Scholar 

  45. M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. Scalar Operand Networks. IEEE Transactions on Parallel and Distributed Systems (Special Issue on On-chip Networks), Feb 2005.

    Google Scholar 

  46. M. B. Taylor, W. Lee, J. E. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In ISCA ’04: Proceedings of the 31st Annual International Symposium on Computer Architecture, pages 2–13, June 2004.

    Google Scholar 

  47. W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In 2002 Compiler Construction, pages 179–196, 2002.

    Google Scholar 

  48. E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it All to Software: Raw Machines. IEEE Computer, 30(9):86–93, Sep 1997.

    Google Scholar 

  49. D. Wentzlaff. Architectural Implications of Bit-level Computation in Communication Applications. Master’s thesis, Massachusetts Institute of Technology, 2002.

    Google Scholar 

  50. D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, and A. Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27(5):15–31, Sept–Oct 2007.

    Article  Google Scholar 

  51. R. Whaley, A. Petitet, J. J. Dongarra, and Whaley. Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing, 27(1–2):3–35, 2001.

    Article  MATH  Google Scholar 

Download references

Acknowledgments

We thank our StreamIt collaborators, specifically M. Gordon, J. Lin, and B. Thies for the StreamIt backend and the corresponding section of this chapter. We are grateful to our collaborators from ISI East including C. Chen, S. Crago, M. French, L. Wang and J. Suh for developing the Raw motherboard, firmware components, and several applications. T. Konstantakopoulos, L. Jakab, F. Ghodrat, M. Seneski, A. Saraswat, R. Barua, A. Ma, J. Babb, M. Stephenson, S. Larsen, V. Sarkar, and several others too numerous to list also contributed to the success of Raw. The Raw chip was fabricated in cooperation with IBM. Raw is funded by DARPA, NSF, ITRI, and the Oxygen Alliance.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag US

About this chapter

Cite this chapter

Taylor, M.B. et al. (2009). Tiled Multicore Processors. In: Keckler, S., Olukotun, K., Hofstee, H. (eds) Multicore Processors and Systems. Integrated Circuits and Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-0263-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-0263-4_1

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-0262-7

  • Online ISBN: 978-1-4419-0263-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics