Landing Stencil Code on Godson-T

Cui, Hui-Min; Wang, Lei; Fan, Dong-Rui; Feng, Xiao-Bing

doi:10.1007/s11390-010-9373-6

Landing Stencil Code on Godson-T

Short Paper
Published: 11 July 2010

Volume 25, pages 886–894, (2010)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Hui-Min Cui^1,2,
Lei Wang^1,2,
Dong-Rui Fan¹ &
…
Xiao-Bing Feng¹

45 Accesses
1 Citation
Explore all metrics

Abstract

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology — together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Article 21 September 2023

References

Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006.
Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005.
Seiler L, Carmean D, Sprangle E, Forsyth T, AbrashM, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18.
Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45.
Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144.
Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244.
Frigo M, Strumpen V. The memory behavior of cache oblivious stencil computations. Journal of Supercomputing, 2006, 29(2): 93-112.
Google Scholar
Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC2006, San Jose, USA, Oct. 22, 2006, pp.51-60.
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1.
Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadhye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101.
McCalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, DCS, Rugers University, 1999.
Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program ming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.
Wonnacott D. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proc. International Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, May 1-5, 2000, p.171.
Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.1-10.
Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 2008, 51(1): 129-159.
Article Google Scholar
Huang H, Yuan N et al. Architecture supported synchronization-based cache coherence protocol for many-core processors. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMPMSI) of ISCA’08, Beijing, China, June 22, 2008.
Ye X, Nguyen V H, Lavenier D, Fan D. Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In Proc. the Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, Dunedin, New Zealand, Dec. 1-4, 2008, pp.167-170.
Long G, Fan D et al. A performance model of dense matrix operations on many-core architectures. In Proc. Euro-Par 2008, Las Palmas de Gran Canaria, Spain, Aug. 26-29, 2008, pp.120-129.
Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In Proc. PPoPP 2008, Salt Lake City, USA, Feb. 14-18, pp.279-280.
Alverson R, Callahan D et al. The Tera compute system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.
Article Google Scholar
Michael E Wolf, Monica S Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, Toronto, Canada, Jun. 24-28, 1991, pp.30-44.
Tseng C W. Compiler optimizations for eliminating barrier synchronization. In Proc. PPOPP 1995, Santa Barbara, California, USA, July 19-21, 1995, pp.144-155.
Haataja J, Savolainen V. Cray T3E User’s Guide. Center for Scientific Computing, Finland, 1997.
Smith B. The Architecture of HEP. Parallel MIMD Computation: HEP Supercomputer and Its Applications. Kowalik J S (ed.), Scientific Computation Series, Cambridge: MIT Press, MA, 1985, p.41-55.
Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B. The Tera computer system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.
Article Google Scholar
Dally W J et al. The message-driven processor. IEEE Micro., 1992, 12(2): 23-39.
Article Google Scholar
Kranz D, Lim B H, Agarwal A. Low-cost support for finegrain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, Massachusetts Institute of Technology, Cambridge, 1992.
Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Proc. the 25th Int. Symp. Computer Architecture, Barcelona, Spain, Jun. 27-Jul. 2, 1998, pp.302-317.
Cray MTA-2 System, http://www.cray.com/About/History.aspx.
Montrym J, Moreton H. The GeForce 6800. IEEE Micro, March 2005, 25(2): 41-51.
Article Google Scholar
Hofstee P. Power efficient architecture and the cell processor. In HPCA-11,Invited Paper and Keynote Speech, San Francisco, USA, Feb. 12-16, 2005.
Asanovic K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. UCB/EECS-2006-183, University of California, Berkeley, 2006.
Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N. An 80-tile 1.28TFLOPS network-onchip in 65 nm CMOS. In Proc. IEEE International Solid-State Circuits Conference, San Francisco, USA, Feb. 11-15, 2007.
Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, Erez M, Jayasena N, Buck I, Knight T J, Kapasi U J. Merrimac: Supercomputing with Streams. In Proc. the Supercomputer Conference, Phoenix, USA, Nov. 15-21, 2003.
Venetis I E, Gao G R. Mapping the LU decomposition on a many core architecture: Challenges and solutions. In Proc. ACM International Conference on Computing Frontiers (CF2009), Ischia, Italy, May 18-20, 2009, pp.71-80.
Xue L, Chen L, Hu Z, Gao G R. Performance Tuning of the Fast Fourier Transform on a Multicore Architecture. CAPSL Technical Memo 81, Feb. 8, 2008.

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Hui-Min Cui, Lei Wang, Dong-Rui Fan (Member CCF, IEEE) & Xiao-Bing Feng
Graduate University of Chinese Academy of Sciences, Beijing, 100039, China
Hui-Min Cui & Lei Wang

Authors

Hui-Min Cui
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Rui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Bing Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hui-Min Cui.

Additional information

Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321602, the National Natural Science Foundation of China under Grant No. 60736012, the National High Technology Research and Development 863 Program of China under Grant Nos. 2007AA01Z110 and 2009AA01Z103.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cui, HM., Wang, L., Fan, DR. et al. Landing Stencil Code on Godson-T. J. Comput. Sci. Technol. 25, 886–894 (2010). https://doi.org/10.1007/s11390-010-9373-6

Download citation

Received: 12 June 2009
Revised: 21 May 2010
Published: 11 July 2010
Issue Date: July 2010
DOI: https://doi.org/10.1007/s11390-010-9373-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Landing Stencil Code on Godson-T

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Landing Stencil Code on Godson-T

Abstract

Access this article

Similar content being viewed by others

Survey on chiplets: interface, interconnect and integration methodology

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation