Skip to main content
Log in

Landing Stencil Code on Godson-T

  • Short Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology — together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Dally W J. Computer architecture in the many-core era. In Keynote at the 24th Int. Conf. Comput. Design, San Jose, CA, USA, Oct. 1, 2006.

  2. Borkar S Y, Mulder H, Dubey P, Pawlowski S S, Kahn K C, Rattner J R, Kuck D J. Platform 2015: Intel processor and platform evolution for the next decade. Technical Report, Intel White Paper, Mar. 2005.

  3. Seiler L, Carmean D, Sprangle E, Forsyth T, AbrashM, Dubey P, Junkins S, Lake A, Sugerman J, Cavin R, Espasa R, Grochowski E, Juan T, Hanrahan P. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics, 27(3): Article No. 18.

  4. Zhu W, Sreedhar V C, Hu Z, Gao G R. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. ISCA 2007, San Diego, USA, June 9-13, 2007, pp.35-45.

  5. Hu Z, del Cuvillo J, Zhu W, Gao G R. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, Aug. 29-Sept. 1, 2006, pp.134-144.

  6. Krishnamoorthy S, Baskaran M, Bondhugula U, Ramanujam J, Rountev A, Sadayappan P. Effective automatic parallelization of stencil computations. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, USA, June 10-13, 2007, pp.235-244.

  7. Frigo M, Strumpen V. The memory behavior of cache oblivious stencil computations. Journal of Supercomputing, 2006, 29(2): 93-112.

    Google Scholar 

  8. Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K. Implicit and explicit optimizations for stencil computations. In Proc. MSPC2006, San Jose, USA, Oct. 22, 2006, pp.51-60.

  9. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. SC2008, Austin, USA, Nov. 15-21, 2008, Article No. 1.

  10. Renganarayanan L, Harthikote-Matha M, Dewri R, Rajopadhye S V. Towards optimal multi-level tiling for stencil computations. In Proc. IPDPS, Long Beach, USA, Mar. 26-30, 2007, p.101.

  11. McCalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, DCS, Rugers University, 1999.

  12. Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Program ming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.

  13. Wonnacott D. Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In Proc. International Conference on Parallel and Distributed Computing Systems, Cancun, Mexico, May 1-5, 2000, p.171.

  14. Baskaran M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.1-10.

  15. Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review, 2008, 51(1): 129-159.

    Article  Google Scholar 

  16. Huang H, Yuan N et al. Architecture supported synchronization-based cache coherence protocol for many-core processors. In the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects (CMPMSI) of ISCA’08, Beijing, China, June 22, 2008.

  17. Ye X, Nguyen V H, Lavenier D, Fan D. Efficient parallelization of a protein sequence comparison algorithm on manycore architecture. In Proc. the Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, Dunedin, New Zealand, Dec. 1-4, 2008, pp.167-170.

  18. Long G, Fan D et al. A performance model of dense matrix operations on many-core architectures. In Proc. Euro-Par 2008, Las Palmas de Gran Canaria, Spain, Aug. 26-29, 2008, pp.120-129.

  19. Tan G, Fan D, Zhang J, Russo A, Gao G R. Experience on optimizing irregular computation for memory hierarchy in manycore architecture. In Proc. PPoPP 2008, Salt Lake City, USA, Feb. 14-18, pp.279-280.

  20. Alverson R, Callahan D et al. The Tera compute system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.

    Article  Google Scholar 

  21. Michael E Wolf, Monica S Lam. A data locality optimizing algorithm. In Proc. ACM SIGPLAN Conf. Progr. Lang. Design and Implementation, Toronto, Canada, Jun. 24-28, 1991, pp.30-44.

  22. Tseng C W. Compiler optimizations for eliminating barrier synchronization. In Proc. PPOPP 1995, Santa Barbara, California, USA, July 19-21, 1995, pp.144-155.

  23. Haataja J, Savolainen V. Cray T3E User’s Guide. Center for Scientific Computing, Finland, 1997.

  24. Smith B. The Architecture of HEP. Parallel MIMD Computation: HEP Supercomputer and Its Applications. Kowalik J S (ed.), Scientific Computation Series, Cambridge: MIT Press, MA, 1985, p.41-55.

  25. Alverson R, Callahan D, Cummings D, Koblenz B, Porterfield A, Smith B. The Tera computer system. SIGARCH Comput. Archit. News, 1990, 18(3b): 1-6.

    Article  Google Scholar 

  26. Dally W J et al. The message-driven processor. IEEE Micro., 1992, 12(2): 23-39.

    Article  Google Scholar 

  27. Kranz D, Lim B H, Agarwal A. Low-cost support for finegrain synchronization in multiprocessors. Technical Report MIT/LCS/TM-470, Massachusetts Institute of Technology, Cambridge, 1992.

  28. Keckler S W, Dally W J, Maskit D, Carter N P, Chang A, Lee W S. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor. In Proc. the 25th Int. Symp. Computer Architecture, Barcelona, Spain, Jun. 27-Jul. 2, 1998, pp.302-317.

  29. Cray MTA-2 System, http://www.cray.com/About/History.aspx.

  30. Montrym J, Moreton H. The GeForce 6800. IEEE Micro, March 2005, 25(2): 41-51.

    Article  Google Scholar 

  31. Hofstee P. Power efficient architecture and the cell processor. In HPCA-11,Invited Paper and Keynote Speech, San Francisco, USA, Feb. 12-16, 2005.

  32. Asanovic K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. UCB/EECS-2006-183, University of California, Berkeley, 2006.

  33. Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N. An 80-tile 1.28TFLOPS network-onchip in 65 nm CMOS. In Proc. IEEE International Solid-State Circuits Conference, San Francisco, USA, Feb. 11-15, 2007.

  34. Dally W J, Labonte F, Das A, Hanrahan P, Ahn J H, Gummaraju J, Erez M, Jayasena N, Buck I, Knight T J, Kapasi U J. Merrimac: Supercomputing with Streams. In Proc. the Supercomputer Conference, Phoenix, USA, Nov. 15-21, 2003.

  35. Venetis I E, Gao G R. Mapping the LU decomposition on a many core architecture: Challenges and solutions. In Proc. ACM International Conference on Computing Frontiers (CF2009), Ischia, Italy, May 18-20, 2009, pp.71-80.

  36. Xue L, Chen L, Hu Z, Gao G R. Performance Tuning of the Fast Fourier Transform on a Multicore Architecture. CAPSL Technical Memo 81, Feb. 8, 2008.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui-Min Cui.

Additional information

Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321602, the National Natural Science Foundation of China under Grant No. 60736012, the National High Technology Research and Development 863 Program of China under Grant Nos. 2007AA01Z110 and 2009AA01Z103.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cui, HM., Wang, L., Fan, DR. et al. Landing Stencil Code on Godson-T. J. Comput. Sci. Technol. 25, 886–894 (2010). https://doi.org/10.1007/s11390-010-9373-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9373-6

Keywords

Navigation