TL-DAE: Thread-Level Decoupled Access/Execution for OpenMP on the Cyclops-64 Many-Core Processor

Gan, Ge; Manzano, Joseph

doi:10.1007/978-3-642-13374-9_6

Ge Gan¹⁸ &
Joseph Manzano¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5898))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

826 Accesses
1 Citations

Abstract

Cyclops-64 is a many-core processor with software managed memory hierarchy. For OpenMP programs running on this processor, a frequently used computing paradigm is: (i) copy data into on-chip memory; (ii) perform computations on the chip; (iii) copy results back to off-chip memory. Obviously, hiding memory copy latency is very crucial to the performance of this computing paradigm. The traditional solution is to use the asynchronous DMA transfer. However, DMA is not supported in the Cyclops-64 processor. Therefore, in this paper, we propose a software solution, called Thread-Level Decoupled Access/Execution (TL-DAE for short). It is a data-driven execution model for OpenMP programs running on the Cyclops-64 processor. The TL-DAE execution model is inspired by the canonical decoupled architecture. In our design, data movements and computations are decoupled implicitly by OpenMP compiler. At runtime, two different groups of threads are spawned: the computation threads and the percolation threads. Computation threads execute computation code while percolation threads execute data movement code. The execution of computation thread and percolation thread can slip with respect to each other, so percolation thread can run further ahead than computation thread and fetch data for it. In this paper, we will not only develop the runtime techniques used to implement the TL-DAE execution model, but also propose the required TL-DAE programming interface that is used by OpenMP compiler to generate the decoupled code. We have evaluated the TL-DAE execution model by using two OpenMP task benchmarks. Experimental results show significant performance enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Fast: A functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on Modeling, Benchmarking and Simulation (MoBS 2005) of ISCA 2005, Madison, Wisconsin (June 2005)
Google Scholar
del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Towards a software infrastructure for cyclops-64 cellular architecture. In: HPCS 2006, Labroda, Canada (June 2005)
Google Scholar
Zhang, Y., Jeong, T., Chen, F., Wu, H., Nitzsche, R., Gao, G.R.: A study of the on-chip interconnection network for the ibm cyclops64 multi-core architecture. In: IPDPS 2006: Proceedings of the 20th International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, April 25-29 (2006)
Google Scholar
Hu, Z., del Cuvillo, J., Zhu, W., Gao, G.R.: Optimization of dense matrix multiplication on ibm cyclops-64: Challenges and experiences. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 134–144. Springer, Heidelberg (2006)
Chapter Google Scholar
Chen, T., Sura, Z., O’Brien, K.M., O’Brien, J.K.: Optimizing the use of static buffers for dma on a cell chip. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007)
Chapter Google Scholar
Kistler, M., Perrone, M., Petrini, F.: Cell multiprocessor communication network: Built for speed. IEEE Micro 26(3), 10–23 (2006)
Article Google Scholar
Chen, T., Lin, H., Zhang, T.: Orchestrating data transfer for the cell/B.E. processor. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, pp. 289–298. ACM, New York (2008)
Chapter Google Scholar
Liu, T., Lin, H., Chen, T., O’Brien, K., Shao, L.: DBDB: optimizing DMATransfer for the cell be architecture. In: Proceedings of the 23rd international conference on Supercomputing, ICS 2009, Yorktown Heights, NY, USA, June 8-12, pp. 36–45. ACM, New York (2009)
Chapter Google Scholar
Smith, J.E.: Decoupled access/execute computer architectures. ACM Trans. Comput. Syst. 2(4), 289–308 (1984)
Article Google Scholar
Smith, J.E., Weiss, S., Pang, N.Y.: A simulation study of decoupled architecture computers. IEEE Trans. Comput. 35(8), 692–702 (1986)
Article Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. In: Proceedings of the 35th Annual Symposium on Foundations of Computer Science, Santa Fe, New Mexico, November 20–22, pp. 356–368. IEEE, Los Alamitos (1994)
Chapter Google Scholar
Gan, G., Wang, X., Manzano, J., Gao, G.R.: Tile percolation: an openmp tile aware parallelization technique for the cyclops-64 multicore processor. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 839–850. Springer, Heidelberg (2009)
Chapter Google Scholar
The NANOS Group at Universitat Politécnica de Catalunya: Barcelona OpenMP Task Suite (May 2009), http://nanos.ac.upc.edu/content/barcelona-openmp-task-suite
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2009)
Article Google Scholar
OpenMP Architecture Review Board: OpenMP Application Program Interface Version 3.0 (May 2008), http://www.openmp.org/mp-documents/spec30.pdf
Kandemir, M.T., Ramanujam, J., Irwin, M.J., Vijaykrishnan, N., Kadayif, I., Parikh, A.: A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Trans. on CAD of Integrated Circuits and Systems 23(2), 243–260 (2004)
Article Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 1–10. ACM, New York (2008)
Chapter Google Scholar
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems 2(4), 452–471 (1991)
Article Google Scholar
Anderson, J.M., Amarasinghe, S.P., Lam, M.S.: Data and computation transformations for multiprocessors. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Santa Barbara, California, July 19–21, pp. 166–178 (1995); SIGPLAN Notices 30(8) (August 1995)
Google Scholar
Muchnick, S.S.: Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, Washington, DC, USA, p. 285. IEEE Computer Society, Los Alamitos (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, 19716, U.S.A.
Ge Gan & Joseph Manzano

Authors

Ge Gan
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Manzano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Delaware, 19716, Newark, DE, USA
Guang R. Gao & Xiaoming Li &
Department of Computer and Information Sciences, University of Delaware, 19716, Newark, DE, USA
Lori L. Pollock & John Cavazos &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gan, G., Manzano, J. (2010). TL-DAE: Thread-Level Decoupled Access/Execution for OpenMP on the Cyclops-64 Many-Core Processor. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-13374-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13373-2
Online ISBN: 978-3-642-13374-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics