Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors

Ren, Bin; Ravi, Nishkam; Yang, Yi; Feng, Min; Agrawal, Gagan; Chakradhar, Srimat

doi:10.1007/978-3-319-29778-1_11

Bin Ren¹⁶,
Nishkam Ravi^17,18,
Yi Yang¹⁸,
Min Feng¹⁸,
Gagan Agrawal¹⁹ &
…
Srimat Chakradhar¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9519))

Included in the following conference series:

Languages and Compilers for Parallel Computing

656 Accesses
3 Citations

Abstract

Orchestrating data transfers between CPU and a coprocessor manually is cumbersome, particularly for multi-dimensional arrays and other data structures with multi-level pointers common in scientific computations. This paper describes a system that includes both compile-time and runtime solutions for this problem, with the overarching goal of improving programmer productivity while maintaining performance.

We find that the standard linearization method performs poorly for non-uniform dimensions on the coprocessor due to redundant data transfers and suppression of important compiler optimizations such as vectorization. The key contribution of this paper is a novel approach for heap linearization that avoids modifying memory accesses to enable vectorization, referred to as partial linearization with pointer reset.

We implement partial linearization with pointer reset as the compile time solution, whereas runtime solution is implemented as an enhancement to MYO library. We evaluate our approach with respect to multiple C benchmarks. Experimental results demonstrate that our best compile-time solution can perform 2.5x-5x faster than original runtime solution, and the CPU-MIC code with it can achieve 1.5x-2.5x speedup over the 16-thread CPU version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Intel C++ Compiler. http://www.intel.com/Compilers.
2.
Due to the page limitation, we omit some details of the runtime optimization and the source-to-source transformation to integrate two approaches, and all of our code examples in this version. Please refer to our LCPC’15 conference version for more details: http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.
3.
OpenACC: Directives for Accelerators. http://www.openacc-standard.org/.
4.
As described in http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.

References

Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008)
Google Scholar
Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995)
Google Scholar
Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013)
Google Scholar
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)
Article Google Scholar
Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)
Article Google Scholar
Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012)
Google Scholar
El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006)
Google Scholar
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010)
Google Scholar
Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013)
Google Scholar
Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999)
Google Scholar
T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013)
Google Scholar
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012)
Google Scholar
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011)
Google Scholar
Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
Google Scholar
Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992)
Google Scholar
Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006)
Google Scholar
Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005)
Google Scholar
Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010)
Google Scholar
Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)
Article Google Scholar
Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011)
Google Scholar
Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014)
Google Scholar
Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013)
Google Scholar
Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998)
Google Scholar
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012)
Google Scholar
Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012)
Google Scholar
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010)
Google Scholar
Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011)
Google Scholar
Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009)
Google Scholar
Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007)
Google Scholar
Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012)
Google Scholar
Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Chapter Google Scholar
Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011)
Google Scholar

Download references

Acknowledgements

We would like to thank Ravindra Ganapathi from Intel for guiding us through the MYO library.

Author information

Authors and Affiliations

Pacific Northwest National Laboratories, Richland, USA
Bin Ren
Cloudera, Palo Alto, USA
Nishkam Ravi
NEC Laboratories America, Princeton, USA
Nishkam Ravi, Yi Yang, Min Feng & Srimat Chakradhar
Department of Computer Science and Engineering, The Ohio State University, Columbus, USA
Gagan Agrawal

Authors

Bin Ren
View author publications
You can also search for this author in PubMed Google Scholar
Nishkam Ravi
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Min Feng
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Srimat Chakradhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yang .

Editor information

Editors and Affiliations

North Carolina State University, Raleigh, North Carolina, USA
Xipeng Shen
North Carolina State University, Raleigh, North Carolina, USA
Frank Mueller
North Carolina State University, Raleigh, North Carolina, USA
James Tuck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, B., Ravi, N., Yang, Y., Feng, M., Agrawal, G., Chakradhar, S. (2016). Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-29778-1_11
Published: 20 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29777-4
Online ISBN: 978-3-319-29778-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics