Skip to main content

Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9519))

Included in the following conference series:

Abstract

Orchestrating data transfers between CPU and a coprocessor manually is cumbersome, particularly for multi-dimensional arrays and other data structures with multi-level pointers common in scientific computations. This paper describes a system that includes both compile-time and runtime solutions for this problem, with the overarching goal of improving programmer productivity while maintaining performance.

We find that the standard linearization method performs poorly for non-uniform dimensions on the coprocessor due to redundant data transfers and suppression of important compiler optimizations such as vectorization. The key contribution of this paper is a novel approach for heap linearization that avoids modifying memory accesses to enable vectorization, referred to as partial linearization with pointer reset.

We implement partial linearization with pointer reset as the compile time solution, whereas runtime solution is implemented as an enhancement to MYO library. We evaluate our approach with respect to multiple C benchmarks. Experimental results demonstrate that our best compile-time solution can perform 2.5x-5x faster than original runtime solution, and the CPU-MIC code with it can achieve 1.5x-2.5x speedup over the 16-thread CPU version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Intel C++ Compiler. http://www.intel.com/Compilers.

  2. 2.

    Due to the page limitation, we omit some details of the runtime optimization and the source-to-source transformation to integrate two approaches, and all of our code examples in this version. Please refer to our LCPC’15 conference version for more details: http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.

  3. 3.

    OpenACC: Directives for Accelerators. http://www.openacc-standard.org/.

  4. 4.

    As described in http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.

References

  1. Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008)

    Google Scholar 

  2. Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990)

    Google Scholar 

  3. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995)

    Google Scholar 

  4. Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013)

    Google Scholar 

  5. Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)

    Article  Google Scholar 

  6. Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999)

    Google Scholar 

  7. Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)

    Article  Google Scholar 

  8. Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012)

    Google Scholar 

  9. El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006)

    Google Scholar 

  10. Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010)

    Google Scholar 

  11. Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013)

    Google Scholar 

  12. Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999)

    Google Scholar 

  13. T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013)

    Google Scholar 

  14. Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012)

    Google Scholar 

  15. Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011)

    Google Scholar 

  16. Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)

    Google Scholar 

  17. Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992)

    Google Scholar 

  18. Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006)

    Google Scholar 

  19. Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005)

    Google Scholar 

  20. Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010)

    Google Scholar 

  21. Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)

    Article  Google Scholar 

  22. Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011)

    Google Scholar 

  23. Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014)

    Google Scholar 

  24. Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013)

    Google Scholar 

  25. Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998)

    Google Scholar 

  26. Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012)

    Google Scholar 

  27. Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012)

    Google Scholar 

  28. Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010)

    Google Scholar 

  29. Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011)

    Google Scholar 

  30. Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009)

    Google Scholar 

  31. Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007)

    Google Scholar 

  32. Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012)

    Google Scholar 

  33. Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  34. Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Ravindra Ganapathi from Intel for guiding us through the MYO library.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ren, B., Ravi, N., Yang, Y., Feng, M., Agrawal, G., Chakradhar, S. (2016). Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29778-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29777-4

  • Online ISBN: 978-3-319-29778-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics