Skip to main content

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

  • Conference paper
  • First Online:
Book cover Languages and Compilers for Parallel Computing (LCPC 2016)

Abstract

Current shared-memory systems can feature tens of processing elements. The old assumption that coarse-grain synchronization is enough in a shared-memory system thus becomes invalid. To efficiently take advantage of such systems, we propose to use fine grain synchronization, with event-driven multithreading. To illustrate our point, we study a naïve 5-point 2D stencil kernel. We provide several synchronization variants using our fine-grain multithreading environment, and compare it to a naïve coarse-grain implementation using OpenMP. We conducted experiments on three different many-core compute nodes, with speedups ranging from 1.2\(\times \) to 1.75\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that we do not claim that our own environment is better than OpenMP 4.

  2. 2.

    Obviously, as we are writing directly using a runtime system API, the code has to be more verbose than its OpenMP counterpart.

References

  1. Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. SIGPLAN Not. 26(7), 39–50 (1991)

    Article  Google Scholar 

  2. Bandishti, V., Pananilath, I., Bondhugula, U.: Tiling stencil computations to maximize parallelism. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012. IEEE Computer Society Press, Salt Lake City (2012)

    Google Scholar 

  3. Barik, R., et al.: The Habanero multicore software research project. In: Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA 2009. ACM, Orlando (2009)

    Google Scholar 

  4. Bertolacci, I.J., et al.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015. ACM, Newport Beach (2015)

    Google Scholar 

  5. Blumofe, R.D., et al.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)

    Article  Google Scholar 

  6. OpenMP Architecture Review Board. OpenMP Application Program Interface version 4.0 (2013)

    Google Scholar 

  7. Christen, M., Schenk, O., Burkhart, H.: PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS) (2011)

    Google Scholar 

  8. Dennis, J.B.: First version of a data flow procedure language. In: Robinet, B. (ed.) Programming Symposium. LNCS, vol. 19, pp. 362–376. Springer, Heidelberg (1974). doi:10.1007/3-540-06859-7_145

    Chapter  Google Scholar 

  9. Gautier, T., et al.: XKaapi: a runtime system for data-flow task programming on heterogeneous architectures. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS) (2013)

    Google Scholar 

  10. Kamil, S., et al.: An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS) (2010)

    Google Scholar 

  11. Knobe, K.: Ease of use with concurrent collections (CnC). In: Hot Topics in Parallelism (2009)

    Google Scholar 

  12. Lauderdale, C., Khan, R.: Towards a codelet-based runtime for exascale computing: position paper. In: Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exafop Era, EXADAPT 2012. ACM, London (2012)

    Google Scholar 

  13. Lesniak, M.: PASTHA: parallelizing stencil calculations in Haskell. In: Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, DAMP 2010. ACM, Madrid (2010)

    Google Scholar 

  14. Liu, C., Kulkarni, M.: Optimizing the LULESH stencil code using concurrent collections. In: Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frame-Works for High Performance Computing, WOLFHPC 2015. ACM, Austin (2015)

    Google Scholar 

  15. Mattson, T., et al.: OCR: the open community runtime interface. Technical report, June 2015. https://xstack.exascaletech.com/git/public

  16. Muranushi, T., Makino, J.: Optimal temporal blocking for stencil computation. Procedia Comput. Sci. 51, 1303–1312 (2015). International Conference on Computational Science, ICCS 2015 Computational Science at the Gates of Nature

    Article  Google Scholar 

  17. Schweizer, H., Besta, M., Hoefler, T.: Evaluating the cost of atomic operations on modern architectures. Technical report ETH Zurich, Department of Computer Science (2015)

    Google Scholar 

  18. Shrestha, S., Manzano, J., Marquez, A., Feo, J., Gao, G.R.: Jagged tiling for intra-tile parallelism and fine-grain multithreading. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 161–175. Springer, Heidelberg (2015). doi:10.1007/978-3-319-17473-0_11

    Google Scholar 

  19. Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Wolf, F., Mohr, B., Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 633–644. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40047-6_63

    Chapter  Google Scholar 

  20. Tang, Y., et al.: The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011. ACM, San Jose (2011)

    Google Scholar 

  21. Zuckerman, S., et al.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, EXADAPT 2011. ACM, San Jose (2011)

    Google Scholar 

Download references

Acknowledgments

This research is based upon work supported by the National Science Foundation, under awards XPS-1439165 and XPS-1439097.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tongsheng Geng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Geng, T. et al. (2017). The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52709-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52708-6

  • Online ISBN: 978-3-319-52709-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics