The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Geng, Tongsheng; Zuckerman, Stéphane; Monsalve, José; Goldman, Alfredo; Habib, Sami; Gaudiot, Jean-Luc; Gao, Guang R.

doi:10.1007/978-3-319-52709-3_16

Tongsheng Geng¹⁶,
Stéphane Zuckerman¹⁷,
José Monsalve¹⁷,
Alfredo Goldman¹⁶,
Sami Habib¹⁸,
Jean-Luc Gaudiot¹⁶ &
…
Guang R. Gao¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10136))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

979 Accesses
3 Citations

Abstract

Current shared-memory systems can feature tens of processing elements. The old assumption that coarse-grain synchronization is enough in a shared-memory system thus becomes invalid. To efficiently take advantage of such systems, we propose to use fine grain synchronization, with event-driven multithreading. To illustrate our point, we study a naïve 5-point 2D stencil kernel. We provide several synchronization variants using our fine-grain multithreading environment, and compare it to a naïve coarse-grain implementation using OpenMP. We conducted experiments on three different many-core compute nodes, with speedups ranging from 1.2\(\times \) to 1.75\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that we do not claim that our own environment is better than OpenMP 4.
2.
Obviously, as we are writing directly using a runtime system API, the code has to be more verbose than its OpenMP counterpart.

References

Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. SIGPLAN Not. 26(7), 39–50 (1991)
Article Google Scholar
Bandishti, V., Pananilath, I., Bondhugula, U.: Tiling stencil computations to maximize parallelism. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012. IEEE Computer Society Press, Salt Lake City (2012)
Google Scholar
Barik, R., et al.: The Habanero multicore software research project. In: Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications, OOPSLA 2009. ACM, Orlando (2009)
Google Scholar
Bertolacci, I.J., et al.: Parameterized diamond tiling for stencil computations with chapel parallel iterators. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015. ACM, Newport Beach (2015)
Google Scholar
Blumofe, R.D., et al.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)
Article Google Scholar
OpenMP Architecture Review Board. OpenMP Application Program Interface version 4.0 (2013)
Google Scholar
Christen, M., Schenk, O., Burkhart, H.: PATUS: a code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS) (2011)
Google Scholar
Dennis, J.B.: First version of a data flow procedure language. In: Robinet, B. (ed.) Programming Symposium. LNCS, vol. 19, pp. 362–376. Springer, Heidelberg (1974). doi:10.1007/3-540-06859-7_145
Chapter Google Scholar
Gautier, T., et al.: XKaapi: a runtime system for data-flow task programming on heterogeneous architectures. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS) (2013)
Google Scholar
Kamil, S., et al.: An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS) (2010)
Google Scholar
Knobe, K.: Ease of use with concurrent collections (CnC). In: Hot Topics in Parallelism (2009)
Google Scholar
Lauderdale, C., Khan, R.: Towards a codelet-based runtime for exascale computing: position paper. In: Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exafop Era, EXADAPT 2012. ACM, London (2012)
Google Scholar
Lesniak, M.: PASTHA: parallelizing stencil calculations in Haskell. In: Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming, DAMP 2010. ACM, Madrid (2010)
Google Scholar
Liu, C., Kulkarni, M.: Optimizing the LULESH stencil code using concurrent collections. In: Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frame-Works for High Performance Computing, WOLFHPC 2015. ACM, Austin (2015)
Google Scholar
Mattson, T., et al.: OCR: the open community runtime interface. Technical report, June 2015. https://xstack.exascaletech.com/git/public
Muranushi, T., Makino, J.: Optimal temporal blocking for stencil computation. Procedia Comput. Sci. 51, 1303–1312 (2015). International Conference on Computational Science, ICCS 2015 Computational Science at the Gates of Nature
Article Google Scholar
Schweizer, H., Besta, M., Hoefler, T.: Evaluating the cost of atomic operations on modern architectures. Technical report ETH Zurich, Department of Computer Science (2015)
Google Scholar
Shrestha, S., Manzano, J., Marquez, A., Feo, J., Gao, G.R.: Jagged tiling for intra-tile parallelism and fine-grain multithreading. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 161–175. Springer, Heidelberg (2015). doi:10.1007/978-3-319-17473-0_11
Google Scholar
Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: Wolf, F., Mohr, B., Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 633–644. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40047-6_63
Chapter Google Scholar
Tang, Y., et al.: The pochoir stencil compiler. In: Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011. ACM, San Jose (2011)
Google Scholar
Zuckerman, S., et al.: Using a “codelet” program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, EXADAPT 2011. ACM, San Jose (2011)
Google Scholar

Download references

Acknowledgments

This research is based upon work supported by the National Science Foundation, under awards XPS-1439165 and XPS-1439097.

Author information

Authors and Affiliations

PArallel Systems and Computer Architecture Lab, Department of Electrical Engineering and Computer Science, University of California, Irvine, USA
Tongsheng Geng, Alfredo Goldman & Jean-Luc Gaudiot
Computer Architecture and Parallel Systems Laboratory, Department of Electrical and Computer Engineering, University of Delaware, Newark, USA
Stéphane Zuckerman, José Monsalve & Guang R. Gao
Computer Engineering Department, Kuwait University, Al-khalidiya, Kuwait
Sami Habib

Authors

Tongsheng Geng
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Zuckerman
View author publications
You can also search for this author in PubMed Google Scholar
José Monsalve
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Goldman
View author publications
You can also search for this author in PubMed Google Scholar
Sami Habib
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Gaudiot
View author publications
You can also search for this author in PubMed Google Scholar
Guang R. Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tongsheng Geng .

Editor information

Editors and Affiliations

University of Rochester , Rochester, New York, USA
Chen Ding
University of Rochester , Rochester, New York, USA
John Criswell
Huawei Inc. , Santa Clara, California, USA
Peng Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geng, T. et al. (2017). The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-52709-3_16
Published: 24 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics