Work-Stealing for NUMA-enabled Architecture

Chen, Quan; Guo, Minyi

doi:10.1007/978-981-10-6238-4_4

Work-Stealing for NUMA-enabled Architecture

Quan Chen³ &
Minyi Guo³

Chapter
First Online: 25 November 2017

891 Accesses

Abstract

Modern mainstream powerful computers not only adopt multi-socket multi-core CPU architecture, but also adopt the Non-Uniform Memory Access (NUMA)-based memory architecture. Although the CAB scheduler introduced in Chap. 3 can effectively improve the shared cache utilization, it still leads to severe remote memory accesses in these computers that significantly degrades the performance of memory-bound applications. To solve this problem, in this chapter, we introduce scheduling techniques that can better utilize both the shared cache in CPUs and the NUMA-based memory system.

Part of contents in this chapter has been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM. Figures 4.1, 4.5, 4.7, 4.8 and 4.9 in this chapter have been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.
Article MathSciNet MATH Google Scholar
AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10 h Processors. AMD (2010).
Google Scholar
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404–418, 2009.
Google Scholar
R. D. Blumofe. Executing Multithreaded Programs Efficiently. Ph.D. thesis, MIT, September 1995.
Google Scholar
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69, 1996.
Article Google Scholar
M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pp. 1–8, (2009).
Google Scholar
Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1) (2014).
Google Scholar
Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pp. 249–260 (2012).
Google Scholar
Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pp. 163–172 (2012).
Google Scholar
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pp. 722–7320 (2011).
Google Scholar
Q. Chen, and M. Guo. Locality-aware work stealing based on online profiling and auto-tuning for multisocket multicore architectures. ACM Transactions on Architecture and Code Optimization, 12(2):22, 2015.
Article Google Scholar
R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pp. 985–989 (2013).
Google Scholar
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pp. 212–223 (1998).
Google Scholar
T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pp. 1299–1308 (2013).
Google Scholar
T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG (2013).
Google Scholar
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.
Article MathSciNet MATH Google Scholar
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pp. 1–12 (2009).
Google Scholar
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work–stealing scheduler. In IPDPS, pp. 1–12 (2010).
Google Scholar
L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system based on C++. ACM (1993).
Google Scholar
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.
Article MathSciNet MATH Google Scholar
T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang. Magpie: Mpis collective communication operations for clustered wide area systems. In Proceeding 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, GA. Citeseer (1999).
Google Scholar
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pp. 25–36 (2010).
Google Scholar
C. Leiserson. The Cilk++ concurrency platform. In DAC, pp. 522–527 (2009).
Google Scholar
A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on numa systems. In OpenMP in the Era of Low Power Devices and Accelerators, pp. 156–170. Springer (2013).
Google Scholar
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02 (2011).
Google Scholar
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pp. 217–229 (2010).
Google Scholar
J. Reinders. Intel threading building blocks. Intel (2007).
Google Scholar
M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pp. 461–473 (2012).
Google Scholar
S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pp. 337–348 (2013).
Google Scholar
B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379–388, 2013.
Article Google Scholar
R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proceedings of the International Parallel and Distributed Processing Symposium, pp. 1046–1057, Anchorage, Alaska, USA. IEEE (2011).
Google Scholar
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pp. 315–325 (2013).
Google Scholar
R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer (2001).
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Quan Chen & Minyi Guo

Authors

Quan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quan Chen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, Q., Guo, M. (2017). Work-Stealing for NUMA-enabled Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-6238-4_4
Published: 25 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6237-7
Online ISBN: 978-981-10-6238-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics