Abstract
Modern mainstream powerful computers not only adopt multi-socket multi-core CPU architecture, but also adopt the Non-Uniform Memory Access (NUMA)-based memory architecture. Although the CAB scheduler introduced in Chap. 3 can effectively improve the shared cache utilization, it still leads to severe remote memory accesses in these computers that significantly degrades the performance of memory-bound applications. To solve this problem, in this chapter, we introduce scheduling techniques that can better utilize both the shared cache in CPUs and the NUMA-based memory system.
Part of contents in this chapter has been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM. Figures 4.1, 4.5, 4.7, 4.8 and 4.9 in this chapter have been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.
AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10 h Processors. AMD (2010).
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404–418, 2009.
R. D. Blumofe. Executing Multithreaded Programs Efficiently. Ph.D. thesis, MIT, September 1995.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69, 1996.
M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pp. 1–8, (2009).
Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1) (2014).
Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pp. 249–260 (2012).
Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pp. 163–172 (2012).
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pp. 722–7320 (2011).
Q. Chen, and M. Guo. Locality-aware work stealing based on online profiling and auto-tuning for multisocket multicore architectures. ACM Transactions on Architecture and Code Optimization, 12(2):22, 2015.
R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pp. 985–989 (2013).
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pp. 212–223 (1998).
T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pp. 1299–1308 (2013).
T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG (2013).
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pp. 1–12 (2009).
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work–stealing scheduler. In IPDPS, pp. 1–12 (2010).
L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system based on C++. ACM (1993).
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.
T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang. Magpie: Mpis collective communication operations for clustered wide area systems. In Proceeding 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, GA. Citeseer (1999).
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pp. 25–36 (2010).
C. Leiserson. The Cilk++ concurrency platform. In DAC, pp. 522–527 (2009).
A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on numa systems. In OpenMP in the Era of Low Power Devices and Accelerators, pp. 156–170. Springer (2013).
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02 (2011).
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pp. 217–229 (2010).
J. Reinders. Intel threading building blocks. Intel (2007).
M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pp. 461–473 (2012).
S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pp. 337–348 (2013).
B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379–388, 2013.
R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proceedings of the International Parallel and Distributed Processing Symposium, pp. 1046–1057, Anchorage, Alaska, USA. IEEE (2011).
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pp. 315–325 (2013).
R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer (2001).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Chen, Q., Guo, M. (2017). Work-Stealing for NUMA-enabled Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-6238-4_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6237-7
Online ISBN: 978-981-10-6238-4
eBook Packages: Computer ScienceComputer Science (R0)