Skip to main content

Work-Stealing for NUMA-enabled Architecture

  • Chapter
  • First Online:
  • 891 Accesses

Abstract

Modern mainstream powerful computers not only adopt multi-socket multi-core CPU architecture, but also adopt the Non-Uniform Memory Access (NUMA)-based memory architecture. Although the CAB scheduler introduced in Chap. 3 can effectively improve the shared cache utilization, it still leads to severe remote memory accesses in these computers that significantly degrades the performance of memory-bound applications. To solve this problem, in this chapter, we introduce scheduling techniques that can better utilize both the shared cache in CPUs and the NUMA-based memory system.

Part of contents in this chapter has been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM. Figures 4.14.54.74.8 and 4.9 in this chapter have been published through ACM Transactions on Architecture and Code Optimization. Reprinted from Ref. [31], with permission from ACM.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.

    Article  MathSciNet  MATH  Google Scholar 

  2. AMD. BIOS and Kernel Developer Guide (BKDG) For AMD Family 10 h Processors. AMD (2010).

    Google Scholar 

  3. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of OpenMP tasks. IEEE TPDS, 20(3):404–418, 2009.

    Google Scholar 

  4. R. D. Blumofe. Executing Multithreaded Programs Efficiently. Ph.D. thesis, MIT, September 1995.

    Google Scholar 

  5. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55–69, 1996.

    Article  Google Scholar 

  6. M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar. NUMA-ICTM: A parallel version of ICTM exploiting memory placement strategies for NUMA machines. In IPDPS, pp. 1–8, (2009).

    Google Scholar 

  7. Q. Chen and M. Guo. Adaptive workload aware task scheduling for single-ISA multi-core architectures. ACM Transactions on Architecture and Code Optimization, 11(1) (2014).

    Google Scholar 

  8. Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-aware task scheduling in asymmetric multi-core architectures. In IPDPS, pp. 249–260 (2012).

    Google Scholar 

  9. Q. Chen, M. Guo, and Z. Huang. CATS: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In ICS, pp. 163–172 (2012).

    Google Scholar 

  10. Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware bi-tier task-stealing in multi-socket multi-core architecture. In ICPP, pp. 722–7320 (2011).

    Google Scholar 

  11. Q. Chen, and M. Guo. Locality-aware work stealing based on online profiling and auto-tuning for multisocket multicore architectures. ACM Transactions on Architecture and Code Optimization, 12(2):22, 2015.

    Article  Google Scholar 

  12. R. Cole and V. Ramachandran. Analysis of randomized work stealing with false sharing. In IPDPS, pp. 985–989 (2013).

    Google Scholar 

  13. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI, pp. 212–223 (1998).

    Google Scholar 

  14. T. Gautier, J. V. Lima, N. Maillard, and B. Raffin. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In IPDPS, pp. 1299–1308 (2013).

    Google Scholar 

  15. T. Gautier, J. V. F. Lima, N. Maillard, B. Raffin, et al. Locality-aware work stealing on Multi-CPU and Multi-GPU architectures. In MULTIPROG (2013).

    Google Scholar 

  16. A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  17. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work- first and help-first scheduling policies for async-finish task parallelism. In IPDPS, pp. 1–12 (2009).

    Google Scholar 

  18. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. SLAW: a scalable locality-aware adaptive work–stealing scheduler. In IPDPS, pp. 1–12 (2010).

    Google Scholar 

  19. L. V. Kale and S. Krishnan. CHARM++: a portable concurrent object oriented system based on C++. ACM (1993).

    Google Scholar 

  20. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998.

    Article  MathSciNet  MATH  Google Scholar 

  21. T. Kielmann, R. F. Hofman, H. E. Bal, A. Plaat, and R. A. Bhoedjang. Magpie: Mpis collective communication operations for clustered wide area systems. In Proceeding 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Atlanta, GA. Citeseer (1999).

    Google Scholar 

  22. J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP, pp. 25–36 (2010).

    Google Scholar 

  23. C. Leiserson. The Cilk++ concurrency platform. In DAC, pp. 522–527 (2009).

    Google Scholar 

  24. A. Muddukrishna, P. A. Jonsson, V. Vlassov, and M. Brorsson. Locality-aware task scheduling and data distribution on numa systems. In OpenMP in the Era of Low Power Devices and Accelerators, pp. 156–170. Springer (2013).

    Google Scholar 

  25. L. L. Pilla, C. P. Ribeiro, D. Cordeiro, A. Bhatele, P. O. Navaux, J.-F. Méhaut, L. V. Kalé, et al. Improving parallel system performance with a NUMA-aware load balancer. TR-JLPC-11-02 (2011).

    Google Scholar 

  26. J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar, pp. 217–229 (2010).

    Google Scholar 

  27. J. Reinders. Intel threading building blocks. Intel (2007).

    Google Scholar 

  28. M. Shaheen and R. Strzodka. NUMA aware iterative stencil computations on many-core systems. In IPDPS, pp. 461–473 (2012).

    Google Scholar 

  29. S. Sridharan, G. Gupta, and G. S. Sohi. Holistic run-time parallelism management for time and energy efficiency. In ICS, pp. 337–348 (2013).

    Google Scholar 

  30. B. Vikranth, R. Wankar, and C. R. Rao. Topology aware task stealing for on-chip NUMA multi-core processors. Procedia Computer Science, 18:379–388, 2013.

    Article  Google Scholar 

  31. R. Yang, J. Antony, A. Rendell, D. Robson, and P. Strazdins. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proceedings of the International Parallel and Distributed Processing Symposium, pp. 1046–1057, Anchorage, Alaska, USA. IEEE (2011).

    Google Scholar 

  32. R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis. Locality-aware task management for unstructured parallelism: a quantitative limit study. In SPAA, pp. 315–325 (2013).

    Google Scholar 

  33. R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer (2001).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Chen .

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chen, Q., Guo, M. (2017). Work-Stealing for NUMA-enabled Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6238-4_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6237-7

  • Online ISBN: 978-981-10-6238-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics