Skip to main content

Work-Stealing for Multi-socket Architecture

  • Chapter
  • First Online:
Task Scheduling for Multi-core and Parallel Architectures
  • 903 Accesses

Abstract

In this chapter, we discuss emerging dynamic task scheduling policies that can improve the performance of parallel applications on multi-socket architecture. In current real systems, multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, the traditional task scheduling policies (for example work-stealing) tend to pollute the shared cache and incur more cache misses. Due to the good performance of work-stealing policy, we use the traditional random work-stealing policy as the baseline in this chapter. To relieve this problem, in this chapter, we present a Cache-Aware Bi-tier work-stealing (CAB) policy. CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. CAB adaptively uses a task graph partitioner to divide an execution task graph into the inter-socket tier and the intra-socket tier. Tasks in the inter-socket tier are scheduled across sockets while tasks in the intra-socket tier are scheduled within the same socket. Experimental results show that CAB can significantly improve the performance of memory-bound applications compared with the traditional random work-stealing policy.

Part of contents in this chapter has been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE. Figures 3.1, 3.2, 3.6 and 3.8 in this chapter have been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    All the programs mentioned below are memory-bound divide-and-conquer parallel programs.

References

  1. U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.

    Article  MathSciNet  MATH  Google Scholar 

  2. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404–418, 2009.

    Article  Google Scholar 

  3. R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Proceedings of the 19th annual international conference on Supercomputing, pages 101–110. ACM, 2005.

    Google Scholar 

  4. M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484–512, 1984.

    Article  MathSciNet  MATH  Google Scholar 

  5. G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the 19th annual ACM-SIAM symposium on Discrete algorithms, pages 501–510. Society for Industrial and Applied Mathematics, 2008.

    Google Scholar 

  6. G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 20th ACM Symposium on Parallel Algorithms and Architectures, San Jose, California, June 2011.

    Google Scholar 

  7. G. Blelloch, P. Gibbons, and H. Simhadri. Low depth cache-oblivious algorithms. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages 189–199. ACM, 2010.

    Google Scholar 

  8. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55–69, Aug. 1996.

    Google Scholar 

  9. R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1995. MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677.

    Google Scholar 

  10. D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

    Google Scholar 

  11. D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, page 28. ACM, 2005.

    Google Scholar 

  12. S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105–115. ACM, 2007.

    Google Scholar 

  13. Q. Chen, M. Guo, and Z. Huang. Cats: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In the 26th International Conference on Supercomputing, pages 163–172. IEEE, 2012.

    Google Scholar 

  14. Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In the 40th International Conference on Parallel Processing, pages 722–732, 2011.

    Google Scholar 

  15. Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334–2343, 2013.

    Article  Google Scholar 

  16. R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.

    Google Scholar 

  17. X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 103–112, 2011.

    Google Scholar 

  18. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In the 40th Annual Symposium on Foundations of Computer Science, pages 285–297, New York, USA, 1999. IEEE.

    Google Scholar 

  19. A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  20. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message passing interface. MIT Press, 1999.

    Google Scholar 

  21. Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In the 23th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2009.

    Google Scholar 

  22. Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work–stealing scheduler. In the 24th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2010.

    Google Scholar 

  23. D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the 21th annual symposium on Principles of distributed computing, pages 280–289. ACM, 2002.

    Google Scholar 

  24. D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Sun Microsystems, Inc. Technical Reports; Vol. SERIES13103, page 69, 2005.

    Google Scholar 

  25. D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, pages 36–43. ACM, 2000.

    Google Scholar 

  26. J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing, pages 25–36. ACM, 2010.

    Google Scholar 

  27. D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. ACM SIGPLAN Notices, 44(10):227–242, 2009.

    Article  Google Scholar 

  28. C. Leiserson. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, pages 522–527. ACM, 2009.

    Google Scholar 

  29. M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 45–54. ACM, 2009.

    Google Scholar 

  30. S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, pages 49–56, Tucson, Arizona, 2011. ACM.

    Google Scholar 

  31. J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, pages 217–229. Springer-Verlag, 2010.

    Google Scholar 

  32. J. Reinders. Intel threading building blocks. O’Reilly, 2007.

    Google Scholar 

  33. R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer, 2001.

    Google Scholar 

  34. L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 266–277. ACM, 2010.

    Google Scholar 

  35. J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In 37th International Conference on Parallel Processing, pages 636–643. IEEE, 2008.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Chen .

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Chen, Q., Guo, M. (2017). Work-Stealing for Multi-socket Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-6238-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-6237-7

  • Online ISBN: 978-981-10-6238-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics