Work-Stealing for Multi-socket Architecture

Chen, Quan; Guo, Minyi

doi:10.1007/978-981-10-6238-4_3

Quan Chen³ &
Minyi Guo³

903 Accesses

Abstract

In this chapter, we discuss emerging dynamic task scheduling policies that can improve the performance of parallel applications on multi-socket architecture. In current real systems, multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, the traditional task scheduling policies (for example work-stealing) tend to pollute the shared cache and incur more cache misses. Due to the good performance of work-stealing policy, we use the traditional random work-stealing policy as the baseline in this chapter. To relieve this problem, in this chapter, we present a Cache-Aware Bi-tier work-stealing (CAB) policy. CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. CAB adaptively uses a task graph partitioner to divide an execution task graph into the inter-socket tier and the intra-socket tier. Tasks in the inter-socket tier are scheduled across sockets while tasks in the intra-socket tier are scheduled within the same socket. Experimental results show that CAB can significantly improve the performance of memory-bound applications compared with the traditional random work-stealing policy.

Part of contents in this chapter has been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE. Figures 3.1, 3.2, 3.6 and 3.8 in this chapter have been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All the programs mentioned below are memory-bound divide-and-conquer parallel programs.

References

U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.
Article MathSciNet MATH Google Scholar
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404–418, 2009.
Article Google Scholar
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Proceedings of the 19th annual international conference on Supercomputing, pages 101–110. ACM, 2005.
Google Scholar
M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484–512, 1984.
Article MathSciNet MATH Google Scholar
G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the 19th annual ACM-SIAM symposium on Discrete algorithms, pages 501–510. Society for Industrial and Applied Mathematics, 2008.
Google Scholar
G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 20th ACM Symposium on Parallel Algorithms and Architectures, San Jose, California, June 2011.
Google Scholar
G. Blelloch, P. Gibbons, and H. Simhadri. Low depth cache-oblivious algorithms. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages 189–199. ACM, 2010.
Google Scholar
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55–69, Aug. 1996.
Google Scholar
R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1995. MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677.
Google Scholar
D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
Google Scholar
D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, page 28. ACM, 2005.
Google Scholar
S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105–115. ACM, 2007.
Google Scholar
Q. Chen, M. Guo, and Z. Huang. Cats: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In the 26th International Conference on Supercomputing, pages 163–172. IEEE, 2012.
Google Scholar
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In the 40th International Conference on Parallel Processing, pages 722–732, 2011.
Google Scholar
Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334–2343, 2013.
Article Google Scholar
R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.
Google Scholar
X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 103–112, 2011.
Google Scholar
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In the 40th Annual Symposium on Foundations of Computer Science, pages 285–297, New York, USA, 1999. IEEE.
Google Scholar
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.
Article MathSciNet MATH Google Scholar
W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message passing interface. MIT Press, 1999.
Google Scholar
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In the 23th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2009.
Google Scholar
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work–stealing scheduler. In the 24th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2010.
Google Scholar
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the 21th annual symposium on Principles of distributed computing, pages 280–289. ACM, 2002.
Google Scholar
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Sun Microsystems, Inc. Technical Reports; Vol. SERIES13103, page 69, 2005.
Google Scholar
D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, pages 36–43. ACM, 2000.
Google Scholar
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing, pages 25–36. ACM, 2010.
Google Scholar
D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. ACM SIGPLAN Notices, 44(10):227–242, 2009.
Article Google Scholar
C. Leiserson. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, pages 522–527. ACM, 2009.
Google Scholar
M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 45–54. ACM, 2009.
Google Scholar
S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, pages 49–56, Tucson, Arizona, 2011. ACM.
Google Scholar
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, pages 217–229. Springer-Verlag, 2010.
Google Scholar
J. Reinders. Intel threading building blocks. O’Reilly, 2007.
Google Scholar
R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer, 2001.
Google Scholar
L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 266–277. ACM, 2010.
Google Scholar
J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In 37th International Conference on Parallel Processing, pages 636–643. IEEE, 2008.
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Quan Chen & Minyi Guo

Authors

Quan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quan Chen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, Q., Guo, M. (2017). Work-Stealing for Multi-socket Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-6238-4_3
Published: 25 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6237-7
Online ISBN: 978-981-10-6238-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics