Abstract
In this chapter, we discuss emerging dynamic task scheduling policies that can improve the performance of parallel applications on multi-socket architecture. In current real systems, multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, the traditional task scheduling policies (for example work-stealing) tend to pollute the shared cache and incur more cache misses. Due to the good performance of work-stealing policy, we use the traditional random work-stealing policy as the baseline in this chapter. To relieve this problem, in this chapter, we present a Cache-Aware Bi-tier work-stealing (CAB) policy. CAB improves the performance of memory-bound applications by reducing memory footprint and cache misses of tasks running inside the same CPU socket. CAB adaptively uses a task graph partitioner to divide an execution task graph into the inter-socket tier and the intra-socket tier. Tasks in the inter-socket tier are scheduled across sockets while tasks in the intra-socket tier are scheduled within the same socket. Experimental results show that CAB can significantly improve the performance of memory-bound applications compared with the traditional random work-stealing policy.
Part of contents in this chapter has been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE. Figures 3.1, 3.2, 3.6 and 3.8 in this chapter have been published through IEEE Transactions on Parallel and Distributed Systems. Reprinted from Ref. [15], with permission from IEEE
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
All the programs mentioned below are memory-bound divide-and-conquer parallel programs.
References
U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321–347, 2002.
E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404–418, 2009.
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Proceedings of the 19th annual international conference on Supercomputing, pages 101–110. ACM, 2005.
M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484–512, 1984.
G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the 19th annual ACM-SIAM symposium on Discrete algorithms, pages 501–510. Society for Industrial and Applied Mathematics, 2008.
G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 20th ACM Symposium on Parallel Algorithms and Architectures, San Jose, California, June 2011.
G. Blelloch, P. Gibbons, and H. Simhadri. Low depth cache-oblivious algorithms. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages 189–199. ACM, 2010.
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55–69, Aug. 1996.
R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1995. MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677.
D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, page 28. ACM, 2005.
S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105–115. ACM, 2007.
Q. Chen, M. Guo, and Z. Huang. Cats: Cache aware task-stealing based on online profiling in multi-socket multi-core architectures. In the 26th International Conference on Supercomputing, pages 163–172. IEEE, 2012.
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In the 40th International Conference on Parallel Processing, pages 722–732, 2011.
Q. Chen, M. Guo, and Z. Huang. Adaptive cache aware bi-tier work-stealing in multi-socket multi-core architectures. IEEE Transactions on Parallel and Distributed Systems, 24(12):2334–2343, 2013.
R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.
X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 103–112, 2011.
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In the 40th Annual Symposium on Foundations of Computer Science, pages 285–297, New York, USA, 1999. IEEE.
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and Distributed Computing, 16(4):276–291, 1992.
W. Gropp, E. Lusk, and A. Skjellum. Using MPI: portable parallel programming with the message passing interface. MIT Press, 1999.
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In the 23th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2009.
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work–stealing scheduler. In the 24th IEEE International Parallel and Distributed Processing Symposium, pages 1–12. IEEE, 2010.
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the 21th annual symposium on Principles of distributed computing, pages 280–289. ACM, 2002.
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Sun Microsystems, Inc. Technical Reports; Vol. SERIES13103, page 69, 2005.
D. Lea. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande, pages 36–43. ACM, 2000.
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing, pages 25–36. ACM, 2010.
D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. ACM SIGPLAN Notices, 44(10):227–242, 2009.
C. Leiserson. The Cilk++ concurrency platform. In Proceedings of the 46th Annual Design Automation Conference, pages 522–527. ACM, 2009.
M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 45–54. ACM, 2009.
S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, pages 49–56, Tucson, Arizona, 2011. ACM.
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, pages 217–229. Springer-Verlag, 2010.
J. Reinders. Intel threading building blocks. O’Reilly, 2007.
R. Van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Citeseer, 2001.
L. Wang, H. Cui, Y. Duan, F. Lu, X. Feng, and P. Yew. An adaptive task creation strategy for work-stealing scheduling. In Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pages 266–277. ACM, 2010.
J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In 37th International Conference on Parallel Processing, pages 636–643. IEEE, 2008.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Chen, Q., Guo, M. (2017). Work-Stealing for Multi-socket Architecture. In: Task Scheduling for Multi-core and Parallel Architectures. Springer, Singapore. https://doi.org/10.1007/978-981-10-6238-4_3
Download citation
DOI: https://doi.org/10.1007/978-981-10-6238-4_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6237-7
Online ISBN: 978-981-10-6238-4
eBook Packages: Computer ScienceComputer Science (R0)