A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA

Zeng, Dan; Zhu, Liang; Liao, Xiaofei; Jin, Hai

doi:10.1007/978-3-319-27140-8_6

A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA

Dan Zeng¹⁷,
Liang Zhu¹⁷,
Xiaofei Liao¹⁷ &
…
Hai Jin¹⁷

Conference paper
First Online: 16 December 2015

1514 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9531))

Abstract

Non-uniform memory access (NUMA) is one of the main architectures of today’s high-performance server. The key feature of NUMA is the non-uniformity of access latency. Access from a processor to attached memory is faster, and it also reduces the possibility of causing contention on interconnect links and memory controller. Multithreaded programs may experience high memory latency without careful placement of data and thread. Thus, it is necessary to develop a tool to identify and help ameliorate NUMA problems. In this paper, we present a data-centric tool to analyze the performance of multithreaded programs on NUMA architectures and provide advices on how to improve the performance. This paper describes the design and implementation of the tool. The tool is evaluated on Linux using three benchmark applications, and the evaluation shows how this tool helps to identify costly variables and choose optimization methods. The result shows performance improvement of up to 51.92 %.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Intel VTune Amplifier 2015. https://software.intel.com/en-us/intel-vtune-amplifier-xe
MPP (massively parallel processing). http://whatis.techtarget.com/definition/MPP-massively-parallel-processing
Symmetric multiprocessing. http://en.wikipedia.org/wiki/Symmetric_multiprocessing
Visual Performance Analyzer. http://www.alphaworks.ibm.com/tech/vpa
Batcher, K.E.: Design of a massively parallel processor. IEEE Trans. Comput. (TOC) 100(9), 836–840 (1980)
Article Google Scholar
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 72–81 (2008)
Google Scholar
Blagodurov, S., Zhuravlev, S., Fedorova, A., Kamali, A.: A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques(PACT), pp. 557–558 (2010)
Google Scholar
Drongowski, P.J.: An introduction to analysis and optimization with AMD Code-Analyst Performance Analyzer. Advanced Micro Devices, Inc (2008)
Google Scholar
Drongowski, P.J., Center, B.D.: Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. Advanced Micro Devices, Inc (2007)
Google Scholar
Intel: Intel 64 and IA-32 Architectures Software Developers Manual. Volume 3B: System Programming Guide (Part 2) (2013)
Google Scholar
Jin, H.Q., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance (1999)
Google Scholar
Kleen, A.: A NUMA API for Linux. Novel Inc (2005)
Google Scholar
Lachaize, R., Lepers, B., Quéma, V.: MemProf: A memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference (ATC), pp. 53–64 (2012)
Google Scholar
Lameter, C.: NUMA(Non-Uniform Memory Access): An overview. ACM Queue 11(7), 40 (2013)
Article Google Scholar
Majo, Z., Gross, T.R.: Matching memory access patterns and data placement for NUMA systems. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), pp. 230–241 (2012)
Google Scholar
Majo, Z., Gross, T.R.: A library for portable and composable data locality optimizations for NUMA systems. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 227–238 (2015)
Google Scholar
Matz, M., Hubicka, J., Jaeger, A., Mitchell, M.: System V Application Binary Interface. AMD64 Architecture Processor Supplement, Draft v0 99 (2005)
Google Scholar
McCurdy, C., Vetter, J.: Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 87–96 (2010)
Google Scholar
Rao, J., Wang, K., Zhou, X., Xu, C.: Optimizing virtual machine scheduling in NUMA multicore systems. In: Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 306–317 (2013)
Google Scholar
Tam, D.K., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In: Proceedings of the 2007 ACM European Conference on Computer Systems (EuroSys), pp. 47–58 (2007)
Google Scholar
Zheng, W., Yang, B., Lin, W., Li, Z.: Task scheduling of parallel programs to optimize communications for cluster of SMPs. Sci. China Ser. Inf. Sci. 44(3), 213–225 (2001)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This paper is supported by National High-tech Research and Development Program of China (863 Program) under grant No. 2012AA010905, National Natural Science Foundation of China under grant No. 61322210, 61272408, 61433019, Doctoral Fund of Ministry of Education of China under grant No. 20130142110048.

Author information

Authors and Affiliations

Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074, China
Dan Zeng, Liang Zhu, Xiaofei Liao & Hai Jin

Authors

Dan Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Liang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofei Liao
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofei Liao .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University, Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeng, D., Zhu, L., Liao, X., Jin, H. (2015). A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9531. Springer, Cham. https://doi.org/10.1007/978-3-319-27140-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-27140-8_6
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27139-2
Online ISBN: 978-3-319-27140-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics