Abstract
Non-uniform memory access (NUMA) is one of the main architectures of today’s high-performance server. The key feature of NUMA is the non-uniformity of access latency. Access from a processor to attached memory is faster, and it also reduces the possibility of causing contention on interconnect links and memory controller. Multithreaded programs may experience high memory latency without careful placement of data and thread. Thus, it is necessary to develop a tool to identify and help ameliorate NUMA problems. In this paper, we present a data-centric tool to analyze the performance of multithreaded programs on NUMA architectures and provide advices on how to improve the performance. This paper describes the design and implementation of the tool. The tool is evaluated on Linux using three benchmark applications, and the evaluation shows how this tool helps to identify costly variables and choose optimization methods. The result shows performance improvement of up to 51.92 %.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Intel VTune Amplifier 2015. https://software.intel.com/en-us/intel-vtune-amplifier-xe
MPP (massively parallel processing). http://whatis.techtarget.com/definition/MPP-massively-parallel-processing
Symmetric multiprocessing. http://en.wikipedia.org/wiki/Symmetric_multiprocessing
Visual Performance Analyzer. http://www.alphaworks.ibm.com/tech/vpa
Batcher, K.E.: Design of a massively parallel processor. IEEE Trans. Comput. (TOC) 100(9), 836–840 (1980)
Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 72–81 (2008)
Blagodurov, S., Zhuravlev, S., Fedorova, A., Kamali, A.: A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques(PACT), pp. 557–558 (2010)
Drongowski, P.J.: An introduction to analysis and optimization with AMD Code-Analyst Performance Analyzer. Advanced Micro Devices, Inc (2008)
Drongowski, P.J., Center, B.D.: Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. Advanced Micro Devices, Inc (2007)
Intel: Intel 64 and IA-32 Architectures Software Developers Manual. Volume 3B: System Programming Guide (Part 2) (2013)
Jin, H.Q., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance (1999)
Kleen, A.: A NUMA API for Linux. Novel Inc (2005)
Lachaize, R., Lepers, B., Quéma, V.: MemProf: A memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference (ATC), pp. 53–64 (2012)
Lameter, C.: NUMA(Non-Uniform Memory Access): An overview. ACM Queue 11(7), 40 (2013)
Majo, Z., Gross, T.R.: Matching memory access patterns and data placement for NUMA systems. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), pp. 230–241 (2012)
Majo, Z., Gross, T.R.: A library for portable and composable data locality optimizations for NUMA systems. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 227–238 (2015)
Matz, M., Hubicka, J., Jaeger, A., Mitchell, M.: System V Application Binary Interface. AMD64 Architecture Processor Supplement, Draft v0 99 (2005)
McCurdy, C., Vetter, J.: Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 87–96 (2010)
Rao, J., Wang, K., Zhou, X., Xu, C.: Optimizing virtual machine scheduling in NUMA multicore systems. In: Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 306–317 (2013)
Tam, D.K., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In: Proceedings of the 2007 ACM European Conference on Computer Systems (EuroSys), pp. 47–58 (2007)
Zheng, W., Yang, B., Lin, W., Li, Z.: Task scheduling of parallel programs to optimize communications for cluster of SMPs. Sci. China Ser. Inf. Sci. 44(3), 213–225 (2001)
Acknowledgments
This paper is supported by National High-tech Research and Development Program of China (863 Program) under grant No. 2012AA010905, National Natural Science Foundation of China under grant No. 61322210, 61272408, 61433019, Doctoral Fund of Ministry of Education of China under grant No. 20130142110048.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zeng, D., Zhu, L., Liao, X., Jin, H. (2015). A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9531. Springer, Cham. https://doi.org/10.1007/978-3-319-27140-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-27140-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27139-2
Online ISBN: 978-3-319-27140-8
eBook Packages: Computer ScienceComputer Science (R0)