Skip to main content

A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA

  • Conference paper
  • First Online:
  • 1514 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9531))

Abstract

Non-uniform memory access (NUMA) is one of the main architectures of today’s high-performance server. The key feature of NUMA is the non-uniformity of access latency. Access from a processor to attached memory is faster, and it also reduces the possibility of causing contention on interconnect links and memory controller. Multithreaded programs may experience high memory latency without careful placement of data and thread. Thus, it is necessary to develop a tool to identify and help ameliorate NUMA problems. In this paper, we present a data-centric tool to analyze the performance of multithreaded programs on NUMA architectures and provide advices on how to improve the performance. This paper describes the design and implementation of the tool. The tool is evaluated on Linux using three benchmark applications, and the evaluation shows how this tool helps to identify costly variables and choose optimization methods. The result shows performance improvement of up to 51.92 %.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Intel VTune Amplifier 2015. https://software.intel.com/en-us/intel-vtune-amplifier-xe

  2. MPP (massively parallel processing). http://whatis.techtarget.com/definition/MPP-massively-parallel-processing

  3. Symmetric multiprocessing. http://en.wikipedia.org/wiki/Symmetric_multiprocessing

  4. Visual Performance Analyzer. http://www.alphaworks.ibm.com/tech/vpa

  5. Batcher, K.E.: Design of a massively parallel processor. IEEE Trans. Comput. (TOC) 100(9), 836–840 (1980)

    Article  Google Scholar 

  6. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: Characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 72–81 (2008)

    Google Scholar 

  7. Blagodurov, S., Zhuravlev, S., Fedorova, A., Kamali, A.: A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques(PACT), pp. 557–558 (2010)

    Google Scholar 

  8. Drongowski, P.J.: An introduction to analysis and optimization with AMD Code-Analyst Performance Analyzer. Advanced Micro Devices, Inc (2008)

    Google Scholar 

  9. Drongowski, P.J., Center, B.D.: Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. Advanced Micro Devices, Inc (2007)

    Google Scholar 

  10. Intel: Intel 64 and IA-32 Architectures Software Developers Manual. Volume 3B: System Programming Guide (Part 2) (2013)

    Google Scholar 

  11. Jin, H.Q., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance (1999)

    Google Scholar 

  12. Kleen, A.: A NUMA API for Linux. Novel Inc (2005)

    Google Scholar 

  13. Lachaize, R., Lepers, B., Quéma, V.: MemProf: A memory profiler for NUMA multicore systems. In: Proceedings of the 2012 USENIX Conference on Annual Technical Conference (ATC), pp. 53–64 (2012)

    Google Scholar 

  14. Lameter, C.: NUMA(Non-Uniform Memory Access): An overview. ACM Queue 11(7), 40 (2013)

    Article  Google Scholar 

  15. Majo, Z., Gross, T.R.: Matching memory access patterns and data placement for NUMA systems. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), pp. 230–241 (2012)

    Google Scholar 

  16. Majo, Z., Gross, T.R.: A library for portable and composable data locality optimizations for NUMA systems. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 227–238 (2015)

    Google Scholar 

  17. Matz, M., Hubicka, J., Jaeger, A., Mitchell, M.: System V Application Binary Interface. AMD64 Architecture Processor Supplement, Draft v0 99 (2005)

    Google Scholar 

  18. McCurdy, C., Vetter, J.: Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 87–96 (2010)

    Google Scholar 

  19. Rao, J., Wang, K., Zhou, X., Xu, C.: Optimizing virtual machine scheduling in NUMA multicore systems. In: Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 306–317 (2013)

    Google Scholar 

  20. Tam, D.K., Azimi, R., Stumm, M.: Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In: Proceedings of the 2007 ACM European Conference on Computer Systems (EuroSys), pp. 47–58 (2007)

    Google Scholar 

  21. Zheng, W., Yang, B., Lin, W., Li, Z.: Task scheduling of parallel programs to optimize communications for cluster of SMPs. Sci. China Ser. Inf. Sci. 44(3), 213–225 (2001)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This paper is supported by National High-tech Research and Development Program of China (863 Program) under grant No. 2012AA010905, National Natural Science Foundation of China under grant No. 61322210, 61272408, 61433019, Doctoral Fund of Ministry of Education of China under grant No. 20130142110048.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofei Liao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zeng, D., Zhu, L., Liao, X., Jin, H. (2015). A Data-Centric Tool to Improve the Performance of Multithreaded Program on NUMA. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9531. Springer, Cham. https://doi.org/10.1007/978-3-319-27140-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27140-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27139-2

  • Online ISBN: 978-3-319-27140-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics