A Scalable Pthreads-Compatible Thread Model for VM-Intensive Programs

  • Yu ZhangEmail author
  • Jiankang Chen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11337)


With the widespread adoption of multicore chips, many multithreaded applications based on the shared address space have been developed. Widely-used operating systems, such as Linux, use a per-process lock to synchronize page faults and memory mapping operations (e.g., mmap and munmap) on the shared address space between threads, restricting the scalability and performance of the applications. We propose a novel Pthreads-compatible multithreaded model, PAthreads, which provides isolated address spaces between threads to avoid contention on address space, and meanwhile preserves the shared variable semantics. We prototype PAthreads on Linux by using a proposed character device driver and a proposed shared heap allocator IAmalloc. Pthreads applications can run with PAthreads without any modifications. Experimental results show that PAthreads runs 2.17\(\times \), 3.19\(\times \) faster for workloads hist, dedup on 32 CPU cores, and 8.15\(\times \) faster for workload lr on 16 cores than Pthreads. Moreover, by using Linux Perf, we further analyze critical bottlenecks that limit the scalability of workloads programmed by Pthreads. This paper also reviews the performance impact of the latest Linux 4.10 kernel optimization on PAthreads and Pthreads, and results show that PAthreads still has advantage for dedup and lr.



This work was partly supported by the grants of National Natural Science Foundation of China (No. 61772487) and Anhui Provincial Natural Science Foundation (No. 1808085MF198).


  1. 1.
    Anderson, J.M., Berc, L.M., Dean, J., et al.: Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst. 15(4), 357–390 (1997). Scholar
  2. 2.
    Baldassin, A., Borin, E., Araujo, G.: Performance implications of dynamic memory allocators on transactional memory systems. In: 20th PPoPPACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 87–96. ACM (2015).
  3. 3.
    Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: 17th PACT International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81, October 2008Google Scholar
  4. 4.
    Birrell, A.: Implementing condition variables with semaphores. In: Herbert, A., Jones, K.S. (eds.) Computer Systems: Theory, Technology, and Applications. Monographs in Computer Science, pp. 29–37. Springer, New York (2004). Scholar
  5. 5.
    Bolosky, W.J., Scott, M.L.: False sharing and its effect on shared memory performance. In: USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems, vol. 4, p. 3. USENIX Association, Berkeley (1993)Google Scholar
  6. 6.
    Boyd-Wickizer, S., Chen, H., Chen, R., et al.: Corey: an operating system for many cores. In: 8th OSDIUSENIX Conference on Operating Systems Design and Implementation, pp. 43–57. USENIX Association, Berkeley (2008)Google Scholar
  7. 7.
    Boyd-Wickizer, S., Clements, A.T., Mao, Y., et al.: An analysis of Linux scalability to many cores. In: 9th OSDIUSENIX Conference on Operating Systems Design and Implementation, pp. 1–8. USENIX Association, Berkeley (2010)Google Scholar
  8. 8.
    Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14(3), 189–204 (2000)CrossRefGoogle Scholar
  9. 9.
    Clements, A.T., Kaashoek, M.F., Zeldovich, N.: Scalable address spaces using RCU balanced trees. In: 17th ASPLOS International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 199–210, ACM, New York (2012).
  10. 10.
    Corbet, J.: MCS locks and qspinlocks, March 2014.
  11. 11.
    Corbet, J.: Optimizing VMA caching, March 2014.
  12. 12.
    Drepper, U.: Futexes are tricky, November 2011.
  13. 13.
    Evans, J.: A scalable concurrent malloc(3) implementation for FreeBSD. In: BSDCan Conference, Ottawa, Canada, May 2006Google Scholar
  14. 14.
    Gloger, W.: Dynamic memory allocator implementations in Linux system libraries, May 2006.
  15. 15.
    Kleen, A.: Linux multi-core scalability. In: Linux Kongress, October 2009Google Scholar
  16. 16.
    Lea, D.: Dlmalloc: a memory allocator (2012). Accessed 24 Sept 2012
  17. 17.
    Liu, T., Curtsinger, C., Berger, E.: DTHREADS: efficient deterministic multithreading. In: 23rd SOSPACM Symposium on Operating Systems Principles, pp. 327–336, October 2011Google Scholar
  18. 18.
    McKenney, P.E.: Exploiting deferred destruction: an analysis of read-copy-update techniques in operating system kernels. Ph.D. thesis, Oregon Health & Science University (2004)Google Scholar
  19. 19.
    de Melo, A.C.: Performance counters on Linux. In: Linux Plumbers Conference, September 2009Google Scholar
  20. 20.
    Ranger, C., Raghuraman, R., Penmetsa, A., et al.: Evaluating MapReduce for multi-core and multiprocessor systems. In: 13th HPCA IEEE International Symposium on High Performance Computer Architecture, pp. 13–24. IEEE Computer Society, Washington, DC, February 2007.
  21. 21.
    Wentzlaff, D., Agarwal, A.: Factored operating systems (fos): the case for a scalable operating system for multicores. SIGOPS Oper. Syst. Rev. 43(2), 76–85 (2009). Scholar
  22. 22.
    Xiong, W., Park, S., Zhang, J., Zhou, Y., Ma, Z.: Ad Hoc synchronization considered harmful. In: 9th OSDIUSENIX Conference on Operating Systems Design and Implementation, pp. 1–8. USENIX Association, Berkeley (2010)Google Scholar
  23. 23.
    Zhang, Y., Cao, H.: DMR: A deterministic MapReduce for multicore systems. Int. J. Parallel Prog. 1–14 (2015). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina

Personalised recommendations