Advertisement

Journal of Computer Science and Technology

, Volume 34, Issue 1, pp 94–112 | Cite as

Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory

  • Yang Hong
  • Yang Zheng
  • Fan Yang
  • Bin-Yu Zang
  • Hai-Bing Guan
  • Hai-Bo ChenEmail author
Regular Paper
  • 13 Downloads

Abstract

The multicore evolution has stimulated renewed interests in scaling up applications on shared-memory multiprocessors, significantly improving the scalability of many applications. But the scalability is limited within a single node; therefore programmers still have to redesign applications to scale out over multiple nodes. This paper revisits the design and implementation of distributed shared memory (DSM) as a way to scale out applications optimized for non-uniform memory access (NUMA) architecture over a well-connected cluster. This paper presents MAGI, an efficient DSM system that provides a transparent shared address space with scalable performance on a cluster with fast network interfaces. MAGI is unique in that it presents a NUMA abstraction to fully harness the multicore resources in each node through hierarchical synchronization and memory management. MAGI also exploits the memory access patterns of big-data applications and leverages a set of optimizations for remote direct memory access (RDMA) to reduce the number of page faults and the cost of the coherence protocol. MAGI has been implemented as a user-space library with pthread-compatible interfaces and can run existing multithreaded applications with minimized modifications. We deployed MAGI over an 8-node RDMAenabled cluster. Experimental evaluation shows that MAGI achieves up to 9.25x speedup compared with an unoptimized implementation, leading to a scalable performance for large-scale data-intensive applications.

Keywords

distributed shared memory (DSM) scalability multicore evolution non-uniform memory access (NUMA) remote direct memory access (RDMA) 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2019_1901_MOESM1_ESM.pdf (313 kb)
ESM 1 (PDF 313 kb)

References

  1. [1]
    Dice D, Marathe V J, Shavit N. Lock cohorting: A general technique for designing NUMA locks. In Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2012, pp.247-256.Google Scholar
  2. [2]
    Calciu I, Dice D, Lev Y, Luchangco V, Marathe V J, Shavit N. Numa-aware reader-writer locks. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2013, pp.157-166.Google Scholar
  3. [3]
    Boyd-Wickizer S, Kaashoek M F, Morris R, Zeldovich N. OpLog: A library for scaling update-heavy data structures. Technical Report, Massachusetts Institute of Technology, 2014. https://dspace.mit.edu/handle/1721.1/89653, September 2018.
  4. [4]
    Majo Z, Gross T R. (Mis)understanding the NUMA memory system performance of multithreaded workloads. In Proc. the 2013 IEEE International Symposium on Workload Characterization, September 2013, pp.11-22.Google Scholar
  5. [5]
    Majo Z, Gross T. R. A library for portable and composable data locality optimizations for NUMA systems. In Proc. the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2015, pp.227-238.Google Scholar
  6. [6]
    Zhang K, Chen R, Chen H. NUMA-aware graph-structured analytics. In Proc. the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2015, pp.183-193.Google Scholar
  7. [7]
    Calciu I, Sen S, Balakrishnan M, Aguilera M K. Blackbox concurrent data structures for NUMA architectures. In Proc. the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems, April 2017, pp.207-221.Google Scholar
  8. [8]
    Li K, Hudak P. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 1989, 7(4): 321-359.CrossRefGoogle Scholar
  9. [9]
    Bennett J K, Carter J B, Zwaenepoel W. Munin: Distributed shared memory based on type-specific memory coherence. In Proc. the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, March 1990, pp.168-176.Google Scholar
  10. [10]
    Chapman M, Heiser G. vNUMA: A virtual shared-memory multiprocessor. In Proc. the 2009 USENIX Annual Technical Conference, June 2009, pp.349-362.Google Scholar
  11. [11]
    Keleher P, Cox A L, Dwarkadas S, Zwaenepoel W. Tread-Marks: Distributed shared memory on standard workstations and operating systems. In Proc. the USENIX Winter 1994 Technical Conference, January 1994, pp.115-132.Google Scholar
  12. [12]
    Fleisch B, Popek G. Mirage: A coherent distributed shared memory design. In Proc. the 12th ACM Symposium on Operating Systems Principles, December 1989, pp.211-223.Google Scholar
  13. [13]
    Iftode L, Singh J P, Li K. Scope consistency: A bridge between release consistency and entry consistency. In Proc. the 8th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1996, pp.277-287.Google Scholar
  14. [14]
    Bershad B N, Zekauskas M J, Sawdon W A. The Midway distributed shared memory system. In Proc. Digest of Papers. Compcon Spring, February 1993, pp.528-537.Google Scholar
  15. [15]
    Erlichson A, Nuckolls N, Chesson G, Hennessy J. Soft-FLASH: Analyzing the performance of clustered distributed virtual shared memory. In Proc. the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp.210-220.Google Scholar
  16. [16]
    Stets R, Dwarkadas S, Hardavellas N, Hunt G, Kontothanassis L, Parthasarathy S, Scott M. Cashmere-2L: Software coherent shared memory on a clustered remotewrite network. In Proc. the 16th ACM Symposium on Operating Systems Principles, October 1997, pp.170-183.Google Scholar
  17. [17]
    Charles P, Grothof C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V. X10: An objectoriented approach to non-uniform cluster computing. In Proc. the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2005, pp.519-538.Google Scholar
  18. [18]
    Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In Proc. the 11th USENIX Symposium on Networked Systems Design and Implementation, April 2014, pp.401-414.Google Scholar
  19. [19]
    Wei X, Shi J, Chen Y, Chen R, Chen H. Fast in-memory transaction processing using RDMA and HTM. In Proc. the 25th Symposium on Operating Systems Principles, October 2015, pp.87-104.Google Scholar
  20. [20]
    Nelson J, Holt B, Myers B, Briggs P, Ceze L, Kahan S, Oskin M. Latency-tolerant software distributed shared memory. In Proc. the 2015 USENIX Annual Technical Conference, July 2015, pp.291-305.Google Scholar
  21. [21]
    Kalia A, Kaminsky M, Andersen D G. Design guidelines for high performance RDMA systems. In Proc. the 2016 USENIX Annual Technical Conference, June 2016, pp.437-450.Google Scholar
  22. [22]
    Kumar M, Maass S, Kashyap S, Veselý J, Yan Z, Kim T, Bhattacharjee A, Krishna T. LATR: Lazy translation coherence. In Proc. the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, March 2018, pp.651-664.Google Scholar
  23. [23]
    Dennard R H, Gaensslen F H, Rideout V L, Bassous E, LeBlanc A R. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits, 1974, 9(5): 256-268.CrossRefGoogle Scholar
  24. [24]
    Esmaeilzadeh H, Blem E, Amant R S, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. In Proc. the 38th International Symposium on Computer Architecture, June 2011, pp.365-376.Google Scholar
  25. [25]
    Borkar S. Thousand core chips: A technology perspective. In Proc. the 44th Annual Design Automation Conference, June 2007, pp.746-749.Google Scholar
  26. [26]
    Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-core and multiprocessor systems. In Proc. the 13th International Symposium on High-Performance Computer Architecture, February 2007, pp.13-24.Google Scholar
  27. [27]
    Chen R, Chen H, Zang B. Tiled-MapReduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In Proc. the 19th International Conference on Parallel Architectures and Compilation Techniques, September 2010, pp.523-534.Google Scholar
  28. [28]
    Zhu X, Chen W, Zheng W, Ma X. Gemini: A computationcentric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.301-316.Google Scholar
  29. [29]
    Guo C, Wu H, Deng Z, Soni G, Ye J, Padhye J, Lipshteyn . RDMA over commodity Ethernet at scale. In Proc. the 2016 ACM SIGCOMM Conference, August 2016, pp.202-215.Google Scholar
  30. [30]
    Scales D J, Gharachorloo K, Thekkath C A. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In Proc. the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996, pp.174-185.Google Scholar
  31. [31]
    Karlsson M, Stenström P. Performance evaluation of a cluster-based multiprocessor built from ATM switches and bus-based multiprocessor servers. In Proc. the 2nd International Symposium on High-Performance Computer Architecture, February 1996, pp.4-13.Google Scholar
  32. [32]
    Gaud F, Lepers B, Decouchant J, Funston J, Fedorova A, Quéma V. Large pages may be harmful on NUMA systems. In Proc. the 2014 USENIX Annual Technical Conference, June 2014, pp.231-242.Google Scholar
  33. [33]
    Tsai S, Zhang Y. LITE kernel RDMA support for datacenter applications. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.306-324.Google Scholar
  34. [34]
    Bienia C, Kumar S, Singh J P, Li K. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp.72-81.Google Scholar
  35. [35]
    Shun J, Blelloch G E. Ligra: A lightweight graph processing framework for shared memory. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2013, pp.135-146.Google Scholar
  36. [36]
    Boldi P, Vigna S. The WebGraph framework I: Compression techniques. In Proc. the 13th International Conference on World Wide Web, May 2004, pp.595-602.Google Scholar
  37. [37]
    Boldi P, Rosa M, Santini M, Vigna S. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proc. the 20th International Conference on World Wide Web, March 2011, pp.587-596.Google Scholar
  38. [38]
    Raikin S, Liss L, Shachar A, Bloch N, Kagan M. Remote transactional memory. US Patent US20150269116, 2015. http://www.freepatentsonline.com/20150269116.pdf, September 2018.
  39. [39]
    Daglis A, Ustiugov D, Novaković S, Bugnion E, Falsafi B, Grot B. SABRes: Atomic object reads for in-memory rackscale computing. In Proc. the 49th Annual ACM/IEEE International Symposium on Microarchitecture, October 2016, Article No. 6.Google Scholar
  40. [40]
    Blumrich M A, Li K, Alpert R, Dubnicki C, Felten E W, Sandberg J. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proc. the 21st Annual International Symposium on Computer Architecture, April 1994, pp.142-153.Google Scholar
  41. [41]
    Kontothanassis L I, Scott M L. Using memory-mapped network interfaces to improve the performance of distributed shared memory. In Proc. the 2nd International Symposium on High-Performance Computer Architecture, February 1996, pp.166-177.Google Scholar
  42. [42]
    Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In Proc. the 2014 ACM Conference on SIGCOMM, August 2014, pp.295-306.Google Scholar
  43. [43]
    Dragojević A, Narayanan D, Nightingale E B, Renzelmann M, Shamis A, Badam A, Castro M. No compromises: Distributed transactions with consistency, availability, and performance. In Proc. the 25th ACM Symposium on Operating Systems Principles, October 2015, pp.54-70.Google Scholar
  44. [44]
    Kalia A, Kaminsky M, Andersen D G. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.185-201.Google Scholar
  45. [45]
    Vasilevsky A, Lively D, Ofsthun S. Linux® virtualization on Virtual IronTM VFe. In Proc. the 2005 Ottawa Linux Symposium, July 2005, pp.235-250.Google Scholar
  46. [46]
    Kaneda K, Oyama Y, Yonezawa A. A virtual machine monitor for providing a single system image. In Proc. the 17th IPSJ Computer System Symposium, November 2005, pp.3-12.Google Scholar
  47. [47]
    Gillett R B. Memory channel network for PCI IEEE Micro, 1996, 16(1): 12-18.CrossRefGoogle Scholar
  48. [48]
    Blumrich M A, Alpert R D, Chen Y et al. Design choices in the SHRIMP system: An empirical study. In Proc. the 25th Annual International Symposium on Computer Architecture, June 1998, pp.330-341.Google Scholar
  49. [49]
    Zhou Y, Iftode L, Li K. Performance evaluation of two home-based lazy release consistency protocols for shared virtual memory systems. In Proc. the 2nd USENIX Symposium on Operating Systems Design and Implementation, October 1996, pp.75-88.Google Scholar
  50. [50]
    Yeung D, Kubiatowicz J, Agarwal A. MGS: A multigrain shared memory system. In Proc. the 23rd Annual International Symposium on Computer Architecture, May 1996, pp.44-55.Google Scholar
  51. [51]
    Novakovic S, Daglis A, Bugnion E, Falsafi B, Grot B. Scaleout NUMA. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2014, pp.3-18.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Yang Hong
    • 1
  • Yang Zheng
    • 1
  • Fan Yang
    • 1
  • Bin-Yu Zang
    • 1
  • Hai-Bing Guan
    • 1
  • Hai-Bo Chen
    • 1
    Email author
  1. 1.Shanghai Key Laboratory for Scalable Computing SystemsShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations