Skip to main content

Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM

  • Conference paper
  • First Online:
Book cover OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity (OpenSHMEM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11283))

Included in the following conference series:

  • 427 Accesses

Abstract

Graphics Processing Units (GPUs) are popular for their massive parallelism and high bandwidth memory and are being increasingly used in data-intensive applications. In this context, GPU-based In-Memory Key-Value (G-IMKV) Stores have been proposed to take advantage of GPUs’ capability to achieve high-throughput indexing operations. The state-of-the-art implementations batch requests on the CPU at the server before launching a compute kernel to process operations on the GPU. They also require explicit data movement operations between the CPU and GPU. However, the startup overhead of compute kernel launches and memory copies limit the throughput of these frameworks unless operations are batched into large groups.

In this paper, we propose the use of persistent GPU compute kernels and of OpenSHMEM to maximize GPU and network utilization with smaller batch sizes. This also helps improve the response time observed by clients while still achieving high throughput at the server. Specifically, clients and servers use OpenSHMEM primitives to move data between CPU and GPU by avoiding copies, and the server interacts with a persistently running compute kernel on the GPU to delegate various key-value store operations efficiently to streaming multi-processors. The experimental results show up to 4.8x speedup compared to the existing G-IMKV framework for a small batch of 1000 keys.

This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research was supported by the United States Department of Defense (DoD) and Computational Research and Development Programs at Oak Ridge National Laboratory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology. https://github.com/NVIDIA/gdrcopy. Accessed 9 Sept 2018

  2. Mega-KV: A GPU-Based In-Memory Key-Value Store. http://kay21s.github.io/megakv/. Accessed 9 Sept 2018

  3. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/. Accessed 9 Sept 2018

  4. NVIDIA CUDA. http://docs.nvidia.com/cuda. Accessed 9 Sept 2018

  5. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 9 Sept 2018

  6. OpenMPI: Open Source High Performance Computing. http://www.open-mpi.org/. Accessed 9 Sept 2018

  7. OpenSHMEM.org. http://www.openshmem.org/site/. Accessed 9 Sept 2018

  8. Redis. https://redis.io/. Accessed 9 Sept 2018

  9. Top 500 Supercomputer sites. http://www.top500.org/. Accessed 9 Sept 2018

  10. Blott, M., Karras, K., Liu, L., Vissers, K., Bär, J., István, Z.: Achieving 10Gbps line-rate key-value stores with FPGAs. In: Presented as Part of the 5th USENIX Workshop on Hot Topics in Cloud Computing, San Jose, CA. USENIX (2013)

    Google Scholar 

  11. Chu, C.H., Hamidouche, K., Venkatesh, A., Awan, A.A., Panda, D.K.: CUDA kernel based collective reduction operations on large-scale GPU clusters. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 726–735, May 2016

    Google Scholar 

  12. Dragojević, A., Narayanan, D., Castro, M., Hodson, O.: FaRM: fast remote memory. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), Seattle, WA, pp. 401–414. USENIX Association (2014)

    Google Scholar 

  13. Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)

    Google Scholar 

  14. Fu, H., Venkata, M.G., Choudhury, A.R., Imam, N., Yu, W.: High-performance key-value store on OpenSHMEM. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 559–568, May 2017

    Google Scholar 

  15. Fu, H., SinghaRoy, K., Venkata, M.G., Zhu, Y., Yu, W.: SHMemCache: enabling memcached on the OpenSHMEM global address model. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 131–145. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_9

    Chapter  Google Scholar 

  16. Hamidouche, K., Venkatesh, A., Awan, A.A., Subramoni, H., Chu, C.H., Panda, D.K.: Exploiting GPUDirect RDMA in designing high performance OpenSHMEM for NVIDIA GPU clusters. In: 2015 IEEE International Conference on Cluster Computing, pp. 78–87, September 2015

    Google Scholar 

  17. Hetherington, T.H., Rogers, T.G., Hsu, L., O’Connor, M., Aamodt, T.M.: Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. In: 2012 IEEE International Symposium on Performance Analysis of Systems Software, pp. 88–98, April 2012

    Google Scholar 

  18. Hetherington, T.H., O’Connor, M., Aamodt, T.M.: MemcachedGPU: scaling-up scale-out key-value stores. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC 2015, pp. 43–57. ACM, New York (2015)

    Google Scholar 

  19. Jin, X., et al.: NetCache: balancing key-value stores with fast in-network caching. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 2017, pp. 121–136. ACM, New York (2017)

    Google Scholar 

  20. Kim, J., Lee, S., Vetter, J.S.: PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 57:1–57:14. ACM, New York (2017)

    Google Scholar 

  21. Li, B., et al.: KV-Direct: high-performance in-memory key-value store with programmable NIC. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 2017, pp. 137–152. ACM, New York (2017)

    Google Scholar 

  22. Li, S., et al.: Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA 2015, pp. 476–488, ACM, New York (2015)

    Google Scholar 

  23. Lim, H., Han, D., Andersen, D.G., Kaminsky, M.: MICA: a holistic approach to fast in-memory key-value storage. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), Seattle, WA, pp. 429–444. USENIX Association (2014)

    Google Scholar 

  24. Lu, X., Shankar, D., Panda, D.K.: Scalable and distributed key-value store-based data management using RDMA-memcached. IEEE Data Eng. Bull. 40, 50–61 (2017)

    Google Scholar 

  25. Mitchell, C., Geng, Y., Li, J.: Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In: Presented as Part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), San Jose, CA, pp. 103–114. USENIX (2013)

    Google Scholar 

  26. Namashivayam, N., et al.: Symmetric memory partitions in OpenSHMEM: a case study with Intel KNL. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 3–18. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_1

    Chapter  Google Scholar 

  27. Potluri, S., Bureddy, D., Wang, H., Subramoni, H., Panda, D.K.: Extending OpenSHMEM for GPU computing. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1001–1012, May 2013

    Google Scholar 

  28. Potluri, S., Goswami, A., Rossetti, D., Newburn, C.J., Venkata, M.G., Imam, N.: GPU-Centric communication on NVIDIA GPU clusters with InfiniBand: a case study with OpenSHMEM. In: 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp. 253–262, December 2017

    Google Scholar 

  29. Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.: Efficient Inter-node MPI communication using GPUDirect RDMA for infiniband clusters with NVIDIA GPUs. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 80–89, October 2013

    Google Scholar 

  30. Potluri, S., Goswami, A., Venkata, M.G., Imam, N.: Efficient breadth first search on multi-GPU systems using GPU-centric OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 82–96. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_6

    Chapter  Google Scholar 

  31. Wang, H., Potluri, S., Bureddy, D., Rosales, C., Panda, D.K.: GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans. Parallel Distrib. Syst. 25(10), 2595–2605 (2014). Oct

    Article  Google Scholar 

  32. Wei, X., Shi, J., Chen, Y., Chen, R., Chen, H.: Fast in-memory transaction processing using RDMA and HTM. ACM Trans. Comput. Syst. 35, 3:1–3:37 (2015)

    Google Scholar 

  33. Zhang, K., Wang, K., Yuan, Y., Guo, L., Lee, R., Zhang, X.: Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow. 8(11), 1226–1237 (2015). Jul

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ching-Hsiang Chu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chu, CH., Potluri, S., Goswami, A., Gorentla Venkata, M., Imam, N., Newburn, C.J. (2019). Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity. OpenSHMEM 2018. Lecture Notes in Computer Science(), vol 11283. Springer, Cham. https://doi.org/10.1007/978-3-030-04918-8_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04918-8_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04917-1

  • Online ISBN: 978-3-030-04918-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics