Abstract
Graphics Processing Units (GPUs) are popular for their massive parallelism and high bandwidth memory and are being increasingly used in data-intensive applications. In this context, GPU-based In-Memory Key-Value (G-IMKV) Stores have been proposed to take advantage of GPUs’ capability to achieve high-throughput indexing operations. The state-of-the-art implementations batch requests on the CPU at the server before launching a compute kernel to process operations on the GPU. They also require explicit data movement operations between the CPU and GPU. However, the startup overhead of compute kernel launches and memory copies limit the throughput of these frameworks unless operations are batched into large groups.
In this paper, we propose the use of persistent GPU compute kernels and of OpenSHMEM to maximize GPU and network utilization with smaller batch sizes. This also helps improve the response time observed by clients while still achieving high throughput at the server. Specifically, clients and servers use OpenSHMEM primitives to move data between CPU and GPU by avoiding copies, and the server interacts with a persistently running compute kernel on the GPU to delegate various key-value store operations efficiently to streaming multi-processors. The experimental results show up to 4.8x speedup compared to the existing G-IMKV framework for a small batch of 1000 keys.
This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). This research was supported by the United States Department of Defense (DoD) and Computational Research and Development Programs at Oak Ridge National Laboratory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology. https://github.com/NVIDIA/gdrcopy. Accessed 9 Sept 2018
Mega-KV: A GPU-Based In-Memory Key-Value Store. http://kay21s.github.io/megakv/. Accessed 9 Sept 2018
MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. http://mvapich.cse.ohio-state.edu/. Accessed 9 Sept 2018
NVIDIA CUDA. http://docs.nvidia.com/cuda. Accessed 9 Sept 2018
NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 9 Sept 2018
OpenMPI: Open Source High Performance Computing. http://www.open-mpi.org/. Accessed 9 Sept 2018
OpenSHMEM.org. http://www.openshmem.org/site/. Accessed 9 Sept 2018
Redis. https://redis.io/. Accessed 9 Sept 2018
Top 500 Supercomputer sites. http://www.top500.org/. Accessed 9 Sept 2018
Blott, M., Karras, K., Liu, L., Vissers, K., Bär, J., István, Z.: Achieving 10Gbps line-rate key-value stores with FPGAs. In: Presented as Part of the 5th USENIX Workshop on Hot Topics in Cloud Computing, San Jose, CA. USENIX (2013)
Chu, C.H., Hamidouche, K., Venkatesh, A., Awan, A.A., Panda, D.K.: CUDA kernel based collective reduction operations on large-scale GPU clusters. In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 726–735, May 2016
Dragojević, A., Narayanan, D., Castro, M., Hodson, O.: FaRM: fast remote memory. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), Seattle, WA, pp. 401–414. USENIX Association (2014)
Fitzpatrick, B.: Distributed caching with memcached. Linux J. 2004(124), 5 (2004)
Fu, H., Venkata, M.G., Choudhury, A.R., Imam, N., Yu, W.: High-performance key-value store on OpenSHMEM. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 559–568, May 2017
Fu, H., SinghaRoy, K., Venkata, M.G., Zhu, Y., Yu, W.: SHMemCache: enabling memcached on the OpenSHMEM global address model. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 131–145. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_9
Hamidouche, K., Venkatesh, A., Awan, A.A., Subramoni, H., Chu, C.H., Panda, D.K.: Exploiting GPUDirect RDMA in designing high performance OpenSHMEM for NVIDIA GPU clusters. In: 2015 IEEE International Conference on Cluster Computing, pp. 78–87, September 2015
Hetherington, T.H., Rogers, T.G., Hsu, L., O’Connor, M., Aamodt, T.M.: Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems. In: 2012 IEEE International Symposium on Performance Analysis of Systems Software, pp. 88–98, April 2012
Hetherington, T.H., O’Connor, M., Aamodt, T.M.: MemcachedGPU: scaling-up scale-out key-value stores. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC 2015, pp. 43–57. ACM, New York (2015)
Jin, X., et al.: NetCache: balancing key-value stores with fast in-network caching. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 2017, pp. 121–136. ACM, New York (2017)
Kim, J., Lee, S., Vetter, J.S.: PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 57:1–57:14. ACM, New York (2017)
Li, B., et al.: KV-Direct: high-performance in-memory key-value store with programmable NIC. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 2017, pp. 137–152. ACM, New York (2017)
Li, S., et al.: Architecting to achieve a billion requests per second throughput on a single key-value store server platform. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA 2015, pp. 476–488, ACM, New York (2015)
Lim, H., Han, D., Andersen, D.G., Kaminsky, M.: MICA: a holistic approach to fast in-memory key-value storage. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), Seattle, WA, pp. 429–444. USENIX Association (2014)
Lu, X., Shankar, D., Panda, D.K.: Scalable and distributed key-value store-based data management using RDMA-memcached. IEEE Data Eng. Bull. 40, 50–61 (2017)
Mitchell, C., Geng, Y., Li, J.: Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In: Presented as Part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), San Jose, CA, pp. 103–114. USENIX (2013)
Namashivayam, N., et al.: Symmetric memory partitions in OpenSHMEM: a case study with Intel KNL. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 3–18. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_1
Potluri, S., Bureddy, D., Wang, H., Subramoni, H., Panda, D.K.: Extending OpenSHMEM for GPU computing. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 1001–1012, May 2013
Potluri, S., Goswami, A., Rossetti, D., Newburn, C.J., Venkata, M.G., Imam, N.: GPU-Centric communication on NVIDIA GPU clusters with InfiniBand: a case study with OpenSHMEM. In: 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pp. 253–262, December 2017
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.: Efficient Inter-node MPI communication using GPUDirect RDMA for infiniband clusters with NVIDIA GPUs. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 80–89, October 2013
Potluri, S., Goswami, A., Venkata, M.G., Imam, N.: Efficient breadth first search on multi-GPU systems using GPU-centric OpenSHMEM. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 82–96. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_6
Wang, H., Potluri, S., Bureddy, D., Rosales, C., Panda, D.K.: GPU-aware MPI on RDMA-enabled clusters: design, implementation and evaluation. IEEE Trans. Parallel Distrib. Syst. 25(10), 2595–2605 (2014). Oct
Wei, X., Shi, J., Chen, Y., Chen, R., Chen, H.: Fast in-memory transaction processing using RDMA and HTM. ACM Trans. Comput. Syst. 35, 3:1–3:37 (2015)
Zhang, K., Wang, K., Yuan, Y., Guo, L., Lee, R., Zhang, X.: Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores. Proc. VLDB Endow. 8(11), 1226–1237 (2015). Jul
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chu, CH., Potluri, S., Goswami, A., Gorentla Venkata, M., Imam, N., Newburn, C.J. (2019). Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity. OpenSHMEM 2018. Lecture Notes in Computer Science(), vol 11283. Springer, Cham. https://doi.org/10.1007/978-3-030-04918-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-04918-8_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04917-1
Online ISBN: 978-3-030-04918-8
eBook Packages: Computer ScienceComputer Science (R0)