Abstract
Application development for modern high-performance systems with many cores, i.e., comprising multiple Graphics Processing Units (GPUs) and multi-core CPUs, currently exploits low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. In this paper, we advocate a high-level programming approach for such systems, which relies on the following two main principles: (a) the model is based on the current OpenCL standard, such that programs remain portable across various many-core systems, independently of the vendor, and all low-level code optimizations can be applied; (b) the model extends OpenCL with three high-level features which simplify many-core programming and are automatically translated by the system into OpenCL code. The high-level features of our programming model are as follows: (1) memory management is simplified and automated using parallel container data types (vectors and matrices); (2) a data (re)distribution mechanism supports data partitioning and generates automatic data movements between multiple GPUs; (3) computations are precisely and concisely expressed using parallel algorithmic patterns (skeletons). The well-defined skeletons allow for semantics-preserving transformations of SkelCL programs which can be applied in the process of program development, as well as in the compilation and optimization phase. We demonstrate how our programming model and its implementation are used to express several parallel applications, and we report first experimental results on evaluating our approach in terms of program size and target performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
OpenACC application program interface. Version 1.0 (2011)
AMD. AMD APP SDK code samples. Version 2.7, February 2013
AMD. Bolt – A C++ template library optimized for GPUs (2013)
Arora, N., Shringarpure, A., Vuduc, R.W.: Direct N-body kernels for multicore platforms. In: 2012 41st International Conference on Parallel Processing, pp. 379–387. IEEE Computer Society, Los Alamitos (2009)
Blelloch, G.E.: Prefix sums and their applications. In: Sythesis of Parallel Algorithms, pp. 35–60. Morgan Kaufmann Publishers Inc. (1990)
Chang, D.-J., Desoky, A.H., Ouyang, M., Rouchka, E.C.: Compute pairwise manhattan distance and pearson correlation coefficient of data points with GPU. In: 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, pp. 501–506 (2009)
Elangovan, V.K., Badia, R.M., Parra, E.A.: OmpSs-OpenCL programming model for heterogeneous systems. In: Kasahara, H., Kimura, K. (eds.) LCPC 2012. LNCS, vol. 7760, pp. 96–111. Springer, Heidelberg (2013)
Enmyren, J., Kessler. C.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings 4th International Workshop on High-Level Parallel Programming and Applications (HLPP-2010), pp. 5–14 (2010)
Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)
Gorlatch, S., Cole, M.: Parallel skeletons. In: Padua, D.A. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, US (2011)
Gorlatch, S., Lengauer, C.: (De)Composition rules for parallel scan and reduction. In: Proceedings of the 3rd International Working Conference on Massively Parallel Programming Models (MPPM’97), pp. 23–32. IEEE Computer Society Press (1998)
Hoberock, J., Bell, N.: (NVIDIA). Thrust: a parallel template, Library (2013)
Khronos Group. The OpenCL specification, Version 2.0, November 2013
Kirk, D.B., Hwu, W.W.: Programming Massively Parallel Processors - A Hands-on Approach. Morgan Kaufman, San Francisco (2010)
Nitsche, T.: Skeleton implementations based on generic data distributions. In: 2nd International Workshop on Constructive Methods for Parallel Programming (2000)
NVIDIA. CUBLAS (2013). http://developer.nvidia.com/cublas
NVIDIA. NVIDIA CUDA SDK code samples. Version 5.0, February 2013
OpenMP Architecture Review Board. OpenMP API. Version 4.0 (2013)
Pepper, P., Südholt. M.: Deriving parallel numerical algorithms using data distribution algebras: Wang’s algorithm. In: 30th Annual Hawaii International Conference on System Sciences (HICSS), pp. 501–510 (1997)
Steuwer, M., Friese, M., Albers, S., Gorlatch, S.: Introducing and implementing the allpairs skeleton for programming multi-GPU systems. Int. J. Parallel Prog. 42(4), 601–618 (2013)
Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) PaCT 2013. LNCS, vol. 7979, pp. 258–272. Springer, Heidelberg (2013)
Acknowledgments
This work is partially supported by the OFERTIE (FP7) and MONICA projects. We would like to thank the anonymous reviewers for their valuable comments, as well as NVIDIA for their generous hardware donation used in our experiments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gorlatch, S., Steuwer, M. (2015). Towards High-Level Programming for Systems with Many Cores. In: Voronkov, A., Virbitskaite, I. (eds) Perspectives of System Informatics. PSI 2014. Lecture Notes in Computer Science(), vol 8974. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46823-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-46823-4_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46822-7
Online ISBN: 978-3-662-46823-4
eBook Packages: Computer ScienceComputer Science (R0)