Improving Performance of OpenCL on CPUs

Karrenberg, Ralf; Hack, Sebastian

doi:10.1007/978-3-642-28652-0_1

Ralf Karrenberg¹⁷ &
Sebastian Hack¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7210))

Included in the following conference series:

International Conference on Compiler Construction

1731 Accesses
39 Citations
3 Altmetric

Abstract

Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of today’s computing devices. In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization. We evaluate our techniques in a custom OpenCL CPU driver which is compared to itself in different configurations and to proprietary implementations by AMD and Intel. We achieve an average speedup factor of 1.21 compared to naïve vectorization and additional factors of 1.15–2.09 for suited kernels due to the optimizations enabled by our analysis. Our best configuration achieves an average speedup factor of 2.5 against the Intel driver.

Download to read the full chapter text

Chapter PDF

pocl: A Performance-Portable OpenCL Implementation

Article 19 August 2014

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Optimizing Task Parallelism with Library-Semantics-Aware Compilation

Keywords

References

Allen, J.R., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: POPL, pp. 177–189. ACM (1983)
Google Scholar
Allen, R., Kennedy, K.: Automatic translation of FORTRAN programs to vector form. ACM Trans. Program. Lang. Syst. 9(4), 491–542 (1987)
Article MATH Google Scholar
AMD: AMD APP SDK v2.5 (March 2011)
Google Scholar
Apodaca, A., Mantle, M.: RenderMan: Pursuing the Future of Graphics. IEEE Computer Graphics & Applications 10(4), 44–49 (1990)
Article Google Scholar
Cheong, G., Lam, M.: An Optimizer for Multimedia Instruction Sets. In: Second SUIF Compiler Workshop (1997)
Google Scholar
Darte, A., Robert, Y., Vivien, F.: Scheduling and Automatic Parallelization. Birkhauser, Boston (2000)
Book MATH Google Scholar
Fritz, N., Lucas, P., Slusallek, P.: CGiS, a New Language for Data-Parallel GPU Programming. In: VMV, pp. 241–248 (2004)
Google Scholar
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: PACT, pp. 205–216. ACM, New York (2010)
Chapter Google Scholar
Hormati, A.H., Choi, Y., Woh, M., Kudlur, M., Rabbah, R., Mudge, T., Mahlke, S.: Macross: macro-simdization of streaming applications. In: ASPLOS, pp. 285–296. ACM, New York (2010)
Google Scholar
Intel: Intel OpenCL SDK 1.1 (June 2011)
Google Scholar
Jaskelainen, P.O., de La Lama, C.S., Huerta, P., Takala, J.: OpenCL-based design methodology for application-specific processors. In: SAMOS 2010, pp. 223–230 (July 2010)
Google Scholar
Karrenberg, R., Hack, S.: Whole Function Vectorization. In: CGO, pp. 141–150 (2011)
Google Scholar
Khronos Group: OpenCL 1.1 Specification (June 2011)
Google Scholar
Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In: CGO (March 2004)
Google Scholar
Newburn, C.J., So, B., Liu, Z., McCool, M.D., Ghuloum, A.M., Toit, S.D., Wang, Z.G., Du, Z., Chen, Y., Wu, G., Guo, P., Liu, Z., Zhang, D.: Intel’s Array Building Blocks: A retargetable, dynamic compiler and embedded language. In: CGO, pp. 224–235 (2011)
Google Scholar
Ngo, V.: Parallel loop transformation techniques for vector-based multiprocessor systems. Ph.D. thesis, University of Minnesota-Twin Cities (May 1994)
Google Scholar
Nuzman, D., Henderson, R.: Multi-platform auto-vectorization. In: CGO, pp. 281–294 (2006)
Google Scholar
Nuzman, D., Zaks, A.: Outer-loop vectorization: revisited for short simd architectures. In: PACT, pp. 2–11. ACM (2008)
Google Scholar
NVIDIA: CUDA Programming Guide (2009)
Google Scholar
Parker, S., et al.: RTSL: A Ray Tracing Shading Language. In: IEEE Symposium on Interactive Ray Tracing (2007)
Google Scholar
Pharr, M.: Intel SPMD Program Compiler (June 2011)
Google Scholar
Shin, J.: Introducing Control Flow into Vectorized Code. In: PACT, pp. 280–291. IEEE Computer Society (2007)
Google Scholar
Sreraman, N., Govindarajan, R.: A vectorizing compiler for multimedia extensions. Int. J. Parallel Program. 28(4), 363–400 (2000)
Article Google Scholar
Steckelmacher, D.: An OpenCL State Tracker for Gallium based on Clover (August 2011), http://people.freedesktop.org/~steckdenis/clover
Stratton, J.A., Stone, S.S., Hwu, W.-m.W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)
Chapter Google Scholar
The Portland Group, Inc.: PGI CUDA-x86 (June 2011)
Google Scholar
Touati, S.A.A., Worms, J., Briais, S.: The Speedup Test. Rapport de recherche (2010), http://hal.inria.fr/inria-00443839/en/

Download references

Author information

Authors and Affiliations

Saarland University, Germany
Ralf Karrenberg & Sebastian Hack

Authors

Ralf Karrenberg
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Hack
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School for Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, UK
Michael O’Boyle

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karrenberg, R., Hack, S. (2012). Improving Performance of OpenCL on CPUs. In: O’Boyle, M. (eds) Compiler Construction. CC 2012. Lecture Notes in Computer Science, vol 7210. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28652-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-28652-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28651-3
Online ISBN: 978-3-642-28652-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Performance of OpenCL on CPUs

Abstract

Chapter PDF

Similar content being viewed by others

pocl: A Performance-Portable OpenCL Implementation

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Optimizing Task Parallelism with Library-Semantics-Aware Compilation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Performance of OpenCL on CPUs

Abstract

Chapter PDF

Similar content being viewed by others

pocl: A Performance-Portable OpenCL Implementation

From Describing to Prescribing Parallelism: Translating the SPEC ACCEL OpenACC Suite to OpenMP Target Directives

Optimizing Task Parallelism with Library-Semantics-Aware Compilation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation