Impact of Compiler Phase Ordering When Targeting GPUs
Research in compiler pass phase ordering (i.e., selection of compiler analysis/transformation passes and their order of execution) has been mostly performed in the context of CPUs and, in a small number of cases, FPGAs. In this paper we present experiments regarding compiler pass phase ordering specialization of OpenCL kernels targeting NVIDIA GPUs using Clang/LLVM 3.9 and the libclc OpenCL library. More specifically, we analyze the impact of using specialized compiler phase orders on the performance of 15 PolyBench/GPU OpenCL benchmarks. In addition, we analyze the final NVIDIA PTX assembly code generated by the different compilation flows in order to identify the main reasons for the cases with significant performance improvements. Using specialized compiler phase orders, we were able to achieve performance improvements over the CUDA version and OpenCL compiled with the NVIDIA driver. Compared to CUDA, we were able to achieve geometric mean improvements of \(1.54\times \) (up to \(5.48\times \)). Compared to the OpenCL driver version, we were able to achieve geometric mean improvements of \(1.65\times \) (up to \(5.70\times \)).
KeywordsGPU Phase ordering Optimization
This work was partially supported by the TEC4Growth project, “NORTE-01-0145-FEDER-000020”, financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). Reis acknowledges the support by FCT through PD/BD/105804/2014.
- 1.Khronos OpenCL Working Group. The OpenCL C Specification, Version 2.0 (2015)Google Scholar
- 3.Betkaoui, B., Thomas, D.B., Luk, W.: Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing. In: 2010 International Conference on Field-Programmable Technology, Beijing, pp. 94–101 (2010)Google Scholar
- 4.Kulkarni, S., Cavazos, J.: Mitigating the compiler optimization phase-ordering problem using machine learning. In: Proceedings of ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA 2012, pp. 147–162. ACM, New York (2012)Google Scholar
- 5.Purini, S., Jain, L.: Finding good optimization sequences covering program space. ACM Trans. Archit. Code Optim. (TACO) 9(4), 56:1–56:23 (2013)Google Scholar
- 6.Martins, L.G.A., et al.: Clustering-based selection for the exploration of compiler optimization sequences. ACM Trans. Archit. Code Optim. (TACO) 13(1), 8:1–8:28 (2016)Google Scholar
- 7.Nobre, R., Martins, L.G.A., Cardoso, J.M.P.: Use of previously acquired positioning of optimizations for phase ordering exploration. In: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2015), pp. 58–67. ACM, New York (2015)Google Scholar
- 8.Nobre, R., Martins, L.G.A., Cardoso, J.M.P.: A graph-based iterative compiler pass selection and phase ordering approach. In: Proceedings of 17th ACM Conference on Languages, Compilers, Tools, and Theory for Embedded Systems, LCTES 2016, pp. 21–30. ACM, New York (2016)Google Scholar
- 9.Nobre, R., Reis, L., Cardoso, J.M.P.: Compiler phase ordering as an orthogonal approach for reducing energy consumption. In: Proceedings of the 19th Workshop on Compilers for Parallel Computing, CPC 2016 (2016)Google Scholar
- 10.Grauer-Gray, S., et al.: Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of Innovative Parallel Computing (InPar 2012) (2012)Google Scholar
- 12.Parallel Thread Execution ISA Version 5.0. CUDA toolkit documentation. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html