Advertisement

Block-Size Independence for GPU Programs

  • Rajeev Alur
  • Joseph Devietti
  • Nimit SinghaniaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11002)

Abstract

Optimizing GPU programs by tuning execution parameters is essential to realizing the full performance potential of GPU hardware. However, many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program. Further, lack of formal models and the presence of non-trivial transformations prevent verification of optimizations.

In this work, we verify transformations involved in tuning the execution parameter, block-size. First, we present a formal programming and execution model for GPUs, and then formalize block-size independence of GPU programs, which ensures tuning block-size preserves program semantics. Next, we present an inter-procedural analysis to verify block-size independence for synchronization-free GPU programs. Finally, we evaluate the analysis on the Nvidia CUDA SDK samples, where 35 global kernels are verified to be block-size independent.

References

  1. 1.
    Bergstra, J., Pinto, N., Cox, D.: Machine learning for predictive auto-tuning with boosted regression trees. In: 2012 Innovative Parallel Computing (InPar), pp. 1–9, May 2012Google Scholar
  2. 2.
    Betts, A., Chong, N., Donaldson, A., Qadeer, S., Thomson, P.: GPUVerify: a verifier for GPU kernels. SIGPLAN Not. 47(10), 113–132 (2012).  https://doi.org/10.1145/2398857.2384625CrossRefGoogle Scholar
  3. 3.
    Boyer, R.S., Elspas, B., Levitt, K.N.: SELECT - a formal system for testing and debugging programs by symbolic execution. In: Proceedings of the International Conference on Reliable Software, pp. 234–245. ACM, New York 1975).  https://doi.org/10.1145/800027.808445
  4. 4.
    Chen, G., Wu, B., Li, D., Shen, X.: PORPLE: an extensible optimizer for portable data placement on GPU. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 88–100. MICRO-47. IEEE Computer Society, Washington (2014).  https://doi.org/10.1109/MICRO.2014.20
  5. 5.
    Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 115–126. ACM, New York (2010).  https://doi.org/10.1145/1693453.1693471
  6. 6.
    Collingbourne, P., Cadar, C., Kelly, P.H.J.: Symbolic testing of OpenCL code. In: Eder, K., Lourenço, J., Shehory, O. (eds.) HVC 2011. LNCS, vol. 7261, pp. 203–218. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-34188-5_18CrossRefGoogle Scholar
  7. 7.
    Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL 1977, pp. 238–252. ACM, New York (1977).  https://doi.org/10.1145/512950.512973
  8. 8.
    King, J.C.: A new approach to program testing. In: Proceedings of the International Conference on Reliable Software, pp. 228–233. ACM, New York (1975).  https://doi.org/10.1145/800027.808444
  9. 9.
    Kofler, K., Cosenza, B., Fahringer, T.: Automatic data layout optimizations for GPUs. In: Träff, J.L., Hunold, S., Versaci, F. (eds.) Euro-Par 2015. LNCS, vol. 9233, pp. 263–274. Springer, Heidelberg (2015).  https://doi.org/10.1007/978-3-662-48096-0_21CrossRefGoogle Scholar
  10. 10.
    Leung, A., Gupta, M., Agarwal, Y., Gupta, R., Jhala, R., Lerner, S.: Verifying GPU kernels by test amplification. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pp. 383–394. ACM, New York (2012).  https://doi.org/10.1145/2254064.2254110
  11. 11.
    Li, G., Gopalakrishnan, G.: Scalable SMT-based verification of GPU kernel functions. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2010, pp. 187–196. ACM, New York (2010).  https://doi.org/10.1145/1882291.1882320
  12. 12.
    Li, G., Li, P., Sawaya, G., Gopalakrishnan, G., Ghosh, I., Rajan, S.P.: GKLEE: concolic verification and test generation for GPUs. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2012, pp. 215–224. ACM, New York (2012).  https://doi.org/10.1145/2145816.2145844
  13. 13.
    Liu, Y., Zhang, E.Z., Shen, X.: A cross-input adaptive framework for GPU program optimizations. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–10, May 2009Google Scholar
  14. 14.
    Magni, A., Dubach, C., O’Boyle, M.: Automatic optimization of thread-coarsening for graphics processors. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT 2014, pp. 455–466. ACM, New York (2014).  https://doi.org/10.1145/2628071.2628087
  15. 15.
    Miné, A.: Symbolic methods to enhance the precision of numerical abstract domains. In: Emerson, E.A., Namjoshi, K.S. (eds.) VMCAI 2006. LNCS, vol. 3855, pp. 348–363. Springer, Heidelberg (2005).  https://doi.org/10.1007/11609773_23CrossRefGoogle Scholar
  16. 16.
    Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-11515-8_10CrossRefGoogle Scholar
  17. 17.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008).  https://doi.org/10.1145/1365490.1365500CrossRefGoogle Scholar
  18. 18.
    Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, Heidelberg (2010)zbMATHGoogle Scholar
  19. 19.
  20. 20.
    Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2013, pp. 519–530. ACM, New York (2013).  https://doi.org/10.1145/2491956.2462176
  21. 21.
    Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A., Hwu, W.m.W.: Program optimization space pruning for a multithreaded GPU. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2008, pp. 195–204. ACM, New York (2008).  https://doi.org/10.1145/1356058.1356084
  22. 22.
    Sankaranarayanan, S., Ivančić, F., Gupta, A.: Program analysis using symbolic ranges. In: Nielson, H.R., Filé, G. (eds.) SAS 2007. LNCS, vol. 4634, pp. 366–383. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74061-2_23. http://dl.acm.org/citation.cfm?id=2391451.2391476CrossRefzbMATHGoogle Scholar
  23. 23.
    Sørensen, H.H.B.: Auto-tuning dense vector and matrix-vector operations for Fermi GPUs. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 619–629. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31464-3_63CrossRefGoogle Scholar
  24. 24.
    Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12(3), 66–73 (2010).  https://doi.org/10.1109/MCSE.2010.69CrossRefGoogle Scholar
  25. 25.
    Venet, A.J.: The gauge domain: scalable analysis of linear inequality invariants. In: Madhusudan, P., Seshia, S.A. (eds.) CAV 2012. LNCS, vol. 7358, pp. 139–154. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31424-7_15CrossRefGoogle Scholar
  26. 26.
    Weber, N., Goesele, M.: MATOG: array layout auto-tuning for CUDA. ACM Trans. Archit. Code Optim. 14(3), 28:1–28:26 (2017).  https://doi.org/10.1145/3106341CrossRefGoogle Scholar
  27. 27.
    Yang, Y., Xiang, P., Kong, J., Mantor, M., Zhou, H.: A unified optimizing compiler framework for different GPGPU architectures. ACM Trans. Archit. Code Optim. 9(2), 9:1–9:33 (2012).  https://doi.org/10.1145/2207222.2207225CrossRefGoogle Scholar
  28. 28.
    Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO 2012, pp. 155–164. ACM, New York (2012).  https://doi.org/10.1145/2259016.2259037

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of PennsylvaniaPhiladelphiaUSA

Personalised recommendations