A Similarity Measure for GPU Kernel Subgraph Matching

  • Robert LimEmail author
  • Boyana Norris
  • Allen Malony
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11882)


Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically measures instruction and basic block frequencies. CUDAflow captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel’s CFGs to gain insights into an application’s resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of CUDAflow is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel control flow characteristics that facilitate end users, autotuners, and compilers in generating high performing code.


  1. 1.
    Adhianto, L., et al.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010)Google Scholar
  2. 2.
    Ammons, G., Ball, T., Larus, J.R.: Exploiting hardware performance counters with flow and context sensitive profiling. ACM Sigplan Not. 32(5), 85–96 (1997)CrossRefGoogle Scholar
  3. 3.
    Ball, T., Larus, J.R.: Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. (TOPLAS) 16(4), 1319–1360 (1994)CrossRefGoogle Scholar
  4. 4.
    Böhm, C., Jacopini, G.: Flow diagrams, turing machines and languages with only two formation rules. Commun. ACM 9(5), 366–371 (1966)CrossRefGoogle Scholar
  5. 5.
    Borgelt, C., Berthold, M.R.: Mining molecular fragments: finding relevant substructures of molecules. In: Proceedings of the IEEE International Conference on Data Mining, pp. 51–58. IEEE (2002)Google Scholar
  6. 6.
    Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization, IISWC 2009, pp. 44–54. IEEE (2009)Google Scholar
  7. 7.
    Collective Knowledge (CK).
  8. 8.
    Csardi, G., Nepusz, T.: The iGraph software package for complex network researchGoogle Scholar
  9. 9.
    Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 63–74. ACM (2010) Google Scholar
  10. 10.
  11. 11.
    Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, H., Yalamanchili, S.: SIMD re-convergence at thread frontiers. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 477–488. ACM (2011)Google Scholar
  12. 12.
    Farooqui, N., Kerr, A., Eisenhauer, G., Schwan, K., Yalamanchili, S.: Lynx: a dynamic instrumentation system for data-parallel applications on GPGPU architectures. In: International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 58–67. IEEE (2012)Google Scholar
  13. 13.
    Gonzales, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley, Reading (1993)Google Scholar
  14. 14.
    Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 549–552. IEEE (2003)Google Scholar
  15. 15.
    Koutra, D., Vogelstein, J.T., Faloutsos, C.: DeltaCon: a principled massive-graph similarity function. SIAMGoogle Scholar
  16. 16.
    Lim, R., Carrillo-Cisneros, D., Alkowaileet, W., Scherson, I.: Computationally efficient multiplexing of events on hardware counters. In: Linux Symposium (2014)Google Scholar
  17. 17.
    Lim, R., Malony, A., Norris, B., Chaimov, N.: Identifying optimization opportunities within kernel execution in GPU codes. In: Hunold, S., et al. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 185–196. Springer, Cham (2015). Scholar
  18. 18.
    Lim, R., Norris, B., Malony, A.: Autotuning GPU kernels via static and predictive analysis. In: 2017 46th International Conference on Parallel Processing (ICPP), pp. 523–532. IEEE (2017)Google Scholar
  19. 19.
    Marin, G., Dongarra, J., Terpstra, D.: MIAMI: A framework for application performance diagnosis. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 158–168. IEEE (2014)Google Scholar
  20. 20.
    Miller, B.P., et al.: The paradyn parallel performance measurement tool. Computer 28(11), 37–46 (1995)CrossRefGoogle Scholar
  21. 21.
  22. 22.
    Sabne, A., Sakdhnagool, P., Eigenmann, R.: Formalizing structured control flow graphs. In: Ding, C., Criswell, J., Wu, P. (eds.) LCPC 2016. LNCS, vol. 10136, pp. 153–168. Springer, Cham (2017). Scholar
  23. 23.
    Sarkar, V.: Determining average program execution times and their variance. In: ACM SIGPLAN Notices, vol. 24, pp. 298–312. ACM (1989)Google Scholar
  24. 24.
    Shende, S.S., Malony, A.D.: The TAU parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)CrossRefGoogle Scholar
  25. 25.
    Singh, R., Xu, J., Berger, B.: Pairwise global alignment of protein interaction networks by matching neighborhood topology. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 16–31. Springer, Heidelberg (2007). Scholar
  26. 26.
    Sreepathi, S., et al.: Application characterization using Oxbow toolkit and PADS infrastructure. In: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, pp. 55–63. IEEE Press (2014)Google Scholar
  27. 27.
    Williams, M.H., Ossher, H.: Conversion of unstructured flow diagrams to structured form. Comput. J. 21(2), 161–167 (1978)CrossRefGoogle Scholar
  28. 28.
    Wu, H., Diamos, G., Li, S., Yalamanchili, S.: Characterization and transformation of unstructured control flow in GPU applications. In: 1st International Workshop on Characterizing Applications for Heterogeneous Exascale Systems (2011)Google Scholar
  29. 29.
    Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: Proceedings of 2002 IEEE International Conference on Data Mining, ICDM 2003, pp. 721–724. IEEE (2002)Google Scholar
  30. 30.
    Zhang, F., D’Hollander, E.H.: Using hammock graphs to structure programs. IEEE Trans. Softw. Eng. 30(4), 231–245 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of OregonEugeneUSA

Personalised recommendations