A Unified Approach to Variable Renaming for Enhanced Vectorization

  • Prasanth ChatarasiEmail author
  • Jun Shirako
  • Albert Cohen
  • Vivek Sarkar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11882)


Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output (write-after-write) dependences. Past approaches, such as variable renaming and variable expansion, break such dependence cycles by either eliminating or repositioning the problematic memory-based dependences. However, the past work suffers from three key limitations: (1) Lack of a unified framework that synergistically integrates multiple storage transformations, (2) Lack of support for bounding the additional space required to break memory-based dependences, and (3) Lack of support for integrating these storage transformations with other code transformations (e.g., statement reordering) to enable vectorization.

In this paper, we address the three limitations above by integrating both Source Variable Renaming (SoVR) and Sink Variable Renaming (SiVR) transformations into a unified formulation, and by formalizing the “cycle-breaking” problem as a minimum weighted set cover optimization problem. To the best of our knowledge, our work is the first to formalize an optimal solution for cycle breaking that simultaneously considers both SoVR and SiVR transformations, thereby enhancing vectorization and reducing storage expansion relative to performing the transformations independently. We implemented our approach in PPCG, a state-of-the-art optimization framework for loop transformations, and evaluated it on eleven kernels from the TSVC benchmark suite. Our experimental results show a geometric mean performance improvement of \({4.61}{\times }\) on an Intel Xeon Phi (KNL) machine relative to the optimized performance obtained by Intel’s ICC v17.0 product compiler. Further, our results demonstrate a geometric mean performance improvement of \({1.08}{\times }\) and \({1.14}{\times }\) on the Intel Xeon Phi (KNL) and Nvidia Tesla V100 (Volta) platforms relative to past work that only performs the SiVR transformation [5], and of \({1.57}{\times }\) and \({1.22}{\times }\) on both platforms relative to past work on using both SiVR and SoVR transformations [8].


Vectorization Renaming Storage transformations Polyhedral compilers Intel KNL Nvidia Volta TSVC Suite SIMD 


  1. 1.
    Baghdadi, R., et al.: PENCIL: a platform-neutral compute intermediate language for accelerator programming. In: Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT 2015, pp. 138–149. IEEE Computer Society, Washington, DC (2015).
  2. 2.
    Bhaskaracharya, S.G., Bondhugula, U., Cohen, A.: SMO: an integrated approach to intra-array and inter-array storage optimization. In: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, pp. 526–538. ACM, New York (2016).
  3. 3.
    Bondhugula, U., Acharya, A., Cohen, A.: The Pluto+ algorithm: a practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. Program. Lang. Syst. 38(3), 12:1–12:32 (2016). Scholar
  4. 4.
    Callahan, D., Dongarra, J., Levine, D.: Vectorizing compilers: a test suite and results. In: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing 1988, pp. 98–105. IEEE Computer Society Press, Los Alamitos (1988).
  5. 5.
    Calland, P., Darte, A., Robert, Y., Vivien, F.: On the removal of anti- and output-dependences. Int. J. Parallel Program. 26(2), 285–312 (1998). Scholar
  6. 6.
    Chang, W.L., Chu, C.P., Ho, M.S.H.: Exploitation of parallelism to nested loops with dependence cycles. J. Syst. Arch. 50(12), 729–742 (2004). Scholar
  7. 7.
    Chu, C.P.: A theoretical approach involving recurrence resolution, dependence cycle statement ordering and subroutine transformation for the exploitation of parallelism in sequential code. Ph.D. thesis, Louisiana State University, Baton Rouge, LA, USA (1992). uMI Order No. GAX92-07498Google Scholar
  8. 8.
    Chu, C.P., Carver, D.L.: An analysis of recurrence relations in Fortran Do-loops for vector processing. In: Proceedings. The Fifth International Parallel Processing Symposium, pp. 619–625, April 1991.
  9. 9.
    Evans, G.C., Abraham, S., Kuhn, B., Padua, D.A.: Vector seeker: a tool for finding vector potential. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, WPMVP 2014, pp. 41–48. ACM, New York (2014).
  10. 10.
    Feautrier, P.: Array expansion. In: Proceedings of the 2nd International Conference on Supercomputing, ICS 1988, pp. 429–441. ACM, New York (1988).
  11. 11.
    Hopcroft, J., Tarjan, R.: Algorithm 447: efficient algorithms for graph manipulation. Commun. ACM 16(6), 372–378 (1973). Scholar
  12. 12.
    Johnson, D.B.: Finding all the elementary circuits of a directed graph. SIAM J. Comput. 4(1), 77–84 (1975). Scholar
  13. 13.
    Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2002)Google Scholar
  14. 14.
    Knobe, K., Sarkar, V.: Array SSA form and its use in parallelization. In: Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1998, pp. 107–120. ACM, New York (1998).
  15. 15.
    Kuck, D.J., Kuhn, R.H., Padua, D.A., Leasure, B., Wolfe, M.: Dependence graphs and compiler optimizations. In: Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1981, pp. 207–218. ACM, New York (1981).
  16. 16.
    Maleki, S., Gao, Y., Garzarán, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, pp. 372–382. IEEE Computer Society, Washington, DC (2011).
  17. 17.
    Rus, S., He, G., Alias, C., Rauchwerger, L.: Region array SSA. In: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, PACT 2006, pp. 43–52. ACM, New York (2006).
  18. 18.
    Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017). Scholar
  19. 19.
    Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010). Scholar
  20. 20.
    Verdoolaege, S., Carlos Juega, J., Cohen, A., Ignacio Gómez, J., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 54:1–54:23 (2013). Scholar
  21. 21.
    Weiss, M.: Strip mining on SIMD architectures. In: Proceedings of the 5th International Conference on Supercomputing, ICS 1991, pp. 234–243. ACM, New York (1991).

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Prasanth Chatarasi
    • 1
    Email author
  • Jun Shirako
    • 1
  • Albert Cohen
    • 2
  • Vivek Sarkar
    • 1
  1. 1.Georgia Institute of TechnologyAtlantaUSA
  2. 2.INRIA & DI ENSParisFrance

Personalised recommendations