Refactoring Loops with Nested IFs for SIMD Extensions Without Masked Instructions

  • Huihui SunEmail author
  • Sergei Gorlatch
  • Rongcai Zhao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


Most CPUs in heterogeneous systems are now equipped with SIMD (Single Instruction Multiple Data) extensions that operate on short vectors in parallel to enable high performance. Refactoring programs for such systems relies on vectorization, i.e., transforming into a form with SIMD-instructions. We improve the state of the art in refactoring loops with nested IF-statements that are notoriously difficult to vectorize. For IF-statements whose conditions are independent of the loop variable, we improve the classical loop unswitching method, such that it can tackle nested IFs. For IF-statements whose conditions change with loop iterations, we develop a novel IF-select transformation method: (1) it can work with arbitrarily nested IFs, and (2) while previous methods rely on either masked instructions or hardware support for predicated execution, our method works for SIMD extensions without such operations (as found, e.g., in IBM Power8 and ARM Cortex-A8). Our experimental evaluation for the SPEC CPU2006 benchmark suite is conducted on an SW26010 processor used in the Sunway TaihuLight supercomputer (#2 in the TOP500 list); it demonstrates the performance advantages of our implemented approach over the vectorizer of the Open64 compiler.


SIMD extensions Nested IF-statements Loop vectorization Loop unswitching IF-select transformation 


  1. 1.
    Aldinucci, M., Danelutto, M., Kilpatrick, P., Meneghin, M., Torquati, M.: Accelerating code on multi-cores with FastFlow. In: Proceedings of the 17th International Conference on Parallel Processing (Euro-Par), pp. 170–181 (2011). Scholar
  2. 2.
    Allen, J.R., Kennedy, K., Porterfield, C., et al.: Conversion of control dependence to data dependence. In: Proceedings of the Symposium on Principles of Programming Languages (POPL), Austin, Texas, USA, pp. 177–189 (1983).
  3. 3.
    AMD: Using the x86 Open64 Compiler Suite (2012). For x86 Open64 version 4.5.2Google Scholar
  4. 4.
  5. 5.
    Danelutto, M., Garcia, J.D., Sanchez, L.M., Sotomayor, R., Torquati, M.: Introducing parallelism by using REPARA C++11 attributes. In: 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pp. 354–358 (2016).
  6. 6.
    Free Software Foundation: Using the GNU Compiler Collection (GCC). Accessed 24 Sept 2018
  7. 7.
    Fu, H., Liao, J., Yang, J., et al.: The Sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 1–16 (2016)Google Scholar
  8. 8.
    Haidl, M., Moll, S., Klein, L., Sun, H., Hack, S., Gorlatch, S.: PACXXv2 + RV: an LLVM-based portable high-performance programming model. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, pp. 7:1–7:12 (2017).
  9. 9.
    Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Arch. News 34, 1–17 (2006)CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Intel: Intel C++ Compiler Developer Guide and Reference (2017). Version 18.0Google Scholar
  12. 12.
    Karrenberg, R., Hack, S.: Whole-function vectorization. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), Chamonix, France, pp. 141–150 (2011).
  13. 13.
    Larsen, S., Amarasinghe, S.P.: Exploiting superword level parallelism with multimedia instruction sets. In: Proceedings of the Conference on Programming Language Design and Implementation (PLDI), Vancouver, Britith Columbia, Canada, pp. 145–156 (2000). Scholar
  14. 14.
    Lattner, C., Adve, V.S.: LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA, pp. 75–88 (2004)Google Scholar
  15. 15.
    Naishlos, D.: Autovectorization in GCC. In: Proceedings of the GCC Developers Summit, Ottawa, Ontario, Canada, pp. 105–118 (2004)Google Scholar
  16. 16.
    Shin, J., Hall, M.W., Chame, J.: Superword-level parallelism in the presence of control flow. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA, pp. 165–175 (2005)Google Scholar
  17. 17.
    Smith, J.E., Faanes, G., Sugumar, R.A.: Vector instruction set support for conditional operations. In: Proceedings of the International Symposium on Computer Architecture (ISCA), Vancouver, BC, Canada, pp. 260–269 (2000)Google Scholar
  18. 18.
    Sreraman, N., Govindarajan, R.: A vectorizing compiler for multimedia extensions. Int. J. Parallel Program. 28, 363–400 (2000)CrossRefGoogle Scholar
  19. 19.
    Thomas, J., Allen, F., Cocke, J.: A Catalogue of Optimizing Transformations. Prentice-Hall, Englewood Cliffs (1971)Google Scholar
  20. 20.
    TOP500. Accessed 24 Sept 2018

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of MünsterMünsterGermany
  2. 2.National Digital Switching System Engineering and Technological Research CenterZhengzhouChina

Personalised recommendations