Vectorization of Flat Loops of Arbitrary Structure Using Instructions AVX-512


Widespread application of supercomputer technologies in various spheres of life, as well as the need of high-performance calculations allows us to speak about the relevance of the problem of increasing the performance of computer codes on supercomputers of modern architectures. Vectorization of program code is a low-level optimization that can, with a relatively local and compact application, increase the productivity of computational codes by several times. Modern Intel microprocessors have support for a unique set of instructions AVX-512, which, due to its features, allows you to vectorize almost any kind of code written in a predicate form. A set of simple restrictions when developing programs along with vectorization tools to enable the use of the AVX-512 instruction set can significantly speed up the resulting program. The article discusses approaches to vectorization of flat loops—a special-purpose program context, the successful vectorization of which allows to increase the productivity of supercomputer applications even for such program code for which optimizing compilers are powerless.

This is a preview of subscription content, access via your institution.


  1. 1

    R. Fadeev, K. Ushakov, M. Tolstykh, R. Ibrayev, V. Shashkin, and G. Goyman, ‘‘Supercomputing the seasonal weather prediction,’’ in Supercomputing. RuSCDays 2019, Ed. by V. Voevodin and S. Sobolev, Commun. Comput. Inform. Sci. 1129 (2019).

  2. 2

    Y. Hu, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, ‘‘Massively scaling seismic processing on sunway TaihuLight supercomputer,’’ IEEE Trans. Parallel Distrib. Syst. 31, 1194–1208 (2020).

    Article  Google Scholar 

  3. 3

    K. E. Jones, ‘‘Supercomputing improves predictions of fluid flow in rock,’’ Comput. Sci. Eng. 21 (6), 74–76 (2019).

    Article  Google Scholar 

  4. 4

    V. Kalantzis, ‘‘Data analytics, accelerators, and supercomputing: The challenges and future of MPI,’’ XRDS 23, 50–52 (2017).

    Article  Google Scholar 

  5. 5

    A. A. Rybakov, ‘‘Inner respresentation and crossprocess exchange mechanism for block-structured grid for supercomputer calculations,’’ Program Syst.: Theory Appl. 32 (8:1), 121–134 (2017).

  6. 6

    A. V. Baranov, G. I. Savin, B. M. Shabanov, et al., ‘‘Methods of jobs containerization for supercomputer workload managers,’’ Lobachevskii J. Math. 40 (5), 525–534 (2019).

    MathSciNet  Article  Google Scholar 

  7. 7

    J. Doerfert and H. Finkel, ‘‘Compiler optimizations for parallel programs,’’ in Languages and Compilers for Parallel Computing LCPC 2018, Ed. by M. Hall and H. Sundar, Lect. Notes Comput. Sci. 11882 (2019).

    Google Scholar 

  8. 8

    B. M. Shabanov, A. A. Rybakov, and S. S. Shumilin, ‘‘Vectorization of high-performance scientific calculations using AVX-512 intruction set,’’ Lobachevskii J. Math. 40 (5), 580–598 (2019).

    MathSciNet  Article  Google Scholar 

  9. 9

    Intel 64 and IA-32 Architectures Software Developer’s Manual (Intel Corp., 2019), Combined Vols.: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4.

  10. 10

    A. A. Rybakov, ‘‘Optimization of the problem of conflict detection with dangerous aircraft movement areas to execute on Intel Xeon Phi,’’ Program. Produkty Sist. 30, 524–528 (2017).

    Google Scholar 

  11. 11

    O. Krzikalla, F. Wende, and M. Höhnerbach, ‘‘Dynamic SIMD vector lane scheduling,’’ Lect. Notes Comput. Sci. 9945, 354–365 (2016).

    Article  Google Scholar 

  12. 12

    Intel Intrinsics Guide. Accessed 2020.

  13. 13

    E. F. Toro, NUMERICA, A Library of Sources for Teaching, Research and Applications. Accessed 2018.

  14. 14

    M. Bader, A. Breuer, W. Höltz, S. Rettenberger, ‘‘Vectorization of an augmented Riemann solver for the shallow water equations,’’ in Proceedings of the 2014 International Conference on High Performance Computing and Simulation HPCS 2014 (2014), pp. 193–201.

  15. 15

    C. R. Ferreira, K. T. Mandli, and M. Bader, ‘‘Vectorization of Riemann solvers for the single- ans multi-layer shallow water equations,’’ in Proceedings of the 2018 International Conference on High Performance Computing and Simulation, HPCS 2018 (2018), pp. 415–422.

  16. 16

    R. Mittal and G. Iaccarino, ‘‘Immersed boundary methods,’’ Ann. Rev. Fluid Mech. 37, 239–261 (2005).

    MathSciNet  Article  Google Scholar 

  17. 17

    Y.-H. Tseng and J. H. Ferziger, ‘‘A ghost-cell immersed boundary method for flow in complex geometry,’’ J. Comput. Phys. 192, 593–623 (2003).

    MathSciNet  Article  Google Scholar 

Download references


The supercomputer MVS-10P, located at the JSCC RAS, was used for calculations during the research.


The work has been done at the JSCC RAS as part of the state assignment for the topic 0065-2019-0016 (reg. no. AAAA-A19-119011590098-8).

Author information



Corresponding authors

Correspondence to G. I. Savin or B. M. Shabanov or A. A. Rybakov or S. S. Shumilin.

Additional information

(Submitted by A. M. Elizarov)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Savin, G.I., Shabanov, B.M., Rybakov, A.A. et al. Vectorization of Flat Loops of Arbitrary Structure Using Instructions AVX-512. Lobachevskii J Math 41, 2575–2592 (2020).

Download citation


  • supercomputers
  • vectorization
  • AVX-512
  • flat loop
  • predicated execution
  • intrinsic function