Performance and energy consumption of the SIMD Gram–Schmidt process for vector orthogonalization

  • Thomas JakobsEmail author
  • Billy Naumann
  • Gudula Rünger


In linear algebra and numerical computing, the orthogonalization of a set of vectors is an important submethod. Thus, the efficient implementation on recent architectures is required to provide a useful kernel for high-performance applications. In this article, we consider the process of orthogonalizing a set of vectors with the Gram–Schmidt method and develop SIMD implementations for processors providing the Advanced Vector Extensions (AVX), which is a set of instructions for SIMD execution on recent Intel and AMD CPUs. Several SIMD implementations of the Gram–Schmidt process for vector orthogonalization are built, and their behavior with respect to performance and energy is investigated. Especially, different ways to implement the SIMD programs are proposed and several optimizations have been studied. As hardware platforms, the Intel Core, Xeon and Xeon Phi processors with the AVX versions AVX, AVX2 and AVX512 have been used.


SIMD AVX Linear algebra Gram–Schmidt method Energy consumption Frequency scaling 



This work was supported by the German Ministry of Science and Education (BMBF) project SeASiTe, Grant No. 01IH16012B.


  1. 1.
    Björck Å (1967) Solving linear least squares problems by Gram–Schmidt orthogonalization. BIT Numer Math 7(1).
  2. 2.
    Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3).
  3. 3.
    Carretero J, Distefano S, Petcu D, Pop D, Rauber T, Rünger G, Singh DE (2015) Energy-efficient Algorithms for ultrascale systems. Supercomput Front Innov 2(2).
  4. 4.
    Cebrián JM, Jahre M, Natvig L (2014) Optimized hardware for suboptimal software: the case for SIMD-aware benchmarks. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
  5. 5.
    Cebrian JM, Jahre M, Natvig L (2015) ParVec: vectorizing the PARSEC benchmark suite. Computing 97(11).
  6. 6.
    Cebrián JM, Natvig L, Meyer JC (2014) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors. Computing 96(12).
  7. 7.
    Crâşmariu V, Arvinte M, Enescu A, Ciochină S (2017) Optimized block-diagonalization precoding technique using givens rotations QR decomposition. In: 2017 25th European Signal Processing Conference (EUSIPCO).
  8. 8.
    Golub GH, Van Loan CF (2013) Matrix computations, 4th edn. Johns Hopkins University Press, BaltimorezbMATHGoogle Scholar
  9. 9.
    Haidar A, Jagode H, YarKhan A, Vaccaro P, Tomov S, Dongarra J (2017) Power-aware computing: measurement, control, and performance analysis for Intel Xeon Phi. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC).
  10. 10.
    Hoffmann W (1989) Iterative algorithms for Gram–Schmidt orthogonalization. Computing 41:4. MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Ibrahim MEA, Rupp M, Fahmy HAH (2009) Code transformations and SIMD impact on embedded software energy/power consumption. In: 2009 International Conference on Computer Engineering Systems.
  12. 12.
    Intel Corporation (2018) Intel 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2a, 2b, 2c,2d, 3a, 3b, 3c, 3d and 4. Technical report, Intel Corporation. URL
  13. 13.
    Jakobs T, Hofmann M, Rünger G (2016) Reducing the power consumption of matrix multiplications by vectorization. In: 2016 IEEE International Conference on Computational Science and Engineering (CSE).
  14. 14.
    Jakobs T, Rünger G (2018) Examining energy efficiency of vectorization techniques using a Gaussian elimination. In: International Conference on High Performance Computing & Simulation (HPCS 2018). IEEE.
  15. 15.
    Jakobs T, Rünger G (2018) On the energy consumption of Load/Store AVX instructions. In: Federated Conference on Computer Science and Information Systems (FedCSIS).
  16. 16.
    Kim C, Satish N, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M, Dubey P (2013) Closing the ninja performance gap through traditional programming and compiler technology. Technical report, Intel Corporation.
  17. 17.
    Rünger G, Schwind M (2005) Comparison of different parallel modified Gram–Schmidt algorithms. In: Euro-Par 2005 Parallel Process.
  18. 18.
    Stock K, Pouchet LN, Sadayappan P (2012) Using machine learning to improve automatic vectorization. ACM Trans Archit Code Optim 8(4).

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Faculty of Computer ScienceChemnitz University of TechnologyChemnitzGermany

Personalised recommendations