Keywords

1 Introduction

The LU factorization with partial pivoting is the most commonly used method to solve general dense linear systems. The pivoting step aims at improving the numerical stability of the method. Even though it does not require extra floating point operations, selecting the pivots involves \(\mathcal {O}(n^2)\) comparisons. Moreover swapping the rows of the matrix involves extra data movements. These aspects can deteriorate the performance of the LU factorization due to the cache invalidations they induce.

As a motivation of this work, let us evaluate the overhead of the pivoting step of the LU factorization with partial pivoting using both GPU and Intel Xeon Phi accelerators. To use accelerators in dense linear algebra computations, we base our work on the MAGMA library [4, 8, 16], which provides LAPACK interface functions, using GPUs or Intel Xeon Phi. Figure 1a shows the results obtained running the corresponding MAGMA routine on an NVIDIA Tesla K20 GPU accelerator. We observe that pivoting takes more than 20% of the total computational time for matrices of size smaller than \(10^4\). However for larger matrices, the pivoting overhead is reduced and most of the computational time is spent performing the matrix-matrix products (DGEMM) on the GPU. Figure 1b displays the pivoting overhead using an Intel Xeon Phi 7120 coprocessor. Experiments have shown that the Intel Xeon Phi version of the factorization needs a greater amount of data than that of the GPU to be efficient. Indeed, for a matrix size of order 6000, the performance of the LU based solver is around 200 Gflop/s on the Xeon Phi whereas it is around 500 Gflop/s on the GPU. Increasing the size of the problem, the performance of both versions tends towards 800 Gflop/s (for double precision). We note that for small matrices, the pivoting overhead on Xeon Phi is proportionally smaller than on GPU.

Fig. 1.
figure 1

Time breakdown for pivoting in the LU factorization.

The previous experiments show that pivoting is a bottleneck in terms of communication cost and parallelism for hybrid CPU/accelerator architectures. To reduce the communication cost of the classical pivoting strategies such as partial pivoting, some alternative pivoting techniques were proposed in the context of the communication-avoiding LU algorithms (CALU then CALU_PRRP) [6, 12]. These techniques are based on tournament pivoting, which was shown to be as stable as partial pivoting in practice.

Another approach consists in avoiding pivoting, and therefore improving the performance of the factorization. This approach is based on the use of Random Butterfly Transformations (RBT). It was first described in [14, 15], and recently revisited for general systems in [3] and for symmetric indefinite systems in [1, 2]. The main difference of the RBT based methods with respect to the classical factorization methods consists in a randomization step, which recursively applies a sequence of butterfly matrices to the input matrix. The main advantage of randomizing is that it allows us to avoid the communication overhead due to pivoting. Tests performed on a collection of matrices [3] show that in practice two recursions are sufficient to obtain a satisfactory accuracy.

The RBT solvers are particularly suitable for accelerators. On one hand, avoiding pivoting on accelerators has an important impact on the performance. On the other hand, the structure of the butterfly matrices can be exploited to perform the randomization at a very low cost. In this work we present the implementation details of a randomized LU-based solver using GPU and Intel Xeon Phi accelerators and discuss its performance on both accelerators.

The remainder of this paper is organized as follows. Section 2 recalls the main principles of the RBT algorithm and how it can be used in a hybrid CPU/accelerator factorization. Sections 3 and 4 describe the implementation and performance of the RBT solver for GPU and Intel Xeon Phi, respectively. Section 5 has concluding remarks.

2 Hybrid RBT Solver

To solve a general linear system \(Ax=b\) using a solver based on RBT, we perform the following steps:

  • Compute \(A_r = U^TAV\) with UV random matrices (recursive butterfly matrices),

  • Factorize \(A_r\) using Gausian Elimination with No Pivoting (GENP),

  • Solve \(A_ry = U^Tb\), then \(x=Vy\).

We recall that an n-by-n butterfly matrix B has the following structure,

$$ B = \frac{1}{\sqrt{2}}\left( \begin{array}{rr} R &{} S \\ R &{} -S \\ \end{array} \right) , $$

where R and S are two random non singular n / 2-by-n / 2 diagonal matrices. The matrix B can then be stored in an n elements vector. A recursive butterfly matrix of depth d is defined as

$$\begin{aligned} W^{<n,d>} = \begin{pmatrix} B_1^{<n/2^{d-1}>} &{} &{} 0 \\ &{} \ddots &{} \\ 0 &{} &{} B_{2^{d-1}}^{<n/2^{d-1}>} \end{pmatrix} \times \dots \times \begin{pmatrix} B_1^{<n/2>} &{} 0 \\ 0 &{} B_2^{<n/2>} \end{pmatrix} \times B^{<n>}, \end{aligned}$$
(1)

where all \(B_i^{<n>}\) blocks are size n butterfly matrices. When n is not a multiple of \(2^d\), we “augment” the matrix A with additional 1’s on the diagonal.

Note that the GENP algorithm can be unstable due to potentially large growth factor. This is why we systematically perform iterative refinement on the computed solution of the randomized system. In this work, we use two recursion levels for the randomization (\(d=2\)). The randomization cost is \(8n^2\) flops, due to the block diagonal structure of the butterfly matrices, as demonstrated in [3]. Then the RBT algorithm adapted for hybrid architectures (CPU with an accelerator) performs the following steps:

  1. 1.

    Random generation and packed storage of the butterflies U and V on the host (CPU), while sending A to the device (accelerator) memory (padding is added if the size of the matrix A is not a multiple of 4).

  2. 2.

    The packed U and V are sent from the host memory to the device memory.

  3. 3.

    The randomization of A is performed on the device. It is done in-place (no additional memory needed).

  4. 4.

    The randomized matrix is factorized with GENP: the panel is factored on the host and the update of the trailing submatrix on the device.

  5. 5.

    We compute \(U^Tb\), and then solve \(A_ry = U^Tb\) on the device.

  6. 6.

    If necessary, iterative refinement is performed for y, on the device.

  7. 7.

    We compute the solution of \(x = Vy\) on the device, and then send x to the host memory.

Let us now describe the randomization phase (step 3) using two n-by-n recursive butterfly matrices U and V of depth two. We consider that the input matrix A can be split into 4 blocks of same size, \(A = \left( \begin{array}{rr} A_{11} &{} A_{12} \\ A_{21} &{} A_{22}\\ \end{array} \right) \). We consider the matrices \(U = U_2 \times U_1\) and \(V = V_2 \times V_1\), where \(U_1\), \(V_1\) are two butterfly matrices, and \(U_2\), \(V_2\) are two matrices of the form \(\left( \begin{array}{cc} B_1 &{} 0 \\ 0 &{} B_2 \\ \end{array} \right) \). \(B_1\) and \(B_2\) are two n / 2-by-n / 2 butterfly matrices as illustrated in Eq. 1. We have \(A_r = U^T A V = U^T_1 \times U^T_2 \times A \times V_2 \times V_1\). Thus we first apply \(U^T_2\) and \(V_2\) to A. We note \(A_r^1 = U^T_2 \times A \times V_2\), the resulting matrix from the first recursion level. Then,

\( A_r^1 = \left( \begin{array}{cc} B_1 &{} 0 \\ 0 &{} B_2 \\ \end{array} \right) \times \left( \begin{array}{rr} A_{11} &{} A_{12} \\ A_{21} &{} A_{22} \\ \end{array} \right) \times \left( \begin{array}{cc} B^T_1 &{} 0 \\ 0 &{} B^T_2 \\ \end{array} \right) = \left( \begin{array}{rr} B_1 A_{11} B^T_1 &{} B_1 A_{12} B^T_2 \\ B_2 A_{21} B^T_1 &{} B_2 A_{22} B^T_2 \\ \end{array} \right) \)

This step consists of four independent products with depth-1 butterfly matrices of size n / 2-by-n / 2. We call the kernel used for each product of the form \(U^T \times A \times V\) Elementary multiplication. We then compute \(A_r\) by applying \(U^T_1\) and \(V_1\) to \(A_r^1\). For that we use again the Elementary multiplication kernel. Implementation details of this kernel will be given in the next sections for both GPU and Intel Xeon Phi accelerator.

3 RBT for Graphics Processing Units

Here we present our randomized LU-based solver using GPU. In particular, we give implementation details of the randomization step, which are specific to the targeted accelerator. We note that our RBT solver exists for all precisions used in LAPACK (simple, double, simple complex and double complex) and is part of the MAGMA library since the 1.6.0 versionFootnote 1.

3.1 Implementation

On hybrid CPU/GPU architectures, the RBT solver is performed as described in Sect. 2. Algorithm 1 describes the randomization steps performed on a given matrix A. It applies the depth-two RBT to the matrix A by processing first each n / 2-by-n / 2 quarter block of A (lines \(5 \dots 8\) in Algorithm 1), and then applying the level one recursion to the whole n-by-n matrix (line 10 in Algorithm 1) as described in Sect. 2. The application of the level two randomization consists in calling a specific GPU kernel, the Elementary Multiplication GPU, on each quarter of the matrix. This is due to the block diagonal structure of the butterfly matrix. Each call to Elementary Multiplication GPU kernel is performed using one GPU thread per element.

figure a

The Elementary Multiplication GPU kernel performs \(A \leftarrow U^TAV\), where U and V are vectors of size n containing the entries of the depth-one random butterfly matrices. Algorithm 2 shows the implementation details of the Elementary Multiplication GPU kernel. We use shared memory arrays for each block of threads to store the elements of U and V relative to this block and thereby improve the efficiency of the access to these elements.

figure b

3.2 Performance Results

In this section, we present performance results of our randomized LU-based solver on GPU. The experiments were carried out on a system composed of a GPU, NVIDIA Kepler K20, with 2496 CUDA cores running at 706 MHz and 4800 MB of memory and a multicore host composed of two Intel Xeon X5680 processors, each with 6 physical cores running at 3.33 GHz, and a Level 3 memory cache of 12 MB. The CPU parts of our code are performed using the multithreaded Intel MKL library [9].

Figure 2a shows that the CUDA [13] implementation of our RBT solver (either with or without iterative refinement) outperforms the classical LU factorization with partial pivoting from MAGMA. For large enough matrices (from size 6000) the obtained performance is about \(20-30\)% faster than the solver based on Gaussian elimination with partial pivoting. In our experiments, when we enable iterative refinement, one iteration is generally enough to improve the computed solution giving an accuracy similar to the one obtained using LU factorization with partial pivoting. The iterative refinement is performed on the GPU and requires \(\mathcal {O}(n^2)\) extra floating point operations, which is a low order term in our case and has no significant impact on the performance of our RBT solver.

In Fig. 2b, we can see that the time required to perform the randomization is less than 4% of the computational time for small matrices and becomes less than 2% for larger matrices. This is due to the low computational cost of the randomization (\(8n^2\) flops) and to our optimized implementation that use the capabilities of the GPU accelerator.

Fig. 2.
figure 2

Randomized LU-based solver on GPU

4 RBT for Intel Xeon Phi

Similarly to the previous section, we present our implementation of the RBT on an Intel Xeon Phi coprocessor and discuss its performance. This solver and all the required routines (randomization, no pivoting LU factorization, iterative refinement) are part of the MAGMA MIC library (version 1.3).

4.1 Implementation

Algorithm 3 presents the randomization routine, using depth-two butterfly matrices. It is similar to its GPU counterpart, except that there are no blocks or threads to deal with inside this routine.

figure c

The Elementary multiplication Phi kernel, described in Algorithm 4, uses SIMD instructions [5] to improve the performance of each core, and OpenMP to handle thread parallelism between cores. This algorithm is well adapted to the SIMD programming model as the dependencies between the data are separated by a large number of values. In Algorithm 4, we use double precision floating point numbers, each of them using 64 bits. This explains why 8 values are stored in each 512-bits SIMD vector. When using 32 bits reals, 16 values are stored in each vector. For complex numbers, 8 numbers are stored in single precision and 4 in double. We take advantage of the SIMD capabilities of the Intel Xeon Phi coprocessor by using the low level Knight’s Corner intrinsics set of instructions [10, 11]. The use of the intrinsics allows the use of the assembly SIMD instructions with C style functions.

figure d

4.2 Performance Results

Here we present the performance results of our solver. The experiments were carried out using the same multicore host as described in Sect. 3.2 (two Intel Xeon X5680) with an Intel Xeon Phi coprocessor 7120 with 61 cores running at 1.238 GHz, with 16 GB of memory. The cores have 30.5 MB of combined L2 cache memory. We mention that each core can manage 4 threads by using hyper-threading. For the experiments, we use 240 threads in total. Note that we were able to perform tests on larger matrices compared to the GPU version. This is due to the larger size of the Intel Xeon Phi memory.

In Fig. 3a, we notice that the Intel Xeon Phi version is up to 50 % faster than the solver using partial pivoting without iterative refinement, and only 25 % faster with iterative refinement, which is not yet optimized for Intel Xeon Phi.

In Fig. 3b, we observe that the randomization requires less than 3 % of the total time and even less than 1 % for larger matrices. We recall that the randomization performed on the Intel Xeon Phi has been optimized using SIMD instructions and OpenMP.

Fig. 3.
figure 3

Randomized LU-based solver on Intel Xeon Phi

5 Conclusion

In this paper, we have presented two implementations of the RBT solver using accelerators based respectively on GPU and Intel Xeon Phi, resulting in routines that are significantly faster than the reference solver based on the LU factorization with partial pivoting. Thanks to an efficient implementation of the randomization, the overhead for randomizing the original system is negligible compared to the computational cost of the whole solver. Ongoing work include optimizing the iterative refinement on Intel Xeon Phi and solving multiple small systems at the same time using batched solvers [7].