1 Introduction

The boundary element method (BEM) has several scientific applications. This method requires fewer unknowns and has a lower meshing cost compared to other volume discretization methods because it requires only the surface of the target objects for analysis. However, the computational cost and memory footprint of BEM analysis are significantly high because a dense coefficient matrix is generated during the analysis. To overcome these problems, parallel computing and approximation techniques, such as hierarchical matrices (\(\mathcal {H}\)-matrices) [1,2,3], \(\mathcal {H}^2\)-matrices [4], and the fast multipole method (FMM) [5] are often used for BEM analysis. Although these techniques have huge programming costs, BEM-BB [6], an open-source software framework for parallel BEM analysis, is useful to for reducing these costs. The framework employs \(\mathcal {H}\)-matrices to approximate the dense coefficient matrix, and it is parallelized using the MPI and OpenMP models. The BEM-BB framework allows for faster BEM analysis on parallel computers by simply preparing programs to calculate the integrals of boundary elements, settings of boundary conditions, and analysis output. In addition, the parallelization and the approximation programs are encapsulated in the framework. Thus, users can concentrate on developing the most important aspects of BEM analysis, namely, a user-defined function for calculating the i-th row and the j-th column of the coefficient matrix. Furthermore, the user-defined function may vary depending on the targeted physical phenomena.

However, this framework does not consider single instruction multiple data (SIMD) vectorization, which is important for achieving high-performance computing on existing processors. For example, the most recent Intel processors, such as Skylake EP/EX and Xeon Phi Knights Landing (KNL), support AVX-512, that is, a 512-bit SIMD instruction set. SIMD vectorization cannot be separated from user-defined functions, unlike in MPI and OpenMP parallelization, because SIMD vectorization is instruction-level parallelization and because user-defined functions can vary. However, SIMD vectorization is difficult for application programmers because it requires knowledge of the compiler and the target processor architecture.

In this paper, we present a framework design based on BEM-BB for SIMD vectorization. A design to encapsulate SIMD-related aspects is proposed. In addition, we evaluate the performance of the proposed framework by solving two problems, namely, static electric field analysis with a perfect conductor and static electric field analysis with a dielectric, which contain different user-defined functions, on Intel Broadwell processor (BDW) and Intel Xeon Phi Knights Landing (KNL). We compare the performance of the proposed framework with the original framework and that of hand-tuned user functions. The results show that the proposed framework offers performance improvements of 2.22x and 4.34x compared to the original framework for the BDW processor and the KNL processor, respectively. Furthermore, the experimental results demonstrate that the performance of the framework is comparable to that achieved using the hand- tuned programs

The remainder of this paper is organized as follows. In Sect. 2, we provide an overview of the BEM-BB framework. The proposed framework is described in Sect. 3. Numerical experiments involving electric field analysis are described in Sect. 4, and a few conclusions and suggestions for future work are presented in Sect. 6.

2 BEM-BB Framework

In this section, the BEM-BB framework, which is the baseline implementation in this study, is introduced. The BEM-BB software framework is used for parallel BEM analysis. It is implemented in the Fortran90 programming environment and parallelized using the OpenMP + MPI hybrid programming model. To reduce the computational cost of parallel programming, the framework supports model data input, assembly of the coefficient matrix, and solution of linear systems, steps that are generally required in BEM analysis. When employing this framework, users are required to generate user-defined functions that calculate each element of the coefficient matrix. In other words, users are required to implement a program to calculate the integrals of boundary elements, which depend on the governing target of BEM analysis. The target integral equation of the BEM-BB framework is described as follows. For \(f \in H'\), \(u \in H\) and a kernel function of a convolution operator \(g:\mathbb {R}^d \times \varOmega \rightarrow \mathbb {R}\),

$$\begin{aligned} \int _{\varOmega } g(x,y)u(y)\mathrm{d}y&= f \end{aligned}$$
(1)

where \(\varOmega \subset \mathbb {R}^d\) denotes a (\(d-1\))-dimensional domain, H the Hilbert space of functions on a \(\varOmega \), and \(H'\) dual space of H. To numerically calculate Eq. (1), we divide the domain, \(\varOmega \), into the elements \(\varOmega _h = \{\omega _j : j \in J\}\), where J is an index set. In weighted residual methods, such as the Ritz-Galerkin method and the collocation method, the function u is approximated from a n-dimensional subspace \(H^{h} \subset H\). Given a basis \((\varphi _{i})_{i \in \beth }\) of \(H^h\) for an index set \(\beth := \{1,\dots ,N\}\), the approximant \(u^h \in H^h\)-u can be expressed using a coefficient vector \(\phi = (\phi _i)_{i \in \beth }\) that satisfies \(u^h = \sum _{i \in \beth } \phi _i\varphi _i\). Note that the supports of the basis \(\varOmega ^{h}_{\varphi _i} :=\) supp \(\varphi \) are assembled from the sets \(\omega _j\). Equation (1) is then reduced to the following system of linear equations.

$$\begin{aligned} A\phi&= b \end{aligned}$$
(2)
$$\begin{aligned} A_{ij}&= \int _{\varOmega } \varphi _i (x) \int _{\varOmega } g(x,y) \varphi (y) \mathrm{d}y \mathrm{d}x \end{aligned}$$
(3)
$$\begin{aligned} b_{i}&= \int _{\varOmega } \varphi _i (x) f \mathrm{d}x \end{aligned}$$
(4)

Here, \(i, j \in \beth \). The user-defined function required to calculate the elements of the i-th row and the j-th column of the coefficient matrix is expressed as Eq. (3).

There are two versions of the implementation: one based on dense matrix computations and the other based on \(\mathcal {H}\)-matrix computations. Although the \(\mathcal {H}\)-matrix version depends on the distributed parallel \(\mathcal {H}\)-matrix library \(\mathcal {H}\)ACApK [7], the problems of vectorization are similar. As shown in Fig. 1, the proposed framework consists of three components: model data input, coefficient matrix generation, and linear solver. In this study, the objective is to interface coefficient matrix generation with user-defined function. Therefore, we focus on the coefficient matrix generation component.

Fig. 1.
figure 1

The design of BEM-BB framework

Fig. 2.
figure 2

Parallel generation of coefficient dense matrix and \(\mathcal {H}\)-matrix.

Figure 2 shows the coefficient matrix generation part. The target coefficient matrix is distributed to multiple thread and each thread sequentially calculates the i-th row and the j-th column element by using user-defined function. The coefficient matrices generated using the dense matrix version and the \(\mathcal {H}\)-matrix version are a dense matrix and an \(\mathcal {H}\)-matrix, respectively. A \(\mathcal {H}\)-matrix is also called a hierarchical matrix. \(\mathcal {H}\)-matrices are among the techniques used to approximate dense matrices. An \(\mathcal {H}\)-matrix is a set of low-rank approximated sub-matrices and small dense sub-matrices as shown in Fig. 2. \(\mathcal {H}\)ACApK generates the coefficient \(\mathcal {H}\)-matrix by exploiting the user-defined function according to the Adaptive Cross Approximation (ACA) algorithm [9]. The ACA algorithm is an approximation technique used to generate a low-rank approximated matrix of a dense matrix without generating the target dense matrix.

The interface of the user-defined function is shown in Fig. 3. In both versions, the function is called from each thread concurrently. To vectorize the user-defined function, the caller of the function, too, is important. Figures 4 and 5 show the callers of the user-defined functions of the dense matrix version and the \(\mathcal {H}\)-matrix version, respectively. Both programs call the user-defined function in loop structures. These loops are the target of SIMD vectorization. In the following sections, we treat the implementation shown in Fig. 4 as the baseline.

Fig. 3.
figure 3

An interface of a user-defined function to calculate the i-th row and the j-th column element of the coefficient matrix. The function arguments after i and j are used as input variable of the calculation.

Fig. 4.
figure 4

User-defined function caller for dense matrix. Here, a(j,i) is a coefficient dense matrix. The ranges of i and j are assigned to each thread adequately.

3 Framework Design for SIMD Vectorization with OpenMP SIMD Directives

In general, three methods are used to perform SIMD vectorization: (1) relying on compiler auto-vectorization, (2) using compiler directives, and (3) using intrinsic functions. However, vectorization using intrinsic functions is cumbersome job, and the required intrinsic functions depend completely on the user-defined function. In this study, we employ compiler auto-vectorization and the directive method. To use SIMD instructions efficiently, there are two constraints on the SIMD target vectors.

  • There should be no data dependency among the elements of the target vector.

  • Vector elements should be stored contiguously.

In addition, to generate efficient code by using compiler vectorizations, the code should be obviously vectorizable from the compiler’s view point. Any new framework design should consider the above points. Furthermore, the design should be user-friendly. Efficiently vectorized SIMD code should be generated if users are unaware of compiler requirements.

Fig. 5.
figure 5

User-defined function caller for sub-matrix of \(\mathcal {H}\)-matrix. Here, HACApK_entry_ij is a wrapper function of ppohBEM_matrix_element_ij. The structure st_bemv contains the variables required as arguments of the user-defined function.

3.1 New Interface Definition for Compiler Vectorization

According to the two compiler requirements, the main problem associated with vectorization pertains to data access. Even though the computations associated with a user-defined function can be executed independently, if a compiler detects possibilities of data dependency, it conservatively generates instructions that are not fully vectorized. Therefore, we propose to handle data access and computation separately in the proposed framework design. We introduce two new interfaces set_args (Fig. 6) and vectorize_func (Fig. 7) for data access and computation, respectively. Figure 8 shows the function caller based on Fig. 4. The variables SIMDLENGTH, which appear in Figs. 7 and 8 and are defined by users, represent the SIMD length of the target processor. For example, the recommended SIMDLENGTH for KNL, which has a 512-bit (\(=\) sizeof(double) \(\times 8\)) wide SIMD unit, is 8. From the compiler’s viewpoint, the !$omp simd loop (Fig. 8 line 14) has no data dependency because the arguments and the return values of vector_func have no alias and are accessed independently for each iteration of the loop. In addition, the arguments and return values are stored contiguously. At this point, if the SIMD interface of the vectorize_func corresponds to the SIMD length, the loop (Fig. 8 lines 13–17) is vectorized similarly to a vector function.

To safely vectorize vectorize_func, we constrain the function such that it cannot contain globally accessible variables, allocatable arrays, or save variables. In addition, the SIMD interfaces of all functions or subroutines called from vectorize_func should correspond to the SIMD length. This parallelization method is similar to the Single Program Multiple Data (SPMD) programming model because each SIMD element executes a single program simultaneously.

To reduce the data access cost, we introduce a pair of interfaces set_args_i and set_args_j. In BEM analysis, the required data such as coordinate of the i-th element and the j-th element usually depends only on the variables i and j, respectively. Therefore, the subroutines set_args_i and set_args_j are used to set arguments depending only on i and j, respectively. The pair of interfaces work effectively in the \(\mathcal {H}\)-matrix version. As shown in Fig. 5, i and j are constants in the lines 4–9 loop and lines 13–18 loop, respectively.

Fig. 6.
figure 6

New interface for data access. The former arguments are the same as ppohBEM_matrix_element_ij. The latter arguments are the scalar variables used in vectorize_func. The number of arguments depends on the target application.

3.2 Using the Framework

The new interfaces are easy to vectorize for compilers, but they are not user-friendly. Specifically, the numbers of arguments of the set_args subroutine and the vectorize_func function depend on the target application, which means users are required to modify the framework program in order to add variable declarations and correspond to the interface. In addition, users must vectorize the user-defined functions by using !$omp declare simd pragma. Furthermore, if users insert a wrong directive, the compiler generates a correct but unvectorized slow executable, which is often more cumbersome compared to a bug.

To minimize these difficulties, we require users to prepare the followings.

  • Implement include files.

  • Implement the set_args, set_args_i, set_args_j and the vectorize_func without the SIMD directives in the file “user_func.f90”.

  • Correctly implement the dummy function ppohBEM_matrix_element_ij_dummy (Fig. 9) without modifying the dummy function itself.

  • Provide SIMDLENGTH of the target processor by using the -D compiler flag.

The include files that appear in the dummy function are used in the subroutine call interface. First, users of the framework must implement the include files as a fill-in-the-blank puzzle to correct the dummy function. In other words, the return value of the dummy function should be equal to ppohBEM_matrix_ele-ment_ij. At this point, users need not consider SIMD vectorization. Notably, users cannot modify the dummy function itself. If users do not need the set_args function, they must create an empty “call_set_args.inc" file. Second, the users must implement the user-defined functions in “user_func.f90." Notably, users need not consider SIMD vectorization as well. Finally, users must define the variable SIMDLENGTH by using a compiler option. During compiling, the compile script automatically inserts SIMD directives into the user-defined functions implemented in user_func.f90 and automatically transforms the include files to adjust the framework, as shown in Fig. 10. Based on the results of the auto-transformation, we succeeded in separating almost all aspects related to SIMD vectorization from the user-defined function. Therefore, users are required to set only the SIMDLENGTH of the target processor.

Fig. 7.
figure 7

New calculation interface. This function should be called after the set_args subroutine and vectorized. All arguments of this function should have intent(in) attribute.

4 Numerical Evaluations

4.1 Test Model and Processors

In this section, we evaluated the proposed framework by performing BEM analysis of two electrostatic field problems. We assumed a perfectly conductive sphere and a dielectric sphere. The electric potentials of the perfect conductor and the dielectric are given by the following functionals \(\mathcal {P}\) and \(\mathcal {D}\), respectively:

$$\begin{aligned} \mathcal {P}[u](x)&:= \int _{\varOmega } \frac{1}{4\pi ||x-y||}u(y)\mathrm{d}y, x \in \varOmega \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {D}[u](x)&:= \int _{\varOmega } \frac{\langle x-y,n(y) \rangle }{4\pi ||x-y||^3}u(y)\mathrm{d}y, x \in \varOmega \end{aligned}$$
(6)

where \(\varOmega \) is the domain surface. Equations (5) and (6) correspond to Eq. (1) and the details of them are described in [3]. The spheres were set at a distance of 0.25 m from the ground with zero electric potential. The radius of the spheres was 0.25 m, and the electric potential of the spheres was 1 V.

For the numerical evaluations, we used the BDW and the KNL processors, which have a 256-bit SIMD unit and a 512-bit SIMD unit, respectively. The processor specifications are summarized in Table 1. For both processors, Intel Fortran compiler ver. 18.0.1 was used. The compiler options for BDW were -align array64byte -xAVX2 -qopenmp -O3 -fpp -ipo -lm -qopt-report=5 -DSIMDLENGTH=4, and those for KNL were -align array64byte -xMIC-AVX512 -qopenmp -O3 -fpp -ipo -lm -qopt-report=5 -DSIMDLENGTH=8.

Fig. 8.
figure 8

User-defined function using new interface caller for dense matrix.

4.2 Hand Tuning Using OpenMP SIMD Directives

To test the compiler vectorizations, we refactored and evaluated two user-defined functions. Vectorization with compiler directives often requires users to converse with the compiler. We tried to vectorize the user-defined functions by preparing the following series of implementations.

  • H1: Original implementation without compiler directives.

  • H2: !$omp simd directives are inserted above the SIMD target loops of H1.

  • H3: !$omp declare simd directives are inserted in the function shown in Fig. 3 and all user-defined functions called from the function of H2 shown in Fig. 3.

  • H4: A simdlen(SIMDLENGTH) clause is attached to each !$omp simd and

    !$omp declare simd directive of H3.

  • H5: Replace the user-defined functions of H4 with the set_args and

    vectorize_func interfaces.

  • H6: The interfaces set_args_i and set_args_j are used as alternatives to set_args of H5.

  • H7: linear clauses are attached to a !$omp declare simd directive of

    vectorize_func of H6.

  • H8: uniform clauses are used as constant variables instead of linear clauses of H7.

Implementations H1-H4 are based on the original framework. The differences among these implementations are only in terms of the OpenMP directives. Therefore, users familiar with SIMD can implement H1-H4 with relative ease. Implementations H5-H8 are based on the proposed framework. Specifically, implementation H7 corresponds to the automatically generated program. Note that implementation H8 is more optimized than implementation H7. However, to automatically generate implementation H8, syntactic analysis is required. This will be realized in the future.

Fig. 9.
figure 9

Dummy function of user-defined function. Although the function is not used in the framework, users are required to implement this function correctly.

Table 1. Processor specifications
Fig. 10.
figure 10

The users program automatically transformed at the compile time.

Figures 11, 12, 13 and 14 show the increase in speed compared to the speed of implementation H1, and Table 2 summarizes the elapsed times of implementations H1 and H7. The results discussed in this section are the averages of 10 measurements. As summarized in Table 2, although we recommend the BEM-BB H-matrix version, we evaluated the dense matrix version, the performance of which depends to a greater extent on the user-defined function. The main difference between the two functions from the viewpoint of SIMD vectorization is whether the function has a branch. Although the increase in speed in case of the dielectric problem shows a trend similar to that in case of the perfect conductor problem, it is slightly worse owing to the branch divergence caused by the dielectric function. The results obtained by solving the perfect conductor problem on a machine with the KNL processor (Fig. 11) show that the proposed implementation (H7) achieved performance improvements of 4.34x and 6.62x compared to implementation H0 for the \(\mathcal {H}\)-matrix and the dense matrix versions, respectively. The theoretical speedup with SIMD vectorization equals SIMDLENGTH, and the results of the dense matrix version demonstrate that the framework improves SIMD vectorization performance considerably. In the results obtained on a machine with the BDW processor (Fig. 13), implementation H7 achieved performance improvements of 2.22x and 2.44x compared to implementation H0 for the \(\mathcal {H}\)- matrix and the dense matrix versions, respectively.

Fig. 11.
figure 11

Solving perfect conductor problem using KNL processor

Fig. 12.
figure 12

Solving dielectric problem using KNL proccesor

Fig. 13.
figure 13

Solving perfect conductor problem using BDW processor

Fig. 14.
figure 14

Solving dielectric problem using BDW processor

5 Related Work

The literature contains many studies about software frameworks for parallel PDE solvers of the finite element method, such as GeoFEM [10] and Free FEM++ [11]. Moreover, \(\mathcal {H}\)-matrices have been used in a few BEM applications [8, 9, 12], and parallelized in their application. Although many frameworks allow for MPI + OpenMP hybrid parallelization, few frameworks support SIMD vectorization, which highly depends on user-defined functions. The main contribution of this study is a SPMD-like SIMD vectorization method that handles data access and computation separately, and hides SIMD-related aspects in the framework. The method uses the characteristics of BEM analysis: the kernel function is relatively computationally intensive, and there exists no data dependency among the calculations of elements of coefficient matrix.

Table 2. The elapsed times of coefficient generation component of original implementation (H1) and implementation of proposed framework (H7)

6 Conclusion

We refined the open-source framework for parallel BEM analysis to enhance SIMD vectorizations, which is important for realizing high-performance computing. By using the refined framework design, we could successfully separate SIMD-related aspects from the user-defined function, which depends on target applications. We evaluated the proposed framework by solving two static electric field analysis problems containing different user-defined functions on a BDW processor and a KNL processor. The numerical results demonstrated the improved performance of the framework. Specifically, in solving the perfect conductor problem by using the KNL processor, we achieved performance improvements of 4.34x and 6.62x in the \(\mathcal {H}\)-matrix case and the dense matrix cases, respectively.

The main contribution of this paper is separating the SIMD-related aspects from the user-defined function and hiding them to minimize the difficulties associated with SIMD. This SPMD-like SIMD vectorization technique can be used for other applications. In the proposed framework, the arguments of the vectorize_func must be scalar variable. This specification is not user-friendly but compiler-friendly. For example, to adjust the user-defined functions in the proposed framework, we separated the vector argument coordinate(3) to scalars x, y, and z. This type of transformation is a typical Array of Structure (AoS) to Structure of Array (SoA) transformation. To improve the not user-friendly specification, we will challenge to support the AoS to SoA transformation in future.