Keywords

1 Introduction

Data Assimilation (DA) is a prediction-correction method for combining a physical model with observations. The Data Assimilation and the Machine Learning (ML) fields are closely related to each other. Machine Learning process is used to perform a specific task without using explicit instructions and it can be seen as a subset of the Artificial Intelligence (AI) field, because it creates new methods and applications for analyze and classify many natural phenomena (see for example [1]). In general, this process consists in two main phases: the analysis phase - some collected data are analyzed to detect patterns that help to create explicit features or parameters; the training phase - data parameter generated in the previous phase are used to create Machine Learning models.

However, the learning part of the training phase relies on a relevant training data-set, containing samples of spatio-temporal dependent structures. In many fields, there is an absence of direct observations of the random variables, and therefore, learning techniques cannot be readily deployed.

Therefore, a correct training dataset is a basic need to get a right learning. This is because in order to perform a correct ML approach, often a classifier is used in the analyze phase. Each classifier is composed by a kernel which aims to correctly predict the classes by using a higher-dimension feature space to make data almost linearly separable. In order to compute a fair classification and accurate prediction a suitable method could be chosen [23].

The variational approach of the Data Assimilation process, characterized by a cost function minimization, is a good choice for classification. Numerically, this means to apply an iterative procedure using a covariance matrix defined by measuring the error between predictions and observed data. Here, we are interested in those numerical issues. In particular, since the error covariance matrix presents a Gaussian correlation structure, the Gaussian convolution process plays a key role in such a problem. Furthermore, it should be noted that, beyond its fundamental role in the Data Assimilation field, the convolution operation is always significant in the computational process of most big-data analysis problems. Hence, a correct Machine Learning process can use it as a basic step in the analysis phase. Moreover, because of the need to process large amount of data, parallel approaches and High Performance Computing (HPC) architectures, as multicore or Graphics Processing Units (GPUs), are mandatory [2,3,4]. In this direction, some recent papers deal with parallel data assimilation [5,6,7] but we just limit our attention to the basic step represented by a parallel implementation for the Gaussian Convolution. In particular, we propose an accelerated procedure to approximate the Gaussian convolution which is based on Recursive Filters (RFs). In fact, Gaussian RFs have been designed to provide an accurate and very efficient approximated Gaussian convolution [8,9,10,11]. Since the use of RFs is mainly suitable to overcome a large execution time, when there is a lot of data to analyze, many parallel implementations have been presented (see survey in [12]). Here, we propose a novel implementation that exploits the computational power of the GPUs which are very useful for solving numerical problems in several application fields [13, 14].

More precisely, to manage big size input data, the parallelization strategy is based on a domain decomposition approach with overlapping, so that all possible interactions between forecasts and observations are included. In this way, this computational step becomes a very fast kernel specifically designed for exploiting the dynamic parallelism [15] approach available on the Compute Unified Device Architecture (CUDA) [16].

The paper is organized as follows. Section 2 recalls the variational Data Assimilation problem, and the use of the Recursive Filter to approximate the discrete Gaussian convolution. In Sect. 3, the underlying domain decomposition strategy and the GPU-CUDA parallel algorithm are provided. The experiments in Sect. 4 confirm the efficiency of the proposed implementation in terms of performance. Finally conclusions are drawn in Sect. 5.

2 Gaussian Convolutions in Data Assimilation

In this section, we show how the Gaussian convolution is involved in a Data Assimilation scenario. In particular, let us consider a three-dimensional variational data assimilation problem [17]: the objective is to give a best estimate of x, that is called the analysis or state vector, once a prior estimate vector \(x^b\) (background), usually provided by a numerical forecasting model, and a vector \(y= {\mathcal H}(x)+\delta y\) of observations, related to the nonlinear model \({\mathcal H}\), are given. The unknown x solves the regularized constrained least-squared problem:

$$\begin{aligned} \min _{x }J(x) = \min _{x } \Big [ \Vert y- {\mathcal H}(x)\Vert ^{2} + \Vert x-x^b\Vert ^{2} \Big ], \end{aligned}$$
(1)

where J denotes the objective function to minimize. Here, \(\Vert x-x^b\Vert ^{2}\) is a penalty term and \(\Vert y-{\mathcal H}(x)\Vert ^{2}\) is a quadratic data-fidelity term which compares measured data and solution obtained by the nonlinear model \({\mathcal H}\) [10]. In this scheme, the background error \(\delta x = x^b - x\) and the observational error \(\delta y=y-{\mathcal H}(x)\) are assumed to be random variables with zero mean and covariance matrices

$$\mathbf{B}={<}\delta x , \delta x^T{>} \qquad \mathrm {and} \qquad \mathbf{R}={<}\delta y , \delta y^T{>},$$

respectively. Following description in [9], let the matrix \(\mathbf{H}\) be a first-order approximation of the Jacobian of \({\mathcal H}\) at \(x^b\) and denote by

$$d= y-{\mathcal H}(x^b)$$

the so-called misfit. Denoting by \(\mathbf{V}\) the unique symmetric Gaussian matrix such that \(\mathbf{V^2}=\mathbf{B}\), and by introducing the variable \(v= \mathbf{V}^{-1} \delta x\), the problem (1) can be proven to be equivalent to [9, 18, 19]:

$$\begin{aligned} \min _{v} \widetilde{J}(v) = \min _{v} \frac{1}{2} (d - \mathbf{H}{} \mathbf{V} v)^T \mathbf{R}^{-1} (d - \mathbf{H}{} \mathbf{V} v) + \frac{1}{2} v^T v. \end{aligned}$$
(2)

The minimization of the cost function \(\widetilde{J}(v)\) leads to the linear system:

$$\begin{aligned} (I+\mathbf{V}{\varPsi }{} \mathbf{V}) v = \mathbf{V}{} \mathbf{H}^T\mathbf{R}^{-1}d. \end{aligned}$$
(3)

Since \(I+\mathbf{V}{\varPsi }{} \mathbf{V}\) is symmetric, the linear system (3) that can be handled by means of the CG method, whose basic operation is the matrix-vector multiplication:

$$({I}+\mathbf{V}{\varPsi }{} \mathbf{V})\rho = \rho +\mathbf{V}{\varPsi }{} \mathbf{V}\rho .$$

Here, \({\varPsi } = \mathbf{H}^T\mathbf{R}^{-1}{} \mathbf{H}\) is a diagonal matrix and \(\rho \) denotes the residual at the current step of the CG algorithm. More precisely, it turns out that such an operation involves three discrete Gaussian convolutions:

$$\begin{aligned} \mathbf{V}{\rho },\qquad \mathbf{V}(\varPsi \mathbf{V} {\rho }), \qquad \mathbf{V}(\mathbf{H}^T\mathbf{R}^{-1}d). \end{aligned}$$
(4)

In conclusion, previous analysis shows that Gaussian convolution becomes a main kernel for Data Assimilation. From here comes the need to implement accurate and fast methods to perform it. In fact, in the described context, the matrix \(\mathbf{V}\) is neither effectively used nor even assembled, and the matrix-vector multiplications in (4) are computed by introducing the so-called Gaussian Recursive Filters. It has been proved that these tools offer good accuracy and bring down the computational cost in time and space [20, 21].

In particular, in this work we just consider K-iterated first-order Gaussian RFs and follow the approach and notation used in [8]. Let:

$$s^{(0)}=\big \{s^{(0)}_j\big \}_{j\in \mathbf {Z}}= \big (\ldots , ,s^{(0)}_{-1},s^{(0)}_0,s^{(0)}_1,\ldots \big )$$

be an input signal and let g denote the Gaussian function with zero mean and standard deviation \(\sigma \). The Gaussian filter is a filter whose response to the input \(s^{(0)}\) is given by the discrete Gaussian convolution:

$$\begin{aligned} s^{(g)}_j =\big (g*s^{(0)}\big )_j= \sum _{t \in \mathbf {Z}} g_{j-t}s^{(0)}_t, \qquad \forall \, j \in \mathbf {Z}, \end{aligned}$$
(5)

where \(g_{t}\equiv g(t)\). A K-iterated first-order Gaussian recursive filter generates an output signal \(s^{(K)}\), the so-called K-iterate approximation of \(s^{(g)}\), whose entries solve the 2K recurrence relations:

$$\begin{aligned} p_j^{(k)}=\beta s_j^{(k-1)}+ \alpha p_{j-1}^{(k)}, \qquad \forall \, j\in \mathbf {Z}, \end{aligned}$$
(6)
$$\begin{aligned} s_j^{(k)}=\beta p_j^{(k)}+ \alpha s_{j+1}^{(k)}, \qquad \forall \,j\in \mathbf {Z}. \end{aligned}$$
(7)

(\(k\!=\!1,\ldots ,K)\) where values \(\alpha \) and \(\beta =1-\alpha \) are called smoothing coefficients and verify:

$$\begin{aligned} \alpha =1+E_{\sigma } - \sqrt{ E_{\sigma }(E_{\sigma }+2)}, \qquad \beta =\sqrt{ E_{\sigma }(E_{\sigma }+2)}-E_{\sigma }, \end{aligned}$$
(8)

with \(E_{\sigma }=K\sigma ^{-2}\). It has been proved that as \(K\rightarrow \infty \) the filter converges to the Gaussian filter [22]. If we consider a finite size input signal \(s^{(0)}\) (i.e. with support in the grid \(\{0,1,2,\ldots ,N-1\}\)) then the index j has to be used in increasing order in (6) and decreasing order in (7). Hence, relations (6) and (7) are suitable called advancing and backing filters, respectively [8]. We highlight that to prime the algorithm these filters requires to set values \(p_0^{(k)}\) and \(s_{N-1}^{(k)}\). This can be done using the boundary conditions [24]:

$${p}_{0}^{(k)}=\frac{1}{1+\alpha } {s}_{0}^{(k-1)}, \qquad {s}_{N-1}^{(k)}=\frac{1}{1+\alpha } {p}_{N-1}^{(k)} $$

which are derived to simulate the effect of the neglected entries when using finite size input signals. Typically a well-known edge effect, i.e. a large perturbation error, can be seen on the boundary entries of the output. In [8], provided that the input support is in \([0,N-1]\), this effect can be mitigated by increasing the input size including and putting artificial zero entries at the left and right boundaries of the input. Algorithm 1 describes a K-iterated first-order Gaussian RF straight implementation.

figure a

3 Parallel Approach and GPU Algorithm

In this section we give a description of our parallel algorithm, and the related strategy, to implement a fast and accurate version of the K-iterated first-order Gaussian RF. This approach exploits the main features of the GPU environment. The main idea relies on several macro steps in order to obtain a reliable and performing computation. The whole process can be partitioned in three steps. In the first phase, step 1, in order to perform a fair workload distribution, we use a Domain Decomposition (DD) approach with overlapping. More specifically, the strategy consists in splitting the input signal \(s^{(0)}\) into t local blocks, one for each thread:

$$\begin{aligned} s^{(0),m}_\mathbf{0}, \ s^{(0),m}_\mathbf{1}, \ldots , \ s^{(0),m}_{\mathbf{t}-1}. \end{aligned}$$
(9)

Here, N denotes the problem size, while:

$$\begin{aligned} d=\left\lfloor \frac{N}{\mathbf{t}}\right\rfloor \quad \mathrm {and} \quad r= mod(N,\mathbf{t}) \end{aligned}$$
(10)

are the quotient and the remainder when dividing N by t, respectively. Moreover, the parameter m denotes the overlapping size. To be specific, each thread \(\mathbf{j}\) loads in own local memory the block \(s^{(0),m}_{\mathbf{j}}\), whose size is \(d+2m\) or \(d+1+2m\) (depending on \(\mathbf{j})\). The entries of the j-th local block are formally defined using the subdivision:

$$\begin{aligned} {} \big ( s^{(0),m}_\mathbf{j}\big )_i\!\!=\!\! \left\{ \begin{array}{lll} s^{(0)}_{\mathbf{j}d+\mathbf{j}+i-m}, &{} i=0,\ldots ,d+2m &{} \quad \quad \mathrm {if \ } \mathbf{j}<r \\ s^{(0)}_{\mathbf{j}d+r+i-m}, &{} i=0,\ldots ,d +2m-1 &{}\quad \quad \mathrm {if \ } \mathbf{j}\ge r \\ \end{array} \right. \end{aligned}$$
(11)

where the input signal entries are set to zero, when not available (\(s^{(0)}_i=0\) for \( i<0\) and \(i\ge N)\).

In other words, this partitioning consists in assigning to each thread a part of the signal, so that two consecutive threads have consecutive signal blocks and those blocks overlap on the edges by sharing exactly 2m entries. The reason of overlapping is because, to perform a good approximation of the convolution, block edge values need to use close values that lie in the neighboring blocks. We notice that by setting \(m=0\), i.e. by excluding the overlapping areas, could create possible perturbation errors and generate a bad accuracy close to the boundaries of the local output signals.

The step 2 deals with the approximated local Gaussian convolution for each block. More precisely, each thread \(\mathbf{j}\) performs the K-iterated first-order Gaussian RF to \(s_\mathbf{j}^{(0),m}\), by applying Algorithm 1, and computes \(s_\mathbf{j}^{(K),m}\).

The last phase, step 3, is related to collect the local approximated results by loading them into a global output signal. Therefore, in order to remove the first and last m entries, a resizing operation is firstly performed for each local output. More in details, each thread \(\mathbf{j}\) resizes the local computed signal \(s_\mathbf{j}^{(K),m}\), by removing its first and last m entries, and it generates the local output \(s_\mathbf{j}^{(K)}\). Finally, a gathering of local resized outputs into the global output signal is done.

A very important consequence of our strategy is that all previous steps, which are summarized in the following parallel algorithm, can be computed by all threads in a fully-parallel way.

figure b

Now, we discuss how the Algorithm 2 is implemented in a CUDA environment. Firstly, input data are transferred to device global memory. Hence, in order to guarantee a reliable workload distribution, the described domain decomposition in step 1 is performed. More in detail, we set for each thread the local size \(n_{loc}=d\) or \(n_{loc}=d+1\), depending on the threads number \(\mathbf{j}\) and the value r in (10). This confirms that if the input size value n is not divisible by t, according to (11) a suitable workload distribution is done. By also considering the overlapping entries, the block length becomes \(n_{loc}+2m\), and each thread can retrieve from the global array the required amount of data needed for its local computation.

Moreover, an adequate access to the global memory is performed by means of a suitable indexing, i.e. every thread loads data from global memory and stores them in its own local memory in order to perform each operation independently. Thanks to this operation, any overhead due to the contention and synchronization of the global memory is avoided. In the following, the overall GPU parallel algorithm is shown.

figure c

Shortly, in Algorithm 3, starting from the input data size N, the iteration number K and the input signal vector input_data, which are loaded in the global device memory, the procedure returns the approximated Gaussian convolution by the signal vector results which is computed in a GPU-parallel way. More in detail, Algorithm 3 highlights several memory and computation strategies.

To be specific, first operations provide, lines 1–8 to set the local stacks for each thread by considering the padding pieces related to the overlapping value m. Hence, according to step 1 each thread performs a preliminary check of the local chunk by means of the local index chunk_idx. Therefore, if the left and the right side of the input data are provided, these values are added, otherwise m values, set to zero, are inserted on the overlapped positions. In lines 9–18 the computation phase is performed and a dynamic parallelism approach [15] has been applied, when possible. CUDA allows us to exploit the dynamic parallelism which is an extension to the CUDA programming model by enabling a CUDA kernel to configure new thread grids in order to launch new kernels for reducing the computational time. The aim of dynamic parallelism in our implementation consists in to the assignment, by each thread corresponding to each input portion, therefore for every CUDA kernel, to K threads by scheduling each thread in order to perform the forward and the backward filter operations as described in Algorithm 1, in synchronous way. More in details, K different threads perform the operations on each element following a pipeline modality. The usage of dynamic parallelism is able to obtain very low execution times, despite the predictable start-up and end-up times. The lines 19–22 are related to gathering of the local results of each thread in the global output. The copy operation is designed according to avoid memory contention, so that it is memory-safe because each thread carries out only the n_loc central elements of own local result by removing the 2m boundary values. This property guarantees a strong memory consistency.

4 Experimental Results

In this section, several experimental results highlight and confirm the reliability and the efficiency of the proposed software. Following, the technical specifications where the GPU-parallel algorithm has been implemented, are shown:

  • two CPU Intel Xeon with 6 cores, E5-2609v3, 1.9 GHz, 32 GB of RAM, 4 channels 51 Gb/s memory bandwidth

  • two NVIDIA GeForce GTX TITAN X, 3072 CUDA cores, 1 GHz Core clock for core, 12 GB DDR5, 336 GBs as bandwidth.

Thanks to GPUs’ computational power, our algorithm exploits the CUDA framework in order to take best advantage of parallel environment. Our approach relies on an ad-hoc memory strategy which provides to increase the size of local stack heap memory for each thread and for each thread blocks’. Exploiting this technique, when a large amount of input data will be loaded, the memory access time is reduced. Previous operations are executed by using the following CUDA routine: cudaDeviceSetLimit, by setting as first parameter cudaLimitMallocHeapSize and second cudaLimitStackSize; while as size, according to hardware architecture, the value \(1024\times 1024\times 1024\) is fixed. This trick allows us to allocate the dynamic memory, by using malloc system-call, directly on the device.

Therefore, in order to increase the performances an additional memory-based operation has been done. More precisely, this operation relies on L2 cache obtaining a gain of performance by varying dynamically the fetch granularity. More in details, after that each thread blocks computation is completed, we perform a dynamic fetch granularity by using the CUDA routine cudaDeviceSetLimit and setting as parameters: cudaLimitMaxL2FetchGranularity and 128*sizeof(int). The value 128 is related to hardware architecture that can be support this range of data loading. Applying this approach an appreciable increasing of performance has been obtained by exploiting the memory cache’s property that can recover the most used data and instructions during the execution. Accordingly, due to the canonical operations of Recursive Filters during their execution, a reduced memory access time is obtained by increasing the fetch granularity.

In other words, in a classical execution each thread accesses to the global memory to retrieve the required data for the computation. In this case, according to the memory hierarchy and the L2 memory strategy, each thread accesses first in the cache, then in the local stack, and finally in the heap/global memory. With this procedure a considerable gain in terms of performance has been achieved. In the following tests we set \(\sigma =2\) and input signals randomly distributed (Gaussian or uniform). The choice \(m=2.5 \sigma =5\) guarantees a good accuracy level, as shown in [8, 20].

Test 1. Here, in order to highlight the performance gain, we set as input: \(N=10^5\), \(m=5\), \(K=10\) and the thread number \(\mathbf{t}=100\). Averaged times, related to 10 executions, achieved are:

  • 7.24 s, without increasing fetch granularity,

  • 6.93 s, by varying dynamic fetch granularity.

The first test highlights a small time difference but, if we give a large dataset input, which requires a large execution time even on GPU, thanks to the granularity of the dynamic recovery a significant performance gain can be obtained. Thus, the dynamic operations are closely related to the size of the input, i.e. according to cache granularity size chosen as parameter into function cudaDeviceSetLimit, where in this case the maximum value is fixed to 128 bytes. This experiment provides a comparison among serial and GPU parallel execution times. More precisely, in Table 1 the execution times for both serial (CPU) and parallel version (GPU) by choosing different input sizes and the iteration numbers are shown. The input parameters are set as: Blocks \(\times \) Threads = \(10\,\times \,100\) and \(m=5\).

Table 1. Execution times (in seconds), Blocks \(\times \) Threads = \(10\,\times \,100\), \(m=5\).
Table 2. Execution times (in seconds), iteration number \(K=500\). \(m=5\).

Test 2. This experiment confirms the reliability effects when choosing different CUDA thread configurations. Here, we emphasize the different input sizes given, while the iteration number and the overlapping value are set to \(K\!=\!500\) and \(m\!=\!5\), respectively. Indeed, reduction of execution times has been achieved by decreasing the Blocks number, this holds true for all given input sizes. This phenomenon is related to a good synchronization applied during the access to the global memory from each thread, which reduces the access time and consequently the overall execution time. These results are confirmed and verified also by choosing any possible CUDA configuration in the range 1000–3072 threads (3072 is the maximum threads number available for our hardware). Table 2 confirms the reliability of the parallelization strategy by highlighting the access time to the global memory. In particular, the results allow us to find the best CUDA thread configuration, Blocks \(\times \) Threads = \(3\,\times \,1024\), obtained in correspondence of the best execution times.

Test 3. This experiment is referred to the optimal CUDA configuration and aims to investigate the behaviour of the algorithm by varying both iteration number value K and the input size N. Figure 1 shows an appreciable gain of performance and, in particular, a sub-linear increase of execution time with respect to the problem size (which is linear in \(N\times K\)), typical for GPUs architectures.

Fig. 1.
figure 1

Execution times by varying K and N (Blocks \(\times \) Threads = \(3\times \,1024\), \(m=5\))

Test 4. Here, we show a further improvement of the performance due to the use the power of dynamic parallelism approach. Table 3 exhibits the best execution times achieved by using the dynamic parallelism and choosing an ad-hoc, i.e. limited by our machine available resources, CUDA configuration. Comparison with Table 2 (first 4 lines) confirms the improvement for all data sizes. However, we underline that, because the hardware limits available, if we set a too large threads number, a big portion of them cannot work and, from a numerical point of view, the output result becomes completely unreliable. In other words, a fair CUDA configuration avoids a failed computation. For this reason, we have no results by using a greater number of threads. Finally, behaviour of results in Table 3 seems to suggest that an improving of performance should be obtained by exploiting a machine with higher computational resources.

Table 3. Execution times (in seconds) with dynamic parallelism, iteration number \(K=500\). \(m=5\).

5 Conclusions

In this paper, we proposed a GPU-parallel algorithm that provides a fast and accurate Gaussian convolution, which is a fundamental step in both Data Assimilation and Machine Learning fields. The algorithm relies on the K-iterated first-order Gaussian Recursive filter. The parallel algorithm is designed by exploiting dynamic parallelism available in CUDA environment. The experimental results confirm the reliability and the efficiency of the proposed algorithm.