1 Introduction

Data Assimilation is the process by which imperfect numerical forecasts are adjusted according to real observations [1]. In sequential methods, a numerical forecast \(\mathbf{x}^b \in \mathbb {R}^{n\times 1}\) is adjusted according to an array of observations \(\mathbf{y}\in \mathbb {R}^{m\times 1}\) where \(n\) and \(m\) are the number of model components and the number of observations, respectively. When Gaussian assumptions are made in prior and observational errors, the posterior mode \(\mathbf{x}^a \in \mathbb {R}^{n\times 1}\) can be estimated via the minimization of the Three Dimensional Variational (3D-Var) cost function:

$$\begin{aligned} \mathcal {J}(\mathbf{x}) = \frac{1}{2} \cdot \left\| \mathbf{x}-\mathbf{x}^b \right\| _{\mathbf{B}^{-1}}^2 + \frac{1}{2} \cdot \left\| \mathbf{y}-\mathcal {H}\left( \mathbf{x}\right) \right\| _{\mathbf{R}^{-1}}^2, \end{aligned}$$
(1)

where \(\mathbf{B}\in \mathbb {R}^{n\times n}\) and \(\mathbf{R}\in \mathbb {R}^{m\times m}\) are the background error and the data error covariance matrices, respectively. Likewise, \(\mathcal {H}(\mathbf{x}):\mathbb {R}^{n\times 1} \rightarrow \mathbb {R}^{m\times 1}\) is a (non-) linear observation operator which maps vector states to observation spaces. The solution to the optimization problem

$$\begin{aligned} \mathbf{x}^a = \arg \,\underset{\mathbf{x}}{\min } \,\mathcal {J}(\mathbf{x}), \end{aligned}$$
(2)

is immediate when \(\mathcal {H}(\mathbf{x})\) is linear (i.e., closed-form expressions can be obtained to compute \(\mathbf{x}^a\)) but, for non-linear observation operators, numerical optimization methods such as Newton’s one must be employed [2]. However, since Newton’s step is derived from a second-order Taylor polynomial, it can be too large with regard to the actual step size. Thus, line search methods can be employed to estimate optimal step lengths among Newton’s method iterations. A DA method based on this idea is the Maximum-Likelihood-Ensemble-Filter (MLEF), which performs the assimilation step onto the ensemble space. However, the convergence of this method is not guaranteed (i.e., as the mismatch of gradients cannot be bounded), and even more, analysis increments can be impacted by sampling noise. We think that there is an opportunity to enhance line-search methods in the non-linear DA context by employing random descent directions onto which analysis increments can be estimated. Moreover, the analysis increments can be computed onto the model space to ensure global convergence.

This paper is organized as follows: in Sect. 2, we discuss topics related to linear and non-linear data assimilation as well as line-search optimization methods. Section 3 proposes an ensemble Kalman filter implementation via random descent directions. In Sect. 4, experimental tests are performed to assess the accuracy of our proposed filter implementation by using the Lorenz 96 model. Conclusions of this research are stated in Sect. 5.

2 Preliminaries

2.1 The Ensemble Kalman Filter

The Ensemble Kalman Filter (EnKF) is a sequential Monte-Carlo method for parameter and state estimation in highly non-linear models [3]. The popularity of the EnKF obeys to his simple formulation and relatively ease implementation. In the EnKF, an ensemble of model realizations is employed to estimate moments of the background error distribution [4]:

$$\begin{aligned} \mathbf{X}_k^b = \left[ \mathbf{x}^{b[1]},\, \mathbf{x}^{b[2]},\, \ldots ,\, \mathbf{x}^{b[N]} \right] \in \mathbb {R}^{n\times N} \end{aligned}$$
(3)

where \(\mathbf{x}^{b[e]} \in \mathbb {R}^{n\times 1}\) stands for the e-th ensemble member, for \(1 \le e \le N\), at time k, for \(0 \le k \le M\). Then, the ensemble mean:

$$\begin{aligned} \overline{\mathbf{x}}^b = \frac{1}{N} \cdot \sum _{e=1}^{N} \mathbf{x}^{b[e]} \in \mathbb {R}^{n\times 1}, \end{aligned}$$
(4)

and the ensemble covariance matrix:

$$\begin{aligned} \mathbf{P}^b = \frac{1}{N-1} \cdot \varvec{\varDelta }{} \mathbf{X}^b \cdot \left[ \varvec{\varDelta }{} \mathbf{X}^b \right] ^T \in \mathbb {R}^{n\times n}, \end{aligned}$$
(5)

act as estimates of the background state \(\mathbf{x}^b\) and the background error covariance matrix \(\mathbf{B}\), respectively, where the matrix of member deviations reads:

$$\begin{aligned} \varvec{\varDelta }{} \mathbf{X}^b = \mathbf{X}^b-\overline{\mathbf{x}}^b \cdot \mathbf{1}^T \in \mathbb {R}^{n\times N}. \end{aligned}$$
(6)

Posterior members can be computed via the use synthetic observations:

$$\begin{aligned} \mathbf{X}^a = \mathbf{X}^b + \varvec{\varDelta }{} \mathbf{X}^a, \end{aligned}$$
(7)

where the analysis increments can be obtained via the solution of the next linear system:

$$\begin{aligned} \left[ \left[ \mathbf{P}^b \right] ^{-1} + \mathbf{H}^T \cdot \mathbf{R}^{-1} \cdot \mathbf{H}\right] \cdot \varvec{\varDelta }{} \mathbf{X}^a = \mathbf{H}^T \cdot \mathbf{R}^{-1} \cdot \mathbf{D}^s \in \mathbb {R}^{n\times N}, \end{aligned}$$
(8)

and \(\mathbf{D}^s \in \mathbb {R}^{m\times N}\) is the innovation matrix on the synthetic observations whose e-th column reads \(\mathbf{y}-\mathbf{H}\cdot \mathbf{x}^{b[e]} + \varvec{\varepsilon }^{[e]} \in \mathbb {R}^{m\times 1}\) with \(\varvec{\varepsilon }^{[e]} \sim \mathcal {N}\left( \mathbf{0}_{m},\, \mathbf{R}\right) \). In practice, model dimensions range in the order of millions while ensemble sizes are constrained by the hundreds and as a direct consequence, sampling errors impact the quality of analysis increments. To counteract the effects of sampling noise, localizations methods are commonly employed [5], in practice. In the EnKF based on a modified Cholesky decomposition (EnKF-MC) [6] the following estimator is employed to approximate the precision covariance matrix of the background error distribution [7]:

$$\begin{aligned} \widehat{\mathbf{B}}^{-1} = \widehat{\mathbf{L}}^T \cdot \widehat{\mathbf{D}}^{-1} \cdot \widehat{\mathbf{L}}\in \mathbb {R}^{n\times n}, \end{aligned}$$
(9)

where the Cholesky factor \(\mathbf{L}\in \mathbb {R}^{n\times n}\) is a lower triangular matrix,

$$\begin{aligned} \left\{ \widehat{\mathbf{L}}\right\} _{i,v} = {\left\{ \begin{array}{ll} -\beta _{i,v} &{} \,,\, v \in P(i,r) \\ 1 &{} \,,\, i=v \\ 0 &{}\,,\, otherwise \end{array}\right. } \,, \end{aligned}$$
(10)

whose non-zero sub-diagonal elements \(\beta _{i,v}\) are obtained by fitting models of the form,

$$\begin{aligned} {\mathbf{x}_{[i]}^T} = \sum _{v \in P(i,\,r)} \beta _{i,v} \cdot {\mathbf{x}_{[v]}^T} + {\varvec{\gamma }_i} \in \mathbb {R}^{N\times 1}, 1 \le i \le n, \end{aligned}$$
(11)

where \({\mathbf{x}_{[i]}^T} \in \mathbb {R}^{N\times 1}\) denotes the i-th row (model component) of the ensemble (3), components of vector \({\varvec{\gamma }_i} \in \mathbb {R}^{N\times 1}\) are samples from a zero-mean Normal distribution with unknown variance \(\sigma ^2\), and \(\mathbf{D}\in \mathbb {R}^{n\times n}\) is a diagonal matrix whose diagonal elements read,

$$\begin{aligned} \left\{ \mathbf{D}\right\} _{i,i}= & {} \widehat{\mathbf{var}}\left( {\mathbf{x}_{[i]}^T} -\sum _{v \in P(i,\,r)} \beta _{i,v} \cdot {\mathbf{x}_{[j]}^T} \right) ^{-1} \end{aligned}$$
(12)
$$\begin{aligned}\approx & {} \mathbf{var}\left( {\varvec{\gamma }_i} \right) ^{-1} = \frac{1}{\sigma ^2} >0, \text { with } \left\{ \mathbf{D}\right\} _{1,1} = \widehat{\mathbf{var}}\left( {\mathbf{x}_{[1]}^T} \right) ^{-1}, \end{aligned}$$
(13)

where \(\mathbf{var}(\bullet )\) and \(\widehat{\mathbf{var}}(\bullet )\) denote the actual and the empirical variances, respectively. The analysis equations can then be written as follows:

$$\begin{aligned} \mathbf{X}^a = \mathbf{X}^b + \left[ \widetilde{\mathbf{L}}^T \cdot \widetilde{\mathbf{D}}^{-1/2} \right] ^{-1} \cdot \mathbf{E}\in \mathbb {R}^{n\times N}, \end{aligned}$$
(14)

where

$$\begin{aligned} \widehat{\mathbf{A}}^{-1}= & {} \widetilde{\mathbf{L}}^T \cdot \widetilde{\mathbf{D}}^{-1} \cdot \widetilde{\mathbf{L}}= \widehat{\mathbf{B}}^{-1} + \mathbf{H}^T \cdot \mathbf{R}^{-1} \cdot \mathbf{H}\\= & {} \widehat{\mathbf{L}}^T \cdot \widehat{\mathbf{D}}^{-1} \cdot \widehat{\mathbf{L}}+ \mathbf{H}^T \cdot \mathbf{R}^{-1} \cdot \mathbf{H}\in \mathbb {R}^{n\times n},\nonumber \end{aligned}$$
(15)

is an estimate of the posterior precision covariance matrix while the columns of matrix \(\mathbf{E}\in \mathbb {R}^{n\times N}\) are formed by samples from a standard Normal distribution, \(\widetilde{\mathbf{L}}^T \in \mathbb {R}^{n\times n}\) is a lower triangular matrix (with the same structure as \(\widehat{\mathbf{L}}\)), and \(\widetilde{\mathbf{D}}^{-1} \in \mathbb {R}^{n\times n}\) is a diagonal matrix. Given the special structure of the left-hand side in (14), the direct inversion of the matrix \(\widetilde{\mathbf{L}}\cdot \widetilde{\mathbf{D}}^{-1/2} \in \mathbb {R}^{n\times n}\) can be avoided [8, Algorithm 1].

2.2 Maximum Likelihood Ensemble Filter (MLEF)

To handle non-linear observation operators during assimilation steps, optimization based methods can be employed to estimate analysis increments. A well-known method in this context is the Maximum-Likelihood-Ensemble-Filter (MLEF) [9, 10]. This square-root filter employs the ensemble space to compute analysis increments, this is:

$$\begin{aligned} \overline{\mathbf{x}}^a-\overline{\mathbf{x}}^b \in \mathbf{range} \left\{ \varvec{\varDelta }{} \mathbf{X}\right\} , \end{aligned}$$

which is nothing but a pseudo square-root approximation of \(\mathbf{B}^{1/2}\). Thus, vector states can be written as follows:

$$\begin{aligned} \mathbf{x}= \overline{\mathbf{x}}^b+\varvec{\varDelta }{} \mathbf{X}\cdot \mathbf{w}, \end{aligned}$$
(16)

where \(\mathbf{w}\in \mathbb {R}^{N\times 1}\) is a vector in redundant coordinates to be computed later. By replacing (16) in (1) one obtains:

$$\begin{aligned} \mathcal {J}(\mathbf{x}) = \mathcal {J}\left( \overline{\mathbf{x}}^b+\varvec{\varDelta }{} \mathbf{X}\cdot \mathbf{w}\right) = \frac{N-1}{2} \cdot \left\| \mathbf{w} \right\| ^2 + \frac{1}{2} \cdot \left\| \mathbf{y}-\mathcal {H}\left( \overline{\mathbf{x}}^b+\varvec{\varDelta }{} \mathbf{X}\cdot \mathbf{w}\right) \right\| ^2_{\mathbf{R}^{-1}}. \end{aligned}$$
(17)

The optimization problem to solve reads:

$$\begin{aligned} \mathbf{w}^{*} = \arg \, \underset{\mathbf{w}}{\min } \, \mathcal {J}\left( \overline{\mathbf{x}}^b+\varvec{\varDelta }{} \mathbf{X}\cdot \mathbf{w}\right) . \end{aligned}$$
(18)

This problem can be numerically solved via Line-Search (LS) and/or Trust-Region methods. However, convergence is not ensured since gradient approximations are performed onto a reduce space whose dimension is much smaller than that of the model one.

2.3 Line Search Optimization Methods

The solution of optimization problems of the form (2) can be approximated via Numerical Optimization. In this context, solutions are obtained via iterations:

$$\begin{aligned} \mathbf{x}_{k+1} = \mathbf{x}_k + \varvec{\varDelta }{} \mathbf{s}_k , \end{aligned}$$
(19)

wherein k denotes iteration index, and \(\varvec{\varDelta }{} \mathbf{s}_k \in \mathbb {R}^{n\times 1}\) is a descent direction, for instance, the gradient descent direction [11]

$$\begin{aligned} \varvec{\varDelta }{} \mathbf{s}_k = -\nabla \mathcal {J}\left( \mathbf{x}_k \right) , \end{aligned}$$
(20a)

the Newton’s step [12],

$$\begin{aligned} \nabla ^2 \mathcal {J}\left( \mathbf{x}_k \right) \cdot \varvec{\varDelta }{} \mathbf{s}_k = -\nabla \mathcal {J}\left( \mathbf{x}_k \right) , \end{aligned}$$
(20b)

or a quasi-Newton based method [13],

$$\begin{aligned} \mathbf{P}_k \cdot \varvec{\varDelta }{} \mathbf{s}_k = -\nabla \mathcal {J}\left( \mathbf{x}_k \right) , \end{aligned}$$
(20c)

where \(\mathbf{P}_k \in \mathbb {R}^{n\times n}\) is a positive definite matrix. A concise survey of Newton based methods can be consulted in [14]. Since step lengths in (20) are based on first or second order Taylor polynomials, the step size can be chosen via line search [15] and/or trust region [16] methods. Thus, we can ensure global convergence of optimization methods to stationary points of the cost function (1). This holds as long as some assumptions over functions, gradients, and (potentially) Hessians are preserved [17]. In the context of line search, the following assumptions are commonly done:

  1. C1

    A lower bound of \(\mathcal {J}(\mathbf{x})\) exists on \(\varOmega _0 = \{\mathbf{x}\in \mathbb {R}^{n \times 1},\, \mathcal {J}(\mathbf{x}) \le \mathcal {J}\left( \mathbf{x}^{\dag } \right) \}\), where \(\mathbf{x}^{\dag } \in \mathbb {R}^{n \times 1}\) is available.

  2. C2

    There is a constant \(\mathbf{L}\) such as:

    $$\begin{aligned} \left\| \nabla \mathcal {J}(\mathbf{x})-\nabla \mathcal {J}(\mathbf{z}) \right\| \le L \cdot \left\| \mathbf{x}-\mathbf{z} \right\| ,\, \text { for } \mathbf{x},\,\mathbf{z}\in B, \text { and } L > 0, \end{aligned}$$

    where B is an open convex set which contains \(\varOmega _0\). These conditions together with iterates of the form,

    $$\begin{aligned} \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha \cdot \varvec{\varDelta }{} \mathbf{s}_k , \end{aligned}$$
    (21)

    ensure global convergence [18] as long as \(\alpha \) is chosen as an (approximated) minimizer of

    $$\begin{aligned} \alpha ^{*} = \arg \,\underset{\alpha \ge 0}{\min } \,\mathcal {J}\left( \mathbf{x}_k + \alpha \cdot \varvec{\varDelta }{} \mathbf{s}_k\right) . \end{aligned}$$
    (22)

In practice, rules for choosing step-size such as the Goldstein rule [19], the Strong Wolfe rule [20], and the Halving method [21] are employed to partially solve (22). Moreover, soft computing methods can be employed for solving (22) [22].

3 Proposed Method: An Ensemble Kalman Filter Implementation via Line-Search Optimization and Random Descent Directions

In this section, we propose an iterative method to estimate the solution of the optimization problem (2). We detail our filter derivation, and subsequently, we theoretically prove the convergence of our method.

3.1 Filter Derivation

Starting with the forecast ensemble (3), we compute an estimate \(\widehat{\mathbf{B}}^{-1}\) of the precision covariance \(\mathbf{B}^{-1}\) via modified Cholesky decomposition. Then, we perform an iterative process as follows: let \(\mathbf{x}_0 = \overline{\mathbf{x}}^b\), at iteration k, for \(0 \le k \le K\), where K is the maximum number of iterations, we build a quadratic approximation of \(\mathcal {J}(\mathbf{x})\) about \(\mathbf{x}_k\)

$$\begin{aligned} \mathcal {J}_k(\mathbf{x}) = \frac{1}{2} \cdot \left\| \mathbf{x}-\mathbf{x}_k \right\| ^2_{\widehat{\mathbf{B}}^{-1}} + \frac{1}{2} \cdot \left\| \mathbf{y}-\widehat{\mathcal {H}}_k(\mathbf{x}) \right\| ^2_{\mathbf{R}^{-1}} , \end{aligned}$$
(23a)

where

$$\begin{aligned} \widehat{\mathcal {H}}_k(\mathbf{x}) = \mathcal {H}\left( \mathbf{x}_k \right) + \mathbf{H}_k \cdot \left[ \mathbf{x}-\mathbf{x}_k \right] , \end{aligned}$$

and \(\mathbf{H}_k\) is the Jacobian of \(\mathcal {H}(\mathbf{x})\) at \(\mathbf{x}_k\). The gradient of (23a) reads:

$$\begin{aligned} \nabla \mathcal {J}_k(\mathbf{x})= & {} \widehat{\mathbf{B}}^{-1} \cdot \left[ \mathbf{x}-\mathbf{x}_k \right] - \mathbf{H}_k^T \cdot \mathbf{R}^{-1} \cdot \left[ \mathbf{d}_k - \mathbf{H}_k \cdot \mathbf{x}\right] \, \\= & {} \left[ \widehat{\mathbf{B}}^{-1} + \mathbf{H}_k^T \cdot \mathbf{R}^{-1} \cdot \mathbf{H}_k \right] \cdot \mathbf{x}- \mathbf{H}_k^T \cdot \mathbf{R}\cdot \mathbf{d}_k \in \mathbb {R}^{n\times 1} , \end{aligned}$$

where \(\mathbf{d}_k = \mathbf{y}-\mathcal {H}(\mathbf{x}_k) +\mathbf{H}_k \cdot \mathbf{x}_k \in \mathbb {R}^{m\times 1}\). Readily, the Hessian of (23a) is

$$\begin{aligned} \nabla ^2 \mathcal {J}_k(\mathbf{x}) = \widehat{\mathbf{B}}^{-1} + \mathbf{H}_k^T \cdot \mathbf{R}^{-1} \cdot \mathbf{H}_k \in \mathbb {R}^{n\times n} , \end{aligned}$$
(23b)

and therefore, the Newton’s step can be written as follows:

(23c)

As we mentioned before, the step size (23c) is based on a quadratic approximation of \(\mathcal {J}(\mathbf{x})\) and depending how highly non-linear is \(\mathcal {H}(\mathbf{x})\), the direction (23c) can poorly estimate the analysis increments. Thus, we compute U random directions based on the Newton’s one as follows:

$$\begin{aligned} \mathbf{q}_{u,k} = \varPi _u \cdot \mathbf{p}_k \left( \mathbf{x}_k \right) \in \mathbb {R}^{n\times 1} , \text { for } 1 \le u \le U , \end{aligned}$$
(23d)

where the matrices \(\varPi _u \in \mathbb {R}^{n\times n}\) are symmetric positive definite and these are randomly formed with \(\left\| \varPi _u \right\| = 1\). We constraint the increments to the space spanned by the vectors (23d), this is

$$\begin{aligned} \mathbf{x}_{k+1} - \mathbf{x}_{k} = \mathbf{range}\left\{ \mathbf{Q}_k \right\} , \end{aligned}$$

where the u-th column of \(\mathbf{Q}_k \in \mathbb {R}^{n\times U}\) reads \(\mathbf{q}_{u,k}\). Thus,

$$\begin{aligned} \mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{Q}_k \cdot \varvec{\gamma }^{*} , \end{aligned}$$
(23e)

where \(\varvec{\gamma }^{*} \in \mathbb {R}^{U \times 1}\) is estimated by solving the following optimization problem

$$\begin{aligned} \varvec{\gamma }^{*} = \arg \, \underset{\varvec{\gamma }}{\min } \, \mathcal {J}\left( \mathbf{x}_k+\mathbf{Q}_k \cdot \varvec{\gamma }\right) . \end{aligned}$$
(23f)

To solve (23f), we proceed as follows: generate Z random vectors \(\varvec{\gamma }_z \in \mathbb {R}^{U \times 1}\), for \(1 \le z \le Z\), with \(\left\| \varvec{\gamma }_z \right\| = 1\). We then, for each direction \(\mathbf{Q}_k \cdot \varvec{\gamma }_z \in \mathbb {R}^{n\times 1}\), we solve the following one-dimensional optimization problem

$$\begin{aligned} \alpha ^{*}_z = \arg \, \underset{\alpha _z}{\min } \, \mathcal {J}\left( \mathbf{x}_k+\alpha _z \cdot \left[ \mathbf{Q}_k \cdot \varvec{\gamma }_z \right] \right) , \end{aligned}$$
(23g)

and therefore, an estimate of the next iterate (23e) reads:

$$\begin{aligned} \mathbf{x}_{k+1} = \mathbf{x}_k + \mathbf{Q}_k \cdot \left[ \alpha _k^{*} \cdot \varvec{\gamma }_k \right] , \end{aligned}$$
(23h)

where the pair \((\alpha ^{*}_k,\, \varvec{\gamma }_k )\) is chosen as the duple \((\alpha ^{*}_z,\, \varvec{\gamma }_z )\) which provide the best profit (minimum value) in (23g), for \(1 \le z \le Z\). The overall process detailed in equations (23) is repeated until some stopping criterion is satisfied (i.e., we let a maximum number of iterations K).

Based on the iterations (23h), we estimate the analysis state as follows:

$$\begin{aligned} \overline{\mathbf{x}}^a = \overline{\mathbf{x}}^b + \sum _{k=1}^K \mathbf{Q}_k \cdot \left[ \alpha ^{*}_k \cdot \varvec{\gamma }_k\right] = \mathbf{x}_{K}. \end{aligned}$$
(24)

The inverse of the Hessian (23b) provides an estimate of the posterior error covariance matrix. Thus, posterior members (analysis ensemble) can be sampled as follows:

$$\begin{aligned} \mathbf{x}^{a[e]} \sim \mathcal {N}\left( \overline{\mathbf{x}}^a,\, \left[ \nabla ^2 \mathcal {J}_K \left( \overline{\mathbf{x}}^a \right) \right] ^{-1}\right) . \end{aligned}$$
(25)

To efficiently perform the sampling process (25) the reader can consult [23]. Afterwards, the analysis members are propagated in time until a new observation is available. We name this formulation the Random Ensemble Kalman Filter (RAN-EnKF).

3.2 Convergence of the Analysis Step in the RAN-EnKF

For proving the convergence of our method, we consider the assumptions C1, C2, and

$$\begin{aligned} \nabla \mathcal {J}\left( \mathbf{x}_k\right) ^T \cdot \mathbf{q}_{u,k} < 0,\, \text { for } 1 \le u \le U . \end{aligned}$$
(26)

The next Theorem states the necessary conditions in order to ensure global convergence of the analysis step in the RAN-EnKF.

Theorem 1

If (2.3), (2.3), and (26) hold, then the RSLS-RD with exact line search generates an infinite sequence \(\left\{ \mathbf{x}_k\right\} _{u=0}^{\infty }\), then

$$\begin{aligned} \underset{k \rightarrow \infty }{\lim } \left[ \frac{ - \nabla \mathcal {J}\left( \mathbf{x}_k \right) ^T \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*} }{ \left\| \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right\| } \right] ^2 = 0 \end{aligned}$$
(27)

holds.

Proof

By Taylor series and the Mean Value Theorem we know that,

and therefore,

for any \(\mathbf{x}_{k+1}\) on the ray \(\mathbf{x}_k+ \alpha \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*}\), with \(\alpha \in [0,\,1]\), we have

hence:

by the Cauchy Schwarz inequality we have

choose

$$\begin{aligned} \alpha ^{*} = -\frac{\nabla \mathcal {J}\left( \mathbf{x}_k \right) ^T \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*}}{L \cdot \left\| \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right\| ^2} , \end{aligned}$$

therefore,

$$\begin{aligned} \mathcal {J}\left( \mathbf{x}_k \right)- & {} \mathcal {J}\left( \mathbf{x}_{k+1} \right) \ge \frac{\left[ \nabla \mathcal {J}\left( \mathbf{x}_k \right) ^T \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right] ^2}{L \cdot \left\| \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right\| ^2} \\- & {} \frac{1}{2} \cdot \frac{\left[ - \nabla \mathcal {J}\left( \mathbf{x}_k \right) ^T \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right] ^2}{L \cdot \left\| \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right\| ^2} \\= & {} \frac{1}{2 \cdot L} \cdot \left[ - \frac{ \nabla \mathcal {J}\left( \mathbf{x}_k \right) ^T \cdot \mathbf{Q}_k \cdot \varvec{\gamma }^{*} }{ \left\| \mathbf{Q}_k \cdot \varvec{\gamma }^{*} \right\| } \right] ^2 . \end{aligned}$$

By (2.3), and (26), it follows that \(\left\{ \mathcal {J}\left( \mathbf{x}_k\right) \right\} _{k = 0}^{\infty }\) is a monotone decreasing number sequence and it has a bound below, therefore \(\left\{ \mathcal {J}\left( \mathbf{x}_k\right) \right\} _{k = 0}^{\infty }\) has a limit, and consequently (27) holds.

We are now ready to test our proposed method numerically.

4 Experimental Results

For the experiments, we consider non-linear observation operators, a current challenge in the context of DA [6, 24]. We make use of the Lorenz-96 model [25] as our surrogate model during the experiments. The Lorenz-96 model is described by the following set of ordinary differential equations [26]:

$$\begin{aligned} \frac{dx_j}{dt} = {\left\{ \begin{array}{ll} \left( x_2-x_{n-1}\right) \cdot x_{n} -x_1 +F &{} \text { for j=1}, \\ \left( x_{j+1}-x_{j-2}\right) \cdot x_{j-1} -x_j +F &{} \text { for } 2 \le j \le n-1, \\ \left( x_1-x_{n-2}\right) \cdot x_{n-1} -x_{n} +F &{} \text { for } j=n, \end{array}\right. } \end{aligned}$$
(28)

where F is external force and \(n=40\) is the number of model components. Periodic boundary conditions are assumed. When \(F=8\) units the model exhibits chaotic behavior, which makes it a relevant surrogate problem for atmospheric dynamics [27, 28]. A time unit in the Lorenz-96 represents 7 days in the atmosphere. We create the initial pool \(\widehat{\mathbf{X}^b}_0\) of \(\widehat{N}=10^4\) members. The error statistics of observations are as follows:

$$\begin{aligned} \mathbf{y}_k \sim \mathcal {N}\left( \mathcal {H}_k \left( \mathbf{x}^{*}_k \right) , \left[ {\epsilon ^{o}} \right] ^2 \cdot \mathbf{I}\right) , \text { for } 0 \le k \le M, \end{aligned}$$

where the standard deviations of observational errors \(\epsilon ^{o} = 10^{-2}\). The components are randomly chosen at the different assimilation cycles. We use the non-smooth and non-linear observation operator [29]:

$$\begin{aligned} \left\{ \mathcal {H}\left( \mathbf{x}\right) \right\} _j = \frac{\left\{ \mathbf{x}\right\} _j }{2} \cdot \left[ \left( \frac{\left| \left\{ \mathbf{x}\right\} _j \right| }{2} \right) ^{\beta -1} +1 \right] , \end{aligned}$$
(29)

where j denotes the j-th observed component from the model state. Likewise, \(\beta \in \{1,\,3,\,5,\,7,\,9 \}\). Since the observation operator (29) is non-smooth, gradients of (1) are approximated by using the \(\ell _2\)-norm. A full observational network is available at assimilation steps. The ensemble size for the benchmarks is \(N= 20\). These members are randomly chosen from the pool \(\widehat{\mathbf{X}^b}_0\) for the different experiments in order to form the initial ensemble \(\mathbf{X}^b_0\) for the assimilation window. Evidently, \(\mathbf{X}^b_0 \subset \widehat{\mathbf{X}^b}_0\). The \(\ell _2\)-norm of errors are utilized as a measure of accuracy at the assimilation step k,

$$\begin{aligned} \mathcal {E} \left( \mathbf{x}_k,\,\mathbf{x}^{*} \right) = \sqrt{\left[ \mathbf{x}^{*} - \mathbf{x}_k \right] ^T \cdot \left[ \mathbf{x}^{*} - \mathbf{x}_k \right] } , \end{aligned}$$
(30)

where \(\mathbf{x}^{*}\) and \(\mathbf{x}_k\) are the reference and current solution at iteration k, respectively. The initial background error, in average, reads \(\epsilon ^b \approx 31.73\). By convenience, this value is expressed in the log scale: \(\log (\epsilon ^b) = 3.45\). We consider a single assimilation cycle for the experiments. We try sub-spaces of dimensions \(U \in \{10,\,20,\,30\}\) and number of samples from those spaces of \(Z \in \{10,\,30,\,50\}\). We set a maximum number of iterations of 40. We compare our results with those obtained by the MLEF with \(N= 40\), note that, the ensemble size in the MLEF doubles the ones employed by our method.

Fig. 1.
figure 1

\(\ell _2\)-norm of errors in the log-scale for the 3D-Var Optimization Problem with different degrees \(\beta \) of the observation operator and dimension of sub-spaces U. For the largest \(\beta \) value, the MLEF diverges and therefore, its results are not reported.

Fig. 2.
figure 2

\(\ell _2\)-norm of errors in the log-scale for the 3D-Var Optimization Problem with different degrees \(\beta \) of the observation operator and number of samples Z. For the largest \(\beta \) value, the MLEF diverges and therefore, its results are not reported.

We group the results in Figs. 1 and 2 by sub-space size and sample size (sub-space dimension), respectively. As can be seen, the RAN-EnKF outperforms the MLEF in terms of \(\ell _2\)-norm of errors, for all cases. Note that the error differences between the compared filter implementations are given by order of magnitudes. This can be explained as follows: the MLEF method performs the assimilation step onto a space given by the ensemble size; this is equivalent to perform an assimilation process by using the sample covariance matrix (5) whose quality is impacted by sampling errors. Contrarily, in our formulation, we employ sub-spaces whose basis vectors rely on the precision covariance (9) and, therefore, the impact of sampling errors is mitigated during optimization steps. As the degree \(\beta \) of the observation operator increases, the accuracy of the MLEF degrades, and consequently, this method diverges for the largest \(\beta \) value. On the other hand, convergence is always achieved in the RAN-EnKF method; this should be expected based on the theoretical results of Theorem 1. It should be noted that, as the \(\beta \) value increases, the 3D-Var cost function becomes highly non-linear, and as a consequence, more iterations are needed to decrease errors (as in any iterative optimization method). In general, it can be seen that as the number of samples Z increases, the results can be improved regardless of the sub-space dimension U (i.e., for \(Z = 10\)). However, it is clear that, for highly non-linear observation operators, it is better to have small sub-spaces and a large number of samples.

5 Conclusions

In this paper, we propose an ensemble Kalman filter implementation via line-search optimization; we name it a Random Ensemble Kalman Filter (RAN-EnKF). The proposed method proceeds as follows: an ensemble of model realization is employed to estimate background moments, and then, quadratic approximations of the 3D-Var cost function are obtained among iterations via the linearization of the observation operator about current solutions. These approximations serve to estimate descent directions of the 3D-Var cost function, which are perturbed to obtain additional directions onto which analysis increments can be computed. We theoretically prove the global convergence of our optimization method. Experimental tests are performed by using the Lorenz 96 model and the Maximum-Likelihood-Ensemble-Filter formulation. The results reveal that the RAN-EnKF outperforms the MLEF in terms of \(\ell _2\)-norm of errors, and even more, it is able to achieve convergence in cases wherein the MLEF diverges.