Keywords

1 Motivation

Characterizing high-impact rare and extreme events such as hurricanes, tornadoes, and cascading power failures are of great social and economic importance. Many of these natural phenomena and engineering systems can be modeled by using dynamical systems. The models representing these complex phenomena are approximate and have many sources of uncertainties. For example, the exact initial and boundary conditions or the external forcings that are necessary to fully define the underlying model might be unknown. Other parameters that are set based on experimental data may also be uncertain or only partially known. A probabilistic framework is generally used to formulate the problem of quantifying various uncertainties in these complex systems. By definition, the outcomes of interest that correspond to high-impact rare and extreme events reside in the tails of the probability distribution of the associated event space. Fully characterizing the tails requires resolving high-dimensional integrals over irregular domains. The most commonly used method to determine the probability of rare and extreme events is Monte Carlo simulation (MCS). Computing rare-event probabilities via MCS involves generating several samples of the random variable and calculating the fraction of the samples that produce the outcome of interest. For small probabilities, however, this process is expensive. For example, consider an event whose probability is around \(10^{-3}\) and the underlying numerical model for the calculation requires ten minutes per simulation. With MCS, estimating the probability of such an event to an accuracy of \(10\%\) will require two years of serial computation. Hence, alternative methods are needed that are computationally efficient.

Important examples of extreme events are rogue waves in the ocean [14], hurricanes, tornadoes [29], and power outages [3]. The motivation for this work comes from the rising concern surrounding transient security in the presence of uncertain initial conditions identified by North American Electric Reliability Corporation in connection with its long-term reliability assessment [32]. The problem can be mathematically formulated as a dynamical system with uncertain initial conditions. In this paper, the aim is to compute the extreme excursion probability: the probability that the transient due to a sudden malfunction exceeds preset safety limits. Typically, the target safe limit exceedance probabilities are in the range \(10^{-4}\)\(10^{-5}\). We note that the same formulation is applicable in other applications such as data assimilation, which is used extensively for medium- to long-term weather forecasting. For example, one can potentially use the formulation in this paper to determine the likelihood of temperature levels at a location exceeding certain thresholds or the likelihood of precipitation levels exceeding safe levels in a certain area.

In [25], we presented an algorithm that uses ideas from excursion probability theory to evaluate the probability of extreme events [1]. In particular we used Rice’s formula [27], which was developed to estimate the average number of upcrossings for a generic stochastic process. Rice’s formula is given by

$$\begin{aligned} \mathbb {E} \left\{ N^{+}_u(0,T) \right\} = \displaystyle \int _{0}^{T} \, \int _{0}^{\infty }\, y \varphi _t(u,y)\, \mathrm {d}y\, \mathrm {d}t\, , \end{aligned}$$
(1)

where the left-hand side denotes the expected number of upcrossings of level u, y is the derivative of the stochastic process in a mean squared sense, and \(\varphi _t(u,y)\) represents the joint probability distribution of the process and its derivative. In this paper, we build on our recent algorithm [25], which we used to construct an importance biasing distribution (IBD) to accelerate the computation of extreme event probabilities. A key step in the algorithm presented in [25] involves solving multiple Bayesian inverse problems, which can be expensive in high dimensions. Here, we propose to use machine-learning-based surrogates to obtain the inverse maps and hence alleviate the computational costs.

1.1 Mathematical Setup and Overview of the Method

The mathematical setup used in this paper consists of a nonlinear dynamical system that is excited by a Gaussian initial state and that results in a non-Gaussian stochastic process. We are interested in estimating the probability of the stochastic process exceeding a preset threshold. Moreover, we wish to estimate the probabilities when the underlying event of the process exceeding the threshold is a rare event. The rare events typically lie in the tails of the underlying event distribution. To characterize the tail of the resulting stochastic process, we use ideas from theory excursion probabilities [1]. Specifically, we use Rice’s formula (1) to estimate the expected number of upcrossings of a stochastic process. For a description of the mathematically rigorous settings used for the rare event problem, we refer interested readers to [25, §2] and references therein.

Evaluating \(\varphi _t(u,y)\), the joint probability distribution of the stochastic process and its derivative, is central to evaluating the integral in Rice’s formula. However, \(\varphi _t(u,y)\) is analytically computable only for Gaussian processes. Since our setup results in a non-Gaussian stochastic process, we linearize the nonlinear dynamical system variation around the trajectories starting at the mean of the initial state. We thus obtain a Gaussian approximation to the system trajectory distribution. In [25], we solve a sequence of Bayesian inverse problems to determine a biasing distribution to accelerate the convergence of the probability estimates. For high-dimensional problems, however, solving multiple Bayesian inverse problems can be expensive. In this work, we propose to replace multiple solutions to Bayesian inverse problems with machine-learning-based surrogates to alleviate the computational burden.

1.2 Organization

The rest of the paper is organized as follows. In Sect. 2 we review the existing literature for estimating rare event probabilities. In Sect. 3 we reformulate the problem of determining the IBD as a Bayesian inference problem, and in Sect. 4 we develop a machine-learning-based surrogate to approximate the solution to the Bayesian inference problem. In Sect. 5 we demonstrate this methodology on a simple nonlinear dynamical system excited by a Gaussian distribution. In Sect. 6 we present our conclusions and potential future research directions.

2 Existing Literature

2.1 Monte Carlo and Importance Sampling

Most of the existing methods to compute the probabilities of rare events use MCS directly or indirectly. The MCS approach was developed by Metropolis and his collaborators to solve problems in mathematical physics [22]. Since then, it has been used in a variety of applications [21, 28]. When evaluating rare event probabilities, the MCS method basically counts the fraction of the random samples that cause the rare event. For a small probablilty P of the underlying event, the number of samples required to obtain an accuracy of \(\epsilon \ll 1\) is \(\mathcal {O}(\epsilon ^{-2} P^{-1})\). Hence MCS becomes impractical for estimating rare event probabilities.

A popular sampling technique that is employed to compute rare event probablities is importance sampling (IS). IS is a variance reduction technique developed in the 1950s [17] to estimate the quantity of interest by constructing estimators that have smaller variance than MCS. In MCS, simulations from most of the samples do not result in the rare event and hence do not play a part in probability calculations. IS, instead, uses problem-specific information to construct an IBD; computing the rare event probability using the IBD requires fewer samples. Based on this idea, several techniques for constructing IBDs have been developed [8]. For a more detailed treatment of IS, we direct interested readers to [2, 13]. One of the major challenges involved with importance sampling is the construction of an IBD that results in a low-variance estimator. We note that the approach may sometimes be inefficient for high-dimensional problems [19]. A more detailed description of MCS and IS in the context of rare events can be found in [25, §2] and references therein.

2.2 Nested Subset Methods

Other methods use the notion of conditional probability over a sequence of nested subsets of the probability space of interest. For example, one can start with the entire probability space and progressively shrink to the region that corresponds to the rare event. Furthermore, one can use the notion of conditional probability to factorize the event of interest as a product of conditional events. Subset simulation (SS) [4] and splitting methods [16] are ideas that use this idea. Several modifications and improvements have been proposed to both SS [6, 9, 10, 18, 33] and splitting methods [5, 7]. Evaluating the conditional probabilities forms a major portion of the computational load. Compute the conditional probabilities for different nested subsets concurrently is nontrivial.

2.3 Other Approaches

Large deviation theory (LDT) is an efficient approach for estimating rare events in cases when the event space of interest is dominated by few elements such as rogue waves of a certain height. LDT also has been used to estimate the probabilities of extreme events in dynamical systems with random components [11, 12]. A sequential sampling strategy has been used to compute extreme event statistics [23, 24].

3 A Bayesian Inference Formulation to Construct IBD

Most of the work in this section is a review of our approach described in [25, §3]. Here we reformulate the problem of constructing an IBD as a sequence of Bayesian inverse problems. Consider the following dynamical system,

$$\begin{aligned} \mathbf {x}'&= f(t,\mathbf {x})\,, \quad t =[0, T] \\ \mathbf {x}(0)&= \mathbf {x}_0\,, \quad \mathbf {x}_0 \sim p\,, \quad \mathbf {x}\in \varOmega \,, \nonumber \end{aligned}$$
(2)

where \(\mathbf {c}\) is a canonical basis vector and \(\mathbf {x}_0\), the initial state of the system, is uncertain and has a probability distribution p. The problem of interest is to estimate the probability that \( \mathbf {c}^{\top } \mathbf {x}(t)\) exceeds the level u for \(t\in [0, T ]\). That is, we seek to estimate the following excursion probability,

(3)

where \(\mathbf {x}(t, \mathbf {x}_0)\) represents the solution of the dynamical system (2) for a given initial condition \(\mathbf {x}_0\). We note that

$$\begin{aligned} P_{T}(u) = \mu (\varOmega (u))\,, \end{aligned}$$
(4)

where \(\mu \) is the respective measure transformation subject to (2) and \(\varOmega (u) \subset \varOmega \) represents the excursion set

(5)

Hence, estimating \(\varOmega (u)\) will help us in estimating the excursion probability \(P_T(u)\). In general, however, estimating the excursion set \(\varOmega (u)\) analytically is difficult. Rice’s formula, (1) gives us insights about the excursion set and can be used to construct an approximation to the excursion set.

Recall that in Rice’s formula (1), \(\varphi _t(u,y)\,\) represents the joint probability density of \(\mathbf {c}^{\top }\mathbf {x}\) and its derivative \(\mathbf {c}^{\top }\mathbf {x}'\) for an excursion level u. The right-hand side of (1) can be interpreted as the summation of all times and slopes at which an excursion occurs. One can sample from \(y \varphi _t(u,y)\) to obtain a slope-time pair \((y_i, t_i)\) at which the sample paths of the stochastic process cause an excursion. Now consider the map \(\mathcal {G}: \mathbb {R}^{d\times 1} \rightarrow \mathbb {R}^2\) that evaluates the vector \(\displaystyle \begin{bmatrix}\mathbf {c}^{\top }\mathbf {x}(t) \\ \mathbf {c}^{\top }\mathbf {x}'(t) \end{bmatrix}\) based on the dynamical system (2), given an initial state \(\mathbf {x}_0\) and a time t. By definition of the excursion set \(\varOmega (u)\), there exists an element \(\mathbf {x}_i \in \varOmega (u)\) that satisfies the following relationship,

$$\begin{aligned} \mathcal {G}(\mathbf {x}_i, t_i) = \displaystyle \begin{bmatrix} u + \varepsilon _i\\ y_i \end{bmatrix} \,, \end{aligned}$$
(6)

where \(\varepsilon > 0\). We can use this insight to construct an approximation of \(\varOmega (u)\) by constructing the preimages of multiple slope-time pairs. Observe that the problem of finding the preimage of a sample \((y_i,t_i)\) is ill-posed since there could be multiple \(\mathbf {x}_i\)’s that map to \(\begin{bmatrix} u + \varepsilon _i\\ y_i \end{bmatrix}\) at \(t_i\) via operator \(\mathcal {G}\). We define the set

(7)

and an approximation \(\widehat{\varOmega }(u)\) to \(\varOmega (u)\) can be written as

(8)

Note that the approximation (8) improves as we increase N. For a discussion on the choice of \(\varepsilon _i\), we refer interested readers to [25, §3.3].

The underlying computational framework to approximate \(\widehat{\varOmega }(u)\) consists of the following stages:

  • Draw samples from unnormalized \(y \varphi _t(u,y)\,\)

  • Find the preimages of these samples to approximate \(\varOmega (u)\).

We use MCMC to draw samples from unnormalized \(y \varphi _t(u,y)\,\). We note that irrespective of the size of the dynamical system, \(y \varphi _t(u,y)\,\) represents an unnormalized density in two dimensions; hence, using MCMC is an effective means , draw samples from it. Drawing samples from \(y \varphi _t(u,y)\,\) requires evaluating it repeatedly, and in the following section we discuss the means to do so.

3.1 Evaluating \(y \varphi _t(u,y)\,\)

We note that \(y \varphi _t(u,y)\,\) can be evaluated analytically only for special cases. Specifically, when \(\varphi _t(u,y)\) is a Gaussian process, then the joint density function \(y \varphi _t(u,y)\,\) is analytically computable. Consider the dynamical system described by (2). When p is Gaussian and f is linear, we have

$$\begin{aligned} \mathbf {x}' = A\, \mathbf {x}(t) + b\,, \quad \mathbf {x}(t_0) = \mathbf {x}_0\,, \quad \mathbf {x}_0 \sim \mathcal {N}(\overline{\mathbf {x}}_0, \varSigma )\,. \end{aligned}$$
(9)

Assuming A is invertible, \(\mathbf {x}(t)\) can be written as

$$\begin{aligned} \mathbf {x}(t) = \exp (A(t-t_0)) \, \mathbf {x}_0 - \left( I - \exp (A(t-t_0))\right) A^{-1} b\,, \end{aligned}$$
(10)

where I represents an identity matrix of the appropriate size. Given that \(\mathbf {x}_0\) is normally distributed, it follows that \(\mathbf {x}(t)\) is a Gaussian process:

$$\begin{aligned}&\mathbf {x}(t) \sim \mathcal {GP} \left( \overline{\mathbf {x}}, \mathrm{cov}_{\mathbf {x}} \right) \,, \text { where }\\&\nonumber \overline{\mathbf {x}} = \exp (A(t-t_0)) \overline{\mathbf {x}}_0 - \left( I - \exp (A(t-t_0))\right) A^{-1} b \, \text { and } \\&\nonumber \mathrm{cov}_{\mathbf {x}} = \exp (A(t-t_0)) \varSigma \left( \exp (A(t-t_0))\right) ^\top \,. \end{aligned}$$
(11)

The joint probability density function (PDF) of \(\mathbf {c}^{\top }\mathbf {x}(t)\) and \(\mathbf {c}^{\top }\mathbf {x}'(t)\) is given by [26, equation 9.1]

$$\begin{aligned} \begin{bmatrix} \mathbf {c}^{\top }\mathbf {x}\\ \mathbf {c}^{\top }\mathbf {x}' \end{bmatrix} \sim \mathcal {GP}\left( \overline{\mathbf {x}}^{\varphi }, \begin{bmatrix} \mathbf {c}^{\top }\varPhi \mathbf {c} &{} \mathbf {c}^{\top }\varPhi A^{\top }\mathbf {c} \\ \mathbf {c}^{\top }A \varPhi ^{\top }\mathbf {c}&{} \mathbf {c}^{\top }A\varPhi A^{\top }\mathbf {c} \end{bmatrix} \right) \,, \end{aligned}$$
(12)

where

We now can evaluate \(y \varphi _t(u,y)\,\) for arbitrary values of \(u_i\), \(y_i\), and \(t_i\) as

$$\begin{aligned} y_i \varphi _{t_i}(u_i,y_i) = \frac{y_i}{2 \pi \mid \varUpsilon \mid }\exp \left( -\frac{1}{2} \left\| \begin{bmatrix} u_i \\ y_i \end{bmatrix} - \overline{\mathbf {x}}^{\varphi }\right\| ^2_{\varUpsilon ^{-1}}\right) \,, \end{aligned}$$
(13)

where and \(\mid \varUpsilon \mid \) denotes the determinant of \(\varUpsilon \). Note that the right-hand side in (13) is dependent on \(t_i\) via \(\varUpsilon \).

3.2 Notes for Nonlinear f

When f is nonlinear, \(y \varphi _t(u,y)\,\) cannot be computed analytically—a key ingredient for our computational procedure. We approximate the nonlinear dynamics by linearizing f around the mean of the initial distribution. Assuming that the initial state of the system is normally distributed as described by Eq. (9), linearizing around the mean of the initial state gives

$$\begin{aligned} \mathbf {x}' \approx \mathbf {F} \cdot (\mathbf {x}- \overline{\mathbf {x}}_0) +f(\overline{\mathbf {x}}_0, 0) \,, \end{aligned}$$
(14)

where \(\mathbf {F}\) represents the Jacobian of f at \(t=0\) and \(\mathbf {x}=\overline{\mathbf {x}}_0\); this reduces the nonlinear dynamical system to a form that is similar to Eq. (9). Thus, we can now use Eqs. (11), (12), and (13) to approximate \(y \varphi _t(u,y)\,\) for nonlinear f.

4 Machine-Learned Inverse Maps

In [25] we formulated the problem of determining preimages (7) as a Bayesian inverse problem. However, solving multiple Bayesian inverse problems can be expensive. Hence we approximated our IBD by using the solutions of a small number of Bayesian inverse problems. In this section we build a simple data-driven surrogate for approximating the preimages \(X_i\) described in Eq. (7). Using the surrogate, we can approximate the preimages of several \(\mathbf {y}_i\)’s obtained by sampling from \(y\varphi _t(u,y)\). The surrogate developed here approximates the inverse of the map defined in Eq. (6). To that end, we wish to approximate the map

$$\begin{aligned} \mathcal {G}^{-1} : \mathbb {R}^{2} \rightarrow \mathbb {R}^d, \end{aligned}$$
(15)

where the input space corresponds to \((u+\varepsilon _i, y)\mid _{t_i}\) and the output lives in the domain of the state space (\(\varOmega \) here). This is equivalent to augmenting \(t_i\) as an additional input variable and building a surrogate that maps from \(\mathbb {R}^{3} \rightarrow \mathbb {R}^{d}\). We utilize a fully connected deep neural network to approximate this map. A one-layered neural network can be expressed as

$$\begin{aligned} \xi _{j}= F\left( \sum _{\ell =1}^{L} c_{m}^{\ell } x_{\ell }+\epsilon _{m}\right) , \end{aligned}$$
(16)

where F is a differentiable activation function that imparts nonlinearity to this transformation; L is the input dimension of an incoming signal; M is the number of hidden-layer neurons (in machine learning terminology); \(c_m^{\ell } \in \mathbb {R}^{M \times L}\) are the weights of this map; \(\epsilon _m \in \mathbb {R}^M\) are the biases; and \(\xi _j \in \mathbb {R}^{J}\) is the nonlinear output of this map, which may be matched to targets available from data or “fed-forward” into future maps. Note that \(\xi _j\) is the postactivation value of each neuron in a hidden layer of J neurons. In practice, multiple compositions of this map may be used to obtain nonlinear function approximators, called deep neural networks, that are very expressive. For nonlinear activation, we utilize

$$\begin{aligned} F(\xi ) = \text {max}(\xi ,0), \end{aligned}$$
(17)

for all its activation functions. In addition, we concatenate three such maps as shown in Eq. 16 to ultimately obtain an approximation for \(\mathcal {G}^{-1}\). Two such submaps have J fixed at 256, and a final transformation utilizes \(J=3\). We note that the function F for the final transformation is the identity, as is common in machine learning algorithms. A schematic of this network architecture is shown in Fig. 1. The trainable parameters (\(c_m^{\ell }\) and \(\epsilon _m\) for each transformation) are optimized with the use of backpropagation [30], an adjoint calculation technique that obtains gradients of the loss function with respect to these parameters. A stochastic gradient optimization technique, ADAM, is used to update these parameters [20] with a learning rate of 0.001. Our loss function is given by the \(L_2\)-distance between the prediction of the network and the targets (i.e., the mean-squared error). Our network also incorporates a regularization strategy, called dropout [31], that randomly switches off certain units \(\xi _j\) (here we utilize a dropout probability of 0.1) in the forward propagation of the map (i.e., from \(d \rightarrow 2\)). Through this approach, memorization of data is avoided, while allowing for effective exploration of a complex nonconvex loss surface.

Fig. 1.
figure 1

Schematic of our neural network architecture. Note that the number of hidden layer units are not representative since this study utilizes 256 such units.

Fig. 2.
figure 2

Convergence of training for our network. Note how both training and validation losses diminish in magnitude concurrently.

Fig. 3.
figure 3

Scatter plots between truth and predicted quantites of the inverse map with dimension 1 (left) and dimension 2 (right). These results are from unseen data.

Our map is trained for 500 epochs with a batch size of 256; in other words, a weight update is performed after a loss is computed for 256 samples. Each epoch is completed when the losses from the entire data set are used for gradient update. During the network training, we set aside a random subset of the data for validation. Losses calculated from this data set are used only to monitor the learning of the framework for unseen data. These are plotted in Fig. 2, where one can see that both training and validation losses are reduced to an equal magnitude. Figure 3 also shows scatter plots for this validation data set where a good agreement between the true and predicted quantities can be seen. We may now use this map for approximating the IBD.

4.1 Using the Machine-Learned Inverse Map to Construct IBD

The following procedure is used to construct the IBD.

  1. 1.

    Obtain different realizations of the initial conditions of the dynamical system by sampling from the initial PDF p.

  2. 2.

    Use \(\mathcal {G}\) to obtain the forward maps of these realizations.

  3. 3.

    Use the forward maps and the corresponding random realizations of the initial conditions to train the inverse map \(\mathcal {G}^{-1}\).

  4. 4.

    We now apply this trained inverse map on samples generated from \(y \varphi _t(u,y)\) to obtain the approximate preimages of samples \(\mathbf {y}_i\).

  5. 5.

    Use a Gaussian approximation of these inverse maps is used as an IBD. Assume that this Gaussian approximation has PDF \(p^\mathrm{IBD}\).

  6. 6.

    Sample from the IBD, and use importance sampling to estimate the probabilities.

4.2 Using IBD to Estimate Rare Event Probability

We can now estimate \(P_T(u)\) using the IBD as follows:

$$\begin{aligned} P_T^\mathrm{IS}(u)(\widehat{\mathbf {x}}_0^1, \ldots , \widehat{\mathbf {x}}_0^M) = \frac{1}{M} \sum _{i=1}^M\,\mathbb {I}(\widehat{\mathbf {x}}_0^i)\psi (\widehat{\mathbf {x}}_0^i)\,, \end{aligned}$$
(18)

where \(\widehat{\mathbf {x}}_0^1, \ldots , \widehat{\mathbf {x}}_0^M\) are sampled from the biasing distribution \(p^\mathrm{IBD}\) and \(\mathbb {I}(\widehat{\mathbf {x}}_0^i)\) represents the indicator function given by

$$\begin{aligned} \mathbb {I}(\widehat{\mathbf {x}}_0^i) = {\left\{ \begin{array}{ll} 1\,, \displaystyle \qquad \sup _{0 \le t \le T} \mathbf {c}^{\top }\mathbf {x}(t, \widehat{\mathbf {x}}_0^i) \ge u\,, ~~t\in [0, T ]\,,\\ 0\,, \displaystyle \qquad \sup _{0 \le t \le T} \mathbf {c}^{\top }\mathbf {x}(t, \widehat{\mathbf {x}}_0^i) < u\,, ~~t\in [0, T ]\,. \end{array}\right. } \end{aligned}$$
(19)

Also, \(\psi (\widehat{\mathbf {x}}_0^i)\) represents the importance weights. The importance weight for an arbitrary \(\widehat{\mathbf {x}}_0^i\) is given by

$$\begin{aligned} \displaystyle \psi (\widehat{\mathbf {x}}_0^i) = \frac{p(\widehat{\mathbf {x}}_0^i)}{p^\mathrm{IBD}(\widehat{\mathbf {x}}_0^i)}\,. \end{aligned}$$
(20)

5 Numerical Experiments

We demonstrate the application of the procedure described in Sect. 3 and Sect. 4 for nonlinear dynamical systems excited by a Gaussian distribution. We use the Lotka-Volterra equations as a test problem. These equations, also known as the predator-prey equations, are a pair of first-order nonlinear differential equations and are used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as a prey. The populations change through time according to the following pair of equations,

$$\begin{aligned} \displaystyle \frac{dx_1}{dt} = \alpha x_1 - \beta x_1x_2 \,, \\ \nonumber \displaystyle \frac{dx_2}{dt} = \delta x_1x_2 - \gamma x_2\,, \end{aligned}$$
(21)

where \(x_1\) is the number of prey, \(x_2\) is the number of predators, and \(\displaystyle \frac{dx_1}{dt}\) and \(\displaystyle \frac{dx_2}{dt}\) represent the instantaneous growth rates of the two populations. We assume that the initial state of the system at time \(t=0\) is a random variable that is normally distributed:

$$\mathbf {x}(0) \sim \mathcal {N}\left( \begin{bmatrix}10 \\ 10 \end{bmatrix}, 0.8\times I_2\right) ,$$

and we are interested in estimating the probability of the event \(P(\mathbf {c}^{\top }\mathbf {x}\ge u)\), where \(\mathbf {c} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\), \(t \in [0,10]\), and \(u = 17\). The first step of our solution procedure involves sampling from \(y \varphi _t(u,y)\) to generate observations \(\mathbf {y}_i\). We linearize the dynamical system about the mean of the distribution of \(\mathbf {x}_0\) (Eq. (14)) and express \(\varphi _t(u,y)\) as a function of t and y as described by Eq. (12). We compute \(y \varphi _t(u,y)\) as shown in Eq. (13). We use the delayed rejection adaptive Metropolis (DRAM) Markov chain Monte Carlo (MCMC) method to generate samples from \(y \varphi _t(u,y)\). (For more details about DRAM, see [15].) To minimize the effect of the initial guess on the posterior inference, we use a burn-in of 1,000 samples. Figure 4 shows the contours of \(y \varphi _t(u,y)\) and samples drawn from it by using DRAM MCMC. In [25] we then solved the Bayesian inverse problem by using both MCMC and Laplace approximation at MAP to construct a distribution that approximately maps to likelihood constructed around \(\mathbf {y}_i\). Here, we replace the solution to the Bayesian inverse problem with a machine-learned inverse map described in Sect. 4. Multiple samples generated from \(y\varphi _t(u,y)\) can be used to construct the IBD, as described in Sect. 4.1, and the IBD can be used to estimate \(P_T(u)\), as explained in Sect. 4.2. Figure 5 compares the results between the conventional MCS and Machine Learning based Importance Sampling (ML-based IS) methods. Note that we use an MCS estimate with 10 Million samples as a proxy for the true probabilities. The “true” probability is \(3.28\times 10^{-5}\). ML-based IS gives fairly good estimate even with small number of model evaluations. When the training dataset size is large enough, the improvements are dramatic. Notice that for a true probability of the order of \(10^{-5}\) we obtain an estimate that has a relative error of less than 1%. Notice that our method gives the same (or better) accuracy as the MCS with hundred times lesser computational cost. The convergence with just 5000 training samples is acceptable and these results improve dramatically for 10000 and 20000 training samples. We believe the results could be even better when we use a Gaussian mixture to represent the IBD instead of a simple Gaussian approximation.

Fig. 4.
figure 4

Contours of \(y\varphi _t(u,y)\) and samples drawn from it using DRAM MCMC.

Fig. 5.
figure 5

Comparison between conventional MCS and ML-based IS. We observe even with a small amount of training data, we obtain fairly accurate estimates; and as we increase the training data, the accuracy improves dramatically.

5.1 Computational Cost

In Fig. 5, we havent included the costs associated with generating training data, training costs, and cost for approximating the inverse map as these costs are almost negligible when compared to the overall costs. Note that generating 20000 samples is approximately equivalent to 400 model evaluations (this is because a single model evaluation can be used to generate the slope and state at 50 different times and each of them can be used as a training sample). The training of the ML framework, for this problem, required very little compute time. Each training was executed on an 8th-generation Intel Core-I7 machine with Python 3.6.8 and Tensorflow 1.14 and took less than 180 s for training 20000 samples (this is less than 50 model evaluations). Inference (for 20000 prediction points) costs were less than 2 s, on average.

6 Conclusions and Future Work

In this work we developed a ML-based IS to estimate rare event probabilities and we demostrated the algorithm on the prey-predator system. The method developed here builds on the approach in [25] and replaces the expensive Bayesian inference with a Machine learning based surrogate. This approach yields fairly accurate estimate of the probabilities and for a given accuracy requires atleast three orders of magnitude lesser computational effort than the traditional MCS. In future, we aim to test this algorithm for larger problems and also use an active learning based approach to pick the training samples. Scaling this algorithm to high dimensions (say \(\mathcal {O}(1000)\)) could be challenging and to address it, we will use state-of-the-art techniques developed by machine learning and deep learning community in the future.