# A Machine-Learning-Based Importance Sampling Method to Compute Rare Event Probabilities

- 126 Downloads

## Abstract

We develop a novel computational method for evaluating the extreme excursion probabilities arising from random initialization of nonlinear dynamical systems. The method uses excursion probability theory to formulate a sequence of Bayesian inverse problems that, when solved, yields the biasing distribution. Solving multiple Bayesian inverse problems can be expensive; more so in higher dimensions. To alleviate the computational cost, we build machine-learning-based surrogates to solve the Bayesian inverse problems that give rise to the biasing distribution. This biasing distribution can then be used in an importance sampling procedure to estimate the extreme excursion probabilities.

## Keywords

Machine learning Rice’s formula Gaussian processes## 1 Motivation

Characterizing high-impact rare and extreme events such as hurricanes, tornadoes, and cascading power failures are of great social and economic importance. Many of these natural phenomena and engineering systems can be modeled by using dynamical systems. The models representing these complex phenomena are approximate and have many sources of uncertainties. For example, the exact initial and boundary conditions or the external forcings that are necessary to fully define the underlying model might be unknown. Other parameters that are set based on experimental data may also be uncertain or only partially known. A probabilistic framework is generally used to formulate the problem of quantifying various uncertainties in these complex systems. By definition, the outcomes of interest that correspond to high-impact rare and extreme events reside in the tails of the probability distribution of the associated event space. Fully characterizing the tails requires resolving high-dimensional integrals over irregular domains. The most commonly used method to determine the probability of rare and extreme events is Monte Carlo simulation (MCS). Computing rare-event probabilities via MCS involves generating several samples of the random variable and calculating the fraction of the samples that produce the outcome of interest. For small probabilities, however, this process is expensive. For example, consider an event whose probability is around \(10^{-3}\) and the underlying numerical model for the calculation requires ten minutes per simulation. With MCS, estimating the probability of such an event to an accuracy of \(10\%\) will require two years of serial computation. Hence, alternative methods are needed that are computationally efficient.

Important examples of extreme events are rogue waves in the ocean [14], hurricanes, tornadoes [29], and power outages [3]. The motivation for this work comes from the rising concern surrounding transient security in the presence of uncertain initial conditions identified by North American Electric Reliability Corporation in connection with its long-term reliability assessment [32]. The problem can be mathematically formulated as a dynamical system with uncertain initial conditions. In this paper, the aim is to compute the extreme excursion probability: the probability that the transient due to a sudden malfunction exceeds preset safety limits. Typically, the target safe limit exceedance probabilities are in the range \(10^{-4}\)–\(10^{-5}\). We note that the same formulation is applicable in other applications such as data assimilation, which is used extensively for medium- to long-term weather forecasting. For example, one can potentially use the formulation in this paper to determine the likelihood of temperature levels at a location exceeding certain thresholds or the likelihood of precipitation levels exceeding safe levels in a certain area.

### 1.1 Mathematical Setup and Overview of the Method

The mathematical setup used in this paper consists of a nonlinear dynamical system that is excited by a Gaussian initial state and that results in a non-Gaussian stochastic process. We are interested in estimating the probability of the stochastic process exceeding a preset threshold. Moreover, we wish to estimate the probabilities when the underlying event of the process exceeding the threshold is a *rare event*. The rare events typically lie in the tails of the underlying event distribution. To characterize the tail of the resulting stochastic process, we use ideas from theory excursion probabilities [1]. Specifically, we use Rice’s formula (1) to estimate the expected number of upcrossings of a stochastic process. For a description of the mathematically rigorous settings used for the rare event problem, we refer interested readers to [25, §2] and references therein.

Evaluating \(\varphi _t(u,y)\), the joint probability distribution of the stochastic process and its derivative, is central to evaluating the integral in Rice’s formula. However, \(\varphi _t(u,y)\) is analytically computable only for Gaussian processes. Since our setup results in a non-Gaussian stochastic process, we linearize the nonlinear dynamical system variation around the trajectories starting at the mean of the initial state. We thus obtain a Gaussian approximation to the system trajectory distribution. In [25], we solve a sequence of Bayesian inverse problems to determine a biasing distribution to accelerate the convergence of the probability estimates. For high-dimensional problems, however, solving multiple Bayesian inverse problems can be expensive. In this work, we propose to replace multiple solutions to Bayesian inverse problems with machine-learning-based surrogates to alleviate the computational burden.

### 1.2 Organization

The rest of the paper is organized as follows. In Sect. 2 we review the existing literature for estimating rare event probabilities. In Sect. 3 we reformulate the problem of determining the IBD as a Bayesian inference problem, and in Sect. 4 we develop a machine-learning-based surrogate to approximate the solution to the Bayesian inference problem. In Sect. 5 we demonstrate this methodology on a simple nonlinear dynamical system excited by a Gaussian distribution. In Sect. 6 we present our conclusions and potential future research directions.

## 2 Existing Literature

### 2.1 Monte Carlo and Importance Sampling

Most of the existing methods to compute the probabilities of rare events use MCS directly or indirectly. The MCS approach was developed by Metropolis and his collaborators to solve problems in mathematical physics [22]. Since then, it has been used in a variety of applications [21, 28]. When evaluating rare event probabilities, the MCS method basically counts the fraction of the random samples that cause the rare event. For a small probablilty *P* of the underlying event, the number of samples required to obtain an accuracy of \(\epsilon \ll 1\) is \(\mathcal {O}(\epsilon ^{-2} P^{-1})\). Hence MCS becomes impractical for estimating rare event probabilities.

A popular sampling technique that is employed to compute rare event probablities is importance sampling (IS). IS is a variance reduction technique developed in the 1950s [17] to estimate the quantity of interest by constructing estimators that have smaller variance than MCS. In MCS, simulations from most of the samples do not result in the rare event and hence do not play a part in probability calculations. IS, instead, uses problem-specific information to construct an IBD; computing the rare event probability using the IBD requires fewer samples. Based on this idea, several techniques for constructing IBDs have been developed [8]. For a more detailed treatment of IS, we direct interested readers to [2, 13]. One of the major challenges involved with importance sampling is the construction of an IBD that results in a low-variance estimator. We note that the approach may sometimes be inefficient for high-dimensional problems [19]. A more detailed description of MCS and IS in the context of rare events can be found in [25, §2] and references therein.

### 2.2 Nested Subset Methods

Other methods use the notion of conditional probability over a sequence of nested subsets of the probability space of interest. For example, one can start with the entire probability space and progressively shrink to the region that corresponds to the rare event. Furthermore, one can use the notion of conditional probability to factorize the event of interest as a product of conditional events. Subset simulation (SS) [4] and splitting methods [16] are ideas that use this idea. Several modifications and improvements have been proposed to both SS [6, 9, 10, 18, 33] and splitting methods [5, 7]. Evaluating the conditional probabilities forms a major portion of the computational load. Compute the conditional probabilities for different nested subsets concurrently is nontrivial.

### 2.3 Other Approaches

Large deviation theory (LDT) is an efficient approach for estimating rare events in cases when the event space of interest is dominated by few elements such as rogue waves of a certain height. LDT also has been used to estimate the probabilities of extreme events in dynamical systems with random components [11, 12]. A sequential sampling strategy has been used to compute extreme event statistics [23, 24].

## 3 A Bayesian Inference Formulation to Construct IBD

*p*. The problem of interest is to estimate the probability that \( \mathbf {c}^{\top } \mathbf {x}(t)\) exceeds the level

*u*for \(t\in [0, T ]\). That is, we seek to estimate the following

*excursion probability*,where \(\mathbf {x}(t, \mathbf {x}_0)\) represents the solution of the dynamical system (2) for a given initial condition \(\mathbf {x}_0\). We note that

*excursion set*Hence, estimating \(\varOmega (u)\) will help us in estimating the excursion probability \(P_T(u)\). In general, however, estimating the excursion set \(\varOmega (u)\) analytically is difficult. Rice’s formula, (1) gives us insights about the excursion set and can be used to construct an approximation to the excursion set.

*u*. The right-hand side of (1) can be interpreted as the summation of all times and slopes at which an excursion occurs. One can sample from \(y \varphi _t(u,y)\) to obtain a slope-time pair \((y_i, t_i)\) at which the sample paths of the stochastic process cause an excursion. Now consider the map \(\mathcal {G}: \mathbb {R}^{d\times 1} \rightarrow \mathbb {R}^2\) that evaluates the vector \(\displaystyle \begin{bmatrix}\mathbf {c}^{\top }\mathbf {x}(t) \\ \mathbf {c}^{\top }\mathbf {x}'(t) \end{bmatrix}\) based on the dynamical system (2), given an initial state \(\mathbf {x}_0\) and a time

*t*. By definition of the excursion set \(\varOmega (u)\), there exists an element \(\mathbf {x}_i \in \varOmega (u)\) that satisfies the following relationship,

*N*. For a discussion on the choice of \(\varepsilon _i\), we refer interested readers to [25, §3.3].

Draw samples from unnormalized \(y \varphi _t(u,y)\,\)

Find the preimages of these samples to approximate \(\varOmega (u)\).

We use MCMC to draw samples from unnormalized \(y \varphi _t(u,y)\,\). We note that irrespective of the size of the dynamical system, \(y \varphi _t(u,y)\,\) represents an unnormalized density in two dimensions; hence, using MCMC is an effective means , draw samples from it. Drawing samples from \(y \varphi _t(u,y)\,\) requires evaluating it repeatedly, and in the following section we discuss the means to do so.

### 3.1 Evaluating \(y \varphi _t(u,y)\,\)

*p*is Gaussian and

*f*is linear, we have

*A*is invertible, \(\mathbf {x}(t)\) can be written as

*I*represents an identity matrix of the appropriate size. Given that \(\mathbf {x}_0\) is normally distributed, it follows that \(\mathbf {x}(t)\) is a Gaussian process:

### 3.2 Notes for Nonlinear *f*

*f*is nonlinear, \(y \varphi _t(u,y)\,\) cannot be computed analytically—a key ingredient for our computational procedure. We approximate the nonlinear dynamics by linearizing

*f*around the mean of the initial distribution. Assuming that the initial state of the system is normally distributed as described by Eq. (9), linearizing around the mean of the initial state gives

*f*at \(t=0\) and \(\mathbf {x}=\overline{\mathbf {x}}_0\); this reduces the nonlinear dynamical system to a form that is similar to Eq. (9). Thus, we can now use Eqs. (11), (12), and (13) to approximate \(y \varphi _t(u,y)\,\) for nonlinear

*f*.

## 4 Machine-Learned Inverse Maps

*F*is a differentiable activation function that imparts nonlinearity to this transformation;

*L*is the input dimension of an incoming signal;

*M*is the number of hidden-layer neurons (in machine learning terminology); \(c_m^{\ell } \in \mathbb {R}^{M \times L}\) are the weights of this map; \(\epsilon _m \in \mathbb {R}^M\) are the biases; and \(\xi _j \in \mathbb {R}^{J}\) is the nonlinear output of this map, which may be matched to targets available from data or “fed-forward” into future maps. Note that \(\xi _j\) is the postactivation value of each neuron in a hidden layer of

*J*neurons. In practice, multiple compositions of this map may be used to obtain nonlinear function approximators, called deep neural networks, that are very expressive. For nonlinear activation, we utilize

*J*fixed at 256, and a final transformation utilizes \(J=3\). We note that the function

*F*for the final transformation is the identity, as is common in machine learning algorithms. A schematic of this network architecture is shown in Fig. 1. The trainable parameters (\(c_m^{\ell }\) and \(\epsilon _m\) for each transformation) are optimized with the use of backpropagation [30], an adjoint calculation technique that obtains gradients of the loss function with respect to these parameters. A stochastic gradient optimization technique, ADAM, is used to update these parameters [20] with a learning rate of 0.001. Our loss function is given by the \(L_2\)-distance between the prediction of the network and the targets (i.e., the mean-squared error). Our network also incorporates a regularization strategy, called dropout [31], that randomly switches off certain units \(\xi _j\) (here we utilize a dropout probability of 0.1) in the forward propagation of the map (i.e., from \(d \rightarrow 2\)). Through this approach, memorization of data is avoided, while allowing for effective exploration of a complex nonconvex loss surface.

Our map is trained for 500 epochs with a batch size of 256; in other words, a weight update is performed after a loss is computed for 256 samples. Each epoch is completed when the losses from the entire data set are used for gradient update. During the network training, we set aside a random subset of the data for validation. Losses calculated from this data set are used only to monitor the learning of the framework for unseen data. These are plotted in Fig. 2, where one can see that both training and validation losses are reduced to an equal magnitude. Figure 3 also shows scatter plots for this validation data set where a good agreement between the true and predicted quantities can be seen. We may now use this map for approximating the IBD.

### 4.1 Using the Machine-Learned Inverse Map to Construct IBD

The following procedure is used to construct the IBD.

- 1.
Obtain different realizations of the initial conditions of the dynamical system by sampling from the initial PDF

*p*. - 2.
Use \(\mathcal {G}\) to obtain the forward maps of these realizations.

- 3.
Use the forward maps and the corresponding random realizations of the initial conditions to train the inverse map \(\mathcal {G}^{-1}\).

- 4.
We now apply this trained inverse map on samples generated from \(y \varphi _t(u,y)\) to obtain the approximate preimages of samples \(\mathbf {y}_i\).

- 5.
Use a Gaussian approximation of these inverse maps is used as an IBD. Assume that this Gaussian approximation has PDF \(p^\mathrm{IBD}\).

- 6.
Sample from the IBD, and use importance sampling to estimate the probabilities.

### 4.2 Using IBD to Estimate Rare Event Probability

## 5 Numerical Experiments

*t*and

*y*as described by Eq. (12). We compute \(y \varphi _t(u,y)\) as shown in Eq. (13). We use the delayed rejection adaptive Metropolis (DRAM) Markov chain Monte Carlo (MCMC) method to generate samples from \(y \varphi _t(u,y)\). (For more details about DRAM, see [15].) To minimize the effect of the initial guess on the posterior inference, we use a burn-in of 1,000 samples. Figure 4 shows the contours of \(y \varphi _t(u,y)\) and samples drawn from it by using DRAM MCMC. In [25] we then solved the Bayesian inverse problem by using both MCMC and Laplace approximation at MAP to construct a distribution that approximately maps to likelihood constructed around \(\mathbf {y}_i\). Here, we replace the solution to the Bayesian inverse problem with a machine-learned inverse map described in Sect. 4. Multiple samples generated from \(y\varphi _t(u,y)\) can be used to construct the IBD, as described in Sect. 4.1, and the IBD can be used to estimate \(P_T(u)\), as explained in Sect. 4.2. Figure 5 compares the results between the conventional MCS and Machine Learning based Importance Sampling (ML-based IS) methods. Note that we use an MCS estimate with 10 Million samples as a proxy for the true probabilities. The “true” probability is \(3.28\times 10^{-5}\). ML-based IS gives fairly good estimate even with small number of model evaluations. When the training dataset size is large enough, the improvements are dramatic. Notice that for a true probability of the order of \(10^{-5}\) we obtain an estimate that has a relative error of less than 1%. Notice that our method gives the same (or better) accuracy as the MCS with hundred times lesser computational cost. The convergence with just 5000 training samples is acceptable and these results improve dramatically for 10000 and 20000 training samples. We believe the results could be even better when we use a Gaussian mixture to represent the IBD instead of a simple Gaussian approximation.

### 5.1 Computational Cost

In Fig. 5, we havent included the costs associated with generating training data, training costs, and cost for approximating the inverse map as these costs are almost negligible when compared to the overall costs. Note that generating 20000 samples is approximately equivalent to 400 model evaluations (this is because a single model evaluation can be used to generate the slope and state at 50 different times and each of them can be used as a training sample). The training of the ML framework, for this problem, required very little compute time. Each training was executed on an 8th-generation Intel Core-I7 machine with Python 3.6.8 and Tensorflow 1.14 and took less than 180 s for training 20000 samples (this is less than 50 model evaluations). Inference (for 20000 prediction points) costs were less than 2 s, on average.

## 6 Conclusions and Future Work

In this work we developed a ML-based IS to estimate rare event probabilities and we demostrated the algorithm on the prey-predator system. The method developed here builds on the approach in [25] and replaces the expensive Bayesian inference with a Machine learning based surrogate. This approach yields fairly accurate estimate of the probabilities and for a given accuracy requires atleast three orders of magnitude lesser computational effort than the traditional MCS. In future, we aim to test this algorithm for larger problems and also use an active learning based approach to pick the training samples. Scaling this algorithm to high dimensions (say \(\mathcal {O}(1000)\)) could be challenging and to address it, we will use state-of-the-art techniques developed by machine learning and deep learning community in the future.

## References

- 1.Adler, R.J.: The geometry of random fields. SIAM (2010)Google Scholar
- 2.Asmussen, S., Glynn, P.W.: Stochastic Simulation: Algorithms and Analysis, vol. 57. Springer, New York (2007). https://doi.org/10.1007/978-0-387-69033-9CrossRefzbMATHGoogle Scholar
- 3.Atputharajah, A., Saha, T.K.: Power system blackouts-literature review. In: 2009 International Conference on Industrial and Information Systems (ICIIS), pp. 460–465. IEEE (2009)Google Scholar
- 4.Au, S.K., Beck, J.L.: Estimation of small failure probabilities in high dimensions by subset simulation. Probab. Eng. Mech.
**16**(4), 263–277 (2001)CrossRefGoogle Scholar - 5.Beck, J.L., Zuev, K.M.: Rare-event simulation. In: Ghanem, R., Higdon, D., Owhadi, H. (eds.) Handbook of Uncertainty Quantification, pp. 1075–1100. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-12385-1_24CrossRefGoogle Scholar
- 6.Bect, J., Li, L., Vazquez, E.: Bayesian subset simulation. SIAM/ASA J. Uncertain. Quan.
**5**(1), 762–786 (2017)MathSciNetCrossRefGoogle Scholar - 7.Botev, Z.I., Kroese, D.P.: Efficient monte carlo simulation via the generalized splitting method. Stat. Comput.
**22**(1), 1–16 (2012)MathSciNetCrossRefGoogle Scholar - 8.Bucklew, J.: Introduction to Rare Event Simulation. Springer, New York (2004). https://doi.org/10.1007/978-1-4757-4078-3 CrossRefzbMATHGoogle Scholar
- 9.Ching, J., Beck, J.L., Au, S.: Hybrid subset simulation method for reliability estimation of dynamical systems subject to stochastic excitation. Probab. Eng. Mech.
**20**(3), 199–214 (2005)CrossRefGoogle Scholar - 10.Ching, J., Au, S.K., Beck, J.L.: Reliability estimation for dynamical systems subject to stochastic excitation using subset simulation with splitting. Comput. Methods Appl. Mech. Eng.
**194**(12–16), 1557–1579 (2005)MathSciNetCrossRefGoogle Scholar - 11.Dematteis, G., Grafke, T., Vanden-Eijnden, E.: Rogue waves and large deviations in deep sea. Proc. Natl. Acad. Sci.
**115**(5), 855–860 (2018)MathSciNetCrossRefGoogle Scholar - 12.Dematteis, G., Grafke, T., Vanden-Eijnden, E.: Extreme event quantification in dynamical systems with random components. SIAM/ASA J. Uncertain. Quan.
**7**(3), 1029–1059 (2019)MathSciNetCrossRefGoogle Scholar - 13.Dunn, W.L., Shultis, J.K.: Exploring Monte Carlo methods. Elsevier, Amsterdam (2011)zbMATHGoogle Scholar
- 14.Dysthe, K., Krogstad, H.E., Müller, P.: Oceanic rogue waves. Annu. Rev. Fluid Mech.
**40**, 287–310 (2008)MathSciNetCrossRefGoogle Scholar - 15.Haario, H., Laine, M., Mira, A., Saksman, E.: Dram: efficient adaptive MCMC. Stat. Comput.
**16**(4), 339–354 (2006)MathSciNetCrossRefGoogle Scholar - 16.Kahn, H., Harris, T.E.: Estimation of particle transmission by random sampling. Natl. Bureau Stand. Appl. Math. Ser.
**12**, 27–30 (1951)Google Scholar - 17.Kahn, H., Marshall, A.W.: Methods of reducing sample size in monte carlo computations. J. Oper. Res. Soc. Am.
**1**(5), 263–278 (1953)zbMATHGoogle Scholar - 18.Katafygiotis, L., Cheung, S.H.: A two-stage subset simulation-based approach for calculating the reliability of inelastic structural systems subjected to gaussian random excitations. Comput. Methods Appl. Mech. Eng.
**194**(12–16), 1581–1595 (2005)CrossRefGoogle Scholar - 19.Katafygiotis, L.S., Zuev, K.M.: Geometric insight into the challenges of solving high-dimensional reliability problems. Probab. Eng. Mech.
**23**(2–3), 208–218 (2008)CrossRefGoogle Scholar - 20.Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- 21.Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, New York (2004). https://doi.org/10.1007/978-0-387-76371-2CrossRefGoogle Scholar
- 22.Metropolis, N., Ulam, S.: The monte carlo method. J. Am. Stat. Assoc.
**44**(247), 335–341 (1949)CrossRefGoogle Scholar - 23.Mohamad, M.A., Sapsis, T.P.: A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems. arXiv preprint arXiv:1804.07240 (2018)
- 24.Mohamad, M.A., Sapsis, T.P.: Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems. Proc. Natl. Acad. Sci.
**115**(44), 11138–11143 (2018)MathSciNetCrossRefGoogle Scholar - 25.Rao, V., Anitescu, M.: Efficient computation of extreme excursion probabilities for dynamical systems (2020)Google Scholar
- 26.Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
- 27.Rice, S.O.: Mathematical analysis of random noise. Bell Labs Tech. J.
**23**(3), 282–332 (1944)MathSciNetCrossRefGoogle Scholar - 28.Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, New York (2005). https://doi.org/10.1007/978-1-4757-4145-2CrossRefzbMATHGoogle Scholar
- 29.Ross, T., Lott, N.: A climatology of 1980–2003 extreme weather and climate events. US Department of Commerece, National Ocanic and Atmospheric Administration, National Environmental Satellite Data and Information Service, National Climatic Data Center (2003)Google Scholar
- 30.Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature
**323**(6088), 533–536 (1986)CrossRefGoogle Scholar - 31.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**15**(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar - 32.“The North American Electricity Reliability Corporation”: 2017 long-term reliability assessment (2017). https://www.nerc.com/pa/RAPA/ra/Reliability
- 33.Zuev, K.M., Beck, J.L., Au, S.K., Katafygiotis, L.S.: Bayesian post-processor and other enhancements of subset simulation for estimating failure probabilities in high dimensions. Comput. Struct.
**92**, 283–296 (2012)CrossRefGoogle Scholar