1 Introduction

Mathematical models are an approximation of real life systems and their validity resides in how well the outputs of the model agree with measured data. Often, the input or parameters of the model are uncertain because data is unavailable or inaccurate; for these cases, one typically performs an uncertainty quantification (UQ) analysis to determine how much the outputs vary with the input parameters of the model. The uncertainty in the outputs can be quantified as a range of values, but also as a probability distribution function (pdf). Several methods, for example Monte Carlo computation and polynomial chaos, try to solve the problem of computing the probability distribution of the output given parameters defined as pdf’s.

A field that has experienced renewed interest in these techniques is energy systems engineering. The electrical power grid with the adoption of renewable energy has tied its behavior to stochastic weather fluctuations requiring the use of UQ techniques to predict its performance. However, the scale of these problems is such that conventional methods are not satisfactory from a computational point of view. Monte Carlo methods can be thought of as the first-line tools for UQ; with sufficient sampling they are able to quantify the uncertainty regardless of the input distribution or the nonlinearities of the system. However, Monte Carlo methods suffer from slow convergence, which has led to the search for alternative approximations [8]. Recently, the method of moments sparked new interest as one alternative [5].

The method of moments is an approximating technique that works with the moments of probability distributions instead of their density functions. The main idea is to use a Taylor expansion of the function and write the moments of the output distribution as a polynomial function of the moments of the input distribution. Depending on the characteristics of the function, only a few terms of the Taylor expansion might be enough to achieve enough precision. As noted in [9], one of the main issues with the method of moments is that although its accuracy increases with the degree of the Taylor polynomial, computing higher-order derivatives poses serious technical challenges, leading to mostly linearization techniques for acquiring sensitivities [3].

In this paper we present ADUPROPFootnote 1, a framework developed at Argonne National Laboratory that combines the automatic differentiation (Sect. 2), method of moments (Sect. 3), uncertainty quantification, and distributed parallelism (Sect. 4) into an easy to use tool that is able to quantify uncertainty of dynamical systems using the method of moments at an unprecedented scale. We use automatic differentiation (AD) by overloading through a C++ template library. This flexible technique allows a straightforward augmentation of C++ codes for computing higher-order derivatives. By exploiting the structure of this approach, we implement a scheme that parallelizes both the accumulation of the derivative information and the computation of the covariance based on the derivative values.

2 Algorithmic Differentiation

Automatic differentiation [2] allows one to differentiate computer programs by applying differential calculus at a program’s statement level. It uses compilers or language-based approaches to transform an implementation of a multivariate vector function \({y} = g({x}), \mathbb {R}^n \mapsto \mathbb {R}^m\) into Jacobian vector products \({y}^{(1)}=J({x}) \cdot {x}^{(1)}\) (tangent-linear model) or transposed Jacobian vector products \({x}_{(1)} = J^{T} \cdot {y}_{(1)}\) (adjoint model), where \({x}^{(1)}\), \({y}^{(1)}\) denotes the tangents and \({x}_{(1)}\), \({y}_{(1)}\) denotes the adjoints. The tangent-linear mode is equivalent to the finite difference method with the additional advantage of providing derivative information up to machine precision with no truncation or cancellation errors.

In this paper we solely rely on the tangent-linear or forward mode where the transformed code computes the product of the Jacobian J at point x times a directional derivative \({x}^{(1)}\), yielding the output tangent. The directional derivatives, denoted with a superscript order of differentiation, are defined as a partial derivative of y and x with respect to an auxiliary variable s. For readability we use Spivak’s notation for derivatives \({y}^{(1)} = \frac{\partial {y}}{\partial s} = \frac{\partial {y}}{\partial {{x}}} \cdot \frac{\partial {x}}{\partial s} = \mathrm {D}g({x})\cdot {x}^{(1)} \in \mathbb {R}^m\). Letting \({x}^{(1)}\) go over the Cartesian basis vectors of the implementation \(J \cdot {x}^{(1)}\) yields, column by column, the entire Jacobian \(J=\mathrm {D}g({x}) \in \mathbb {R}^{m \times n}\). Thus, for the accumulation of the full Jacobian we need to rerun the tangent-linear code n (number of columns) times. For higher-order derivative models we use the inner product \(<>\) notation introduced in [6] where the tangent-linear model is written as a projection of the Jacobian onto the tangent:

$$\begin{aligned} \begin{array}{llll} {y}=g({x}),&{y}^{(1)} = < \mathrm {D}g({x}), {x}^{(1)}> = \mathrm {D}g({x}) \cdot {x}^{(1)} \,. \end{array} \end{aligned}$$
(1)

Note that in general an implementation transformed by an automatic differentiation (AD) tool computes both g(x) and the Jacobian vector product. Reapplying an AD tool to an already first-order differentiated code yields a second-order forward over forward (FoF) code computing (2):

$$\begin{aligned} \begin{array}{llll} {y}&{}=g({x}), &{} {y}^{(2)} &{}=< \mathrm {D}g({x}), {x}^{(2)}>\\ {y}^{(1)}&{}=< \mathrm {D}g({x}), {x}^{(1)}>, &{} {y}^{(1,2)} &{}=< \mathrm {D}^2 g({x}), {x}^{(1)}, {x}^{(2)}> + <\mathrm {D}g({x}), {x}^{(1,2)}> \,. \end{array} \end{aligned}$$
(2)

The superscript \(^{(2)}\) denotes the second order of differentiation. Rerunning this FoF model and letting \({x}^{(1)}\) and \({x}^{(2)}\) each go over the Cartesian basis vectors, we obtain all the entries of the Hessian \(\mathrm {D}^2 g \in \mathbb {R}^{m \times n \times n}\) evaluated at x. Here \(< \mathrm {D}^2 g({x}), {x}^{(1)}, {x}^{(2)}>\) is the projection of \({x}^{(1)}\) onto the Hessian followed by the projection of \({x}^{(2)}\); \({x}^{(1,2)}\) must be set to zero. (For a detailed definition of Jacobian, Hessian and tensor projections, please refer to [6]). Following this logic, we reapply the tangent-linear model to acquire third order derivatives using the forward over forward over forward model (FoFoF):

$$\begin{aligned} \begin{array}{llll} {y}&{}=g({x}), &{} {y}^{(3)}&{}=< \mathrm {D}g({x}), {x}^{(3)}>\\ {y}^{(2)}&{}=< \mathrm {D}g({x}), {x}^{(2)}>, &{} {y}^{(2,3)}&{}=< \mathrm {D}^2 g({x}), {x}^{(2)}, {x}^{(3)}> +<\mathrm {D}g({x}), {x}^{(2,3)}> \\ {y}^{(1)}&{}=< \mathrm {D}g({x}), {x}^{(1)}>, &{} {y}^{(1,3)}&{}=< \mathrm {D}^2 g({x}), {x}^{(1)}, {x}^{(3)}> + <\mathrm {D}g({x}), {x}^{(1,3)}> \,, \end{array} \end{aligned}$$

and the last term capturing the third-order tensor \(\mathrm {D}^3\):

$$\begin{aligned} \begin{array}{ll} {y}^{(1,2)}&{}=< \mathrm {D}^2 g({x}), {x}^{(1)}, {x}^{(2)}> +<\mathrm {D}g({x}), {x}^{(1,2)}>, \\ {y}^{(1,2,3)}&{}=< \mathrm {D}^3 g({x}), {x}^{(1)}, {x}^{(2)}, {x}^{(3)}> +< \mathrm {D}^2 g({x}), {x}^{(1,3)}, {x}^{(2)}>\\ &{} +< \mathrm {D}^2 g({x}), {x}^{(1)}, {x}^{(2,3)}>+<\mathrm {D}^2 g({x}), {x}^{(1,2)}, {x}^{(3)}> + <\mathrm {D}g({x}), {x}^{(1,2,3)}> \,. \end{array} \end{aligned}$$
(3)

The original code using one variable x went up to two for the tangent-linear model, four for the FoF model and eight for the FoFoF model. To accumulate the third-order tensor \(\mathrm {D}^3 g({x}) \in \mathbb {R}^{m \times n \times n \times n}\) in \({y}^{(1,2,3)}\) we have to let \({x}^{(1)}\), \({x}^{(2)}\), and \({x}^{(3)}\) go over the Cartesian basis vectors, thus requiring \(n^3\) reruns of the model. The remaining tangents of x must be set to zero. These properties will translate directly to the implementation described in Sect. 5.

3 Method of Moments

What is the distribution of y if we let \(y = g(x)\) be a function of a random variable x with known properties? Computing this analytically is often difficult; and, in particular, obtaining the pdf of y is not possible in general. An alternative approach is to consider the moments of the distributions. Depending on the shape of the pdf, the first few moments of the pdf can be sufficient to capture relevant behavior. More concretely, consider \(g(\varvec{x})\) where \(\varvec{x}\) is a random variable with density f(x). If g(x) is sufficiently smooth, given the mean value theorem, we can write [7]

$$\begin{aligned} \mathbb {E}\left[ g(\varvec{x}) \right] = \int ^{\infty }_{\infty } g(x)f(x)dx \approx g(\mu ) \int ^{\infty }_{\infty } f(x)dx = g(\mu ) \,, \end{aligned}$$
(4)

where \(\mu = \mathbb {E}\left[ \varvec{x} \right] \). Using a third order Taylor expansion for a function g(x) around \(\mu = \mathbb {E}\left[ \varvec{x} \right] \) and considering the expressions for the mean \(\mu ^g=\mathbb {E}\left[ g(x) \right] \) and covariance \(c^g_{pg}=\mathbb {E}\left[ (g_p(\varvec{x}) - \mu _p)(g_q(\varvec{x}) - \mu _q) \right] = \mathbb {E}\left[ g_p(\varvec{x})g_q(\varvec{x}) \right] - \mathbb {E}\left[ g_p(\varvec{x}) \right] \mathbb {E}\left[ g_q(\varvec{x}) \right] \,\), we obtain

$$\begin{aligned} \mu _p^{g} = g(\mu ) + \frac{1}{2} \sum _{i,j = 1}^n \mathrm {D}^2_{i j} g_{p} \cdot c_{ij} \end{aligned}$$
(5)

and

$$\begin{aligned} \begin{array}{lll} c^{g}_{pq} &{}= \frac{1}{2!}\sum _{i,j = 1}^n &{}\left( \mathrm {D}_{j} g_{p}\cdot \mathrm {D}_{i} g_{q} + \mathrm {D}_{j} g_{q}\cdot \mathrm {D}_{i} g_{p} \right) c_{ij} \\ &{}+\, \frac{1}{4!} \sum _{i,j,k,l = 1}^n &{}\left( \mathrm {D}_{i} g_{p}\cdot \mathrm {D}^3_{j k l} g_{q} + \mathrm {D}_{j} g_{p}\cdot \mathrm {D}^3_{i k l} g_{q} + \mathrm {D}_{k} g_{p}\cdot \mathrm {D}^3_{i j l} g_{q} \right. \\ &{} &{}\left. +\, \mathrm {D}_{l} g_{p}\cdot \mathrm {D}^3_{i j k} g_{q} + \mathrm {D}^2_{i j} g_{p} \cdot \mathrm {D}^2_{k l} g_{q} + \mathrm {D}^2_{i k} g_{p} \cdot \mathrm {D}^2_{j l} g_{q} \right. \\ &{} &{} +\left. \mathrm {D}^2_{i l} g_{p} \cdot \mathrm {D}^2_{j k} g_{q} + \mathrm {D}^2_{j k} g_{p} \cdot \mathrm {D}^2_{i l} g_{q} + \mathrm {D}^2_{j l} g_{p} \cdot \mathrm {D}^2_{i k} g_{q} \right. \\ &{} &{} +\left. \mathrm {D}^2_{k l} g_{p} \cdot \mathrm {D}^2_{i j} g_{q} + \mathrm {D}^3_{j k l} g_{p} \cdot \mathrm {D}_{i} g_{q} + \mathrm {D}^3_{i k l} g_{p} \cdot \mathrm {D}_{j} g_{q} \right. \\ &{} &{} \left. +\, \mathrm {D}^3_{i j l} g_{p} \cdot \mathrm {D}_{k} g_{q} + \mathrm {D}^3_{i j k} g_{p} \cdot \mathrm {D}_{l} g_{q} \right) c_{ijkl}\\ &{}-\, \frac{1}{2!} \sum _{i,j,k,l = 1}^n &{}\left( \mathrm {D}^2_{i j} g_{p}\cdot \mathrm {D}^2_{k l} g_{q} \right) c_{ij}c_{kl} \,. \end{array} \end{aligned}$$
(6)

In the derivation of these formulas we assume a Gaussian distribution, which results in the cancellation of the odd terms of the expansion.

4 Tensor Decomposition

In addition to combining the aforementioned methods and tools in a novel way to compute the moments, our main contribution is the distributed parallelization described in this section. The FoF model always computes one projection \(y^{(1,2)}\) of \(D^2g \in \mathbb {R}^{m \times n \times n}\) onto \(x^{(1)},x^{(2)} \in \mathbb {R}^n\) (see Fig. 1). Hence we cannot decompose the Hessian along the entries of \(y^{(1,2)}\). By parallelizing over the entries of \(x^{(1)}\) or \(x^{(2)}\), we can restrict the Cartesian basis vectors (see Sect. 2) to the local indices and thus distribute the Hessian accumulation over all processes. The same holds true for the computation of the four-dimensional tensor \(D^3g\) (3) where we can parallelize over the entries of \(x^{(1)}\), \(x^{(2)}\), or \(x^{(3)}\) and thus distribute the tensor over all processes.

Fig. 1.
figure 1

Hessian \(D^2g\)

The computation of the mean in (5) is then parallelized in a straightforward way over either the index i or j. Each process thus has a local copy of \(\mu \) that needs to be allreduced at the end of the summation. We parallelize the computation of C over p or q and perform allgather to share C among all processes. Solving the linear system for the time stepper has a runtime complexity of at most \(\mathcal {O}\left( n^3 \right) \), or lower in case of sparsity in the inner Jacobian. Accumulating the tensor \(D^3g\) has a runtime complexity of \(\mathcal {O}\left( n^2 \right) \cdot cost(g)\). Thus, the total run-time complexity for accumulating the tensor is \(\mathcal {O}\left( n^6/p \right) \), where p is the number of processes. The covariance computation in (6) yields the same complexity of \(\mathcal {O}\left( n^6/p \right) \), which is our global runtime complexity.

5 Implementation

ADUPROP (AD for Uncertainty Propagation) is a C++ implementation of the propagation of moments. In particular, it implements the concepts of Sect. 3 in the context of differential equations. ADUPROP is a template-based code that allows easy computation of higher-order derivatives. The library provides vector, matrix, and tensor data structures using a template type instead of . For our implementation of ADUPROP we chose the AD tool CoDiPack, which is based on operator overloading. To create an \(n+1\) derivative type we recursively apply differentiation on an n order type.

figure d

This maps exactly to the notation in Sect. 2. Accessing, for example, of the third-order type of a variable is equivalent to accessing \(\varvec{x}^{(2,3)}\). The variable type in the implementation of f allows us to instantiate the function using the types , , , and .

As an example, we implement UQ of a differential equation system with ADUPROP. When the system is discretized (\(x_k = \phi (x_{k-1})\)), the procedure is identical to the one described in Sect. 3. The default integration scheme that we use is backward Euler. One of the main advantages of using AD is that we can differentiate through functions, loops, or other complex functions in which obtaining explicit derivatives might be practically challenging and tedious. The integration loop is written as follows

figure m

The object , defined by the user, has to contain the residual function and the Jacobian . The function provides an interface to linear solvers. Currently we support BLAS and Eigen for dense and sparse linear systems, respectively. For an order of differentiation k we differentiate \(Ax=b\). Let \(A(s) \in \mathbb {R}^{n \times n}\), \(b(s) \in \mathbb {R}^n\), and \(x(s) \in \mathbb {R}^n\) with s being some input dependency. We define \(\frac{\partial ^k A}{\partial s^k} = A_k, \frac{\partial ^k b}{\partial s^k} = b_k \text{, } \text{ and } \frac{\partial ^k x}{\partial s^k} = x_k\). With \(Ax = b\), we have

$$\begin{aligned} c_k \cdot A \cdot x^{(k)} = b^{(k)} - c_0 \cdot A^{(k)}\cdot x - c_1 \cdot A^{(k-1)} x^{(1)} - \ldots - c_{k-1} \cdot A^{(1)} \cdot x^{(k-1)}. \end{aligned}$$
(7)

In summary, we have to solve 2, 4, and 8 linear systems for first-, second-, and third-order derivatives, respectively. With these three basic blocks in place, a time stepper, residual function, Jacobian, and linear system, we can compute the Jacobian, Hessian, and third order derivative tensor using the logic described in Sect. 2.

6 Scalability

A prototype sequential implementation of this method in Julia was tested on power system dynamics in [5]. To assess the scaling and computational capabilities of ADUPROP, we resort to a well-known dampened nonlinear dynamical system used in the weather simulation community and elsewhere [4, 10, 11]:

$$\begin{aligned} \dot{x}_i = x_{i-1}\left( x_{i+1} - x_{i-2}\right) - x_i + F, \ i=1,\dots ,n > 3 \,. \end{aligned}$$
(8)

This system is known to show chaotic behavior when the forcing term \(F \ge 8\), having an equilibrium \((F,\dots ,F)\) that becomes unstable for all \(n \ge 4\). The system transitions from a damped, constant-valued system to a traveling wave with a periodic attractor, and eventually to chaotic behavior, all adjustable through the selection of F.

Fig. 2.
figure 2

Strong scaling

The scalability study was done on Theta at the Argonne Leadership Computing Facility. Theta is composed of 1.3 GHz Intel Xeon Phi 7230 SKU nodes with 64 cores each. Our goal was to achieve strong scaling on a single node with up to 64 MPI processes. Our focus is on the strong scaling of the covariance computation using Hessians and third-order derivative tensors. We use the same \(F=4.4\) forcing but increase the dimension to \(N=64\) for third-order derivatives and \(N=512\) for second-order derivatives. The time horizon is irrelevant to the scaling, since there is no parallelization in time. With the timestep set to 1 we achieved the strong-scaling results in Fig. 2. The black line serves as a reference point for linear scaling. We show that our implementation scales nearly linearly with up to 64 cores, with both second- and third-order derivatives. As anticipated by our complexity analysis in Sect. 4, the covariance computation dominates the runtime with second-order derivatives, whereas with third-order derivatives both the tensor accumulation and covariance computation are much closer.

Each KNL node has 64 cores, limiting the strong scaling to one row or projection of the derivative tensor. Our code is able to run beyond a single node, but the runtime cost of computing the derivative tensor becomes too high. In the future we will investigate low-rank approximations to compress the derivative tensor and decrease its computational cost [1].

We validate the approximation of the variance propagation in Fig. 3 with dimension \(N=10\) and in a nonlinear regime with \(F=5\). The higher-order derivatives allow us to better capture the effects of nonlinearity in power systems [5]. The numerical aspects of this research will be subject of future research.

Fig. 3.
figure 3

Approximation of the variance propagation from timestep 1,500 to 3,000 (h = 0.001) for \(N=10\) \(F=5\) using first-order, second-order and third-order derivatives in addition to fully converged Monte Carlo at 1,000 samples.

7 Conclusion

This paper describes a distributed parallel framework for using the method of moments backed with AD. The extension to third-order derivatives and the parallelization over the derivative and covariance accumulation is unprecedented at this scale and speed. The distribution of the Hessian and tensor is chosen such that AD and the covariance computation benefit from the parallelism. Nonetheless, the computational cost grows at a factor of \(\mathcal {O}\left( N^6 \right) \), N being the dimension, times the original simulation evaluation using third-order derivatives. This factor is reduced to \(\mathcal {O}\left( N^4 \right) \) while using only the Hessian. In both cases we have shown the scalability on the current Intel KNL architecture. As opposed to Monte Carlo, this algorithm provides an analytical propagation workflow that showed promising results in [5]. To scale beyond a single KNL node, future research will focus on exploiting Hessian, tensor and covariance matrix structure in order to apply sampling methods. This approach has shown promising results in machine learning [12] and we plan to integrate this approach in our software. While being highly problem dependent, it has the potential to significantly reduce the complexity of the covariance computation. In particular, at higher dimensions, this method of propagating uncertainties may become a valuable alternative to the Monte Carlo-based sampling methods.

figure r