# MCMC for Bayesian Uncertainty Quantification from Time-Series Data

- 112 Downloads

## Abstract

In computational neuroscience, Neural Population Models (NPMs) are mechanistic models that describe brain physiology in a range of different states. Within computational neuroscience there is growing interest in the inverse problem of inferring NPM parameters from recordings such as the EEG (Electroencephalogram). Uncertainty quantification is essential in this application area in order to infer the mechanistic effect of interventions such as anaesthesia.

This paper presents Open image in new window software for Bayesian uncertainty quantification in the parameters of NPMs from approximately stationary data using Markov Chain Monte Carlo (MCMC). Modern MCMC methods require first order (and in some cases higher order) derivatives of the posterior density. The software presented offers two distinct methods of evaluating derivatives: finite differences and exact derivatives obtained through Algorithmic Differentiation (AD). For AD, two different implementations are used: the open source Stan Math Library and the commercially licenced Open image in new window tool distributed by NAG (Numerical Algorithms Group). The use of derivative information in MCMC sampling is demonstrated through a simple example, the noise-driven harmonic oscillator. And different methods for computing derivatives are compared. The software is written in a modular object-oriented way such that it can be extended to derivative based MCMC for other scientific domains.

## Keywords

Uncertainty quantification Algorithmic Differentiation Computational neuroscience## 1 Introduction

Bayesian uncertainty quantification is useful for calibrating physical models to observed data. As well as inferring the parameters that produce the best fit between a model and observed data, Bayesian methods can also identify the range of parameters that are consistent with observations and allow for prior beliefs to be incorporated into inferences. This means that not only can predictions be made but also their uncertainty can be quantified. In demography, Bayesian analysis is used to forecast the global population [7]. In defence systems, Bayesian analysis is used to track objects from radar signals [1]. And in computational neuroscience Bayesian analysis is used to compare different models of brain connectivity and to estimate physiological parameters in mechanistic models [13]. Many more examples can be found in the references of [2, 6, 15].

We focus here on problems which require the use of Markov Chain Monte Carlo (MCMC), a widely applicable methodology for generating samples approximately drawn from the posterior distribution of model parameters given observed data. MCMC is useful for problems where a parametric closed form solution for the posterior distribution cannot be found. MCMC became popular in the statistical community with the re-discovery of Gibbs sampling [26], and the development of the BUGS software [15]. More recently it has been found that methods which use derivatives of the posterior distribution with respect to model parameters, such as the Metropolis Adjusted Langevin Algorithm (MALA) and Hamiltonian Monte Carlo (HMC) tend to generate samples more efficiently than methods which do not require derivatives [12]. HMC is used in the popular Stan software [4]. From the perspective of a C++ programmer, the limitations of Stan are as follows: it may take a significant investment of effort to get started. Either the C++ code has to be translated into the Stan modelling language. Or, alternatively, C++ code can be called from Stan, but it may be challenging to (efficiently) obtain the derivatives that Stan needs in order to sample efficiently.

The software that we present includes (i) our own implementation of a derivative-based MCMC sampler called simplified manifold MALA (smMALA). This sampler can be easily be used in conjunction with C++ codes for Bayesian data analysis, (ii) Stan’s MCMC sampler with derivatives computed using Open image in new window , an industrial standard tool for efficient derivative computation.

An alternative approach to the one presented in this paper would be simply to use Stan as a stand-alone tool without the smMALA sampler and without Open image in new window . Determining the most efficient MCMC sampler for a given problem is still an active area of research, but at least within computational neuroscience, it has been found the smMALA performs better than HMC methods on certain problems [25]. Determining the most appropriate method for computing derivatives will depend on both the user and the problem at hand. In many applications Algorithmic Differentiation (AD) is needed for the reasons given in Sect. 2.3. The Stan Math Library includes an open-source AD tool. Commercial AD tools such as Open image in new window offer a richer set of features than open-source tools, and these features may be needed in order to optimize derivative computations. For example, the computations done using the Eigen linear algebra library [10] can be differentiated either using the Stan Math Library or using Open image in new window but there are cases where Open image in new window computes derivatives more efficiently than the Stan Math Library [22]. The aim of the software we present is to offer a range of options that both make it easy to get started and to tune performance.

## 2 Methods for Spectral Time-Series Analysis

### 2.1 Whittle Likelihood

The software presented in this paper is targeted at differential equation models with a stable equilibrium point and stochastic input. We refer to such models as stable SDEs. The methods that are implemented assume that the system is operating in a regime where we can approximate the dynamics through linearization around the stable fixed point. If the time-series data is stationary this is a reasonable assumption. Note that the underlying model may be capable of operating in nonlinear regimes such as limit cycles or chaos in addition to approximately linear dynamics. However, parameter estimation using data in nonlinear regimes quickly becomes intractable - see Chapter 2 of [16]. The stability and linearity assumptions are commonly made in the computational neuroscience literature, see for example [18].

*dW*(

*t*) represents a white noise process with variance \(\sigma _{in}^2\). The observations are modelled as \(Y_k = X_0(k\cdot \varDelta t) + \epsilon _k\) with \(\epsilon _k \sim N(0, \sigma _{obs}^2)\).

Our aim is to infer model parameters (\(\omega _0, \zeta , \sigma _{in}\)) from time-series data. This could be done in the time-domain using a Kalman filter, but it is often more computationally efficient to do inference in the frequency domain [3, 17]. In the case where we only have a single output (indexed by *i*) and a single input (indexed by *j*), we can compute a likelihood of the parameters \(\theta \) in the frequency domain through the following steps.

- 1.Compute the (left and right) eigendecomposition of
*A*, such that,where \(\mathrm {diag}(c)\) is a diagonal matrix, such that \(c_i\) is the dot product of the$$\begin{aligned} A \mathcal {R} = \varLambda \mathcal {R}, \quad \mathcal {L} A = \mathcal {L} \varLambda , \quad \mathcal {L} \mathcal {R} = \mathrm {diag}(c) \end{aligned}$$(3)*i*th left eigenvector with the*i*th right eigenvector. - 2.Compute
*ij*th element of the transfer matrix for frequencies \(\omega _1,\ldots ,\omega _K\),$$\begin{aligned} \mathcal {T}(\omega ) = \mathcal {R}\ \mathrm {diag}\bigg [ \frac{1}{c_k (i\omega - \lambda _k)} \bigg ]\ \mathcal {L}. \end{aligned}$$(4) - 3.Evaluate the spectral density for component
*i*of \(\mathbf {X}(t)\), \(f_{X_i}(\omega )\), and the spectral density for the observed time-series, \(f_{Y}(\omega )\),$$\begin{aligned} f_{X_i}(\omega )&= |\mathcal {T}_{ij}(\omega )|^2\ f_{P_j}(\omega ), \end{aligned}$$(5)where \(f_{P_j}(\omega )\) is the spectral density for component$$\begin{aligned} \nonumber \\ f_{Y}(\omega )&= f_{X_i}(\omega ) + \sigma _{obs}^2 \varDelta t, \end{aligned}$$(6)*j*of \(\mathbf {P}(t)\). - 4.Evaluate the Whittle likelihood,where \(\{S_k\}\) is the Discrete Fourier Transform of \(\{y_k\}\). Note that \(\theta \) represents a parameter set (e.g. specific values of \(\omega _0, \zeta , \sigma _{in}\)) that determines the spectral density.$$\begin{aligned} p(y_0,\ldots ,y_{n-1}|\theta ) = p(S_0,\ldots ,S_{n-1}|\theta ) \approx \prod _{k=1}^{n/2-1} \frac{1}{f_Y(\omega _k)} \exp \left[ -\frac{S_k}{f_Y(\omega _k)}\right] , \end{aligned}$$(7)

The matrix *A* that parameterizes a linear state-space model is typically non-symmetric, which means that eigenvectors and eigenvalues will be complex-valued. We use Eigen-AD [22], a fork of the
Open image in new window
linear algebra library Eigen [10]. Eigen is templated which facilitates the application of AD by overloading tools and Eigen-AD provides further optimizations for such tools. The operations above require an AD tool that supports differentiation of complex variables. AD of complex variables is considered in [24]. It is currently available in the feature/0123-complex-var branch of the Stan Math Library and in
Open image in new window
from release 3.4.3.

### 2.2 Markov Chain Monte Carlo

The error in estimates obtained from MCMC is approximately \(C / \sqrt{N}\), where *N* is the number of MCMC iterations and *C* is some problem-dependent constant. In general it is not possible to demonstrate that MCMC has converged, but there are several diagnostics that can indicate non-convergence, see Section 11.4 of [6] for more detail. Briefly, there are two phases of MCMC sampling: burn-in and the stationary phase. Burn-in is finished when we are in the region of the parameter space containing the true parameters. In this paper we restrict ourselves to synthetic data examples. In this case it is straight-forward to assess whether the sampler is burnt in by checking whether the true parameters used to simulate the data are contained in the credible intervals obtained from the generated samples. In real data applications, it is good practice to test MCMC sampling on a synthetic data problem that is analogous to the real data problem. During the stationary phase we assess convergence rate using the effective sample size, *N Eff*. If we were able to generate independent samples from the posterior then the constant *C* is \(\mathcal {O}(1)\). MCMC methods generate correlated samples, in which case *C* may be \(\gg 1\). A small *N Eff* (relative to *N*) indicates that this is the case.

*N Eff*for the

*i*th component is equal to,

*S*is the number of samples obtained during the stationary period, and \(\hat{\rho }_i(k)\), is an estimate of the autocorrelation at lag

*k*for the

*i*th component of the samples. This expression can be derived from writing down the variance of the average of a correlated sequence (Chapter 11 of [6]). The key point to note is that if the autocorrelation decays slowly

*N Eff*will be relatively small.

### 2.3 Derivative Computation

*i*th Cartesian basis vector, and

*h*is a user-defined step-size. For first-order derivatives the default value we use is \(\sqrt{\epsilon }|\theta _i|\), where \(\epsilon \) is machine epsilon for double types, i.e., if \(\theta _i\) is \(\mathcal {O}(1)\), \(h\approx 10^{-8}\). For second derivatives the default value is \(\epsilon ^{1/3}|\theta _i|\), so that \(h\approx 5\cdot 10^{-6}\). More details on the optimal step size in finite difference approximations can be found in [20]. First derivatives computed using finite differences require \(\mathcal {O}(N)\) function evaluations, where

*N*is the number of input variables. Second derivatives require \(\mathcal {O}(N^2)\) function evaluations.

*N*inputs and

*M*outputs, tangent mode requires \(\mathcal {O}(N)\) function evaluations and adjoint mode requires \(\mathcal {O}(M)\) function evaluations. In statistical applications our output is often a scalar probability density, so adjoint AD will scale better with

*N*than either tangent mode or finite differences in terms of total computation time. Adjoint AD is implemented as follows using the Stan Math Library, [5].

Open image in new window provides its adjoint type using the \(\texttt {dco::ga1s<T>::type}\) typedef, where T s the corresponding primal type (e.g. double). Higher order adjoints can be achieved by recursively nesting the adjoint type. Using this type, the program is first executed in the augmented forward run, where derivative information is stored in the global_tape data structure. The register_variable function initialises recording of the tape and facilitates dynamic varied analysis [11]. Derivative information recorded during the augmented primal run is then interpreted using the interpret_adjoint() function and dco::derivative is used to access the propagated adjoints.

## 3 Results for Noise Driven Harmonic Oscillator

Estimated quantiles of posterior distribution for noise driven harmonic oscillator. Synthetic data was generated by simulating from the model: duration = 20.0, time-step = 0.01. Two datasets (\(c_1\) and \(c_2\)) were generated with different parameter values for \(\omega _0\) and \(\sigma _{in}\) in each dataset (values are given in ‘actual’ column). The quantiles are estimated from a sequence of 10, 000 MCMC samples from the joint (posterior) distribution of all the parameters given all of the data. For example, 2.5% of the MCMC samples had \(\omega _0(c_1)\) values less than 77.3.

Actual | Quantile | |||
---|---|---|---|---|

0.025 | 0.50 | 0.975 | ||

\(\omega _0(c_1)\) | 80 | 77.3 | 80.3 | 81.9 |

\(\omega _0(c_2)\) | 40 | 36.8 | 38.8 | 41.3 |

\(\sigma _{in}(c_1)\) | 100 | 92 | 101 | 111 |

\(\sigma _{in}(c_2)\) | 10 | 9.81 | 10.8 | 12.4 |

\(\zeta \) | 0.2 | 0.164 | 0.193 | 0.223 |

Table 2 uses the same model and the same datasets as above to compare smMALA with NUTS, and finite differences with AD. The AD implementation used in smMALA was dco’s tangent over adjoint mode (i.e. dco::gt1s combined with dco::ga1s). In NUTS we used the dco’s tangent mode for computing the derivatives of the spectral density and Stan’s adjoint mode (stan::math::var) for the rest of the computation. Given the most recent developments in the feature/0123-complex-var branch of the Stan Math Library it would likely be possible to use Stan’s adjoint mode for the whole computation. However this was not the case when we started writing the code. In general users may find that they need the more advanced functionality of Open image in new window for part of their computation in order to obtain good performance.

The MCMC samplers were each run for 1,000 iterations. The results were analyzed using Stan’s stansummary command-line tool. To account for correlation between MCMC samples we use the *N Eff* diagnostic defined in Sect. 2.2 to measure sampling efficiency, min *N Eff* is the minimum over the 5 model parameters. Table 2 shows that, for the noise driven harmonic oscillator, we can accelerate MCMC sampling by a factor of around 3–5 by using NUTS rather than smMALA. We also see that the *N Eff/s* is higher for finite differences than for AD, because of the small number of input variables. However, even for this simple model the NUTS min *N Eff* is higher for AD than for finite differences suggesting that the extra accuracy in the derivatives results in more efficient sampling per MCMC iteration.

*C*in the Monte Carlo error (\(C / \sqrt{N}\)) by that factor. Another way of interpreting this is that we could reduce the number of MCMC iterations by a factor of 3–5 and expect to obtain the same level of accuracy in the estimated posterior distribution.

Noise-driven harmonic oscillator benchmarking results.

MCMC sampler | Derivative implementation | CPU time (s) | min | min |
---|---|---|---|---|

smMALA | Finite differences | 3.3 | 152 | 46 |

smMALA | 5.2 | 152 | 29 | |

NUTS | Finite differences | 3.2 | 485 | 153 |

NUTS | Stan and Open image in new window | 3.9 | 506 | 130 |

## 4 Discussion

This function should evaluate the posterior density in the user’s model. The gradient and Hessian of the posterior density are implemented in the Computation class. This reduces the effort required to use NUTS or smMALA for other computational models. Classes derived from Stable_sde or from Computation can be instantiated with several different scalar types including double, stan::math::var, and a number of Open image in new window types. This enables automatic evaluation of first and second derivatives. The gradient and Hessian functions need a template specialization to be defined in the base class for each different scalar type that is instantiated.

We conclude with some comments regarding the choice of MCMC sampler and the method for evaluating derivatives. Our software only implements derivative-based MCMC as this tends to be more computationally efficient than other MCMC methods [12, 23]. Derivative-based MCMC samplers can be further subdivided into methods that use higher-order derivatives, such as smMALA (sometimes referred to as Riemannian) and methods that only require first-order derivatives (such as NUTS). Riemannian methods tend to be more robust to complexity in the posterior distribution, but first-order methods tend to be more computationally efficient for problems where the posterior geometry is relatively simple. We recommend using smMALA in the first instance, then NUTS as a method that may provide acceleration in MCMC sampling, in terms of effective samples per second, *N Eff/s*. Regarding the derivative method, finite differences often results in adequate performance for problems with moderate input dimension (e.g. 10–20), at least with smMALA. But for higher-dimensional problems (e.g. Partial Differential Equations or Monte Carlo simulation) we recommend accelerating derivative computation using adjoint AD [19, 21]. The software presented enables users to implement and benchmark all these alternatives so that the most appropriate methods for a given problem can be chosen.

## References

- 1.Arulampalam, M.S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process.
**50**(2), 174–188 (2002)CrossRefGoogle Scholar - 2.Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
- 3.Bojak, I., Liley, D.: Modeling the effects of anesthesia on the electroencephalogram. Phys. Rev. E
**71**(4), 041902 (2005)CrossRefGoogle Scholar - 4.Carpenter, B., et al.: Stan: A probabilistic programming language. J. Stat. Softw.
**76**(1), 1–32 (2017)Google Scholar - 5.Carpenter, B., Hoffman, M.D., Brubaker, M., Lee, D., Li, P., Betancourt, M.: The Stan math library: reverse-mode automatic differentiation in C++. arXiv preprint arXiv:1509.07164 (2015)
- 6.Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis. CRC Press, Boca Raton (2013)CrossRefGoogle Scholar
- 7.Gerland, P., et al.: World population stabilization unlikely this century. Science
**346**(6206), 234–237 (2014)CrossRefGoogle Scholar - 8.Girolami, M., Calderhead, B.: Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.)
**73**(2), 123–214 (2011)MathSciNetCrossRefGoogle Scholar - 9.Green, P.J., Łatuszyński, K., Pereyra, M., Robert, C.P.: Bayesian computation: a summary of the current state, and samples backwards and forwards. Stat. Comput.
**25**(4), 835–862 (2015)MathSciNetCrossRefGoogle Scholar - 10.Guennebaud, G., Jacob, B., et al.: Eigen v3 (2010). http://eigen.tuxfamily.org
- 11.Hascoët, L., Naumann, U., Pascual, V.: "To be recorded" analysis in reverse-mode automatic differentiation. Future Gen. Comput. Syst.
**21**(8), 1401–1417 (2005)CrossRefGoogle Scholar - 12.Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res.
**15**(1), 1593–1623 (2014)MathSciNetzbMATHGoogle Scholar - 13.Kiebel, S.J., Garrido, M.I., Moran, R.J., Friston, K.J.: Dynamic causal modelling for EEG and MEG. Cogn. Neurodyn.
**2**(2), 121 (2008)CrossRefGoogle Scholar - 14.Leppkes, K., Lotz, J., Naumann, U.: Derivative code by overloading in C++ (dco/c++): introduction and summary of features. Technical report AIB-2016-08, RWTH Aachen University, September 2016. http://aib.informatik.rwth-aachen.de/2016/2016-08.pdf
- 15.Lunn, D., Jackson, C., Best, N., Thomas, A., Spiegelhalter, D.: The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press, Boca Raton (2012)CrossRefGoogle Scholar
- 16.Maybank, P.: Bayesian inference for stable differential equation models with applications in computational neuroscience. Ph.D. thesis, University of Reading (2019)Google Scholar
- 17.Maybank, P., Bojak, I., Everitt, R.G.: Fast approximate Bayesian inference for stable differential equation models. arXiv preprint arXiv:1706.00689 (2017)
- 18.Moran, R.J., Stephan, K.E., Seidenbecher, T., Pape, H.C., Dolan, R.J., Friston, K.J.: Dynamic causal models of steady-state responses. NeuroImage
**44**(3), 796–811 (2009)CrossRefGoogle Scholar - 19.NAG: NAG algorithmic differentiation software. https://www.nag.com/content/algorithmic-differentiation-software. Accessed 27 Jan 2020
- 20.NAG: OptCorner: the price of derivatives - using finite differences. https://www.nag.co.uk/content/optcorner-price-derivatives-using-finite-differences. Accessed 27 Jan 2020
- 21.Naumann, U., du Toit, J.: Adjoint algorithmic differentiation tool support for typical numerical patterns in computational finance. J. Comput. Finan.
**21**(4), 23–57 (2018)Google Scholar - 22.Peltzer, P., Lotz, J., Naumann, U.: Eigen-AD: Algorithmic differentiation of the Eigen library. arXiv preprint arXiv:1911.12604 (2019)
- 23.Penny, W., Sengupta, B.: Annealed importance sampling for neural mass models. PLoS Comput. Biol.
**12**(3), e1004797 (2016)CrossRefGoogle Scholar - 24.Pusch, G., Bischof, C., Carle, A.: On Automatic Differentiation of Codes with Complex Arithmetic with Respect to Real Variables, September 1995Google Scholar
- 25.Sengupta, B., Friston, K.J., Penny, W.D.: Gradient-based MCMC samplers for dynamic causal modelling. NeuroImage
**125**, 1107–1118 (2016)CrossRefGoogle Scholar - 26.Smith, A.F., Roberts, G.O.: Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Stat. Soc.: Ser. B (Methodol.)
**55**(1), 3–23 (1993)MathSciNetzbMATHGoogle Scholar