Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization

  • Donna Henderson
  • Gerton LunterEmail author
Open Access
Original paper


Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a “learning rate”, and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at


Stochastic approximation expectation maximization Sequential Monte Carlo Latent variable model Online estimation 

1 Introduction

Expectation Maximization (EM) is a general and widely used technique for estimating maximum likelihood parameters of latent variable models (Dempster et al. 1977). It involves iterating two steps: computing the expected log-likelihood marginalizing over the latent variable conditioned on parameters and data (the E step), and optimizing parameters to maximize this expected log-likelihood (the M step). In important special cases the E-step is analytically tractable; examples include linear systems with Gaussian noise (Shumway and Stoffer 1982) and finite-state hidden Markov models (Baum 1972). In general however, Monte Carlo techniques such as Stochastic EM (SEM; Celeux and Diebolt 1985; Celeux et al. 1995) and Monte Carlo EM (MCEM; Wei and Tanner 1990) are necessary to approximate the required integral. The stochastic nature of Monte Carlo techniques result in noisy parameter estimates, and to address this, methods such as Stochastic Approximation EM (SAEM; Nowlan 1991; Celeux and Diebolt 1992; Delyon et al. 1999) were developed that make smaller incremental updates parameterized by a learning rate \(\gamma \) or learning schedule \(\{\gamma _t\}\).

In this paper we focus on models where the latent variable has a longitudinal structure and follows a Markov model (see e.g. Lopes and Tsay 2010 for examples in financial econometrics). For such models, the required samples from the posterior distribution can be generated using Sequential Monte Carlo (SMC) techniques (see Doucet et al. 2001; Doucet and Johansen 2009 and references therein). In one approach, the Batch EM (BEM) algorithm processes a contiguous chunk of data to generate latent variable samples from the posterior, which are used in the M step to update parameters. An alternative approach is online EM (OEM; Mongillo and Denève 2008; Cappé 2009), in which parameters are continuously updated as data are processed. Analogous to SAEM, OEM algorithms have a parameter \(\gamma \) controlling the learning rate, an idea apparently first introduced in this context by Jordan and Jacobs (1993). Several recent papers have addressed related problems. For instance Yildirim et al. (2013) use a particle filter to implement an online EM algorithm for change point models (see also Fearnhead 2006; Fearnhead and Vasileiou 2009), which uses a pre-specified learning schedule (called “step-size sequence” in their work) to control convergence. Le Corff and Fort (2013) introduced a “block online” EM algorithm for hidden Markov models that combines online and batch ideas, controlling convergence through a block size sequence \(\tau _k\).

All these algorithms thus require choosing tuning parameters in the form of a batch size, block sequence, learning rate or a learning schedule. It turns out that this choice can strongly influences the performance of these algorithms. For instance, for BEM, very large batch sizes lead to inaccurate estimates because of slow convergence, whereas very small batch sizes lead to imprecise estimates due to the inherent stochasticity of the model within a small batch of observations. The optimal batch size in BEM or the optimal learning rate in OEM depends on the particularities of the model.

This raises the question of how to choose this tuning parameter. Several authors have proposed adaptive acceleration techniques for EM methods that obviate the need for choosing tuning parameters (Jamshidian and Jennrich 1993; Lange 1995; Varadhan and Roland 2008), but these methods require that the E-step is analytically tractable. In the context of (stochastic) gradient descent optimization (Bottou 2012), several influential adaptive algorithms have recently been proposed (Zeiler 2012; Kingma and Ba 2015; Mandt et al. 2016; Reddi et al. 2018) that have few or no tuning parameters. In principle, these methods can be used to find maximum likelihood parameters, but unless data is processed in batches, applying these methods to state-space models with a sequential structure is not straightforward. In addition, EM approaches enjoy several advantages over gradient descent methods, including automatic guarantees of parameter constraints and increased numerical stability (Xu and Jordan 1996; Cappé 2009; Kantas et al. 2009; Chitralekha et al. 2010).

Here we introduce a novel algorithm, termed Introspective Online EM (IOEM), which removes the need for setting the learning rate by estimating optimal parameter-specific learning rates from the data. This is particularly helpful when inferring parameters in a high dimensional model, since the optimal learning rate may differ between parameters. IOEM can be applied to inference in state-space models with observations \(Y_t\) and state variables \(X_t\) governed by transition probability function \(f(x_{t+1}|x_t,\theta )\) and observation probability function \(g(y_t|x_t,\theta )\), for which \(f(x_{t}|x_{t-1},\theta )g(y_t|x_t,\theta )\) belongs to an exponential family with sufficient statistic \(s(x_{t-1},x_t,y_t)\). Broadly, IOEM works by estimating both the precision and the accuracy of parameters in an online manner through weighted linear regression, and uses these estimates to tune the learning rate so as to improve both simultaneously.

The outline of this paper is as follows. Section 2 introduces BEM, OEM, and a simplified version of IOEM in the context of a one-parameter autoregressive state-space model. Section 3 introduces the complete IOEM algorithm required for inference in the full 3-parameter autoregressive model. Section 4 discusses simulation results of the algorithms for these two models. In addition we consider a 2-dimensional autoregressive model to show the benefit of the proposed algorithm when inferring many parameters, and we demonstrate desirable performance in the stochastic volatility model, an important case as it is nonlinear and hence relevant to actual applications of SAEM. In Sect. 5 we apply IOEM with the autoregressive and stochastic volatility models to a financial time series, and we end the paper with a brief discussion.

2 EM algorithms for a simplified autoregressive model

Here we review BEM (Dempster et al. 1977), OEM (Cappé 2009) and SMC (Doucet and Johansen 2009), and present the IOEM algorithm in a simplified context. This illustrates the main ideas behind IOEM before presenting the full algorithm in Sect. 3.

We consider a simple noisily-observed autoregressive model with one unknown parameter, equivalent to an ARMA(1,1) model. We observe the sequence of random variables \(Y_{1:t}:=\{Y_k\}_{k=1,\ldots ,t}\) that depend on the unobserved sequence \(X_{1:t}:=\{X_k\}_{k=1,\ldots ,t}\) as follows:
$$\begin{aligned} X_t&=a X_{t-1}+ \sigma _w W_t, \nonumber \\ Y_t&=X_t + \sigma _v V_t, \end{aligned}$$
where \(W_t\) and \(V_t\) are i.i.d. standard normal variates, \(a=0.95\) and \(\sigma _w^2=1\) are known parameters, and \(\sigma _v^2\) is unknown. Under this model, we have the following transition and emission densities:
$$\begin{aligned} f(x_t|x_{t-1})&= (2 \pi \sigma _w^2)^{-1/2} \exp \Big \{-\frac{(x_t - a x_{t-1})^2}{2 \sigma _w^2}\Big \}, \\ g(y_t|x_t)&= (2 \pi \sigma _v^2)^{-1/2} \exp \Big \{-\frac{(y_t- x_t)^2}{2 \sigma _v^2}\Big \}. \end{aligned}$$
We have chosen \(\sigma _v^2\) as the unknown parameter as it is the most straightforward to estimate, allowing us to introduce the idea of IOEM while avoiding certain complications that we address in Sect. 3. As f and g are members of the exponential family of distributions, the M step of EM can be done using sufficient statistics, and the E step amounts to calculating their expectation. In this model, the parameter \(\sigma _v^2\) has the sufficient statistic
$$\begin{aligned} S_t = \mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t} \sum _{k=1}^{t}(Y_k-X_k)^2 \right] . \end{aligned}$$
The estimate of \(\sigma _v^2\) is obtained by setting \(\hat{\sigma }_{v,t}^2=\hat{S}_t\). More generally, for an unknown parameter \(\theta \), \(\hat{\theta }_t=\varLambda (\hat{S}_t)\) where \(\varLambda \) is a known function mapping sufficient statistics to parameter estimates.
To estimate \(S_t\), we use sequential Monte Carlo (SMC) to simulate particles \(X^{(i)}_{1:t}\) and their associated weights \(w(X^{(i)}_{1:t})\), \(i=1,\ldots ,N\), so that
$$\begin{aligned} \sum _{i=1}^N w(X^{(i)}_{1:t}) \delta _{X^{(i)}_{1:t}} \end{aligned}$$
approximates the distribution \(p(X_{1:t}|Y_{1:t},\theta )\). The standard MCEM approximation of \(p(X_{1:t}|Y_{1:t},\hat{\theta })\) would require storage of all observations \(Y_{1:t}\) and simulation of \(X^{(i)}_{1:t}\) each time \(\hat{\theta }\) is updated, and ideally an increasing Monte Carlo sample size as the parameter estimates near convergence. To avoid this, we employ SAEM (Celeux and Diebolt 1992) which effectively averages over previous parameter estimates as an alternative to generating a new Monte Carlo sample every time an estimate is updated, and hence is more suitable to online inference. This method as proposed in Cappé and Moulines (2009) approximates the expectation in (2) recursively.
The outline of the SMC with EM algorithm we consider in this paper is as follows (Doucet and Johansen 2009):

Here \(\mu (\cdot | \hat{ \theta }_{0})\) is the initial distribution for \(X_1\), ESS is the effective sample size defined as \([\sum _{i=1}^N w_t(X_{1:t}^{(i)})^{-2}]^{-1}\), \(w_0(\cdot )=1/N\), and \(X_{t}^{(i)}\) is shorthand for the \(t^{\mathrm{th}}\) coordinate of \(X_{1:t}^{(i)}\). In models with multiple unknown parameters, each parameter is updated in step 3 of the algorithm, however we will refer only to a single parameter \(\theta \) to keep the notation simple.

Throughout this paper we follow common practice in using the fixed-lag technique in order to reduce the mean square error between \(S_t\) and \(\hat{S}_t\) (Cappé and Moulines 2005; Cappé et al. 2007). We choose a lag \(\varDelta > 0\) and at time t, using particles \(X_{1:t}^{(i)}\) shaped by data \(Y_{1:t}\), we estimate the \(t-\varDelta ^{\text {th}}\) term of the summation in (2). We will use \(X_{1:t}^{(i)}(t-\varDelta )\) to denote the \(t-\varDelta ^{\mathrm{th}}\) coordinate of the particle \(X_{1:t}^{(i)}\), but we will continue to write \(X_{t}^{(i)}\) as a shorthand for \(X_{1:t}^{(i)}(t)\). (See Table 1 for an overview of notation used in this paper).

The fixed-lag technique involves making the approximation
$$\begin{aligned} S_t \approx \mathbb {E}_{X_{1:t}|Y_{1:t} , \theta }\left[ \frac{1}{t- \varDelta } \sum _{j=1}^{t-\varDelta } s(Y_j,X_j) \right] \approx \frac{1}{t- \varDelta } \sum _{j=1}^{t-\varDelta } \mathbb {E}_{X_{1:j+\varDelta }| Y_{1:j+\varDelta } , \hat{\theta } } \left[ s(Y_j,X_j) \right] \end{aligned}$$
where we assume that \(S_t\) can be written as
$$\begin{aligned} S_t&= \mathbb {E}_{X_{1:t}|Y_{1:t},\theta } \sum _{j=1}^t s(Y_j,X_j) \end{aligned}$$
This allows \(S_t\) to be updated in an online manner by computing the component-wise sufficient statistics
$$\begin{aligned} \tilde{s}_t&:= \mathbb {E}_{ X_{1:t}|Y_{1:t} , \theta } \left[ s( Y_{t-\varDelta }, X_{1:t}(t-\varDelta ) ) \right] \\&\approx \sum _i w_k(X_{1:t}^{(i)}) s(Y_{t-\varDelta }, X_{1:t}^{(i)}(t-\varDelta )), \end{aligned}$$
allowing \(\hat{S}_t\) to be updated as
$$\begin{aligned} \hat{S}_t = \gamma _t \cdot \tilde{s}_t + (1-\gamma _t) \cdot \hat{S}_{t-1}, \end{aligned}$$
with some learning schedule \(\gamma _t\); in (3) \(\gamma _t=1/(t-\varDelta )\). This approach is slightly different from that of Cappé and Moulines (2005); see Sect. 7.1 for a discussion.

Choosing a large value of \(\varDelta \) allows SMC to use many observations to improve the posterior distribution of \(X_{t-\varDelta }\). However the cost of a large \(\varDelta \) is an increased path degeneracy due to the resampling procedure, which increases the sample variance. The optimal choice for \(\varDelta \) balances the opposing influences of the forgetting rate of the model and the collapsing rate of the resampling process due to the divergence between the proposal distribution and the posterior distribution. For the examples in this paper we chose \(\varDelta = 20\) as recommended by Cappé and Moulines (2005), which seems to be a reasonable choice for our models.

There are various other techniques to improve on this basic SMC method, including improved resampling schemes (Douc and Cappé 2005; Olsson et al. 2008; Doucet and Johansen 2009; Cappé et al. 2007), and choosing better sampling distributions through lookahead strategies or resample-move procedures (Pitt and Shephard 1999; Lin et al. 2013; Doucet and Johansen 2009), which are not discussed further here. Instead, in the remainder of this paper, we focus on the process of updating the parameter estimates \(\hat{ \theta }_{t}\). The remainder of this section describes the options for step 3 of Algorithm 1.

2.1 Batch expectation maximization

Batch Expectation Maximization (BEM) processes the data in batches. Within a batch of size b, the parameter estimate stays constant (\(\hat{ \theta }_{t}=\hat{ \theta }_{t-1}\)) and the update to the sufficient statistic
$$\begin{aligned} \tilde{s}_{t}:=\sum _{i}w_t(X_{1:t}^{(i)}) \cdot (Y_{t-\varDelta }-X_{1:t}^{(i)}(t-\varDelta ))^2, \end{aligned}$$
is collected at each iteration t. At the end of the mth batch we have \(t=mb\), at which time
$$\begin{aligned} \hat{S}^{BEM}_{t}:=\frac{1}{b}\sum _{k=(m-1)b+1}^{mb}\tilde{s}_{k}, \end{aligned}$$
is our approximation of S, and \(\hat{\sigma }^{2}_{v,t}:=\hat{S}^{BEM}_{t}\).

The batch size determines the convergence behavior of the estimates. For a fixed computational cost, choosing b too small will result in noise-dominated estimates and low precision, whereas choosing b too large will result in precise but inaccurate estimates due to slow convergence.

2.2 Online expectation maximization

BEM only makes use of the collected evidence at the end of each batch, missing potential early opportunities for improving parameter estimates. OEM addresses this issue by updating the parameter estimate at every iteration. The approximation of S at time t is a running average of \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\), weighted by a pre-specified learning schedule. The choice of learning schedule determines how quickly the algorithm “forgets” the earlier parameter estimates. In OEM at time t,
$$\begin{aligned} \hat{S}^{OEM}_{t}=\gamma _{t} \cdot \tilde{s}_{t} + (1-\gamma _{t}) \cdot \hat{S}^{OEM}_{t-1}, \end{aligned}$$
where \(\{\gamma _{t}\}_{t=1,2,\ldots }\) is the chosen learning schedule, typically of the form
$$\begin{aligned} \gamma _{t}=t^{-c} \end{aligned}$$
for a fixed choice of \(c\in (0.5,1]\) (Cappé 2009). Note that when using lag \(\varDelta \), \(\gamma _{t}=(t-\varDelta )^{-c}\) for \(t \ge \varDelta \). This update rule ensures that at time t, \(\hat{S}^{OEM}\) is a weighted sum of \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\) where the term \({\tilde{s}}_k\) has weight
$$\begin{aligned} \eta _k^t := \gamma _{k} (1-\gamma _{k+1}) \cdots (1-\gamma _{t-1})(1-\gamma _{t}). \end{aligned}$$

Although this method can outperform BEM as parameters are updated continuously, its performance remains strongly dependent on the parameter c determining the learning schedule \(\gamma _t\), and a suboptimal choice can reduce performance by orders of magnitude. At one extreme, the estimates will depend strongly only on the most recent data, resulting in noisy parameter estimates and low precision. At the other extreme, the estimates will average out stochastic effects but be severely affected by false initial estimates, resulting in more precise but less accurate estimates. Again, the best choice depends on the model.

A pragmatic approach to the problem of choosing a tuning parameter in OEM takes inspiration from Polyak (1990). In this method, a learning schedule that emphasizes incoming data is used to ensure quick initial convergence, while imprecise estimates are avoided at later iterations by averaging all OEM estimates beyond a threshold \(t_0\).
$$\begin{aligned} \hat{\theta }^{AVG}_t = {\left\{ \begin{array}{ll} \hat{\theta }^{OEM}_t &{} \quad \text {for } t < t_0 \\ \frac{1}{t-t_0+1} \sum _{k=t_0}^t \hat{\theta }^{OEM}_k &{} \quad \text {for } t \ge t_0. \end{array}\right. } \end{aligned}$$
Choosing an appropriate threshold \(t_0\) can be more straightforward than choosing c for \(\gamma _t = t^{-c}\), but it still requires the user to have an intuition for how the estimates for each parameter will behave. We will refer to this method as AVG, use \(c=0.6\), and set \(t_0=50{,}000\) which is half the total iterations for our examples.

2.3 Introspective online expectation maximization

We now introduce IOEM to address the issue of having to pre-specify a learning schedule \(\{\gamma _t\}_{t=1,\ldots }\). The algorithm is similar to OEM, but instead of pre-specifying \(\gamma _{t}\), we estimate the precision and accuracy in the sufficient statistic updates \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\) and use these to determine \(\gamma _{t+1}\). More precisely, we keep online estimates of a weighted regression on the dependent variables \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\) where \(k-t\) serves as the (shifted) explanatory variable:
$$\begin{aligned} \tilde{s}_k = \beta _0 + \beta _1(k-t) + \epsilon _k \end{aligned}$$
where \(\epsilon _k\sim N(0,\sigma ^2)\), and data point \((k-t,\tilde{s}_{k})\) has weight (6) as before. This weighted regression results in intercept and slope estimates \(\hat{\beta }_0\), \(\hat{\beta }_1\) and variance estimates \(\hat{\sigma }_0^2\), \(\hat{\sigma }_1^2\), where at convergence \(\hat{\beta }_0\) is the sought-after estimate and \(\hat{\beta }_1\simeq 0\). We do not use standard weighted regression, in which weights are inversely proportional to the variance of the observation, as this assumption is not justified here and would lead to biased estimates of \(\hat{\sigma }^2_{0,1}\). Instead we assume that observations share an unknown variance, and we use the weights to modulate the influence of each observation to the regression estimates, to reduce the impact of the bias in earlier observations; see Sect. 7.2 for details.
We next use the regression coefficients to estimate the past iteration where the drift term \(|\hat{\beta }_1|(k-t)\) is of the same order as the uncertainty \(\hat{\sigma }_0\) in the main estimate \(\hat{\beta }_0\):
$$\begin{aligned} t - k = \alpha {\hat{\sigma }_0 \over |\hat{\beta }_1| + \hat{\sigma }_1}, \end{aligned}$$
where \(\hat{\sigma }_1\) ensures that division by zero does not occur, and \(\alpha \) tunes the algorithms’s sensitivity to model misfit due to underlying parameter changes; we use \(\alpha =1\) unless stated otherwise. We propose a learning rate \(\gamma ^{reg}_{t+1}\) that results in a characteristic forgetting time \(1/\gamma ^{reg}_{t+1}\) matching this distance:
$$\begin{aligned} \gamma _{t+1}^{reg}=\frac{ |\hat{\beta }_1| + \hat{\sigma }_1 }{ \alpha \hat{\sigma }_0 }. \end{aligned}$$
This choice ensures that a substantial slope estimate \(|\hat{\beta }_1|\) indicating that \(\hat{\beta }_0\) has low accuracy puts large weight on the incoming statistic, improving accuracy, whereas a large \(\hat{\sigma }_0\) reflecting low precision in estimate \(\hat{\beta }_0\) results in a small weight, smoothing out successive estimates and improving precision. We impose restrictions on \(\gamma _{t+1}\) which keep it between the most extreme valid learning schedules for OEM. Taken together, the update step for \(\gamma \) becomes
$$\begin{aligned} \gamma _{t+1} = \min \left( (t+1)^{-c}, \max \left( \gamma _{t+1}^{reg}, \gamma _t / (1+\gamma _t) \right) \right) \end{aligned}$$
where \(c>0.5\) is chosen to be very close to 0.5 and guarantees convergence. These restrictions ensure that our algorithm satisfies the assumptions of Theorem 1 of Cappé and Moulines (2009), namely that \(0< \gamma _t < 1\), \(\sum _{t=1}^\infty \gamma _t = \infty \), and \(\sum _{t=1}^{\infty } \gamma _t^2 < \infty \). Hence for any model for which f and g satisfy the assumptions guaranteeing convergence of the standard OEM estimator, the IOEM algorithm is also guaranteed to converge. The precise conditions are detailed in Assumption 1, Assumption 2, and Theorem 1 of Cappé and Moulines (2009).

3 The IOEM algorithm for the full autoregressive model

The adapting learning schedule \(\{\gamma _t\}_{t=1,\ldots }\) sets IOEM apart from OEM. However, the way \(\gamma _t\) is calculated in Algorithm 3 only works in the special case that a single sufficient statistic and the single parameter of interest coincide (here, \(\hat{\sigma }_{v,t}^2=\hat{S}_t\)). In general, the sufficient statistics \(\hat{S}\) are mapped to parameter estimates \(\hat{\theta }\) by a function \(\varLambda \), leading to a more involved setup that we explore here. To this end, we now consider the full noisily-observed autoregressive model AR(1) with master equations as in (1), but now with unknown parameters a, \(\sigma _w\), and \(\sigma _v\). We define four sufficient statistics,
$$\begin{aligned} S_{1,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=1}^{t-1}X_k^2\right] ,\\ S_{2,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=1}^{t-1}X_k \cdot X_{k+1}\right] ,\\ S_{3,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=2}^{t}X_k^2\right] ,\\ S_{4,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t}\sum _{k=1}^{t}(Y_k-X_k)^2\right] . \end{aligned}$$
Then, in BEM and OEM, we update the parameter estimates to
$$\begin{aligned} \hat{a}_{t}&=\hat{S}_{2,t}/ \hat{S}_{1,t} , \end{aligned}$$
$$\begin{aligned} \hat{\sigma }_{w,t}&=(\hat{S}_{3,t}-(\hat{S}_{2,t})^2/\hat{S}_{1,t})^{1/2} ,\end{aligned}$$
$$\begin{aligned} \hat{\sigma }_{v,t}&=(\hat{S}_{4,t})^{1/2} , \end{aligned}$$
where \(\hat{S}_t\) is an approximation of \(S_t\).

In most cases, as above, the function \(\varLambda \) mapping \(\hat{S}_{t}\) to \(\hat{\theta }_{t}\) is nonlinear, and requires multiple sufficient statistics as input. To avoid bias, we want all sufficient statistics that inform one parameter estimate to share a learning schedule \(\{\gamma _{t}\}_{t=1,2,\ldots }\). We therefore estimate an adapting learning schedule for each parameter independently, by performing the regression on the level of the parameter estimates (Algorithm 4), rather than on the level of the sufficient statistics. We will calculate \(\hat{S}_t\) as in OEM (4) using our adapting learning schedule instead of a user specified learning schedule. Because the adapting learning schedule is specific to each parameter, we will have multiple estimates of certain summary sufficient statistics. In this case \(S_{1,t}\) and \(S_{2,t}\) are estimated by \(\hat{S}_{1,t}^{a}\) and \(\hat{S}_{2,t}^{a}\) for (11) and by \(\hat{S}_{1,t}^{\sigma _w}\) and \(\hat{S}_{2,t}^{\sigma _w}\) for (12).

Simply regressing on \(\hat{\theta }_{1:t}\) with respect to t would correspond to regression on \(\hat{S}_{1:t}\), not \(\tilde{s}_{1:t}\). As \(\hat{S}\) is a running average, there is a strong correlation between \(\hat{S}_{t-1}\) and \(\hat{S}_t\) and hence also a strong dependence between \(\hat{\theta }_{t-1}\) and \(\hat{\theta }_t\). In order to perform the regression on the parameters we must “unsmooth” \(\hat{\theta }_{1:t}\) to create pseudo-independent parameter updates \(\tilde{\theta }_t\) (see Algorithm 4). This is accomplished by taking linear combinations,
$$\begin{aligned} \tilde{\theta }_{t}:=\frac{1}{\gamma _{t}} \cdot \hat{\theta }_{t} + \left( 1-\frac{1}{\gamma _{t}}\right) \cdot \hat{\theta }_{t-1}, \end{aligned}$$
where the coefficients are chosen so as to minimize the covariance between successive updates, justifying the term pseudo-independent. The resulting updates correspond with the unsmoothed sufficient statistics updates \(\tilde{s}_{t}\) used in Sect. 2.3. See Sect. 7.3 for further details on this step.

4 Simulations

We performed inference on different models using the BEM, OEM and IOEM algorithms as described above. For BEM we used batch sizes from 100 to 10, 000, and for OEM we used learning schedules \(\gamma _t=t^{-c}\) with c ranging from 0.6 to 0.9. In all cases the bootstrap filter was run with \(N=100\) particles, and the algorithm was run from \(t=1\) to \(t=100{,}000\). For all parameter choices, 100 independent replicates were generated, and we show the distribution of inferred parameter values across these replicates.

4.1 Inference with the simplified IOEM algorithm

We first applied the simplified IOEM algorithm (Algorithm 3) to the problem of inferring \(\sigma _v^2\) in model (1), with all other parameters assumed known, and compared the results with the BEM and OEM algorithms (Fig. 1). The choice of tuning parameter in BEM and OEM makes a significant difference to the precision of the estimate even after 100,000 observations. IOEM was able to recognize that behavior similar to BEM with \(b=10{,}000\) or OEM with \(c=0.9\) was optimal. The accuracy and precision of IOEM are comparable with those of the post-OEM averaging technique (AVG) with parameters \(c=0.6\) and \(t_0=50{,}000\).
Fig. 1

Comparison of EM methods on simplified AR model with known true parameters \(a=.95\), \(\sigma _w=1\), and unknown true \(\sigma _v^2=30\), and initial parameter estimate \(\sigma _{v,0}^2=20\). \(\hat{\sigma }_{v,100k}^2\) is plotted for 100 replicates, \(N=100\)

4.2 Inference with the complete IOEM algorithm

We next treated all four parameters of the AR(1) model (1) as unknown, and inferred them using the full IOEM algorithm (Algorithm 4). Estimates for the a parameter under different EM methods are presented in Fig. 2; for the other parameter inferences see Sect. 7.5, Fig. 6.
Fig. 2

Comparison of EM methods on full autoregressive model with unknown true parameters \(a=0.95\), \(\sigma _{w}=1\), \(\sigma _{v}=5.5\) and initial parameters \(a_{0}=0.8\), \(\sigma _{w,0}=3\), \(\sigma _{v,0}=1\). \(\hat{a}_t\) at \(t=100{,}000\) is plotted for 100 replicates, \(N=100\)

In the AR(1) model, IOEM outperforms most other EM methods when estimating the a parameter, while AVG for the chosen parameter settings (\(c=0.6\), \(t_0=50{,}000\)) provides slightly more precise estimates at similar accuracy. It is worth noting that in this case, OEM with \(c=0.6\) substantially outperforms OEM with \(c=0.9\), in contrast to the results shown in Fig. 1. This is a result of the bad initial estimates. OEM with \(c=0.6\) forgets the earlier simulations much faster than OEM with \(c=0.9\) and hence is able to move its estimates of a, \(\sigma _w\), and \(\sigma _v\) much more quickly. Here IOEM recognizes that it should have similar behavior to OEM with \(c=0.6\), whereas in the inference displayed in Fig. 1 IOEM chose behavior similar to OEM with \(c=0.9\). IOEM can indeed adapt to the model.

4.3 Inference of multiple parameters

Next we investigated a model with a larger number of parameters and varying accuracy of initial parameter estimates. One of the advantages of the IOEM algorithm over OEM is its ability to adapt to each parameter independently. To highlight this, we applied IOEM to a simple 2-dimensional autoregressive model. For this model we consider the sequences \(\{Y^A,Y^B\}_{1:t}\) as observed, while \(\{X^A,X^B\}_{1:t}\) are unobserved, where
$$\begin{aligned} X^{A}_{t}&=a^{A} X^{A}_{t-1}+ \sigma ^{A}_{w} W^{A}_{t},&X^{B}_{t}&=a^{B} X^{B}_{t-1}+ \sigma ^{B}_{w} W^{B}_{t}, \nonumber \\ Y^{A}_{t}&=X^{A}_{t} + \sigma _{v} V^{A}_{t},&Y^{B}_{t}&=X^{B}_{t} + \sigma _{v} V^{B}_{t}. \end{aligned}$$
Note that \(Y^A\) and \(Y^B\) are uncoupled, and that their master equation have independent parameters except for a shared parameter \(\sigma _v\). By giving component A good initial estimates and B bad initial estimates, we can see how the different EM methods cope with a combination of accurate and inaccurate initializations. IOEM is able to identify the set with good initial estimates (\(a^A,\sigma ^{A}_{w}\)) and quickly start smoothing out noise. To IOEM, the other parameters appear to not have converged (\(\sigma ^{B}_{w}\) and \(\sigma _{v}\) because they are at the wrong value, \(a^B\) because it will be changing to compensate for \(\sigma ^{B}_{w}\) and \(\sigma _{v}\)).
Figure 3 shows the inference of \(\sigma _v\), which due to its dependence on components A and B, suffers the most from a blanket choice of tuning parameter in BEM or OEM. OEM with \(c=0.6\) and OEM with \(c=0.9\) both suffer in this model as they are both well suited to parameter estimation in one of the components, but not the other. AVG provides precise but biased estimates in this case, because of its reliance on a fast-forgetting initial OEM stage which again is suited to only one of the model components. IOEM on the other hand is able to capture the best of both worlds, striving for precision in component A and initially foregoing precision in favour of accuracy in component B.
Fig. 3

Comparison of EM methods on 2-dimensional autoregressive model with true parameters \(a^A=0.95\), \(\sigma ^{A}_{w}=1\), \(\sigma _{v}=5.5\), \(a^B=0.95\), \(\sigma ^{B}_{w}=1\) and initial parameters \(a^{A}_{0}=0.95\), \(\sigma ^{A}_{w,0}=1\), \(\sigma _{v,0}=3\), \(a^{B}_{0}=0.95\), \(\sigma ^{B}_{w,0}=3\). \(\hat{\sigma }_{v,t}\) at \(t=100{,}000\) is plotted for 100 replicates, \(N=100\)

The inference of the other parameters and comparisons with a different choice of AVG threshold are shown in Sect. 7.5, Figs. 7, 8, 9, 10.

4.4 Inference of parameters of a stochastic volatility model

The previous sections have demonstrated IOEM is comparable to choosing the optimal tuning parameter in OEM or BEM in certain models. However, the models shown have all been based on the noisily observed autoregressive model, which is a linear Gaussian case where in practice analytic techniques would be preferred over SAEM. We now examine the behaviour of these algorithms when inferring the parameters of a non-linear stochastic volatility model defined by transition and emission densities
$$\begin{aligned} f(x_t|x_{t-1})&= (2 \pi \sigma ^2)^{-1/2} \exp \Big \{-\frac{(x_t - \phi x_{t-1})^2}{2 \sigma ^2}\Big \}, \end{aligned}$$
$$\begin{aligned} g(y_t|x_t)&= (2 \pi \beta ^2 e^{x_t} )^{-1/2} \exp \Big \{-\frac{ 1 }{ 2 \beta ^2 e^{x_t} } y^2_t \Big \}. \end{aligned}$$
We define four summary sufficient statistics,
$$\begin{aligned} S_{1,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=1}^{t-1}X_k \cdot X_{k+1}\right] ,\\ S_{2,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=1}^{t-1}X_k^2\right] ,\\ S_{3,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t-1}\sum _{k=2}^{t}X_k^2\right] ,\\ S_{4,t}&=\mathbb {E}_{X_{1:t} | Y_{1:t}, \theta }\left[ \frac{1}{t}\sum _{k=1}^{t} e^{-X_k} \cdot Y_k^2 \right] . \end{aligned}$$
Then the set of parameters that maximises the likelihood at step t are
$$\begin{aligned} \hat{\phi }_{t}&=\hat{S}_{1,t}/ \hat{S}_{2,t} , \end{aligned}$$
$$\begin{aligned} \hat{\sigma }_{t}&=(\hat{S}_{3,t}-(\hat{S}_{1,t})^2/ \hat{S}_{2,t})^{1/2} , \end{aligned}$$
$$\begin{aligned} \hat{\beta }_{t}&=(\hat{S}_{4,t})^{1/2} , \end{aligned}$$
Again IOEM results in similar estimates to the optimal BEM/OEM and the online averaging technique with a well-chosen threshold (see Fig. 4 and Sect. 7.5, Fig. 11).
Fig. 4

Estimates of \(\phi \) in stochastic volatility model

5 Application to financial time series

We next applied our approach to daily log returns for US dollar to UK pound exchange rates, obtained from Between 18/05/2010 to 2/3/2016, roughly the period between the 2010 flash crash and the Brexit referendum, rates were fairly stable and might be described by an ARMA(1,1) model equivalent to (1). To assess confidence in estimates, we independently inferred model parameters 24 times from day-on-day log returns measured at every full hour (Fig. 5). We note that these time series are not fully comparable due to intraday seasonalities (Cornett et al. 1995), an effect that may be expected to increase the observed variation between the 24 time series, which would lead to conservative confidence estimates. Results suggest a weak negative correlation of successive daily log returns (\(a<0\)), which is supported by a direct fit of an ARMA(1,1) model to the data (Fig. 12). Although the ARMA(1,1) model assumes fixed parameters and in particular constant volatility, running inferences strongly indicate volatility variations (Figs. 5 and 13), suggesting model (16) might be appropriate. Inferred values of \(\phi \) indicate substantial day-to-day inertia in volatility. Running estimates of parameters are fairly constant in time, although those for \(\beta \) show that the model has difficulty tracking the two sudden drops in volatility that occurred in this period, indicating model misfit.
Fig. 5

Running estimates of parameters of model (1) (top) and model (16) (bottom) on time series of daily GBP/USD log returns (top left panel) based on 24 different hourly offsets (colours) using IOEM (\(N=5000\)). We used \(\alpha =2\) to reduce the impact of underlying parameter changes; for results with \(\alpha =1\) see Figs. 15 and 16 (color figure online)

6 Conclusion

Stochastic Approximation EM is a general and effective technique for estimating parameters in the context of SMC. However, convergence can be slow, and improving convergence speed is of particular interest in this setting. We have shown that IOEM produces accurate and precise parameter estimates when applied to continuous state-space models. Across models, and across varying levels of accuracy of the initial estimates, the efficiency of IOEM matches that of BEM/OEM with the optimal choice of tuning parameter. The AVG procedure also shows good behaviour, but like BEM/OEM it has tuning parameters, and when these are chosen suboptimally performance is not as good as IOEM (Figs. 9 and 10). BEM/OEM/AVG all make use of a single learning schedule \(\{\gamma _{t}\}\), and for more complex models a single learning schedule generally cannot achieve optimal convergence rates for all parameters, as we have shown for the 2-dimensional AR example. In addition, AVG works by post-hoc averaging of noisy estimates, and since the inferences depend on the noisy estimates themselves, this implicitly relies on the model being sufficiently linear around the true parameter value. We expect IOEM to be more resilient to strong nonlinearities than AVG, but we have not explored this idea further here.

IOEM finds parameter-specific learning schedules, resulting in better performance than standard methods with a single learning rate parameter are able to achieve. IOEM can be applied with minimal prior knowledge of the model’s behavior, and requires no user supervision, while retaining the convergence guarantees of BEM/OEM, therefore providing an efficient, practical approach to parameter estimation in SMC methods. While not the focus of this paper, application to a financial time series suggests that IOEM may be useful in informally assessing model fit; it would be interesting to investigate whether this could be made rigorous.



This work was supported by Wellcome Trust grants 090532/Z/09/Z and 102423/Z/13/Z, and Medical Research Council Strategic Alliance Funding MC_UU_12025.


  1. Baum LE (1972) An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8Google Scholar
  2. Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture notes in computer science, , vol 7700, pp 421–436Google Scholar
  3. Cappé O (2009) Online sequential Monte Carlo EM algorithm. In: IEEE/SP 15th Workshop on statistical signal processing, 2009. SSP’09, pp 37–40. IEEEGoogle Scholar
  4. Cappé O, Godsill SJ, Moulines E (2007) An overview of existing methods and recent advances in sequential monte carlo. Proc IEEE 95(5):899–924CrossRefGoogle Scholar
  5. Cappé O, Moulines E (2005) On the use of particle filtering for maximum likelihood parameter estimation. In: 13th European signal processing conference, pp 1–4. IEEEGoogle Scholar
  6. Cappé O, Moulines E (2009) On-line expectation–maximization algorithm for latent data models. J R Stat Soc Ser B 71(3):593–613MathSciNetCrossRefGoogle Scholar
  7. Celeux G, Chaveaux D, Diebolt J (1995) On stochastic versions of the EM algorithm. Technical Report 2514, INRIAGoogle Scholar
  8. Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2:73–82Google Scholar
  9. Celeux G, Diebolt J (1992) A stochastic approximation type EM algorithm for the mixture problem. Stoch Stoch Rep 2(41):119–132MathSciNetCrossRefGoogle Scholar
  10. Chitralekha SB, Prakash J, Raghavan H, Gopaluni R, Shah SL (2010) A comparison of simultaneous state and parameter estimation schemes for a continuous fermentor reactor. J Process Control 20(8):934–943CrossRefGoogle Scholar
  11. Cornett MM, Schwarz TV, Szakmary AC (1995) Seasonalities and intraday return patterns in the foreign currency futures market. J Bank Finance 19:843–869CrossRefGoogle Scholar
  12. Delyon B, Lavielle M, Moulines E (1999) Convergence of a stochastic approximation version of the EM algorithm. Ann Stat 27(1):94–128MathSciNetCrossRefGoogle Scholar
  13. Dempster AP, Laird NM, Rubin DB (1977) J R Stat Soc Ser B. Maximum likelihood from incomplete data via the EM algorithm, Soc., pp 1–38 Google Scholar
  14. Douc R, Cappé O (2005) Comparison of resampling schemes for particle filtering. In: Proceedings of the 4th international symposium on image and signal processing and analysis, pp 64–69. IEEEGoogle Scholar
  15. Doucet A, de Freitas N, Gordon N (eds) (2001) Sequential Monte Carlo methods in practice. SpringerGoogle Scholar
  16. Doucet A, Johansen AM (2009) A tutorial on particle filtering and smoothing: fifteen years later. Handb Nonlinear Filter 12:656–704zbMATHGoogle Scholar
  17. Fearnhead P (2006) Efficient and exact bayesian inference for multiple changepoint problems. Stat Comput 16:203–213MathSciNetCrossRefGoogle Scholar
  18. Fearnhead P, Vasileiou D (2009) Bayesian analysis of isochores. J Am Stat Assoc 104:132–141MathSciNetCrossRefGoogle Scholar
  19. Jamshidian M, Jennrich RI (1993) Acceleration of the EM algorithm by using Quasi-Newton methods. J Am Stat Assoc 88(421):221–228zbMATHGoogle Scholar
  20. Jordan MI, Jacobs RA (1993) Hierarchical mixtures of experts and the EM algorithm. In: Proceedings of the 1993 international joint conference on neural networks, pp 1339–1344Google Scholar
  21. Kantas N, Doucet A, Singh SS, Maciejowski JM (2009) An overview of sequential Monte Carlo methods for parameter estimation in general state-space models. In: 15th IFAC symposium on system identification (SYSID), vol 102, p 117CrossRefGoogle Scholar
  22. Kaufman P (1995) Smarter trading: improving performance in changing markets. McGraw-Hill, New YorkGoogle Scholar
  23. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR 2015Google Scholar
  24. Lange K (1995) A quasi Newton acceleration of the EM algorithm. Stat Sin 5:1–18MathSciNetzbMATHGoogle Scholar
  25. Le Corff S, Fort G (2013) Online expectation maximization based algorithms for inference in hidden Markov models. Electron J Stat 7:763–792MathSciNetCrossRefGoogle Scholar
  26. Lin M, Chen R, Liu JS et al (2013) Lookahead strategies for sequential Monte Carlo. Stat Sci 28(1):69–94MathSciNetCrossRefGoogle Scholar
  27. Lopes HF, Tsay RS (2010) Particle filters and bayesian inference in financial econometrics. J Forecast 30(1):168–209 MathSciNetCrossRefGoogle Scholar
  28. Mandt S, Hoffman MD, Blei DM (2016) A variational analysis of stochastic gradient algorithms. In: Proceedings of the 33rd international conference on machine learning, vol 48Google Scholar
  29. Mongillo G, Denève S (2008) Online learning with hidden Markov models. Neural Comput 20(7):1706–1716MathSciNetCrossRefGoogle Scholar
  30. Nowlan S (1991) Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures, Ph.D. thesis, School of Computer Science. Cargegie Mellon UniversityGoogle Scholar
  31. Olsson J, Cappé O, Douc R, Moulines E et al (2008) Sequential monte carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli 14(1):155–179MathSciNetCrossRefGoogle Scholar
  32. Pitt MK, Shephard N (1999) Filtering via simulation: auxiliary particle filters. J Am Stat Assoc 94(446):590–599MathSciNetCrossRefGoogle Scholar
  33. Polyak BT (1990) A new method of stochastic approximation type. Avtomatika i telemekhanika 7:98–107MathSciNetGoogle Scholar
  34. Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: Proceedings of ICLRGoogle Scholar
  35. Shumway RH, Stoffer DS (1982) An approach to time series smoothing and forecasting using the EM algorithm. J Time Ser Anal 3(4):253–264CrossRefGoogle Scholar
  36. Varadhan R, Roland C (2008) Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand J Stat 35(2):335–353MathSciNetCrossRefGoogle Scholar
  37. Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRefGoogle Scholar
  38. Xu L, Jordan M (1996) On convergence properties of the EM algorithm for gaussian mixtures. Neural Comput 8(1):129–151CrossRefGoogle Scholar
  39. Yildirim S, Singh SS, Doucet A (2013) An online expectation-maximization algorithm for changepoint models. J Comput Graph Stat 22(4):906–926MathSciNetCrossRefGoogle Scholar
  40. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701

Copyright information

© The Author(s) 2019

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit

Authors and Affiliations

  1. 1.Wellcome Centre of Human GeneticsUniversity of OxfordOxfordUK
  2. 2.MRC Weatherall Institute of Molecular MedicineUnversity of OxfordOxfordUK

Personalised recommendations