# Efficient inference in state-space models through adaptive learning in online Monte Carlo expectation maximization

- 78 Downloads

## Abstract

Expectation maximization (EM) is a technique for estimating maximum-likelihood parameters of a latent variable model given observed data by alternating between taking expectations of sufficient statistics, and maximizing the expected log likelihood. For situations where sufficient statistics are intractable, stochastic approximation EM (SAEM) is often used, which uses Monte Carlo techniques to approximate the expected log likelihood. Two common implementations of SAEM, Batch EM (BEM) and online EM (OEM), are parameterized by a “learning rate”, and their efficiency depend strongly on this parameter. We propose an extension to the OEM algorithm, termed Introspective Online Expectation Maximization (IOEM), which removes the need for specifying this parameter by adapting the learning rate to trends in the parameter updates. We show that our algorithm matches the efficiency of the optimal BEM and OEM algorithms in multiple models, and that the efficiency of IOEM can exceed that of BEM/OEM methods with optimal learning rates when the model has many parameters. Finally we use IOEM to fit two models to a financial time series. A Python implementation is available at https://github.com/luntergroup/IOEM.git.

## Keywords

Stochastic approximation expectation maximization Sequential Monte Carlo Latent variable model Online estimation## 1 Introduction

Expectation Maximization (EM) is a general and widely used technique for estimating maximum likelihood parameters of latent variable models (Dempster et al. 1977). It involves iterating two steps: computing the expected log-likelihood marginalizing over the latent variable conditioned on parameters and data (the E step), and optimizing parameters to maximize this expected log-likelihood (the M step). In important special cases the E-step is analytically tractable; examples include linear systems with Gaussian noise (Shumway and Stoffer 1982) and finite-state hidden Markov models (Baum 1972). In general however, Monte Carlo techniques such as Stochastic EM (SEM; Celeux and Diebolt 1985; Celeux et al. 1995) and Monte Carlo EM (MCEM; Wei and Tanner 1990) are necessary to approximate the required integral. The stochastic nature of Monte Carlo techniques result in noisy parameter estimates, and to address this, methods such as Stochastic Approximation EM (SAEM; Nowlan 1991; Celeux and Diebolt 1992; Delyon et al. 1999) were developed that make smaller incremental updates parameterized by a learning rate \(\gamma \) or learning schedule \(\{\gamma _t\}\).

In this paper we focus on models where the latent variable has a longitudinal structure and follows a Markov model (see e.g. Lopes and Tsay 2010 for examples in financial econometrics). For such models, the required samples from the posterior distribution can be generated using Sequential Monte Carlo (SMC) techniques (see Doucet et al. 2001; Doucet and Johansen 2009 and references therein). In one approach, the Batch EM (BEM) algorithm processes a contiguous chunk of data to generate latent variable samples from the posterior, which are used in the M step to update parameters. An alternative approach is online EM (OEM; Mongillo and Denève 2008; Cappé 2009), in which parameters are continuously updated as data are processed. Analogous to SAEM, OEM algorithms have a parameter \(\gamma \) controlling the learning rate, an idea apparently first introduced in this context by Jordan and Jacobs (1993). Several recent papers have addressed related problems. For instance Yildirim et al. (2013) use a particle filter to implement an online EM algorithm for change point models (see also Fearnhead 2006; Fearnhead and Vasileiou 2009), which uses a pre-specified learning schedule (called “step-size sequence” in their work) to control convergence. Le Corff and Fort (2013) introduced a “block online” EM algorithm for hidden Markov models that combines online and batch ideas, controlling convergence through a block size sequence \(\tau _k\).

All these algorithms thus require choosing tuning parameters in the form of a batch size, block sequence, learning rate or a learning schedule. It turns out that this choice can strongly influences the performance of these algorithms. For instance, for BEM, very large batch sizes lead to inaccurate estimates because of slow convergence, whereas very small batch sizes lead to imprecise estimates due to the inherent stochasticity of the model within a small batch of observations. The optimal batch size in BEM or the optimal learning rate in OEM depends on the particularities of the model.

This raises the question of how to choose this tuning parameter. Several authors have proposed adaptive acceleration techniques for EM methods that obviate the need for choosing tuning parameters (Jamshidian and Jennrich 1993; Lange 1995; Varadhan and Roland 2008), but these methods require that the E-step is analytically tractable. In the context of (stochastic) gradient descent optimization (Bottou 2012), several influential adaptive algorithms have recently been proposed (Zeiler 2012; Kingma and Ba 2015; Mandt et al. 2016; Reddi et al. 2018) that have few or no tuning parameters. In principle, these methods can be used to find maximum likelihood parameters, but unless data is processed in batches, applying these methods to state-space models with a sequential structure is not straightforward. In addition, EM approaches enjoy several advantages over gradient descent methods, including automatic guarantees of parameter constraints and increased numerical stability (Xu and Jordan 1996; Cappé 2009; Kantas et al. 2009; Chitralekha et al. 2010).

Here we introduce a novel algorithm, termed Introspective Online EM (IOEM), which removes the need for setting the learning rate by estimating optimal parameter-specific learning rates from the data. This is particularly helpful when inferring parameters in a high dimensional model, since the optimal learning rate may differ between parameters. IOEM can be applied to inference in state-space models with observations \(Y_t\) and state variables \(X_t\) governed by transition probability function \(f(x_{t+1}|x_t,\theta )\) and observation probability function \(g(y_t|x_t,\theta )\), for which \(f(x_{t}|x_{t-1},\theta )g(y_t|x_t,\theta )\) belongs to an exponential family with sufficient statistic \(s(x_{t-1},x_t,y_t)\). Broadly, IOEM works by estimating both the precision and the accuracy of parameters in an online manner through weighted linear regression, and uses these estimates to tune the learning rate so as to improve both simultaneously.

The outline of this paper is as follows. Section 2 introduces BEM, OEM, and a simplified version of IOEM in the context of a one-parameter autoregressive state-space model. Section 3 introduces the complete IOEM algorithm required for inference in the full 3-parameter autoregressive model. Section 4 discusses simulation results of the algorithms for these two models. In addition we consider a 2-dimensional autoregressive model to show the benefit of the proposed algorithm when inferring many parameters, and we demonstrate desirable performance in the stochastic volatility model, an important case as it is nonlinear and hence relevant to actual applications of SAEM. In Sect. 5 we apply IOEM with the autoregressive and stochastic volatility models to a financial time series, and we end the paper with a brief discussion.

## 2 EM algorithms for a simplified autoregressive model

Here we review BEM (Dempster et al. 1977), OEM (Cappé 2009) and SMC (Doucet and Johansen 2009), and present the IOEM algorithm in a simplified context. This illustrates the main ideas behind IOEM before presenting the full algorithm in Sect. 3.

*f*and

*g*are members of the exponential family of distributions, the M step of EM can be done using sufficient statistics, and the E step amounts to calculating their expectation. In this model, the parameter \(\sigma _v^2\) has the sufficient statistic

Here \(\mu (\cdot | \hat{ \theta }_{0})\) is the initial distribution for \(X_1\), *ESS* is the effective sample size defined as \([\sum _{i=1}^N w_t(X_{1:t}^{(i)})^{-2}]^{-1}\), \(w_0(\cdot )=1/N\), and \(X_{t}^{(i)}\) is shorthand for the \(t^{\mathrm{th}}\) coordinate of \(X_{1:t}^{(i)}\). In models with multiple unknown parameters, each parameter is updated in step 3 of the algorithm, however we will refer only to a single parameter \(\theta \) to keep the notation simple.

Throughout this paper we follow common practice in using the fixed-lag technique in order to reduce the mean square error between \(S_t\) and \(\hat{S}_t\) (Cappé and Moulines 2005; Cappé et al. 2007). We choose a lag \(\varDelta > 0\) and at time *t*, using particles \(X_{1:t}^{(i)}\) shaped by data \(Y_{1:t}\), we estimate the \(t-\varDelta ^{\text {th}}\) term of the summation in (2). We will use \(X_{1:t}^{(i)}(t-\varDelta )\) to denote the \(t-\varDelta ^{\mathrm{th}}\) coordinate of the particle \(X_{1:t}^{(i)}\), but we will continue to write \(X_{t}^{(i)}\) as a shorthand for \(X_{1:t}^{(i)}(t)\). (See Table 1 for an overview of notation used in this paper).

Choosing a large value of \(\varDelta \) allows SMC to use many observations to improve the posterior distribution of \(X_{t-\varDelta }\). However the cost of a large \(\varDelta \) is an increased path degeneracy due to the resampling procedure, which increases the sample variance. The optimal choice for \(\varDelta \) balances the opposing influences of the forgetting rate of the model and the collapsing rate of the resampling process due to the divergence between the proposal distribution and the posterior distribution. For the examples in this paper we chose \(\varDelta = 20\) as recommended by Cappé and Moulines (2005), which seems to be a reasonable choice for our models.

There are various other techniques to improve on this basic SMC method, including improved resampling schemes (Douc and Cappé 2005; Olsson et al. 2008; Doucet and Johansen 2009; Cappé et al. 2007), and choosing better sampling distributions through lookahead strategies or resample-move procedures (Pitt and Shephard 1999; Lin et al. 2013; Doucet and Johansen 2009), which are not discussed further here. Instead, in the remainder of this paper, we focus on the process of updating the parameter estimates \(\hat{ \theta }_{t}\). The remainder of this section describes the options for step 3 of Algorithm 1.

### 2.1 Batch expectation maximization

*b*, the parameter estimate stays constant (\(\hat{ \theta }_{t}=\hat{ \theta }_{t-1}\)) and the update to the sufficient statistic

*t*. At the end of the

*m*th batch we have \(t=mb\), at which time

*S*, and \(\hat{\sigma }^{2}_{v,t}:=\hat{S}^{BEM}_{t}\).

The batch size determines the convergence behavior of the estimates. For a fixed computational cost, choosing *b* too small will result in noise-dominated estimates and low precision, whereas choosing *b* too large will result in precise but inaccurate estimates due to slow convergence.

### 2.2 Online expectation maximization

*S*at time

*t*is a running average of \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\), weighted by a pre-specified learning schedule. The choice of learning schedule determines how quickly the algorithm “forgets” the earlier parameter estimates. In OEM at time

*t*,

*t*, \(\hat{S}^{OEM}\) is a weighted sum of \(\{\tilde{s}_{k}\}_{k=\varDelta +1,\ldots ,t}\) where the term \({\tilde{s}}_k\) has weight

Although this method can outperform BEM as parameters are updated continuously, its performance remains strongly dependent on the parameter *c* determining the learning schedule \(\gamma _t\), and a suboptimal choice can reduce performance by orders of magnitude. At one extreme, the estimates will depend strongly only on the most recent data, resulting in noisy parameter estimates and low precision. At the other extreme, the estimates will average out stochastic effects but be severely affected by false initial estimates, resulting in more precise but less accurate estimates. Again, the best choice depends on the model.

*c*for \(\gamma _t = t^{-c}\), but it still requires the user to have an intuition for how the estimates for each parameter will behave. We will refer to this method as AVG, use \(c=0.6\), and set \(t_0=50{,}000\) which is half the total iterations for our examples.

### 2.3 Introspective online expectation maximization

*f*and

*g*satisfy the assumptions guaranteeing convergence of the standard OEM estimator, the IOEM algorithm is also guaranteed to converge. The precise conditions are detailed in Assumption 1, Assumption 2, and Theorem 1 of Cappé and Moulines (2009).

## 3 The IOEM algorithm for the full autoregressive model

*a*, \(\sigma _w\), and \(\sigma _v\). We define four sufficient statistics,

In most cases, as above, the function \(\varLambda \) mapping \(\hat{S}_{t}\) to \(\hat{\theta }_{t}\) is nonlinear, and requires multiple sufficient statistics as input. To avoid bias, we want all sufficient statistics that inform one parameter estimate to share a learning schedule \(\{\gamma _{t}\}_{t=1,2,\ldots }\). We therefore estimate an adapting learning schedule for each parameter independently, by performing the regression on the level of the parameter estimates (Algorithm 4), rather than on the level of the sufficient statistics. We will calculate \(\hat{S}_t\) as in OEM (4) using our adapting learning schedule instead of a user specified learning schedule. Because the adapting learning schedule is specific to each parameter, we will have multiple estimates of certain summary sufficient statistics. In this case \(S_{1,t}\) and \(S_{2,t}\) are estimated by \(\hat{S}_{1,t}^{a}\) and \(\hat{S}_{2,t}^{a}\) for (11) and by \(\hat{S}_{1,t}^{\sigma _w}\) and \(\hat{S}_{2,t}^{\sigma _w}\) for (12).

*t*would correspond to regression on \(\hat{S}_{1:t}\), not \(\tilde{s}_{1:t}\). As \(\hat{S}\) is a running average, there is a strong correlation between \(\hat{S}_{t-1}\) and \(\hat{S}_t\) and hence also a strong dependence between \(\hat{\theta }_{t-1}\) and \(\hat{\theta }_t\). In order to perform the regression on the parameters we must “unsmooth” \(\hat{\theta }_{1:t}\) to create pseudo-independent parameter updates \(\tilde{\theta }_t\) (see Algorithm 4). This is accomplished by taking linear combinations,

## 4 Simulations

We performed inference on different models using the BEM, OEM and IOEM algorithms as described above. For BEM we used batch sizes from 100 to 10, 000, and for OEM we used learning schedules \(\gamma _t=t^{-c}\) with *c* ranging from 0.6 to 0.9. In all cases the bootstrap filter was run with \(N=100\) particles, and the algorithm was run from \(t=1\) to \(t=100{,}000\). For all parameter choices, 100 independent replicates were generated, and we show the distribution of inferred parameter values across these replicates.

### 4.1 Inference with the simplified IOEM algorithm

### 4.2 Inference with the complete IOEM algorithm

*a*parameter under different EM methods are presented in Fig. 2; for the other parameter inferences see Sect. 7.5, Fig. 6.

In the AR(1) model, IOEM outperforms most other EM methods when estimating the *a* parameter, while AVG for the chosen parameter settings (\(c=0.6\), \(t_0=50{,}000\)) provides slightly more precise estimates at similar accuracy. It is worth noting that in this case, OEM with \(c=0.6\) substantially outperforms OEM with \(c=0.9\), in contrast to the results shown in Fig. 1. This is a result of the bad initial estimates. OEM with \(c=0.6\) forgets the earlier simulations much faster than OEM with \(c=0.9\) and hence is able to move its estimates of *a*, \(\sigma _w\), and \(\sigma _v\) much more quickly. Here IOEM recognizes that it should have similar behavior to OEM with \(c=0.6\), whereas in the inference displayed in Fig. 1 IOEM chose behavior similar to OEM with \(c=0.9\). IOEM can indeed adapt to the model.

### 4.3 Inference of multiple parameters

*A*good initial estimates and

*B*bad initial estimates, we can see how the different EM methods cope with a combination of accurate and inaccurate initializations. IOEM is able to identify the set with good initial estimates (\(a^A,\sigma ^{A}_{w}\)) and quickly start smoothing out noise. To IOEM, the other parameters appear to not have converged (\(\sigma ^{B}_{w}\) and \(\sigma _{v}\) because they are at the wrong value, \(a^B\) because it will be changing to compensate for \(\sigma ^{B}_{w}\) and \(\sigma _{v}\)).

The inference of the other parameters and comparisons with a different choice of AVG threshold are shown in Sect. 7.5, Figs. 7, 8, 9, 10.

### 4.4 Inference of parameters of a stochastic volatility model

*t*are

## 5 Application to financial time series

## 6 Conclusion

Stochastic Approximation EM is a general and effective technique for estimating parameters in the context of SMC. However, convergence can be slow, and improving convergence speed is of particular interest in this setting. We have shown that IOEM produces accurate and precise parameter estimates when applied to continuous state-space models. Across models, and across varying levels of accuracy of the initial estimates, the efficiency of IOEM matches that of BEM/OEM with the optimal choice of tuning parameter. The AVG procedure also shows good behaviour, but like BEM/OEM it has tuning parameters, and when these are chosen suboptimally performance is not as good as IOEM (Figs. 9 and 10). BEM/OEM/AVG all make use of a single learning schedule \(\{\gamma _{t}\}\), and for more complex models a single learning schedule generally cannot achieve optimal convergence rates for all parameters, as we have shown for the 2-dimensional AR example. In addition, AVG works by post-hoc averaging of noisy estimates, and since the inferences depend on the noisy estimates themselves, this implicitly relies on the model being sufficiently linear around the true parameter value. We expect IOEM to be more resilient to strong nonlinearities than AVG, but we have not explored this idea further here.

IOEM finds parameter-specific learning schedules, resulting in better performance than standard methods with a single learning rate parameter are able to achieve. IOEM can be applied with minimal prior knowledge of the model’s behavior, and requires no user supervision, while retaining the convergence guarantees of BEM/OEM, therefore providing an efficient, practical approach to parameter estimation in SMC methods. While not the focus of this paper, application to a financial time series suggests that IOEM may be useful in informally assessing model fit; it would be interesting to investigate whether this could be made rigorous.

## Notes

### Acknowledgements

This work was supported by Wellcome Trust grants 090532/Z/09/Z and 102423/Z/13/Z, and Medical Research Council Strategic Alliance Funding MC_UU_12025.

## References

- Baum LE (1972) An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8Google Scholar
- Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture notes in computer science, , vol 7700, pp 421–436Google Scholar
- Cappé O (2009) Online sequential Monte Carlo EM algorithm. In: IEEE/SP 15th Workshop on statistical signal processing, 2009. SSP’09, pp 37–40. IEEEGoogle Scholar
- Cappé O, Godsill SJ, Moulines E (2007) An overview of existing methods and recent advances in sequential monte carlo. Proc IEEE 95(5):899–924CrossRefGoogle Scholar
- Cappé O, Moulines E (2005) On the use of particle filtering for maximum likelihood parameter estimation. In: 13th European signal processing conference, pp 1–4. IEEEGoogle Scholar
- Cappé O, Moulines E (2009) On-line expectation–maximization algorithm for latent data models. J R Stat Soc Ser B 71(3):593–613MathSciNetCrossRefGoogle Scholar
- Celeux G, Chaveaux D, Diebolt J (1995) On stochastic versions of the EM algorithm. Technical Report 2514, INRIAGoogle Scholar
- Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2:73–82Google Scholar
- Celeux G, Diebolt J (1992) A stochastic approximation type EM algorithm for the mixture problem. Stoch Stoch Rep 2(41):119–132MathSciNetCrossRefGoogle Scholar
- Chitralekha SB, Prakash J, Raghavan H, Gopaluni R, Shah SL (2010) A comparison of simultaneous state and parameter estimation schemes for a continuous fermentor reactor. J Process Control 20(8):934–943CrossRefGoogle Scholar
- Cornett MM, Schwarz TV, Szakmary AC (1995) Seasonalities and intraday return patterns in the foreign currency futures market. J Bank Finance 19:843–869CrossRefGoogle Scholar
- Delyon B, Lavielle M, Moulines E (1999) Convergence of a stochastic approximation version of the EM algorithm. Ann Stat 27(1):94–128MathSciNetCrossRefGoogle Scholar
- Dempster AP, Laird NM, Rubin DB (1977) J R Stat Soc Ser B. Maximum likelihood from incomplete data via the EM algorithm, Soc., pp 1–38 Google Scholar
- Douc R, Cappé O (2005) Comparison of resampling schemes for particle filtering. In: Proceedings of the 4th international symposium on image and signal processing and analysis, pp 64–69. IEEEGoogle Scholar
- Doucet A, de Freitas N, Gordon N (eds) (2001) Sequential Monte Carlo methods in practice. SpringerGoogle Scholar
- Doucet A, Johansen AM (2009) A tutorial on particle filtering and smoothing: fifteen years later. Handb Nonlinear Filter 12:656–704zbMATHGoogle Scholar
- Fearnhead P (2006) Efficient and exact bayesian inference for multiple changepoint problems. Stat Comput 16:203–213MathSciNetCrossRefGoogle Scholar
- Fearnhead P, Vasileiou D (2009) Bayesian analysis of isochores. J Am Stat Assoc 104:132–141MathSciNetCrossRefGoogle Scholar
- Jamshidian M, Jennrich RI (1993) Acceleration of the EM algorithm by using Quasi-Newton methods. J Am Stat Assoc 88(421):221–228zbMATHGoogle Scholar
- Jordan MI, Jacobs RA (1993) Hierarchical mixtures of experts and the EM algorithm. In: Proceedings of the 1993 international joint conference on neural networks, pp 1339–1344Google Scholar
- Kantas N, Doucet A, Singh SS, Maciejowski JM (2009) An overview of sequential Monte Carlo methods for parameter estimation in general state-space models. In: 15th IFAC symposium on system identification (SYSID), vol 102, p 117CrossRefGoogle Scholar
- Kaufman P (1995) Smarter trading: improving performance in changing markets. McGraw-Hill, New YorkGoogle Scholar
- Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR 2015Google Scholar
- Lange K (1995) A quasi Newton acceleration of the EM algorithm. Stat Sin 5:1–18MathSciNetzbMATHGoogle Scholar
- Le Corff S, Fort G (2013) Online expectation maximization based algorithms for inference in hidden Markov models. Electron J Stat 7:763–792MathSciNetCrossRefGoogle Scholar
- Lin M, Chen R, Liu JS et al (2013) Lookahead strategies for sequential Monte Carlo. Stat Sci 28(1):69–94MathSciNetCrossRefGoogle Scholar
- Lopes HF, Tsay RS (2010) Particle filters and bayesian inference in financial econometrics. J Forecast 30(1):168–209 MathSciNetCrossRefGoogle Scholar
- Mandt S, Hoffman MD, Blei DM (2016) A variational analysis of stochastic gradient algorithms. In: Proceedings of the 33rd international conference on machine learning, vol 48Google Scholar
- Mongillo G, Denève S (2008) Online learning with hidden Markov models. Neural Comput 20(7):1706–1716MathSciNetCrossRefGoogle Scholar
- Nowlan S (1991) Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures, Ph.D. thesis, School of Computer Science. Cargegie Mellon UniversityGoogle Scholar
- Olsson J, Cappé O, Douc R, Moulines E et al (2008) Sequential monte carlo smoothing with application to parameter estimation in nonlinear state space models. Bernoulli 14(1):155–179MathSciNetCrossRefGoogle Scholar
- Pitt MK, Shephard N (1999) Filtering via simulation: auxiliary particle filters. J Am Stat Assoc 94(446):590–599MathSciNetCrossRefGoogle Scholar
- Polyak BT (1990) A new method of stochastic approximation type. Avtomatika i telemekhanika 7:98–107MathSciNetGoogle Scholar
- Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: Proceedings of ICLRGoogle Scholar
- Shumway RH, Stoffer DS (1982) An approach to time series smoothing and forecasting using the EM algorithm. J Time Ser Anal 3(4):253–264CrossRefGoogle Scholar
- Varadhan R, Roland C (2008) Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand J Stat 35(2):335–353MathSciNetCrossRefGoogle Scholar
- Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRefGoogle Scholar
- Xu L, Jordan M (1996) On convergence properties of the EM algorithm for gaussian mixtures. Neural Comput 8(1):129–151CrossRefGoogle Scholar
- Yildirim S, Singh SS, Doucet A (2013) An online expectation-maximization algorithm for changepoint models. J Comput Graph Stat 22(4):906–926MathSciNetCrossRefGoogle Scholar
- Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701

## Copyright information

**Open Access**This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.