Skip to main content
Log in

Layered adaptive importance sampling

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Monte Carlo methods represent the de facto standard for approximating complicated integrals involving multidimensional target distributions. In order to generate random realizations from the target distribution, Monte Carlo techniques use simpler proposal probability densities to draw candidate samples. The performance of any such method is strictly related to the specification of the proposal distribution, such that unfortunate choices easily wreak havoc on the resulting estimators. In this work, we introduce a layered (i.e., hierarchical) procedure to generate samples employed within a Monte Carlo scheme. This approach ensures that an appropriate equivalent proposal density is always obtained automatically (thus eliminating the risk of a catastrophic performance), although at the expense of a moderate increase in the complexity. Furthermore, we provide a general unified importance sampling (IS) framework, where multiple proposal densities are employed and several IS schemes are introduced by applying the so-called deterministic mixture approach. Finally, given these schemes, we also propose a novel class of adaptive importance samplers using a population of proposals, where the adaptation is driven by independent parallel or interacting Markov chain Monte Carlo (MCMC) chains. The resulting algorithms efficiently combine the benefits of both IS and MCMC methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Note that, as both \(\bar{\pi }(\mathbf{x})\) and Z depend on the observations \(\mathbf{y}\), the use of \(\bar{\pi }(\mathbf{x}|\mathbf{y})\) and \(Z(\mathbf{y})\) would be more precise. However, since the observations are fixed, in the sequel we remove the dependence on \(\mathbf{y}\) to simplify the notation.

  2. Note that, in the ideal case described here, each \({\varvec{\mu }}_j\) is also independent of the other \({\varvec{\mu }}\)’s. However, in the rest of this work, we also consider cases where correlation among the mean vectors (\({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_J\)) is introduced.

  3. Given a function \(f(\mathbf{x})\), the optimal proposal q minimizing the variance of the IS estimator is \(\widetilde{q}(\mathbf{x}|\mathbf{C}) \propto |f(\mathbf{x})| \bar{\pi }(\mathbf{x})\). However, in practical applications, we are often interested in computing expectations w.r.t. several f’s. In this context, a more appropriate strategy is to minimize the variance of the importance weights. In this case, the minimum variance is attained when \(\widetilde{q}(\mathbf{x}|\mathbf{C})= \bar{\pi }(\mathbf{x})\) (Doucet and Johansen 2008).

  4. The standard PMC method (Cappé et al. 2004) is described in Sect. 1.

  5. These values have been obtained with a deterministic, expensive and exhaustive numerical integration method, using a thin grid.

References

  • Ali, A.M., Yao, K., Collier, T.C., Taylor, E., Blumstein, D., Girod, L.: An empirical study of collaborative acoustic source localization. In: Proceedings of the Information Processing in Sensor Networks (IPSN07), Boston (2007)

  • Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.: An introduction to MCMC for machine learning. Mach. Learn. 50, 5–43 (2003)

    Article  MATH  Google Scholar 

  • Andrieu, C., Doucet, A., Holenstein, R.: Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. B 72(3), 269–342 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Andrieu, C., Thoms, J.: A tutorial on adaptive mcmc. Stat. Comput. 18, 343373 (2015)

    MathSciNet  Google Scholar 

  • Beaujean, F., Caldwell, A.: Initializing adaptive importance sampling with Markov chains. arXiv:1304.7808 (2013)

  • Botev, Z.I., Kroese, D.P.: An efficient algorithm for rare-event probability estimation, combinatorial optimization, and counting. Methodol. Comput. Appl. Probab. 10(4), 471–505 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Botev, Z.I., LEcuyer, P., Tuffin, B.: Markov chain importance sampling with applications to rare event probability estimation. Stat. Comput. 23, 271–285 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Brockwell, A., Del Moral, P., Doucet, A.: Interacting Markov chain Monte Carlo methods. Ann. Stat. 38(6), 3387–3411 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Bugallo, M.F., Martino, L., Corander, J.: Adaptive importance sampling in signal processing. Digit. Signal Process. 47, 36–49 (2015)

    Article  MathSciNet  Google Scholar 

  • Caldwell, A., Liu, C.: Target density normalization for Markov Chain Monte Carlo algorithms. arXiv:1410.7149 (2014)

  • Cappé, O., Douc, R., Guillin, A., Marin, J.M., Robert, C.P.: Adaptive importance sampling in general mixture classes. Stat. Comput. 18, 447–459 (2008)

    Article  MathSciNet  Google Scholar 

  • Cappé, O., Guillin, A., Marin, J.M., Robert, C.P.: Population Monte Carlo. J. Comput. Graph. Stat. 13(4), 907–929 (2004)

    Article  MathSciNet  Google Scholar 

  • Chib, S., Jeliazkov, I.: Marginal likelihood from the metropolis-hastings output. J. Am. Stat. Assoc. 96, 270–281 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Chopin, N.: A sequential particle filter for static models. Biometrika 89, 539–552 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Cornuet, J.M., Marin, J.M., Mira, A., Robert, C.P.: Adaptive multiple importance sampling. Scand. J. Stat. 39(4), 798–812 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Craiu, R., Rosenthal, J., Yang, C.: Learn from thy neighbor: parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 104(448), 1454–1466 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Del Moral, P., Doucet, A., Jasra, A.: Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 68(3), 411–436 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Douc, G.R., Marin, J.M., Robert, C.: Convergence of adaptive mixtures of importance sampling schemes. Ann. Stat. 35, 420–448 (2007a)

    Article  MathSciNet  MATH  Google Scholar 

  • Douc, G.R., Marin, J.M., Robert, C.: Minimum variance importance sampling via population Monte Carlo. ESAIM Probab. Stat. 11, 427–447 (2007b)

    Article  MathSciNet  MATH  Google Scholar 

  • Doucet, A., Johansen, A.M.: A tutorial on particle filtering and smoothing: fifteen years later. Technical report (2008)

  • Doucet, A., Wang, X.: Monte Carlo methods for signal processing. IEEE Signal Process. Mag. 22(6), 152–170 (2005)

    Article  Google Scholar 

  • Elvira, V., Martino, L., Luengo, D., Bugallo, M.: Efficient multiple importance sampling estimators. IEEE Signal Process. Lett. 22(10), 1757–1761 (2015)

    Article  Google Scholar 

  • Elvira, V., Martino, L., Luengo, D., Bugallo, M.F.: Generalized multiple importance sampling. arXiv:1511.03095 (2015)

  • Fearnhead, P., Taylor, B.M.: An adaptive sequential Monte Carlo sampler. Bayesian Anal. 8(2), 411–438 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Fitzgerald, W.J.: Markov chain Monte Carlo methods with applications to signal processing. Signal Process. 81(1), 3–18 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Friel, N., Wyse, J.: Estimating the model evidence: a review. arXiv:1111.1957 (2011)

  • Geyer, C.J.: Markov chain Monte Carlo maximum likelihood. In: Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pp. 156–163 (1991)

  • Haario, H., Saksman, E., Tamminen, J.: An adaptive Metropolis algorithm. Bernoulli 7(2), 223–242 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Ihler, A.T., Fisher, J.W., Moses, R.L., Willsky, A.S.: Nonparametric belief propagation for self-localization of sensor networks. IEEE Trans. Sel. Areas Commun. 23(4), 809–819 (2005)

    Article  Google Scholar 

  • Jacob, P., Robert, C.P., Smith, M.H.: Using parallel computation to improve Independent Metropolis–Hastings based estimation. J. Comput. Graph. Stat. 3(20), 616–635 (2011)

    Article  MathSciNet  Google Scholar 

  • Liang, F., Liu, C., Caroll, R.: Advanced Markov Chain Monte Carlo Methods: Learning from Past Samples. Wiley Series in Computational Statistics, England (2010)

  • Liesenfeld, R., Richard, J.F.: Improving MCMC, using efficient importance sampling. Comput. Stat. Data Anal. 53, 272–288 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, Berlin (2004)

    Book  Google Scholar 

  • Liu, J.S., Liang, F., Wong, W.H.: The multiple-try method and local optimization in metropolis sampling. J. Am. Stat. Assoc. 95(449), 121–134 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Luengo, D., Martino, L.: Fully adaptive Gaussian mixture Metropolis–Hastings algorithm. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2013)

  • Marin, J.M., Pudlo, P., Sedki, M.: Consistency of the adaptive multiple importance sampling. arXiv:1211.2548 (2012)

  • Marinari, E., Parisi, G.: Simulated tempering: a new Monte Carlo scheme. Europhys. Lett. 19(6), 451–458 (1992)

    Article  Google Scholar 

  • Martino, L., Elvira, V., Luengo, D., Artes, A., Corander, J.: Orthogonal MCMC algorithms. In: IEEE Workshop on Statistical Signal Processing (SSP), pp. 364–367 (2014)

  • Martino, L., Elvira, V., Luengo, D., Artes, A., Corander, J.: Smelly parallel MCMC chains. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2015)

  • Martino, L., Elvira, V., Luengo, D., Corander, J.: An adaptive population importance sampler: learning from the uncertanity. IEEE Trans. Signal Process. 63(16), 4422–4437 (2015)

    Article  MathSciNet  Google Scholar 

  • Martino, L., Elvira, V., Luengo, D., Corander, J.: MCMC-driven adaptive multiple importance sampling. In: Interdisciplinary Bayesian Statistics Springer Proceedings in Mathematics & Statistics, vol. 118, Chap. 8, pp. 97–109 (2015)

  • Martino, L., Míguez, J.: A generalization of the adaptive rejection sampling algorithm. Stati. Comput. 21(4), 633–647 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Mendes, E.F., Scharth, M., Kohn, R.: Markov Interacting Importance Samplers. arXiv:1502.07039 (2015)

  • Neal, R.: MCMC using ensembles of states for problems with fast and slow variables such as Gaussian process regression. arXiv:1101.0387 (2011)

  • Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)

    Article  MathSciNet  Google Scholar 

  • Owen, A.: Monte Carlo theory, methods and examples. http://statweb.stanford.edu/~owen/mc/ (2013)

  • Owen, A., Zhou, Y.: Safe and effective importance sampling. J. Am. Stat. Assoc. 95(449), 135–143 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  • Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, Berlin (2004)

  • Schäfer, C., Chopin, N.: Sequential Monte Carlo on large binary sampling spaces. Stat. Comput. 23(2), 163–184 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Skilling, J.: Nested sampling for general Bayesian computation. Bayesian Anal. 1(4), 833–860 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Veach, E., Guibas, L.: Optimally combining sampling techniques for Monte Carlo rendering. In: SIGGRAPH 1995 Proceedings, pp. 419–428 (1995)

  • Wand, M.P., Jones, M.C.: Kernel Ssoothing. Chapman and Hall, London (1994)

    Google Scholar 

  • Wang, X., Chen, R., Liu, J.S.: Monte Carlo Bayesian signal processing for wireless communications. J. VLSI Signal Process. 30, 89–105 (2002)

    Article  MATH  Google Scholar 

  • Warnes, G.R.: The Normal Kernel Coupler: an adaptive Markov Chain Monte Carlo method for efficiently sampling from multi-modal distributions. Technical Report (2001)

  • Weinberg, M.D.: Computing the Bayes factor from a Markov chain Monte Carlo simulation of the posterior distribution. arXiv:0911.1777 (2010)

  • Yuan, X., Lu, Z., Yue, C.Z.: A novel adaptive importance sampling algorithm based on Markov chain and low-discrepancy sequence. Aerosp. Sci. Technol. 29, 253–261 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

This work has been supported by the projects COMONSENS (CSD2008 00010), ALCIT (TEC2012 38800C03 01), DISSECT (TEC2012 38058 C03 01), OTOSiS (TEC 2013 41718 R), and COMPREHENSION (TEC 2012 38883 C02 01), by the BBVA Foundation with “I Convocatoria de Ayudas Fundacin BBVA a Investigadores, Innovadores y Creadores Culturales”- MG FIAR project, by the ERC Grant 239784 and AoF Grant 251170, and by the European Union 7th Framework Programme through the Marie Curie Initial Training Network “Machine Learning for Personalized Medicine” MLPM2012, Grant No. 316861.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to L. Martino.

Appendices

Appendix 1: Consistency of GAMIS estimators

First of all, we remark that the complete analysis should take in account the chosen adaptive procedure since, in general, the adaptation uses the information of previous weighted samples. However, in this work we consider an adaption procedure completely independent of the estimation steps, as clarified in Sects. 3.45.1. This simplifies substantially the analysis as described in Sect. 5.1.

The consistency of the global estimators in Eq. (29) provided by GAMIS can be considered when number of samples per time step (\(M\times N\)) and the number of iterations of the algorithm (T) grow to infinity. For some exhaustive studies of specific cases, see the analysis in Robert and Casella (2004), Douc et al. (2007a) and Marin et al. (2012). Here we provide some brief arguments for explaining why \(\hat{I}_T\) and \({\hat{Z}}_T\) obtained by a GAMIS scheme are, in general, consistent. Let us assume that \(q_{n,t}\)’s have heavier tails than \(\bar{\pi }(\mathbf{x}) \propto \pi (\mathbf{x})\). Note that the global estimator \(\hat{I}_T\) can be seen as a result of a static batch MIS estimator involving L different mixture-proposals \(\varPhi _{n,t}(\mathbf{x})\) and \(J=\textit{NMT}\) total number of samples. The weights \(w_{n,t}^{(m)}\) built using \(\varPhi _{n,t}(\mathbf{x})\) in the denominator of the IS ratio are suitable importance weights yielding consistent estimators, as explained in detail in Appendix 2. Hence, for a finite number of iterations \(T < \infty \), when \(M\rightarrow \infty \) (or \(N \rightarrow \infty \)), the consistency can be guaranteed by standard IS arguments, since it is well known that \(\hat{Z}_{T} \rightarrow Z\) and \(\hat{I}_{T} \rightarrow I\) as \(M \rightarrow \infty \), or \(N\rightarrow \infty \) (Robert and Casella 2004).

Furthermore, for \(T \rightarrow \infty \) and \(N,M < \infty \), we have a convex combination, given in Eq. (31), of conditionally independent (consistent but biased) IS estimators (Robert and Casella 2004). Indeed, although in an adaptive scheme the proposals depend on the previous configurations of the population, the samples drawn at each iteration are conditionally independent of the previous ones, and independent of each other drawn at the same iteration. The bias is due to unknown Z (see Eq. 4), and hat \(\hat{Z}_{T}\) is used to replace Z. However, \(\hat{Z}_{T} \rightarrow Z\) as \(T \rightarrow \infty \), as discussed in Robert and Casella (2004, Chap. 14): hence, \(\hat{I}_{T}\) is asymptotically unbiased as \(T \rightarrow \infty \).

Appendix 2: Importance sampling with multiple proposals

Recall that our goal is computing efficiently the integral \(I = \frac{1}{Z} \int _{\mathcal {X}} f(\mathbf{x}) \pi (\mathbf{x}) d\mathbf{x}\) where f is any square-integrable function (w.r.t. \(\bar{\pi }(\mathbf{x})\)) of \(\mathbf{x}\), and \(Z=\int _{ \mathcal {X}} \pi (\mathbf{x}) d\mathbf{x}<\infty \) with \(\pi (\mathbf{x}) \ge 0\) for all \(\mathbf{x}\in \mathcal {X}\subseteq \mathbb {R}^{D_x}\). Let us assume that we have two proposal pdfs, \(q_1(\mathbf{x})\) and \(q_2(\mathbf{x})\), from which we intend to draw \(M_1\) and \(M_2\) samples respectively:

$$\begin{aligned} \mathbf{x}_1^{(1)},\ldots ,\mathbf{x}_{1}^{(M_1)}\sim q_1(\mathbf{x}) \quad \text {and} \quad \mathbf{x}_2^{(1)},\ldots ,\mathbf{x}_{2}^{(M_2)}\sim q_2(\mathbf{x}). \end{aligned}$$

There are at least two procedures to build a joint IS estimator: the standard MIS approach and the full deterministic mixture (DM-MIS) scheme.

1.1 Standard IS approach

The simplest approach (Robert and Casella 2004, Chap. 14) is computing the classical IS weights:

$$\begin{aligned} w_1^{(i)} = \frac{\pi (\mathbf{x}_1^{(i)})}{q_1(\mathbf{x}_1^{(i)})}, \quad w_2^{(k)} = \frac{\pi (\mathbf{x}_2^{(k)})}{q_2(\mathbf{x}_2^{(k)})}, \end{aligned}$$
(42)

with \( i=1,\ldots , M_1\) and \(k=1,\ldots , M_2\). The IS estimator is then built by normalizing them jointly, i.e., computing

$$\begin{aligned} \hat{I}_{IS} = \frac{1}{S_{tot}}\left( \sum _{i=1}^{M_1} w_1^{(i)} f(\mathbf{x}_1^{(i)})+\sum _{k=1}^{M_2} w_2^{(k)} f(\mathbf{x}_2^{(k)}) \right) , \end{aligned}$$
(43)

where \(S_{tot}=\sum _{i=1}^{M_1} w_1^{(i)}+\sum _{k=1}^{M_2} w_2^{(k)}\). For \(J>2\) proposal pdfs and \(\mathbf{x}_j^{(1)},\ldots ,\mathbf{x}_{j}^{(M_j)}\sim q_j(\mathbf{x})\), for \(j=1,\ldots ,J\), we have

$$\begin{aligned} \left\{ \begin{array}{l} w_j^{(m_j)} = \frac{\pi (\mathbf{x}_j^{(m_j)})}{q_j(\mathbf{x}_j^{(m_j)})}, \quad \text { and } \\ \hat{I}_{IS} = \frac{1}{\sum _{n=1}^{J} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)}} \sum _{j=1}^{J} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)} f(\mathbf{x}_{j}^{(m_j)}). \end{array} \right. \end{aligned}$$

In this case, \(S_{tot}=\sum _{n=1}^{J} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)}\).

1.2 Deterministic mixture approach

An alternative approach is based on the deterministic mixture sampling idea (Owen and Zhou 2000; Veach and Guibas 1995; Elvira et al. 2015). Considering \(N=2\) proposals \(q_1, q_2\), and setting

$$\begin{aligned} \mathcal {Z}= \left\{ \mathbf{x}_{1}^{(1)},\ldots ,\mathbf{x}_{1}^{(M_1)},\mathbf{x}_{2}^{(1)},\ldots ,\mathbf{x}_{2}^{(M_2)}\right\} , \end{aligned}$$

with \(\mathbf{x}_{j}^{(m_j)} \in \mathbb {R}^{D_x} (n \in \{1,2\}\) and \(1 \le m_j \le M_j)\), the weights are now defined as

$$\begin{aligned} w_{j}^{(m_j)} = \frac{\pi (\mathbf{x}_{j}^{(m_j)})}{\frac{M_1}{M_1+M_2} q_1(\mathbf{x}_{j}^{(m_j)})+\frac{M_2}{M_1+M_2} q_2(\mathbf{x}_{j}^{(m_j)})}. \end{aligned}$$
(44)

In this case, the complete proposal is considered to be a mixture of \(q_1\) and \(q_2\), weighted according to the number of samples drawn from each one. Note that, unlike in the standard procedure for sampling from a mixture, a deterministic and fixed number of samples are drawn from each proposal in the DM approach (Elvira et al. 2015). It can be shown that the set \(\mathcal {Z}\) of samples drawn in this deterministic way is distributed according to the mixture \(q(\mathbf{z})=\frac{M_1}{M_1+M_2} q_1(\mathbf{z})+\frac{M_2}{M_1+M_2} q_2(\mathbf{z})\) (Owen 2013, Chap. 9, Sect. 11). The DM estimator is finally given by

$$\begin{aligned} \hat{I}_{DM} = \frac{1}{S_{tot}}\sum _{j=1}^{2} \sum _{m_j=1}^{M_j}{w_{j}^{(m_j)} f(\mathbf{x}_{j}^{(m_j)})}, \end{aligned}$$
(45)

where \(S_{tot} =\sum _{j=1}^{2} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)}\) and the \(w_{j}^{(m_j)}\) are given by (44). For \(J>2\) proposal pdfs, the DM estimator can also be easily generalized:

$$\begin{aligned} \left\{ \begin{array}{l} w_{i}^{(m_i)} = \frac{\pi (\mathbf{x}_{i}^{(m_i)})}{\sum _{j=1}^{J}{\frac{M_j}{M_{tot}} q_j(\mathbf{x}_{j}^{(m_j)})}},\quad \text { and } \\ \hat{I}_{DM} = \frac{1}{\sum _{n=1}^{J} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)}} \sum \limits _{j=1}^{J} \sum \limits _{m_j=1}^{M_j}w_{j}^{(m_j)} f(\mathbf{x}_{j}^{(m_j)}), \end{array} \right. \end{aligned}$$

with \(i=1,\ldots ,J, M_{tot}=M_1+M_2+\cdots +M_J\) and \(S_{tot} =\sum _{j=1}^{J} \sum _{m_j=1}^{M_j}w_{j}^{(m_j)}\). On the one hand, the DM approach is more efficient than the IS method, thus providing a better performance in terms of a reduced variance of the corresponding estimator, as shown in the following section. On the other hand, it needs to evaluate every proposal \(M_{tot}\) times instead of only \(M_j\) times (in the standard MIS procedure), and therefore is more costly from a computational point of view. However, this increased computational cost is negligible when the proposal is much cheaper to evaluate than the target, as it often happens in practical applications.

1.3 Convex combination of partial IS estimators

Regardless the type of weights employed in the IS scheme [either as in Eq. (42) or as in Eq. (44)], the resulting estimators can be written as convex combination of simpler ones. First of all, let us consider again the use of \(J=2\) proposals, \(q_1\) and \(q_2\). We draw \(M_j\) samples from each one, \(\mathbf{x}_j^{(1)},\ldots ,\mathbf{x}_{j}^{(M_j)}\sim q_j(\mathbf{x})\), with \(j\in \{1,2\}\). The two partial sums of the weights corresponding only to the samples drawn from \(q_1\) and \(q_2\), are given by \(S_1=\sum _{i=1}^{M_1} w_1^{(i)}\) and \(S_2=\sum _{k=1}^{M_2} w_2^{(k)}\). The partial IS estimators, obtained by considering only one proposal pdf, are \(\hat{I}_1=\sum _{i=1}^{M_1} \bar{w}_1^{(i)} f(\mathbf{x}_1^{(i)})\) and \(\hat{I}_2=\sum _{k=1}^{M_2} \bar{w}_2^{(k)} f(\mathbf{x}_2^{(k)})\) where the normalized weights are \(\bar{w}_1^{(i)}=\frac{w_1^{(i)}}{S_1}\) and \(\bar{w}_2^{(k)}=\frac{w_2^{(k)}}{S_2}\), respectively. The complete IS estimator, taking into account the \(M_1+M_2\) samples jointly, is

$$\begin{aligned} \hat{I}_{tot}= & {} \frac{1}{S_1+S_2}\left( S_1 \hat{I}_1+S_2 \hat{I}_2 \right) \nonumber \\= & {} \frac{S_1}{S_1+S_2}\hat{I}_1+\frac{S_2}{S_1+S_2}\hat{I}_2. \end{aligned}$$
(46)

This procedure can be easily extended for \(J>2\) different proposal pdfs, obtaining the complete estimator as the convex combination of the N partial estimators:

$$\begin{aligned} \begin{aligned}&\hat{I}_{tot} = \frac{\sum _{j=1}^{J}{S_j \hat{I}_j}}{\sum _{j=1}^{J}{S_j}}, \\&\hat{Z}_{tot} =\frac{1}{\sum _{j=1}^{J} M_j}\sum _{j=1}^{J}{S_j}=\frac{1}{\sum _{j=1}^{J} M_j} \sum _{j=1}^{J}M_j {\hat{Z}_j} , \end{aligned} \end{aligned}$$
(47)

where \(\mathbf{x}_j^{(1)},\ldots ,\mathbf{x}_{j}^{(M_j)} \sim q_j(\mathbf{x}), \hat{I}_j = \sum _{k=1}^{M_j}{w_j^{(k)} f(\mathbf{x}_j^{(k)})}, S_j = \sum _{k=1}^{M_j}{w_j^{(k)}}\) and \(\hat{Z}_j=\frac{1}{M_j} \sum _{k=1}^{M_j}w_j^{(k)}\).

Fig. 7
figure 7

(Ex-Sect. 6.2) Graphical representation of the results in Table 12 (except for the last column): the curve \(\log (\text {MSE})\) versus \(\log (\sigma )\) with \(\sigma \in \{0.5,1,2,3,5,10,70\}\) for the different methods, (a) worst results and (b) best results

Appendix 3: Hierarchical interpretation of PMC

The standard population Monte Carlo (PMC) (Cappé et al. 2004) method can be interpreted as using a hierarchical procedure. Although it is possible to recognize the two different layers, there are some differences w.r.t. the hierarchical procedure in Sect. 3. The first one is that in PMC the generation of \({\varvec{\mu }}\)’s is not independent of the previously generated \(\mathbf{x}\)’s. The second one is that the prior is instead \(h({\varvec{\mu }})=\hat{\pi }_t^{(N)}({\varvec{\mu }})\), where \(\hat{\pi }_t^{(N)}\) is an approximation of the measure of \(\bar{\pi }({\varvec{\mu }})\) obtained using the previously generated samples \(\mathbf{x}\)’s (in the second level of the hierarchical approach). More specifically, a standard PMC method (Cappé et al. 2004) is an adaptive importance sampler using a population of proposals \(q_1, \ldots , q_N\). PMC consists of the following steps, given an initial set, \({\varvec{\mu }}_{1,0}, \ldots , {\varvec{\mu }}_{N,0}\), of mean vectors:

  1. 1.

    For \(t=0,\ldots ,T-1:\)

    1. (a)

      Draw \(\mathbf{x}_{n,t}\sim q_{n,t}(\mathbf{x}|{\varvec{\mu }}_{n,t},\mathbf{C}_n)\), for \(n=1,\ldots ,N\).

    2. (b)

      Assign to each sample \(\mathbf{x}_{n,t}\) the weights,

      $$\begin{aligned} w_{n,t}=\frac{\pi (\mathbf{x}_{n,t})}{q_{n,t}(\mathbf{x}_{n,t}|{\varvec{\mu }}_{n,t},\mathbf{C}_n)}. \end{aligned}$$
      (48)
    3. (c)

      Resampling draw N independent samples \({\varvec{\mu }}_{n,t+1}, n=1,\ldots , N\), according to the particle approximation

      $$\begin{aligned} \hat{\pi }_t^{(N)}({\varvec{\mu }}|\mathbf{x}_{1:N,t})=\frac{1}{\sum _{n=1}^N w_{n,t}} \sum _{n=1}^N w_{n,t} \delta ({\varvec{\mu }}-\mathbf{x}_{n,t}), \end{aligned}$$
      (49)

      where we have denoted \(\mathbf{x}_{1:N,t}=[\mathbf{x}_{1,t},\ldots ,\mathbf{x}_{N,t}]^{\top }\). Note that each \({\varvec{\mu }}_{n,t+1} \in \{\mathbf{x}_{1,t},\ldots , \mathbf{x}_{N,t}\}\), for all n.

  2. 2.

    Return all the pairs \(\{\mathbf{x}_{n,t}, w_{n,t}\}, n=1,\ldots ,N\) and \(t=0,\ldots ,T-1\).

Fixing an iteration t, the generating procedure used in one iteration of the standard PMC method can be cast in the hierarchical formulation:

  1. 1.

    Draw N samples \({\varvec{\mu }}_{1,t},\ldots ,{\varvec{\mu }}_{N,t}\) from \(\hat{\pi }_{t-1}^{(N)}({\varvec{\mu }}| \mathbf{x}_{1:N,t-1})\).

  2. 2.

    Draw \(\mathbf{x}_{n,t}\sim q_{n,t}(\mathbf{x}|{\varvec{\mu }}_{n,t},\mathbf{C}_n)\), for \(n=1,\ldots ,N\).

Note that \(\hat{\pi }_{t-1}^{(N)}\) plays the role of the prior h in the hierarchical scheme above. Differently from the novel proposed scheme, the two levels of hierarchical procedure are not independent since the pdf \(\hat{\pi }_{t}^{(N)}({\varvec{\mu }}|\mathbf{x}_{1:N,t})\) depends on the samples drawn in the lower level. Furthermore, \(\hat{\pi }_{t}^{(N)}\) also varies with t and N, whereas in our procedure we consider a fixed prior h. However, note that \(\hat{\pi }_{t}^{(N)}\) is an empirical measure approximation of \({\bar{\pi }}\) that improves when N grows. An equivalent formulation of the hierarchical scheme for PMC is given below, involving a probability of generating a new mean \({\varvec{\mu }}\) given the previous ones \({\varvec{\mu }}_{1:N,t-1}=[{\varvec{\mu }}_{1,t-1},\ldots ,{\varvec{\mu }}_{N,t-1}]^{\top },\) denoted as \(K_t^{(N)}({\varvec{\mu }}|{\varvec{\mu }}_{1:N,t-1})\).

1.1 Distribution after one resampling step

Consider the t-th iteration of PMC. Let us define as

$$\begin{aligned} \mathbf{m}_{\lnot n}=[\mathbf{x}_{1,t},\ldots , \mathbf{x}_{n-1,t},\mathbf{x}_{n+1,t},\ldots , \mathbf{x}_{N,t}]^{\top }, \end{aligned}$$

the vector containing all the generated samples except for the n-th. Let us also denote as \({\varvec{\mu }}_{i,t+1}\in \{\mathbf{x}_{1,t}\ldots ,\mathbf{x}_{N,t}\}\), a generic mean vector, i.e., \(i\in \{1,\ldots ,N\}\) at the iteration \(t+1\), after applying one resampling step (i.e., a multinomial sampling according to the normalized weights). Hence, the distribution of \({\varvec{\mu }}\) given the previous means \({\varvec{\mu }}_{1:N,t-1}\) is

$$\begin{aligned}&K_{t+1}^{(N)}({\varvec{\mu }}_{i,t+1}|{\varvec{\mu }}_{1,t},\dots ,{\varvec{\mu }}_{N,t}) \nonumber \\&\quad =\int _{\mathcal {X}^{N}} \hat{\pi }_t^{(N)}({\varvec{\mu }}_{i,t+1}|\mathbf{x}_{1:N,t}) \left[ \prod _{n=1}^{N}{q_{n,t}(\mathbf{x}_{n,t}|{\varvec{\mu }}_{n,t},\mathbf{C}_n)}\right] d\mathbf{x}_{1:N,t}, \end{aligned}$$
(50)

where \(\hat{\pi }_t^{(N)}({\varvec{\mu }}|\mathbf{x}_{1:N,t})\) is given in Eq. (49). For simplicity, below we denote

$$\begin{aligned} q_{n}(\mathbf{x})=q_{n,t}(\mathbf{x}|{\varvec{\mu }}_{n,t},\mathbf{C}_n), \quad \text { and }\quad {\varvec{\mu }}={\varvec{\mu }}_{i,t}. \end{aligned}$$

Then, after some straightforward rearrangements, Eq. (50) can be rewritten as

$$\begin{aligned}&K_{t+1}^{(N)}({\varvec{\mu }}|{\varvec{\mu }}_{1,t},\dots ,{\varvec{\mu }}_{N,t})\\&\quad =\sum _{j=1}^{N}\left( \int _{\mathcal {X}^{N-1}} \frac{\pi (\mathbf{x}_{j,t})}{\sum _{n=1}^{N}{\frac{\pi (\mathbf{x}_{n,t})}{q_n(\mathbf{x}_{n,t})}}}\left[ \prod _{\begin{array}{c} n=1 \\ n \ne j \end{array}}^{N}{q_n(\mathbf{x}_{n,t})}\right] d\mathbf{m}_{\lnot j}\right) \delta ({\varvec{\mu }}-\mathbf{x}_{j,t}). \end{aligned}$$

Finally, we can write

$$\begin{aligned}&K_{t+1}^{(N)}({\varvec{\mu }}|{\varvec{\mu }}_{1,t},\dots ,{\varvec{\mu }}_{N,t})\nonumber \\&\quad = \pi ({\varvec{\mu }}) \sum _{j=1}^{N} \left( {\int _{\mathcal {X}^{N-1}} \frac{1}{N\hat{Z}} \left[ \prod _{\begin{array}{c} n=1 \\ n \ne j \end{array}}^N q_n(\mathbf{x}_{n,t})\right] d\mathbf{m}_{\lnot j}} \right) , \end{aligned}$$
(51)

where \(\hat{Z}= \frac{1}{N}\sum _{n=1}^N\frac{\pi (\mathbf{x}_n)}{q_n(\mathbf{x}_n)}\) is the estimate of the normalizing constant of the target obtained using the classical IS weights. The hierarchical formulation of PMC can be rewritten as:

  1. 1.

    Draw N samples \({\varvec{\mu }}_{1,t},\ldots ,{\varvec{\mu }}_{N,t}\) from \(K_{t}^{(N)}({\varvec{\mu }}|{\varvec{\mu }}_{1:N,t-1})\) in Eq. (50) or (51).

  2. 2.

    Draw \(\mathbf{x}_{n,t}\sim q_{n,t}(\mathbf{x}|{\varvec{\mu }}_{n,t},\mathbf{C}_n)\), for \(n=1,\ldots ,N\).

Fig. 8
figure 8

(Ex-Sect. 6.3) The curve \(\log (\text {MSE})\) as function of dimension of the problem, \(D_x\in \{2,3,5,10,12,15,20,25,35,40,50\}\), for different methods. We test (a) \(N=100\) and (b) \(N=500\), keeping fixed the same number of evaluation of the target \(E=2\times 10^5\). Hence the total number of iterations (of the different algorithms) is greater in (a) than in (b)

When \(N \rightarrow \infty \), then \(\hat{Z} \rightarrow Z\) (Robert and Casella 2004), and thus \(K_t^{(N)}({\varvec{\mu }}|{\varvec{\mu }}_{1:N,t-1}) \rightarrow \frac{1}{Z} \pi ({\varvec{\mu }}) = \bar{\pi }({\varvec{\mu }})\), for all \(t=1\ldots ,T\). Namely, when N grows, the hierarchical scheme above tends to have \(h({\varvec{\mu }})={\bar{\pi }}({\varvec{\mu }})\) as prior in the upper level. Figures 5 show three different examples of the conditional pdf \(K_t^{(N)}\) (obtained via numerical approximation) for a fixed t and different \(N\in \{2,20,1000\}\). We can observe that \(K_t^{(N)}\) becomes closer to the target \(\bar{\pi }\) (depicted in solid line) as N grows.

1.1.1 Differences between PMC and MAIS algorithms

In the MAIS schemes described in Sect. 5, since we are using MCMC methods for drawing from \(h({\varvec{\mu }})={\bar{\pi }}({\varvec{\mu }})\), actually we have also a current prior \(K_t^{(N)}({\varvec{\mu }}_{1:N,t}|{\varvec{\mu }}_{1:N,t-1})\), determined for the kernels of the considered MCMC algorithms. For instance, in PI-MAIS we have

$$\begin{aligned} K_t^{(N)}({\varvec{\mu }}_{1:N,t}|{\varvec{\mu }}_{1:N,t-1})=\prod _{n=1}^N A_n({\varvec{\mu }}_{n,t}|{\varvec{\mu }}_{n,t-1}), \end{aligned}$$

where \(A_n({\varvec{\mu }}_{n,t}|{\varvec{\mu }}_{n,t-1})\) is the kernel of the n-th chain. Unlike in PMC, since we are using ergodic chains with invariant pdf \({\bar{\pi }}\), we know that \(K_t^{(N)}({\varvec{\mu }}_{1:N,t}|{\varvec{\mu }}_{1:N,t-1})\rightarrow \prod _{n=1}^N{\bar{\pi }}({\varvec{\mu }}_{n})\) for \(t\rightarrow \infty \), with a fixed N. Whereas PMC requires to increase N for obtaining the same result (Figs. 6, 7, 8).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martino, L., Elvira, V., Luengo, D. et al. Layered adaptive importance sampling. Stat Comput 27, 599–623 (2017). https://doi.org/10.1007/s11222-016-9642-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-016-9642-5

Keywords

Navigation