Conditionally structured variational Gaussian approximation with importance weights

Abstract

We develop flexible methods of deriving variational inference for models with complex latent variable structure. By splitting the variables in these models into “global” parameters and “local” latent variables, we define a class of variational approximations that exploit this partitioning and go beyond Gaussian variational approximation. This approximation is motivated by the fact that in many hierarchical models, there are global variance parameters which determine the scale of local latent variables in their posterior conditional on the global parameters. We also consider parsimonious parametrizations by using conditional independence structure and improved estimation of the log marginal likelihood and variational density using importance weights. These methods are shown to improve significantly on Gaussian variational approximation methods for a similar computational cost. Application of the methodology is illustrated using generalized linear mixed models and state space models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Archer, E., Park, I.M., Buesing, L., Cunningham, J., Paninski, L.: Black box variational inference for state space models (2016). arXiv:1511.07367

  2. Blei, D.M., Kucukelbir, A., McAuliffe, J.D.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877 (2017)

    MathSciNet  Article  Google Scholar 

  3. Breslow, N.E., Clayton, D.G.: Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)

    MATH  Google Scholar 

  4. Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. In: Proceedings of the 4th International Conference on Learning Representations (ICLR) (2016)

  5. Diggle, P.J., Heagerty, P., Liang, K.Y., Zeger, S.L.: The Analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford (2002)

    MATH  Google Scholar 

  6. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)

  7. Domke, J., Sheldon, D.R.: Importance weighting and variational inference. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31, pp. 4470–4479. Curran Associates, Inc., New York (2018)

    Google Scholar 

  8. Fitzmaurice, G.M., Laird, N.M.: A likelihood-based method for analysing longitudinal binary responses. Biometrika 80, 141–151 (1993)

    Article  Google Scholar 

  9. Germain, M., Gregor, K., Murray, I., Larochelle, H.: MADE: masked autoencoder for distribution estimation. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, Proceedings of Machine Learning Research, vol. 37, pp. 881–889 (2015)

  10. Guo, F., Wang, X., Broderick, T., Dunson, D.B.: Boosting variational inference (2016). arXiv: 1611.05559

  11. Han, S., Liao, X., Dunson, D.B., Carin, L.C.: Variational Gaussian copula inference. In: Gretton, A., Robert, C.C. (eds.) Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 51, pp. 829–838 (2016)

  12. Hoffman, M., Blei, D.: Stochastic structured variational inference. In: Lebanon, G., Vishwanathan, S. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, vol. 38, pp. 361–369 (2015)

  13. Huszár, F.: Variational inference using implicit distributions (2017). arXiv:1702.08235

  14. Jaakkola, T.S., Jordan, M.I.: Improving the mean field approximation via the use of mixture distributions, pp. 163–173. Springer, Dordrecht (1998)

    MATH  Google Scholar 

  15. Kastner, G., Frühwirth-Schnatter, S.: Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Comput. Stat. Data Anal. 76, 408–423 (2014)

    MathSciNet  Article  Google Scholar 

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)

  17. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR) (2014)

  18. Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 4743–4751. Curran Associates, Inc., New York (2016)

    Google Scholar 

  19. Li, Y., Turner, R.E.: Rényi divergence variational inference. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’16, pp. 1081–1089 (2016)

  20. Maddison, C.J., Lawson, J., Tucker, G., Heess, N., Norouzi, M., Mnih, A., Doucet, A., Teh, Y.: Filtering variational objectives. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 6573–6583. Curran Associates, Inc, New York (2017)

    Google Scholar 

  21. Magnus, J.R., Neudecker, H.: The elimination matrix: some lemmas and applications. SIAM J. Algebr. Discrete Methods 1, 422–449 (1980)

    MathSciNet  Article  Google Scholar 

  22. Magnus, J.R., Neudecker, H.: Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd edn. Wiley, New York (1999)

    MATH  Google Scholar 

  23. Miller, A.C., Foti, N., Adams, R.P.: Variational boosting: iteratively refining posterior approximations (2016). arXiv: 1611.06585

  24. Minka, T.: Divergence measures and message passing. Technical report (2005)

  25. Ormerod, J.T., Wand, M.P.: Explaining variational approximations. Am. Stat. 64, 140–153 (2010)

    MathSciNet  Article  Google Scholar 

  26. Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: Non-centered parameterisations for hierarchical models and data augmentation. In: Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., West, M. (eds.) Bayesian Statistics 7, pp. 307–326. Oxford University Press, New York (2003)

    Google Scholar 

  27. Papaspiliopoulos, O., Roberts, G.O., Sköld, M.: A general framework for the parametrization of hierarchical models. Stat Sci 22, 59–73 (2007)

    MathSciNet  Article  Google Scholar 

  28. Ranganath, R., Tran, D., Blei, D.M.: Hierarchical variational models. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 324–333 (2016)

  29. Regli, J.B., Silva, R.: Alpha-beta divergence for variational inference (2018). arXiv: 1805.01045

  30. Rényi, A.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, University of California Press, Berkeley, Calif., pp. 547–561 (1961)

  31. Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 37, pp. 1530–1538 (2015)

  32. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning, JMLR Workshop and Conference Proceedings, vol. 32, pp. 1278–1286 (2014)

  33. Roeder, G., Wu, Y., Duvenaud, D.K.: Sticking the landing: simple, lower-variance gradient estimators for variational inference. In: Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30 (2017)

  34. Roeder, G., Grant, P.K., Phillips, A., Dalchau, N., Meeds, E.: Efficient amortised bayesian inference for hierarchical and nonlinear dynamical systems. Proc. Mach. Learn. Res. 97, 4445–4455 (2019)

    Google Scholar 

  35. Rothman, A.J., Levina, E., Zhu, J.: A new approach to Cholesky-based covariance regularization in high dimensions. Biometrika 97, 539–550 (2010)

    MathSciNet  Article  Google Scholar 

  36. Salimans, T., Knowles, D.A.: Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Anal. 8, 837–882 (2013)

    MathSciNet  Article  Google Scholar 

  37. Smith, M.S., Loaiza-Maya, R., Nott, D.J.: High-dimensional copula variational approximation through transformation (2019). arXiv:1904.07495

  38. Spall, J.C.: Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control. Wiley, New York (2003)

    Book  Google Scholar 

  39. Spantini, A., Bigoni, D., Marzouk, Y.: Inference via low-dimensional couplings. J. Mach. Learn. Res. 19, 1–71 (2018)

    MathSciNet  MATH  Google Scholar 

  40. Tan, L.S.L.: Efficient data augmentation techniques for Gaussian state space models (2017). arXiv:1712.08887

  41. Tan, L.S.L.: Use of model reparametrization to improve variational Bayes (2018). arXiv:1805.07267

  42. Tan, L.S.L., Nott, D.J.: Variational inference for generalized linear mixed models using partially non-centered parametrizations. Stat. Sci. 28, 168–188 (2013)

    Article  Google Scholar 

  43. Tan, L.S.L., Nott, D.J.: Gaussian variational approximation with sparse precision matrices. Stat. Comput. 28, 259–275 (2018)

    MathSciNet  Article  Google Scholar 

  44. Thall, P.F., Vail, S.C.: Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671 (1990)

    MathSciNet  Article  Google Scholar 

  45. Titsias, M., Lázaro-Gredilla, M.: Doubly stochastic variational Bayes for non-conjugate inference. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971–1979 (2014)

  46. Tran, D., Blei, D.M., Airoldi, E.M.: Copula variational inference. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 3564–3572 (2015)

  47. Tucker, G., Lawson, D., Gu, S., Maddison, C.J.: Doubly reparametrized gradient estimators for Monte Carlo objectives (2018). arXiv: 1810.04152

  48. van Erven, T., Harremos, P.: Rnyi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 60, 3797–3820 (2014)

    Article  Google Scholar 

  49. Yang, Y., Pati, D., Bhattacharya, A.: \(\alpha \)-variational inference with statistical guarantees. Ann. Stat. (2019) (to appear)

Download references

Acknowledgements

We wish to thank the editor and reviewer for their time in reviewing this manuscript and for their constructive comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Linda S. L. Tan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Linda Tan and Aishwarya Bhaskaran are supported by the start-up Grant R-155-000-190-133.

Appendices

Appendix A: Derivation of stochastic gradient

Let \(\otimes \) denote the Kronecker product between any two matrices. We have

$$\begin{aligned} r_\lambda (s) = \begin{bmatrix} \theta _G \\ \theta _L \end{bmatrix} = \begin{bmatrix} \mu _1 + C_1^{-T} s_1 \\ d+ C_2^{-T} (s_2 - DC_1^{-T} s_1) \end{bmatrix}, \end{aligned}$$

where \(v(C_2^*) = f + F(\mu _1 + C_1^{-T} s_1)\). Differentiating \(r_\lambda (s)\) with respect to \(\lambda \), \(\nabla _\lambda r_\lambda (s) \) is given by

$$\begin{aligned} \begin{aligned}&\begin{bmatrix} \nabla _{\mu _1} \theta _G &{} \nabla _{\mu _1} \theta _L \\ \nabla _{v(C_1^*)} \theta _G &{} \nabla _{v(C_1^*)} \theta _L \\ \nabla _{d} \theta _G &{} \nabla _{d} \theta _L \\ \nabla _{\mathrm{vec}(D)} \theta _G &{} \nabla _{\mathrm{vec}(D)} \theta _L \\ \nabla _{f} \theta _G &{} \nabla _{f} \theta _L \\ \nabla _{\mathrm{vec}(F)} \theta _G &{} \nabla _{\mathrm{vec}(F)} \theta _L \\ \end{bmatrix}. \end{aligned} \end{aligned}$$

Since \(\theta _G\) does not depend on d, D, f and F, we have

$$\begin{aligned} \begin{aligned} \nabla _{d} \theta _G&= 0_{nL \times G}, \quad \nabla _{\mathrm{vec}(D)} \theta _G = 0_{nLG \times G} \\ \nabla _{f} \theta _G&= 0_{nL(nL+1)/2 \times G}, \;\; \nabla _{\mathrm{vec}(F)} \theta _G = 0_{nLG(nL+1)/2 \times G}. \end{aligned} \end{aligned}$$

It is easy to see that \(\nabla _{\mu _1} \theta _G = I_G\) and \(\nabla _d \theta _L = I_{nL}\). The rest of the terms are derived as follows.

Differentiating \(\theta _G\) with respect to \(v(C_1^*)\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _G&= -C_1^{-T} \mathrm{d}(C_1^T) C_1^{-T} s_1\\&= - (s_1^T C_1^{-1} \otimes C_1^{-T}) K_G E_G^T D_1^* \mathrm{d}v(C_1^*) \\&= - (C_1^{-T} \otimes s_1^T C_1^{-1}) E_G^T D_1^* \mathrm{d}v(C_1^*). \\ \therefore \; \nabla _{v(C_1^*)} \theta _G&= - D_1^* E_G (C_1^{-1} \otimes C_1^{-T} s_1 ). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to f,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}(C_2^T) C_2^{-T} (s_2 - DC_1^{-T} s_1) \\&= - \{ (s_2 - DC_1^{-T} s_1)^T C_2^{-1} \otimes C_2^{-T} \} \\&\quad \times K_{nL} E_{nL}^T D_2^* \mathrm{d}f \\ \therefore \; \nabla _{f} \theta _L&= - D_2^* E_{nL} \{ C_2^{-1} \otimes C_2^{-T} (s_2 - DC_1^{-T} s_1) \}. \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to F,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T \mathrm{d}F \theta _G \\&= \{ \theta _G^T \otimes (\nabla _{f} \theta _L)^T \} \mathrm{d}\mathrm{vec}(F). \\ \therefore \; \nabla _{\mathrm{vec}(F)} \theta _L&= \theta _G \otimes \nabla _{f} \theta _L. \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to D,

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T} \mathrm{d}D C_1^{-T} s_1 \\&= - (s_1^T C_1^{-1} \otimes C_2^{-T}) \mathrm{d}\mathrm{vec}(D). \\ \therefore \; \nabla _{\mathrm{vec}(D)} \theta _L&= - (C_1^{-T} s_1 \otimes C_2^{-1}). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to \(\mu _1\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= (\nabla _{f} \theta _L)^T F \mathrm{d}\mu _1 \\ \therefore \; \nabla _{\mu _1} \theta _L&= F^T (\nabla _{f} \theta _L). \end{aligned} \end{aligned}$$

Differentiating \(\theta _L\) with respect to \(v(C_1)\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\theta _L&= -C_2^{-T}\mathrm{d}(C_2^T) C_2^{-T}(s_2 - DC_1^{-T} s_1) \\&\quad - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= (\nabla _{f} \theta _L)^T F\mathrm{d}(C_1^{-T}) s_1 - C_2^{-T} D \mathrm{d}(C_1^{-T}) s_1 \\&= \{ (\nabla _{f} \theta _L)^T F - C_2^{-T} D\} (\nabla _{v(C_1^*)} \theta _G )^T \mathrm{d}v(C_1^*) \\ \therefore \; \nabla _{v(C_1^*)} \theta _L&= \nabla _{v(C_1^*)}\theta _G \{ F^T \nabla _{f} \theta _L - D^T C_2^{-1} \} \\&= \nabla _{v(C_1^*)}\theta _G \{ \nabla _{\mu _1} \theta _L - D^T C_2^{-1} \}. \end{aligned} \end{aligned}$$

Since \(s_1 = C_1^T(\theta _G - \mu _1)\) and \(s_2 = C_2^T (\theta _L - \mu _2)\), we have

$$\begin{aligned} \begin{aligned} \log q_\lambda (\theta )&= \log q(\theta _G) + \log q(\theta _L|\theta _G) \\&= -\frac{G}{2} \log (2\pi ) + \log |C_1| \\&\quad - \frac{1}{2}(\theta _G - \mu _1)^T C_1 C_1^T (\theta _G - \mu _1) \\&\quad -\frac{nL}{2} \log (2\pi ) + \log |C_2| \\&\quad - \frac{1}{2}(\theta _L - \mu _2)^T C_2 C_2^T (\theta _L - \mu _2) \\&= - \frac{nL+G}{2} \log (2\pi ) + \log |C_1C_2| - \frac{1}{2} s^T s. \end{aligned} \end{aligned}$$

As \(\mu _2 = d + C_2^{-T} D(\mu _1 - \theta _G)\) and \(v(C_2^*) = f + F \theta _G\), differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _G\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _G - \mu _1)^T C_1 C_1^T \mathrm{d}\theta _G - (\theta _L - \mu _2)^T C_2 C_2^T(-\mathrm{d}\mu _2)\\&\quad - (\theta _L - \mu _2)^T \mathrm{d}C_2 s_2 + \mathrm{tr}(C_2^{-1} \mathrm{d}C_2) \\&= -s_1^T C_1^T \mathrm{d}\theta _G + s_2^T C_2^T\{- C_2^{-T} D \mathrm{d}\theta _G \\&\quad + \mathrm{d}(C_2^{-T}) D(\mu _1 - \theta _G)\} \\&\quad - \mathrm{vec}(C_2^{-T} s_2 s_2^T)^T \mathrm{d}\mathrm{vec}(C_2) + \mathrm{vec}(C_2^{-T})^T d\mathrm{vec}(C_2) \\&= \mathrm{vec}(C_2^{-T} - \{C_2^{-T} s_2 + (\mu _2 -d)\}s_2^T)^T d\mathrm{vec}(C_2) \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G \\&= \mathrm{vec}(C_2^{-T} - (\theta _L -d) s_2^T)^T E_{nL}^T D_2^* F \mathrm{d}\theta _G \\&\quad -s_1^T C_1^T \mathrm{d}\theta _G - s_2^T D \mathrm{d}\theta _G. \end{aligned} \end{aligned}$$

Therefore

$$\begin{aligned} \begin{aligned} \nabla _{\theta _G} \log q_\lambda (\theta )&=F^T D_2^* v(C_2^{-T} - (\theta _L -d) s_2^T) \\&\quad - C_1 s_1 - D^T s_2. \end{aligned} \end{aligned}$$

Note that \(D_2^* v(C_2^{-T}) = v(I_{nL})\) as \(C_2^{-T}\) is upper triangular and \(v(C_2^{-T})\) only retains the diagonal elements of \(C_2^{-T}\).

Differentiating \(\log q_\lambda (\theta )\) with respect to \(\theta _L\),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log q_\lambda (\theta )&= - (\theta _L - \mu _2)^T C_2 C_2^T \mathrm{d}\theta _L \\&= - s_2^T C_2^T \mathrm{d}\theta _L. \\ \therefore \; \nabla _{\theta _L} \log q_\lambda (\theta )&= - C_2 s_2. \end{aligned} \end{aligned}$$

Appendix B: Gradients for generalized linear mixed models

Since \(\theta = [\beta ^T, \omega ^T, {\tilde{b}}_1^T, \dots , {\tilde{b}}_n^T]^T\), we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\beta \log p(y, \theta ), \nabla _\omega \log p(y, \theta ), \\&\quad \nabla _{{\tilde{b}}_1} \log p(y, \theta ), \dots , \nabla _{{\tilde{b}}_n} \log p(y, \theta )]^T. \end{aligned}$$

For the centered parametrization, the components in \(\nabla _\theta \log p(y, \theta )\) are given below. Note that \(\beta = [\beta _{RG_1}^T, \beta _{G_2}^T]^T\).

$$\begin{aligned} \nabla _{\beta _{G_2}} \log p(y, \theta )= & {} \sum _{i=1}^n {X_i^{G_2}}^T \{ y_i - h'(\eta _i) \} - \beta _{G_2}/\sigma _\beta ^2. \\ \nabla _{\beta _{RG_1}} \log p(y, \theta )= & {} \sum _{i=1}^n C_i^T W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&- \beta _{RG_1}/\sigma _\beta ^2. \end{aligned}$$

Differentiating \(\log p(y, \theta ) \) with respect to \(\omega \),

$$\begin{aligned} \begin{aligned} \mathrm{d}\log p(y, \theta )&= -\sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1})^T \mathrm{d}W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}) \\&\quad + n \mathrm{tr}(W^{-1} \mathrm{d}W) - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2 \\&= \mathrm{vec}\bigg \{ - \sum _{i=1}^n ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i - C_i \beta _{RG_1})^T W \\&\quad + nW^{-T} \bigg \}^T E_L^T D_L^* \mathrm{d}\omega - \omega ^T \mathrm{d}\omega /\sigma _\omega ^2, \end{aligned} \end{aligned}$$

where \(\mathrm{d}v(W) = D^*_L \mathrm{d}\omega \) and \(D^*_L = \mathrm{diag}\{ v(\mathrm{dg}(W) + \mathbf{1}_L\mathbf{1}_L^T - I_L) \}\). Hence

$$\begin{aligned} \begin{aligned} \nabla _\omega \log p(y, \theta )&=- D^*_L \sum _{i=1}^n v\{ ({\tilde{b}}_i - C_i \beta _{RG_1}) ({\tilde{b}}_i \\&\quad - C_i \beta _{RG_1}) ^TW \} \\&\quad + n v(I_L) - \omega /\sigma _\omega ^2. \end{aligned} \end{aligned}$$

Note that \(D_L^* v(W^{-T}) = v(I_L)\) because \(W^{-T}\) is upper triangular and \(v(W^{-T})\) only retains the diagonal elements.

$$\begin{aligned} \nabla _{{\tilde{b}}_i} \log p(y, \theta ) = Z_i^T \{ y_i - h'(\eta _i)\} - W W^T ({\tilde{b}}_i - C_i \beta _{RG_1}). \end{aligned}$$

Appendix C: Gradients for state space models

Since \(\theta = [\alpha , \kappa , \psi , b_1^T, \dots , b_n^T]^T\), we require

$$\begin{aligned}&\nabla _\theta \log p(y, \theta ) = [\nabla _\alpha \log p(y, \theta ), \nabla _\kappa \log p(y, \theta ), \\&\quad \nabla _\psi \log p(y, \theta ), \nabla _{b_1} \log p(y, \theta ), \dots , \nabla _{b_n} \log p(y, \theta )]^T. \end{aligned}$$

The components in \(\nabla _\theta \log p(y, \theta )\) are given below.

$$\begin{aligned} \nabla _\alpha \log p(y, \theta )= & {} \frac{1}{2}\sum ^n_{i=1} (b_i y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - b_i)(1-\mathrm{e}^{-\sigma }) - \frac{\alpha }{\sigma _{\alpha }^2}. \\ \nabla _\kappa \log p(y, \theta )= & {} \frac{1}{2} \bigg (\sum ^n_{i=1} y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - n \bigg ) - \kappa /\sigma _{\kappa }^2.\\ \nabla _\psi \log p(y, \theta )= & {} \bigg \{ \sum _{i=2}^{n} (b_i - \phi b_{i-1})b_{i-1} + b_1^2 \phi - \frac{\phi }{1-\phi ^2} \bigg \} \\&\quad \times \phi (1-\phi ) - \psi /\sigma _{\psi }^2.\\ \nabla _{b_1} \log p(y, \theta )= & {} \frac{\sigma }{2} (y_1^2 \mathrm{e}^{-\sigma b_1 - \kappa } - 1) \\&+ \phi (b_2 - \phi b_1)- b_1 (1-\phi )^2. \end{aligned}$$

For \(2 \le i \le n-1\),

$$\begin{aligned} \begin{aligned} \nabla _{b_i} \log p(y, \theta )&= \frac{\sigma }{2} (y_i^2 \mathrm{e}^{-\sigma b_i - \kappa } - 1) +\phi (b_{i+1} - \phi b_i)\\&\quad - (b_i - \phi b_{i-1}). \\ \nabla _{b_n} \log p(y, \theta )&= \frac{\sigma }{2} (y_n^2 \mathrm{e}^{-\sigma b_n - \kappa } - 1) - (b_n - \phi b_{n-1}). \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tan, L.S.L., Bhaskaran, A. & Nott, D.J. Conditionally structured variational Gaussian approximation with importance weights. Stat Comput 30, 1255–1272 (2020). https://doi.org/10.1007/s11222-020-09944-8

Download citation

Keywords

  • Gaussian variational approximation
  • Sparse precision matrix
  • Stochastic variational inference
  • Importance weighted lower bound
  • Rényi’s divergence