Skip to main content

Asynchronous Stochastic Variational Inference

  • Conference paper
  • First Online:
Recent Advances in Big Data and Deep Learning (INNSBDDL 2019)

Part of the book series: Proceedings of the International Neural Networks Society ((INNS,volume 1))

Included in the following conference series:

Abstract

Stochastic variational inference (SVI) employs stochastic optimization to scale up Bayesian computation to massive data. Since SVI is at its core a stochastic gradient-based algorithm, horizontal parallelism can be harnessed to allow larger scale inference. We propose a lock-free parallel implementation for SVI which allows distributed computations over multiple slaves in an asynchronous style. We show that our implementation leads to linear speed-up while guaranteeing an asymptotic ergodic convergence rate \(O(1/\sqrt{T})\) while the number of slaves is bounded by \(\sqrt{T}\) (T is the total number of iterations). The implementation is done in a high-performance computing environment using message passing interface for python (MPI4py). The empirical evaluation shows that our parallel SVI is lossless, performing comparably well to its counterpart serial SVI with linear speed-up.

A. Bouchachia was supported by the European Commission under the Horizon 2020 Grant 687691 related to the project: PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/proteus-h2020/proteus-solma/tree/master/src/main/scala/eu/proteus/solma/asvi.

References

  1. Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)

    Article  Google Scholar 

  2. Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)

    MATH  Google Scholar 

  3. Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)

    MathSciNet  MATH  Google Scholar 

  4. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)

    Google Scholar 

  5. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)

    Google Scholar 

  6. Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)

    Google Scholar 

  7. Zhang, R., Kwok, J.T.: Asynchronous distributed ADMM for consensus optimization. In: ICML, pp. 1701–1709 (2014)

    Google Scholar 

  8. Feyzmahdavian, H.R., Aytekin, A., Johansson, M.: An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans. Autom. Control 61(12), 3740–3754 (2016)

    Article  MathSciNet  Google Scholar 

  9. Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970 (2015)

  10. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)

    MATH  Google Scholar 

  11. Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)

    Google Scholar 

  12. Raman, P., Zhang, J., Yu, H.-F., Ji, S., Vishwanathan, S.V.N.: Extreme stochastic variational inference: distributed and asynchronous. arXiv preprint arXiv:1605.09499 (2016)

  13. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  14. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)

    Google Scholar 

  15. Lichman, M.: UCI Machine Learning Repository (2013)

    Google Scholar 

  16. Honkela, A., Valpola, H.: On-line variational Bayesian learning. In: 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 803–808 (2003)

    Google Scholar 

  17. Broderick, T., Boyd, N., Wibisono, A., Wilson, A.C., Jordan, M.I.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)

    Google Scholar 

  18. Neiswanger, W., Wang, C., Xing, E.: Embarrassingly parallel variational inference in nonconjugate models. arXiv preprint arXiv:1510.04163 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saad Mohamad .

Editor information

Editors and Affiliations

Appendices

A Background

We derive the model family studied in this paper and review SVI following the same pattern as in [3].

Model Family. Our family of models consists of three random variables: observations \(\varvec{x}=\varvec{x}_{1:n}\), local hidden variables \(\varvec{z}=\varvec{z}_{1:n}\), global hidden variables \(\varvec{\beta }\) and fixed parameters \(\varvec{\alpha }\). The model assumes that the distribution of the n pairs of \((\varvec{x}_i,\varvec{z}_i)\) is conditionally independent given \(\varvec{\beta }\). Further, their distribution and the prior distribution of \(\varvec{\beta }\) are in an exponential family:

$$\begin{aligned} p(\varvec{\beta },\varvec{x},\varvec{z}|\varvec{\alpha })=p(\varvec{\beta }|\varvec{\alpha })\prod _{i=1}^n p(\varvec{z}_i,\varvec{x}_i|\varvec{\beta }), \end{aligned}$$
(6)
$$\begin{aligned} p(\varvec{z}_i,\varvec{x}_i|\varvec{\beta })=h(\varvec{x}_i,\varvec{z}_i)\exp \big (\varvec{\beta }^Tt(\varvec{x}_i,\varvec{z}_i)-a(\varvec{\beta })\big ), \end{aligned}$$
(7)
$$\begin{aligned} p(\varvec{\beta }|\varvec{\alpha })=h(\varvec{\beta })\exp \big (\varvec{\alpha }^Tt(\varvec{\beta })-a(\varvec{\alpha }) \big ) \end{aligned}$$
(8)

Here, we overload the notation for the base measures h(.), sufficient statistics t(.) and log normalizer a(.). While the proposed approach is generic, we assume a conjugacy relationship between \((\varvec{x}_i,\varvec{z}_i)\) and \(\varvec{\beta }\). That is, the distribution \(p(\varvec{\beta }|\varvec{x},\varvec{z})\) is in the same family as the prior \(p(\varvec{\beta }| \varvec{\alpha })\).

Note that this innocent looking family of models includes (but is not limited to) latent Dirichlet allocation [13], Bayesian Gaussian mixture, probabilistic matrix factorization, hidden Markov models, hierarchical linear and probit regression, and many Bayesian non-parametric models.

Mean-Field Variational Inference. Variational inference (VI) approximates intractable posterior \(p(\varvec{\beta },\varvec{z}|\varvec{x})\) by positing a family of simple distributions \(q(\varvec{\beta },\varvec{z})\) and find the member of the family that is closest to the posterior (closeness is measured with KL divergence). The resulting optimization problem is equivalent maximizing the evidence lower bound (ELBO):

$$\begin{aligned} \mathcal {L}(q)=E_q[\log p(\varvec{x},\varvec{z},\varvec{\beta })]-E_q[\log p(\varvec{z}\varvec{\beta })]\le \log p(\varvec{x}) \end{aligned}$$
(9)

Mean-field is the simplest family as it allows the distribution over hidden variables to factorize:

$$\begin{aligned} q(\varvec{\beta },\varvec{z})=q(\varvec{\beta }|\varvec{\lambda })\prod _{i=1}^np(\varvec{z}_i|\varvec{\phi }_i) \end{aligned}$$
(10)

Each variational distribution is assumed to come from the same family of the true one. Mean-field VI optimizes the new ELBO with respect to the local and global variational parameters \(\varvec{\phi }\) and \(\varvec{\lambda }\):

$$\begin{aligned} \mathcal {L}(\varvec{\lambda },\varvec{\phi })=E_q\bigg [\log \frac{p(\varvec{\beta })}{q(\varvec{\beta })}\bigg ]+\sum _{i=1}^nE_q\bigg [\log \frac{p(\varvec{x}_i,\varvec{z}_i|\varvec{\beta })}{q(\varvec{z}_i)} \bigg ] \end{aligned}$$
(11)

It iteratively updates each variational parameter holding the others fixed. With the assumptions taken so far, each update has a closed form solution. The local parameters are a function of the global ones:

$$\begin{aligned} \varvec{\phi }({\varvec{\lambda }}_t)=\arg \max _{\varvec{\phi }}\mathcal {L}(\varvec{\lambda }_t,\varvec{\phi }) \end{aligned}$$
(12)

The global parameters summarise the dataset (clusters in Bayesian Gaussian mixture, topics in LDA):

$$\begin{aligned} \mathcal {L}(\varvec{\lambda })=\max _{\varvec{\phi }} \mathcal {L}(\varvec{\lambda },\varvec{\phi }) \end{aligned}$$
(13)

To find optimal \(\varvec{\lambda }\) given fixed \(\varvec{\phi }\), we compute the natural gradient of \(\mathcal {L}(\varvec{\lambda })\) and set it to zero by setting:

$$\begin{aligned} \varvec{\lambda }^*=\varvec{\alpha }+\sum _{i=1}^nE_{\varvec{\phi }_i({\varvec{\lambda }}_t)}[t(\varvec{x}_i,\varvec{z}_i)] \end{aligned}$$
(14)

Thus, the new optimal global parameters are \(\varvec{\lambda }_{t+1}=\varvec{\lambda }^*\). The algorithm works by iterating between computing the optimal local parameters given the global ones (Eq. (12)) and vice versa (Eq. (14)).

Stochastic Variational Inference. Rather than analysing all the data to compute \(\varvec{\lambda }^*\) at each iteration, stochastic optimization can be used. Assuming that the data is uniformity at random selected from the dataset, an unbiased noisy estimator of \(\mathcal {L}(\varvec{\lambda },\varvec{\phi })\) can be developed based on a single data point:

$$\begin{aligned} \mathcal {L}_i(\varvec{\lambda },\varvec{\phi }_i)=E_{q}\bigg [\log \frac{p(\varvec{\beta })}{q(\varvec{\beta })}\bigg ]+nE_q\bigg [\log \frac{p(\varvec{x}_i,\varvec{z}_i|\varvec{\beta })}{q(\varvec{z}_i)} \bigg ] \end{aligned}$$
(15)

The unbiased stochastic approximation of the ELBO as a function of \(\varvec{\lambda }\) can be written as follows:

$$\begin{aligned} \mathcal {L}_i(\varvec{\lambda })=\max _{\varvec{\phi }_i}\mathcal {L}_i(\varvec{\lambda },\varvec{\phi }_i) \end{aligned}$$
(16)

Following the same step in the previous section, we obtain a noisy unbiased estimate of Eq. (14):

$$\begin{aligned} \varvec{\hat{\lambda }}=\varvec{\alpha }+nE_{\varvec{\phi }_i({\varvec{\lambda }}_t)}[t(\varvec{x}_i,\varvec{z}_i)] \end{aligned}$$
(17)

Iteratively, we move the global parameters a step-size \(\rho _t\) in the direction of the noisy natural gradient:

$$\begin{aligned} \varvec{\lambda }_{t+1}=(1-\rho _t)\varvec{\lambda }_t+\rho _t\varvec{\hat{\lambda }} \end{aligned}$$
(18)

With certain conditions on \(\rho _t\), the algorithm converges (\(\sum _{t=1}^\infty \rho _t=\infty \), \(\sum _{t=1}^\infty \rho _t^2<\infty \)) [4].

B Related Work

Few work has been proposed to scale VI to large datasets. We can distinguish two major classes. The first class is based on the Bayesian filtering approach [16, 17]. That is, the sequential nature of Bayes theorem is exploited to recursively update an approximation of the posterior. Particularly, VI is used between the updates to approximate the posterior which becomes the prior of the next step. Author in [16] uses forgetting factors to decay the contribution of old data in favour of a new better one. The algorithm proposed in [17] considers a sequence of data batches and iterates over the data in each batch until convergence. Relying on a master-slave architecture, the computation of the batches posterior is done in a distributed and asynchronous manner. That is, the algorithm applies VI by performing asynchronous Bayesian updates to the posterior as data batches arrive continuously.

The second class of work is based on optimization [3, 12, 18]. As we already discussed, SVI proposed by [3] employs stochastic optimization to scale up Bayesian computation to massive data. SVI is inherently serial and requires the model parameters to fit in the memory of a single processor. Authors in [18] present a VI based inference algorithm that runs in parallel on data divided across several slaves. However at each iteration, the slaves are synchronized to combine their obtained parameters. Such synchronisation limits the scalability and decreases the speed of the update to that of the slowest slave. To avoid bulk synchronization, authors in [12] propose an asynchronous and lock-free update. In this update, vertical parallelism is adopted, where each processor asynchronously updates a subset of the parameters based on a subset of attributes. In contrast, we adopt horizontal parallelism update based on few (mini-batched or single) data points acquired from distributed sources. The update steps are aggregated to form the global update. Our proposed approach can make use of the mechanism proposed by [12] to achieve a hybrid horizontal-vertical parallelism. On the contrary to [12], our approach is not customised for LDA and can be applied to any of the model family presented in Sect. A.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mohamad, S., Bouchachia, A., Sayed-Mouchaweh, M. (2020). Asynchronous Stochastic Variational Inference. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_31

Download citation

Publish with us

Policies and ethics