Asynchronous Stochastic Variational Inference

Mohamad, Saad; Bouchachia, Abdelhamid; Sayed-Mouchaweh, Moamar

doi:10.1007/978-3-030-16841-4_31

Saad Mohamad⁷,
Abdelhamid Bouchachia⁷ &
Moamar Sayed-Mouchaweh⁸

Part of the book series: Proceedings of the International Neural Networks Society ((INNS,volume 1))

Included in the following conference series:

INNS Big Data and Deep Learning conference

1032 Accesses
3 Citations

Abstract

Stochastic variational inference (SVI) employs stochastic optimization to scale up Bayesian computation to massive data. Since SVI is at its core a stochastic gradient-based algorithm, horizontal parallelism can be harnessed to allow larger scale inference. We propose a lock-free parallel implementation for SVI which allows distributed computations over multiple slaves in an asynchronous style. We show that our implementation leads to linear speed-up while guaranteeing an asymptotic ergodic convergence rate $O(1/\sqrt{T})$ while the number of slaves is bounded by $\sqrt{T}$ (T is the total number of iterations). The implementation is done in a high-performance computing environment using message passing interface for python (MPI4py). The empirical evaluation shows that our parallel SVI is lossless, performing comparably well to its counterpart serial SVI with linear speed-up.

A. Bouchachia was supported by the European Commission under the Horizon 2020 Grant 687691 related to the project: PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/proteus-h2020/proteus-solma/tree/master/src/main/scala/eu/proteus/solma/asvi.

References

Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)
Article Google Scholar
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
MATH Google Scholar
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
MathSciNet MATH Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)
Google Scholar
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Google Scholar
Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)
Google Scholar
Zhang, R., Kwok, J.T.: Asynchronous distributed ADMM for consensus optimization. In: ICML, pp. 1701–1709 (2014)
Google Scholar
Feyzmahdavian, H.R., Aytekin, A., Johansson, M.: An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans. Autom. Control 61(12), 3740–3754 (2016)
Article MathSciNet Google Scholar
Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970 (2015)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)
MATH Google Scholar
Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)
Google Scholar
Raman, P., Zhang, J., Yu, H.-F., Ji, S., Vishwanathan, S.V.N.: Extreme stochastic variational inference: distributed and asynchronous. arXiv preprint arXiv:1605.09499 (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013)
Google Scholar
Honkela, A., Valpola, H.: On-line variational Bayesian learning. In: 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 803–808 (2003)
Google Scholar
Broderick, T., Boyd, N., Wibisono, A., Wilson, A.C., Jordan, M.I.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)
Google Scholar
Neiswanger, W., Wang, C., Xing, E.: Embarrassingly parallel variational inference in nonconjugate models. arXiv preprint arXiv:1510.04163 (2015)

Download references

Author information

Authors and Affiliations

Department of Computing, Bournemouth University, Poole, UK
Saad Mohamad & Abdelhamid Bouchachia
Department of Informatics and Automatics, Ecole des Mines, Douai, France
Moamar Sayed-Mouchaweh

Authors

Saad Mohamad
View author publications
You can also search for this author in PubMed Google Scholar
Abdelhamid Bouchachia
View author publications
You can also search for this author in PubMed Google Scholar
Moamar Sayed-Mouchaweh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saad Mohamad .

Editor information

Editors and Affiliations

Department of Informatics, Bioengineering, Robotics, and Systems Engineering, University of Genova, Genoa, Italy
Luca Oneto
Department of Mathematics, University of Padova, Padua, Italy
Nicolò Navarin
Department of Mathematics, University of Padova, Padua, Italy
Alessandro Sperduti
Department of Informatics, Bioengineering, Robotics, and Systems Engineering, University of Genova, Genoa, Italy
Davide Anguita

Appendices

A Background

We derive the model family studied in this paper and review SVI following the same pattern as in [3].

Model Family. Our family of models consists of three random variables: observations $\varvec{x}=\varvec{x}_{1:n}$, local hidden variables $\varvec{z}=\varvec{z}_{1:n}$, global hidden variables $\varvec{\beta }$ and fixed parameters $\varvec{\alpha }$. The model assumes that the distribution of the n pairs of $(\varvec{x}_i,\varvec{z}_i)$ is conditionally independent given $\varvec{\beta }$. Further, their distribution and the prior distribution of $\varvec{\beta }$ are in an exponential family:

$$\begin{aligned} p(\varvec{\beta },\varvec{x},\varvec{z}|\varvec{\alpha })=p(\varvec{\beta }|\varvec{\alpha })\prod _{i=1}^n p(\varvec{z}_i,\varvec{x}_i|\varvec{\beta }), \end{aligned}$$

(6)

$$\begin{aligned} p(\varvec{z}_i,\varvec{x}_i|\varvec{\beta })=h(\varvec{x}_i,\varvec{z}_i)\exp \big (\varvec{\beta }^Tt(\varvec{x}_i,\varvec{z}_i)-a(\varvec{\beta })\big ), \end{aligned}$$

(7)

$$\begin{aligned} p(\varvec{\beta }|\varvec{\alpha })=h(\varvec{\beta })\exp \big (\varvec{\alpha }^Tt(\varvec{\beta })-a(\varvec{\alpha }) \big ) \end{aligned}$$

(8)

Here, we overload the notation for the base measures h(.), sufficient statistics t(.) and log normalizer a(.). While the proposed approach is generic, we assume a conjugacy relationship between $(\varvec{x}_i,\varvec{z}_i)$ and $\varvec{\beta }$. That is, the distribution $p(\varvec{\beta }|\varvec{x},\varvec{z})$ is in the same family as the prior $p(\varvec{\beta }| \varvec{\alpha })$.

Note that this innocent looking family of models includes (but is not limited to) latent Dirichlet allocation [13], Bayesian Gaussian mixture, probabilistic matrix factorization, hidden Markov models, hierarchical linear and probit regression, and many Bayesian non-parametric models.

Mean-Field Variational Inference. Variational inference (VI) approximates intractable posterior $p(\varvec{\beta },\varvec{z}|\varvec{x})$ by positing a family of simple distributions $q(\varvec{\beta },\varvec{z})$ and find the member of the family that is closest to the posterior (closeness is measured with KL divergence). The resulting optimization problem is equivalent maximizing the evidence lower bound (ELBO):

$$\begin{aligned} \mathcal {L}(q)=E_q[\log p(\varvec{x},\varvec{z},\varvec{\beta })]-E_q[\log p(\varvec{z}\varvec{\beta })]\le \log p(\varvec{x}) \end{aligned}$$

(9)

Mean-field is the simplest family as it allows the distribution over hidden variables to factorize:

$$\begin{aligned} q(\varvec{\beta },\varvec{z})=q(\varvec{\beta }|\varvec{\lambda })\prod _{i=1}^np(\varvec{z}_i|\varvec{\phi }_i) \end{aligned}$$

(10)

Each variational distribution is assumed to come from the same family of the true one. Mean-field VI optimizes the new ELBO with respect to the local and global variational parameters $\varvec{\phi }$ and $\varvec{\lambda }$:

$$\begin{aligned} \mathcal {L}(\varvec{\lambda },\varvec{\phi })=E_q\bigg [\log \frac{p(\varvec{\beta })}{q(\varvec{\beta })}\bigg ]+\sum _{i=1}^nE_q\bigg [\log \frac{p(\varvec{x}_i,\varvec{z}_i|\varvec{\beta })}{q(\varvec{z}_i)} \bigg ] \end{aligned}$$

(11)

It iteratively updates each variational parameter holding the others fixed. With the assumptions taken so far, each update has a closed form solution. The local parameters are a function of the global ones:

$$\begin{aligned} \varvec{\phi }({\varvec{\lambda }}_t)=\arg \max _{\varvec{\phi }}\mathcal {L}(\varvec{\lambda }_t,\varvec{\phi }) \end{aligned}$$

(12)

The global parameters summarise the dataset (clusters in Bayesian Gaussian mixture, topics in LDA):

$$\begin{aligned} \mathcal {L}(\varvec{\lambda })=\max _{\varvec{\phi }} \mathcal {L}(\varvec{\lambda },\varvec{\phi }) \end{aligned}$$

(13)

To find optimal $\varvec{\lambda }$ given fixed $\varvec{\phi }$, we compute the natural gradient of $\mathcal {L}(\varvec{\lambda })$ and set it to zero by setting:

$$\begin{aligned} \varvec{\lambda }^*=\varvec{\alpha }+\sum _{i=1}^nE_{\varvec{\phi }_i({\varvec{\lambda }}_t)}[t(\varvec{x}_i,\varvec{z}_i)] \end{aligned}$$

(14)

Thus, the new optimal global parameters are $\varvec{\lambda }_{t+1}=\varvec{\lambda }^*$. The algorithm works by iterating between computing the optimal local parameters given the global ones (Eq. (12)) and vice versa (Eq. (14)).

Stochastic Variational Inference. Rather than analysing all the data to compute $\varvec{\lambda }^*$ at each iteration, stochastic optimization can be used. Assuming that the data is uniformity at random selected from the dataset, an unbiased noisy estimator of $\mathcal {L}(\varvec{\lambda },\varvec{\phi })$ can be developed based on a single data point:

$$\begin{aligned} \mathcal {L}_i(\varvec{\lambda },\varvec{\phi }_i)=E_{q}\bigg [\log \frac{p(\varvec{\beta })}{q(\varvec{\beta })}\bigg ]+nE_q\bigg [\log \frac{p(\varvec{x}_i,\varvec{z}_i|\varvec{\beta })}{q(\varvec{z}_i)} \bigg ] \end{aligned}$$

(15)

The unbiased stochastic approximation of the ELBO as a function of $\varvec{\lambda }$ can be written as follows:

$$\begin{aligned} \mathcal {L}_i(\varvec{\lambda })=\max _{\varvec{\phi }_i}\mathcal {L}_i(\varvec{\lambda },\varvec{\phi }_i) \end{aligned}$$

(16)

Following the same step in the previous section, we obtain a noisy unbiased estimate of Eq. (14):

$$\begin{aligned} \varvec{\hat{\lambda }}=\varvec{\alpha }+nE_{\varvec{\phi }_i({\varvec{\lambda }}_t)}[t(\varvec{x}_i,\varvec{z}_i)] \end{aligned}$$

(17)

Iteratively, we move the global parameters a step-size $\rho _t$ in the direction of the noisy natural gradient:

$$\begin{aligned} \varvec{\lambda }_{t+1}=(1-\rho _t)\varvec{\lambda }_t+\rho _t\varvec{\hat{\lambda }} \end{aligned}$$

(18)

With certain conditions on $\rho _t$, the algorithm converges ($\sum _{t=1}^\infty \rho _t=\infty $, $\sum _{t=1}^\infty \rho _t^2<\infty $) [4].

B Related Work

Few work has been proposed to scale VI to large datasets. We can distinguish two major classes. The first class is based on the Bayesian filtering approach [16, 17]. That is, the sequential nature of Bayes theorem is exploited to recursively update an approximation of the posterior. Particularly, VI is used between the updates to approximate the posterior which becomes the prior of the next step. Author in [16] uses forgetting factors to decay the contribution of old data in favour of a new better one. The algorithm proposed in [17] considers a sequence of data batches and iterates over the data in each batch until convergence. Relying on a master-slave architecture, the computation of the batches posterior is done in a distributed and asynchronous manner. That is, the algorithm applies VI by performing asynchronous Bayesian updates to the posterior as data batches arrive continuously.

The second class of work is based on optimization [3, 12, 18]. As we already discussed, SVI proposed by [3] employs stochastic optimization to scale up Bayesian computation to massive data. SVI is inherently serial and requires the model parameters to fit in the memory of a single processor. Authors in [18] present a VI based inference algorithm that runs in parallel on data divided across several slaves. However at each iteration, the slaves are synchronized to combine their obtained parameters. Such synchronisation limits the scalability and decreases the speed of the update to that of the slowest slave. To avoid bulk synchronization, authors in [12] propose an asynchronous and lock-free update. In this update, vertical parallelism is adopted, where each processor asynchronously updates a subset of the parameters based on a subset of attributes. In contrast, we adopt horizontal parallelism update based on few (mini-batched or single) data points acquired from distributed sources. The update steps are aggregated to form the global update. Our proposed approach can make use of the mechanism proposed by [12] to achieve a hybrid horizontal-vertical parallelism. On the contrary to [12], our approach is not customised for LDA and can be applied to any of the model family presented in Sect. A.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mohamad, S., Bouchachia, A., Sayed-Mouchaweh, M. (2020). Asynchronous Stochastic Variational Inference. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-16841-4_31
Published: 03 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16840-7
Online ISBN: 978-3-030-16841-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics