Abstract
Stochastic variational inference (SVI) employs stochastic optimization to scale up Bayesian computation to massive data. Since SVI is at its core a stochastic gradient-based algorithm, horizontal parallelism can be harnessed to allow larger scale inference. We propose a lock-free parallel implementation for SVI which allows distributed computations over multiple slaves in an asynchronous style. We show that our implementation leads to linear speed-up while guaranteeing an asymptotic ergodic convergence rate \(O(1/\sqrt{T})\) while the number of slaves is bounded by \(\sqrt{T}\) (T is the total number of iterations). The implementation is done in a high-performance computing environment using message passing interface for python (MPI4py). The empirical evaluation shows that our parallel SVI is lossless, performing comparably well to its counterpart serial SVI with linear speed-up.
A. Bouchachia was supported by the European Commission under the Horizon 2020 Grant 687691 related to the project: PROTEUS: Scalable Online Machine Learning for Predictive Analytics and Real-Time Interactive Visualization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrieu, C., De Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)
Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found. Trends® Mach. Learn. 1(1–2), 1–305 (2008)
Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic variational inference. J. Mach. Learn. Res. 14(1), 1303–1347 (2013)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 693–701 (2011)
Agarwal, A., Duchi, J.C.: Distributed delayed stochastic optimization. In: Advances in Neural Information Processing Systems, pp. 873–881 (2011)
Zhang, R., Kwok, J.T.: Asynchronous distributed ADMM for consensus optimization. In: ICML, pp. 1701–1709 (2014)
Feyzmahdavian, H.R., Aytekin, A., Johansson, M.: An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans. Autom. Control 61(12), 3740–3754 (2016)
Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970 (2015)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice Hall, Englewood Cliffs (1989)
Lian, X., Huang, Y., Li, Y., Liu, J.: Asynchronous parallel stochastic gradient for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2737–2745 (2015)
Raman, P., Zhang, J., Yu, H.-F., Ji, S., Vishwanathan, S.V.N.: Extreme stochastic variational inference: distributed and asynchronous. arXiv preprint arXiv:1605.09499 (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)
Lichman, M.: UCI Machine Learning Repository (2013)
Honkela, A., Valpola, H.: On-line variational Bayesian learning. In: 4th International Symposium on Independent Component Analysis and Blind Signal Separation, pp. 803–808 (2003)
Broderick, T., Boyd, N., Wibisono, A., Wilson, A.C., Jordan, M.I.: Streaming variational bayes. In: Advances in Neural Information Processing Systems, pp. 1727–1735 (2013)
Neiswanger, W., Wang, C., Xing, E.: Embarrassingly parallel variational inference in nonconjugate models. arXiv preprint arXiv:1510.04163 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Background
We derive the model family studied in this paper and review SVI following the same pattern as in [3].
Model Family. Our family of models consists of three random variables: observations \(\varvec{x}=\varvec{x}_{1:n}\), local hidden variables \(\varvec{z}=\varvec{z}_{1:n}\), global hidden variables \(\varvec{\beta }\) and fixed parameters \(\varvec{\alpha }\). The model assumes that the distribution of the n pairs of \((\varvec{x}_i,\varvec{z}_i)\) is conditionally independent given \(\varvec{\beta }\). Further, their distribution and the prior distribution of \(\varvec{\beta }\) are in an exponential family:
Here, we overload the notation for the base measures h(.), sufficient statistics t(.) and log normalizer a(.). While the proposed approach is generic, we assume a conjugacy relationship between \((\varvec{x}_i,\varvec{z}_i)\) and \(\varvec{\beta }\). That is, the distribution \(p(\varvec{\beta }|\varvec{x},\varvec{z})\) is in the same family as the prior \(p(\varvec{\beta }| \varvec{\alpha })\).
Note that this innocent looking family of models includes (but is not limited to) latent Dirichlet allocation [13], Bayesian Gaussian mixture, probabilistic matrix factorization, hidden Markov models, hierarchical linear and probit regression, and many Bayesian non-parametric models.
Mean-Field Variational Inference. Variational inference (VI) approximates intractable posterior \(p(\varvec{\beta },\varvec{z}|\varvec{x})\) by positing a family of simple distributions \(q(\varvec{\beta },\varvec{z})\) and find the member of the family that is closest to the posterior (closeness is measured with KL divergence). The resulting optimization problem is equivalent maximizing the evidence lower bound (ELBO):
Mean-field is the simplest family as it allows the distribution over hidden variables to factorize:
Each variational distribution is assumed to come from the same family of the true one. Mean-field VI optimizes the new ELBO with respect to the local and global variational parameters \(\varvec{\phi }\) and \(\varvec{\lambda }\):
It iteratively updates each variational parameter holding the others fixed. With the assumptions taken so far, each update has a closed form solution. The local parameters are a function of the global ones:
The global parameters summarise the dataset (clusters in Bayesian Gaussian mixture, topics in LDA):
To find optimal \(\varvec{\lambda }\) given fixed \(\varvec{\phi }\), we compute the natural gradient of \(\mathcal {L}(\varvec{\lambda })\) and set it to zero by setting:
Thus, the new optimal global parameters are \(\varvec{\lambda }_{t+1}=\varvec{\lambda }^*\). The algorithm works by iterating between computing the optimal local parameters given the global ones (Eq. (12)) and vice versa (Eq. (14)).
Stochastic Variational Inference. Rather than analysing all the data to compute \(\varvec{\lambda }^*\) at each iteration, stochastic optimization can be used. Assuming that the data is uniformity at random selected from the dataset, an unbiased noisy estimator of \(\mathcal {L}(\varvec{\lambda },\varvec{\phi })\) can be developed based on a single data point:
The unbiased stochastic approximation of the ELBO as a function of \(\varvec{\lambda }\) can be written as follows:
Following the same step in the previous section, we obtain a noisy unbiased estimate of Eq. (14):
Iteratively, we move the global parameters a step-size \(\rho _t\) in the direction of the noisy natural gradient:
With certain conditions on \(\rho _t\), the algorithm converges (\(\sum _{t=1}^\infty \rho _t=\infty \), \(\sum _{t=1}^\infty \rho _t^2<\infty \)) [4].
B Related Work
Few work has been proposed to scale VI to large datasets. We can distinguish two major classes. The first class is based on the Bayesian filtering approach [16, 17]. That is, the sequential nature of Bayes theorem is exploited to recursively update an approximation of the posterior. Particularly, VI is used between the updates to approximate the posterior which becomes the prior of the next step. Author in [16] uses forgetting factors to decay the contribution of old data in favour of a new better one. The algorithm proposed in [17] considers a sequence of data batches and iterates over the data in each batch until convergence. Relying on a master-slave architecture, the computation of the batches posterior is done in a distributed and asynchronous manner. That is, the algorithm applies VI by performing asynchronous Bayesian updates to the posterior as data batches arrive continuously.
The second class of work is based on optimization [3, 12, 18]. As we already discussed, SVI proposed by [3] employs stochastic optimization to scale up Bayesian computation to massive data. SVI is inherently serial and requires the model parameters to fit in the memory of a single processor. Authors in [18] present a VI based inference algorithm that runs in parallel on data divided across several slaves. However at each iteration, the slaves are synchronized to combine their obtained parameters. Such synchronisation limits the scalability and decreases the speed of the update to that of the slowest slave. To avoid bulk synchronization, authors in [12] propose an asynchronous and lock-free update. In this update, vertical parallelism is adopted, where each processor asynchronously updates a subset of the parameters based on a subset of attributes. In contrast, we adopt horizontal parallelism update based on few (mini-batched or single) data points acquired from distributed sources. The update steps are aggregated to form the global update. Our proposed approach can make use of the mechanism proposed by [12] to achieve a hybrid horizontal-vertical parallelism. On the contrary to [12], our approach is not customised for LDA and can be applied to any of the model family presented in Sect. A.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Mohamad, S., Bouchachia, A., Sayed-Mouchaweh, M. (2020). Asynchronous Stochastic Variational Inference. In: Oneto, L., Navarin, N., Sperduti, A., Anguita, D. (eds) Recent Advances in Big Data and Deep Learning. INNSBDDL 2019. Proceedings of the International Neural Networks Society, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-030-16841-4_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-16841-4_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16840-7
Online ISBN: 978-3-030-16841-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)