An Overview of Restricted Boltzmann Machines
 68 Downloads
Abstract
The restricted Boltzmann machine (RBM) is a twolayered network of stochastic units with undirected connections between pairs of units in the two layers. The two layers of nodes are called visible and hidden nodes. In an RBM, there are no connections from visible to visible or hidden to hidden nodes. RBMs are used mainly as a generative model. They can be suitably modified to perform classification tasks also. They are among the basic building blocks of other deep learning models such as deep Boltzmann machine and deep belief networks. The aim of this article is to give a tutorial introduction to the restricted Boltzmann machines and to review the evolution of this model.
1 Introduction
In 1982, Hopfield introduced a fully connected network of interacting units using which one can store and retrieve binary patterns21. This network can be regarded as a dynamical system in which the stable states (associated with minima of a suitably defined energy) of the system correspond to the desired memories. The network is initialized with a random state and each node is allowed to change/update its state through a simple rule that depends on the states of nodes connected to it. This dynamics results in a path in the state space that continuously decreases the energy of the network. The Hopfield network was very influential in the development of neural network models in the 1980s. Though successful in storing/retreiving the desired patterns, it was observed that the Hopfield model has limited capacity to store memories and it has the problem of spurious minima. To mitigate some of these problems, a stochastic version of the Hopfield model, called Boltzmann machine, was proposed17. A Boltzmann machine (BM) is a model of pairwise interacting units where each unit updates its state over time in a probabilistic way depending on the states of the neighboring units. Unlike a Hopfiled model, the BM can have some hidden units too. However, learning the parameters of the BM model is computationally intensive. To reduce the complexity of learning, the connectivity structure of the BM is restricted15, 19, 42. This model is called the restricted Boltzmann machine (RBM). The RBM model has played an important role in the current resurgence of neural networks. The algorithm18 for unsupervised pretraining of deep belief networks using RBMs is a very significant development in the field in the sense that it rekindled interest in deep neural networks.
RBM is a probabilistic energybased model (EBM) with a twolayer architecture in which the visible stochastic units are connected to the hidden stochastic units as shown in Fig. 1. There are no connections from visible to visible or hidden to hidden nodes. In a Boltzman machine, like in a Hopfield model, every unit can be connected to every other unit. RBM is a special case of the Boltzmann machine where an additional bipartite structure is imposed by avoiding the intralayer connections15, 19, 42. Even though RBM is a generative model, it can also be used as a discriminative model with suitable modifications.
The parameters, \(\theta \), of the RBM include all the connection weights and the biases. The probability distribution represented by the model is determined by these parameters. Given training data, we need to learn the parameters so that the distribution represented by the model closely matches the desired distribution as indicated by the training data.
The maximum likelihood estimation is the most popular method used to learn RBM parameters. However, the gradient (w.r.t. the parameters of the model) of the loglikelihood is intractable, since it contains an expectation term w.r.t. the model distribution. This expectation is computationally expensive: exponential in (minimum of) the number of visible/hidden units in the model. Therefore, this expectation is approximated by taking an average over the samples from the model distribution. The samples are obtained by exploiting the bipartite connectivity structure of the RBM.
To avoid evaluating intractable expectation in the maximum likelihood method, other methods such as the pseudolikelihood, ratio matching and generalized score matching are proposed. These methods use the conditional distributions due to which the intractability referred to above is resolved. However, empirical analysis in Marlin et al.28 showed that the maximum likelihood method is better compared to other methods even though it is computationally intensive.
We would first discuss the model in brief before discussing the learning algorithms.
1.1 The RBM model
1.2 Representational Power
The representational power of a model defines its ability to capture a class of distributions. As seen from Eq. (8), RBM distribution is a product of mixtures, where each hidden unit contributes to a mixture which is a product of two distributions. The increase in the number of hidden units guarantees improvement in the training loglikelihood or equivalently guarantees reduction in the KL divergence between the data and the model distribution24. Further, they showed that any distribution over \(\{0, 1\}^n\) can be approximated arbitrarily well (in terms of the KL divergence measure) with an RBM with \(k + 1\) hidden units, where k is the number of input vectors with nonzero probability. This result is generalized to show that any disribution can be approximated arbitrarily well by the RBM with \(2^n1\) hidden units32. Further, the results are refined33 to show that RBM with that RBM with \(\alpha 2^n1\) (where \(\alpha < 1\)) hidden units are sufficient to approximate any distribution.
2 Learning RBM with Maximum Likelihood
2.1 Markov Chain Monte Carlo Estimation and Gibbs Sampling
 1.
\(x_1^{s+1}\sim p(x_1\vert x_2^{s},x_3^{s},\ldots ,x_n^{s}),\)
 2.
\(x_2^{s+1}\sim p(x_2\vert x_1^{s+1},x_3^{s},x_4^{s},\ldots ,x_n^{s}),\)
 3.
\(x_i^{s+1}\sim p(x_i\vert x_{1:i1}^{s+1},x_{i+1:n}^{s}),\quad \text {for } i=3,\ldots ,n1,\)
 4.
\(x_n^{s+1}\sim p(x_n\vert x_1^{s+1},x_2^{s+1},\ldots ,x_{n1}^{s+1}),\)
For the RBM case, all of the hidden units can be sampled simultaneously, because they are conditionally independent from each other given the visible units. Similarly, all of the visible units can be sampled simultaneously, since they are conditionally independent from each other given the hidden units. This sampling method which update many variables simultaneously is called block Gibbs sampling. As seen earlier, the conditionals are given by sigmoid functions.
2.2 Contrastive Divergence
The fixed points of CD differ from those of ML and CDk gives a biased estimate of the gradient. This bias depends on the number of units in the RBM and the maximum change in energy that can be produced by changing a single unit14. The bias is also affected by the distance in variation between the model distribution and the initial distribution of the Gibbs chain14. The studies in1, 5 reveal that the bias in CDk approximation can lead to convergence to parameters that do not reach the maximum likelihood. The CDk update is not a gradient of any function, and counterintuitive regularization function that causes CD learning to cycle indefinitely is constructed in43. The CDk appoximation can also be viewed as a stochastic approximation algorithm. The convergence conditions are analyzed in22, 50, 51.
2.2.1 Other Approaches
There are two main classes of approaches to make the learning of RBM more efficient. The first is to design an efficient MCMC method to get good representative samples from the model distribution and thereby reduce the variance of the estimated gradient. The persistent contrastive divergence (PCD) algorithm45 was initially proposed to train Boltzmann machines49, then later shown to perform better than CD in the context of RBM45. The algorithm is similar to CD learning. The only change is in the Gibbs chain initialization. Specifically, the Gibbs chain is not reinitialized to the training vector after ksteps for each parameter update; instead, it starts the chain at samples from the previous iteration (termed fantasy particles). In the fast persistent contrastive divergence (FPCD) algorithm46, an additional set of weights are introduced and shown to perform better by improving the mixing rate of the persistent chain. The regular and the fast weights both contribute to the effective weight update. The fast weights are also calculated similar to the regular updates, but with much stronger weight decay and a faster learning rate. Other such algorithms based on modifying the MCMC sampling procedure include the population (popCD)36, and average contrastive divergence (ACD)26. Another popular algorithm, parallel tempering (PT)9, is also based on MCMC. However, in general, such advanced MCMC methods are computationally intensive.
The second approach is to design better optimization strategies which are robust to the noise in estimated gradient4, 11, 29. Most approaches to design better optimization methods for learning RBMs are secondorder optimization techniques that either need approximate Hessian inverse or an estimate of the inverse Fisher matrix. The AdaGrad12 method uses diagonal approximation of the Hessian matrix, while TONGA38 assumes block diagonal structure. The Hessian–Free (H–F) method29 is an iterative procedure which approximately solves a linear system to obtain the curvature through matrix–vector product. The H–F method is used to design natural gradient descent for learning Boltzmann machines11. A sparse Gaussian graphical model can be used to estimate the inverse fisher matrix to desvise factorized natural gradient descent procedure16. All these methods either need additional computations to solve an auxiliary linear system or are computationally intensive methods to directly estimate the inverse Fisher matrix. The centered gradients (CG) method30 is motivated by the principle that by removing the mean of the training data and the mean of the hidden activations from the visible and the hidden variables, respectively, the conditioning of the underlying optimizing problem can be improved31. The RBM loglikelihood function is a difference of convex functions, since both f and g in Eq. (13) can be written as logsumexponential function form. This property is exploited in4 to devise a majorization–minimization optimization algorithm called the stochastic spectral descent (SSD) algorithm. A stochastic variation of difference of convex programming (DCP) can also be used to exploit the difference of convex functions property of the RBM loglikelihood function35, 47.
 Pseudolikelihood:$$\begin{aligned} L_\theta ^{PL}(\mathbf {v}^{(1)},\mathbf {v}^{(2)},\ldots , \mathbf {v}^{(N)})= \frac{1}{N m}\sum \limits _{i=1}^N \sum \limits _{j=1}^m \log p\left( v_j^{(i)}\vert \mathbf {v}_{j}^{(i)},\theta \right) \end{aligned}$$(19)
 Ratio matching:$$\begin{aligned} L_\theta ^{RM}(\mathbf {v}^{(1)},\mathbf {v}^{(2)},\ldots , \mathbf {v}^{(N)})=  \frac{1}{N m}\sum \limits _{i=1}^N \sum \limits _{j=1}^m \left( 1p\left( v_j^{(i)}\vert \mathbf {v}_{j}^{(i)},\theta \right) \right) ^2 \end{aligned}$$(20)
 Generalized score matching$$\begin{aligned}&L_\theta ^{\text {GSM}}(\mathbf {v}^{(1)},\mathbf {v}^{(2)},\ldots , \mathbf {v}^{(N)})\nonumber \\&\quad =  \frac{1}{N m}\sum \limits _{i=1}^N \sum \limits _{j=1}^m \left( \frac{1}{p\left( v_j^{(i)}\vert \mathbf {v}_{j}^{(i)},\theta \right) }\frac{1}{p_\text {data}\left( v_j^{(i)}\vert \mathbf {v}_{j}^{(i)},\theta \right) }\right) ^2 \end{aligned}$$(21)
3 Generalization
4 Multilayered Networks with RBM
One of the developments that contributed to the initial success of deep learning is greedy layerwise training of stacked RBMs which provided good initialization for training neural networks with many hidden layers. The RBMs can be stacked to form multiple layers yeilding different models. For example, the deep beleif network and the deep Boltzmann machine have RBM as a building block but their connections differ.
4.1 Deep Belief Network
Deep belief network (DBN) is a generative model containing many layers of hidden units, in which each layer captures correlations between the activities of hidden units in the layer below. The top two layers form an undirected bipartite graph similar to RBM. The lower layers form a directed sigmoid belief network, as shown in Fig. 3. A DBN with only one hidden layer is an RBM.
The learnt weights of the DBN are considered as initial weights of the MLP which consists of an additional classification layer. This MLP is finetuned for the classification task. The initial success of DBN is due to this algorithm which provided an efficient way to initialize the weights of an MLP and thereby improve the performance of MLP on many discriminative tasks both in terms of traning time and the classification accuracy18.
4.2 Deep Boltzmann Machine
5 Evaluation
The likelihood \(p(\mathbf {v})\) can be written as \(p^{*}(\mathbf {v})/Z\). While \(p^{*}(\mathbf {v})\) is easy to evaluate, the normalizing constant Z, called the partition function, is computationally expensive. Therefore, various samplingbased estimators have been proposed for the estimation of the loglikelihood. The first approach is to approximately estimate the partition function using the samples obtained from the model distribution. Since sampling from the model distribution of an RBM is complicated, a useful sampling technique is the importance sampling where samples obtained from a simple distribution, called the proposal distribution, are used to estimate the partition function.
5.1 Simple Importance Sampling
The expectation of a function f(x) with respect to a given distribution p(x) can be approximated by taking average over independent samples from p(x). When the distribution is too complex (rugged and high dimensional), generating independent samples is difficult. In such situations, a simple distribution from which independent samples can easily be generated is used to assist in obtaining the approximate expectation. The algorithm works as follows.
5.2 Annealed Importance Sampling

\(p_k(\mathbf {x})\ne 0\) whenever \(p_{k+1}(\mathbf {x})\ne 0\).

The unnormalized probability \(p_k^{*}(\mathbf {x})\) is easy to calculate \(\forall k\).

For each k, it is possible to get sample \(\mathbf {x}'\) given \(\mathbf {x}\) through Markov transition \(T_k(\mathbf {x}^{'}\vert \mathbf {x})\) which leaves \(p_k(\mathbf {x})\) invariant.
There are other approaches to estimate the average test loglikelihood directly by marginalizing over the hidden variables from the model distribution. However, the computational complexity grows exponentially with the minimum of the number of visible and hidden units present in the model. Hence, an approximate method which uses a samplebased estimator called conservative samplingbased likelihood estimator (CSL)2. A more efficient method called reverse annealed importance sampling estimator (RAISE)3 implements CSL by formulating the problem of marginalization as a partition function estimation problem.
6 Conclusions
The RBM model has played an important role in the recent spectacular developments in neural network models and deep learning. The RBM is a generative model and it can extract patterns from the given data in an unsupervised manner. It is a very useful method for unsupervised feature learning. RBMs have also played an important role in unsupervised pretraining for weight initialization of deep neural networks. As mentioned earlier, the RBM is a building block for DBNs and DBMs. The pretrained networks are used as an efficient initialzation of MLPs which are finetuned later for the specific tasks. RBMs have been used in a variety of applications such as collaborative filtering39, to analyze connectivity structure of brain using fMRI images40, constructing topic models from unstructured text data20, modeling of natural images8, etc.
In this paper, we presented a tutorial introduction to the RBM model and provided some discussion on learning an RBM from data through maximum likelihood estimation. We described the popular CDk algorithm and also discussed many other methods that were proposed. We also discussed generalizations of the basic model both in terms of realvalued visible units as well as multilayer networks constructed using RBMs.
Many other successful variants of RBM have also been proposed to further improve the ability of the model in representing different types of data. For instance, Conditional restricted Boltzmann machine (CRBM) was proposed for collaborative filtering, Gaussian–Bernoulli restricted Boltzmann machine (GRBM) was proposed for realvalued continuous data, recurrent temporal Boltzmann machine (RTBM) was devised to represent sequential data, and convolutional RBMs are developed to capture the spatial structure in images and to learn accoustic filters from spectrogram of speech data.
RBMs represent useful generative models and unlike some of the other generative models (e.g., GANs), RBMs are also very effective as discriminative models. However, learning an RBM is computationally expensive. This is mainly because of the intractability of the partition function. As discused here, one uses the MCMC sampling techniques to estimate gradients of the likelihood function for learning. Better methods for learning RBMs is currently an important research problem. The problem of constructing RBMs with realvalued visible units is also a problem of much current interest. Designing proper architectures for RBMs with multiple hidden layers is also an interesting problem. RBMs, along with CNNs, provided the initial push for the deep learning revolution over the past few years and are likely to be important for the field in the years to come.
Notes
References
 1.Bengio Y, Delalleau O (2009) Justifying and generalizing contrastive divergence. Neural Comput 21(6):1601–1621CrossRefGoogle Scholar
 2.Bengio Y, Yao L, Cho K (2013) Bounding the test loglikelihood of generative models. arXiv:1311.6184 (arXiv preprint)
 3.Burda Y, Grosse RB, Salakhutdinov R (2014) Accurate and conservative estimates of MRF loglikelihood using reverse annealing. arXiv:1412.8566 (arXiv preprint)
 4.Carlson D, Cevher V, Carin L (2015) Stochastic spectral descent for restricted Boltzmann machines. In: Proceedings of the eighteenth international conference on artificial intelligence and statistics, pp 111–119Google Scholar
 5.CarreiraPMA, Hinton GE (2005) On contrastive divergence learning. In: Proceedings of the tenth international workshop on artificial intelligence and statistics. Citeseer, pp 33–40Google Scholar
 6.Cho K, Ilin A, Raiko T (2011) Improved learning of Gaussian–Bernoulli restricted Boltzmann machines. In: Honkela T, Duch W, Girolami M, Kaski S (eds) Artificial neural networks and machine learning–ICANN 2011. Springer, Berlin, pp 10–17 (ISBN 9783642217357) CrossRefGoogle Scholar
 7.Courville A, Bergstra J, Bengio Y A spike and slab restricted Boltzmann machine. In: Gordon G, Dunson D, Dudík M (eds) Proceedings of the fourteenth international conference on artificial intelligence and statistics, volume 15 of proceedings of machine learning research, Fort Lauderdale, FL, USA, 11–13 Apr 2011a. PMLR, pp 233–241. http://proceedings.mlr.press/v15/courville11a.html
 8.Courville Aaron, Bergstra James, Bengio Yoshua (2011b) Unsupervised models of images by spikeandslab rbms. In: Proceedings of the 28th international conference on international conference on machine learning, ICML’11, USA. Omnipress, pp 1145–1152. http://dl.acm.org/citation.cfm?id=3104482.3104626 (ISBN 9781450306195)
 9.Desjardins G, Courville A, Bengio Y (2010a) Adaptive parallel tempering for stochastic maximum likelihood learning of RBMS. arXiv:1012.3476 (arXiv preprint)
 10.Desjardins G, Courville AC, Bengio Y, Vincent P, Delalleau O (2010b) Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In: International conference on artificial intelligence and statistics, pp 145–152Google Scholar
 11.Desjardins G, Pascanu R, Courville AC, Bengio Y (2013) Metricfree natural gradient for jointtraining of Boltzmann machines. CoRR. arXiv:1301.3545
 12.Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(Jul):2121–2159Google Scholar
 13.Fischer A, Igel C (2010) Empirical analysis of the divergence of Gibbs sampling based learning algorithms for restricted Boltzmann machines. In: Artificial neural networks–ICANN 2010. Springer, pp 208–217Google Scholar
 14.Fischer A, Igel C (2011) Bounding the bias of contrastive divergence learning. Neural Comput 23(3):664–673CrossRefGoogle Scholar
 15.Freund Y, Haussler D (1994) Unsupervised learning of distributions of binary vectors using two layer networks. Computer Research Laboratory [University of California, Santa Cruz]Google Scholar
 16.Grosse RB, Salakhutdinov R (2015) Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In: Proceedings of the 32nd international conference on international conference on machine learning, volume 37, ICML’15, pp 2304–2313. JMLR.org. http://dl.acm.org/citation.cfm?id=3045118.3045363
 17.Hinton GE, Sejnowski TJ (1986) Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter learning and relearning in Boltzmann machines. MIT Press, Cambridge, pp 282–317. URL http://dl.acm.org/citation.cfm?id=104279.104291 (ISBN 026268053X)
 18.Hinton G, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554CrossRefGoogle Scholar
 19.Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800CrossRefGoogle Scholar
 20.Hinton GE, Salakhutdinov RR (2009) Replicated Softmax: an undirected topic model. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc., pp 1607–1614. http://papers.nips.cc/paper/3856replicatedsoftmaxanundirectedtopicmodel.pdf
 21.Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558. https://doi.org/10.1073/pnas.79.8.2554. https://www.pnas.org/content/79/8/2554 (ISSN 00278424)
 22.Jiang B, Wu TY, Jin Y, Wong WH (2016) Convergence of contrastive divergence algorithm in exponential family. arXiv:1603.05729 (arXiv eprints)
 23.Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s Thesis. http://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf
 24.Le Roux N, Bengio Y (2008) Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput 20(6):1631–1649CrossRefGoogle Scholar
 25.Lee H, Ekanadham C, Ng AY (2008) Sparse deep belief net model for visual area v2. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20. Curran Associates, Inc, pp 873–880. http://papers.nips.cc/paper/3313sparsedeepbeliefnetmodelforvisualareav2.pdf
 26.Ma X, Wang X (2016) Average contrastive divergence for training restricted Boltzmann machines. Entropy 18(1):35CrossRefGoogle Scholar
 27.MacKay DJC (2003) Information theory, inference, and learning algorithms, vol 7. Cambridge University Press, CambridgeGoogle Scholar
 28.Marlin BM, Swersky K, Chen B, Freitas ND (2010) Inductive principles for restricted Boltzmann machine learning. In: International conference on artificial intelligence and statistics, pp 509–516Google Scholar
 29.Martens J (2010) Deep learning via hessianfree optimization. In: ICMLGoogle Scholar
 30.Melchior J, Fischer A, Wiskott L (2016) How to center deep Boltzmann machines. J Mach Learn Res 17(99):1–61Google Scholar
 31.Montavon G, KlausRobert M (2012) Deep Boltzmann machines and the centering trick. Springer, Berlin, pp 621–637. https://doi.org/10.1007/9783642352898_33 (ISBN 9783642352898)
 32.Montufar G, Ay N (2011) Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Comput 23(5):1306—1319. https://doi.org/10.1162/neco_a_00113. https://doi.org/10.1162/NECO_a_00113 (ISSN 08997667)
 33.Montúfar G, Rauh J (2017) Hierarchical models as marginals of hierarchical models. Int J Approx Reason 88:531–546. https://doi.org/10.1016/j.ijar.2016.09.003. http://www.sciencedirect.com/science/article/pii/S0888613X16301414 (ISSN 0888613X)
 34.Neal RM (2001) Annealed importance sampling. Stat Comput 11(2):125–139CrossRefGoogle Scholar
 35.Nitanda A, Suzuki T Stochastic difference of convex algorithm and its application to training deep Boltzmann machines. In: Singh A, Zhu J (eds) Proceedings of the 20th international conference on artificial intelligence and statistics, vol 54 of Proceedings of machine learning research, Fort Lauderdale, FL, USA, 20–22 Apr 2017, pp 470–478. PMLR. http://proceedings.mlr.press/v54/nitanda17a.html
 36.Oswin K, Igel C, Fischer A (2015) Populationcontrastivedivergence: does consistency help with RBM training? CoRR. arXiv:1510.01624
 37.Ranzato M, Hinton GE (2010) Modeling pixel means and covariances using factorized thirdorder Boltzmann machines. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 2551–2558. https://doi.org/10.1109/CVPR.2010.5539962
 38.Roux NL, Manzagol PA, Bengio Y (2008) Topmoumoute online natural gradient algorithm. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20. Curran Associates, Inc., pp 849–856. http://papers.nips.cc/paper/3234topmoumouteonlinenaturalgradientalgorithm.pdf
 39.Salakhutdinov R, Mnih A, Hinton G (2007) Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th international conference on machine learning, ICML ’07, New York, NY, USA. ACM, pp 791–798. https://doi.org/10.1145/1273496.1273596. http://doi.acm.org/10.1145/1273496.1273596 (ISBN 9781595937933)
 40.Schmah T, Hinton GE, Small SL, Strother S, Zemel RS (2009) Generative versus discriminative training of RBMs for classification of fMRI images. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates, Inc., pp 1409–1416. http://papers.nips.cc/paper/3577generativeversusdiscriminativetrainingofrbmsforclassificationoffmriimages.pdf
 41.Schulz H, Müller A, Behnke S (2010) Investigating convergence of restricted Boltzmann machine learning. In: NIPS 2010 workshop on deep learning and unsupervised feature learningGoogle Scholar
 42.Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theoryGoogle Scholar
 43.Sutskever I, Tieleman T (2010) On the convergence properties of contrastive divergence. In: International conference on artificial intelligence and statistics, pp 789–795Google Scholar
 44.Theis L, Gerwinn S, Sinz F, Bethge M (2011). In: All likelihood, deep belief is not enough. J Mach Learn Res 12:3071–3096. http://dl.acm.org/citation.cfm?id=1953048.2078204 (ISSN 15324435)
 45.Tieleman T (2008) Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1064–1071Google Scholar
 46.Tieleman T, Hinton G (2009) Using fast weights to improve persistent contrastive divergence. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1033–1040Google Scholar
 47.Upadhya V, Sastry PS (2017) Learning RBM with a DC programming approach. In: Proceedings of the ninth Asian conference on machine learning, volume 77 of proceedings of machine learning research. PMLR, 15–17 Nov 2017, pp 498–513Google Scholar
 48.Wang N, Melchior J, Wiskott L (2014) Gaussianbinary restricted Boltzmann machines on modeling natural image statistics. CoRR. arXiv:1401.5900
 49.Younes L (1989) Parametric inference for imperfectly observed gibbsian fields. Prob Theory Relat Fields 82(4):625–645CrossRefGoogle Scholar
 50.Younes L (1999) On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stoch Stoch Rep 65(3–4):177–228. https://doi.org/10.1080/17442509908834179 CrossRefGoogle Scholar
 51.Yuille AL (2006) The convergence of contrastive divergences. Department of Statistics, UCLAGoogle Scholar