A simultaneous perturbation weak derivative estimator for stochastic neural networks

  • Thomas FlynnEmail author
  • Felisa Vázquez-Abad
Original Paper


In this paper we study gradient estimation for a network of nonlinear stochastic units known as the Little model. Many machine learning systems can be described as networks of homogeneous units, and the Little model is of a particularly general form, which includes as special cases several popular machine learning architectures. However, since a closed form solution for the stationary distribution is not known, gradient methods which work for similar models such as the Boltzmann machine or sigmoid belief network cannot be used. To address this we introduce a method to calculate derivatives for this system based on measure-valued differentiation and simultaneous perturbation. This extends previous works in which gradient estimation algorithm’s were presented for networks with restrictive features like symmetry or acyclic connectivity.



  1. Ackley DH, Hinton GE, Sejnowski TJ (1985) A learning algorithm for Boltzmann machines. Cognit Sci 9(1):147–169CrossRefGoogle Scholar
  2. Apolloni B, de Falco D (1991) Learning by asymmetric parallel Boltzmann machines. Neural Comput 3(3):402–408CrossRefGoogle Scholar
  3. Apolloni B, de Falco D (1991) Learning by parallel Boltzmann machines. IEEE Trans Inf Theory 37(4):1162–1165. CrossRefGoogle Scholar
  4. Cao XR (1998) The relations among potentials, perturbation analysis, and markov decision processes. Discrete Event Dyn Syst 8(1):71–87. CrossRefGoogle Scholar
  5. Ermoliev Y (1983) Stochastic quasigradient methods and their application to system optimization. Stochastics 9(1–2):1–36. CrossRefGoogle Scholar
  6. Heidergott B, Hordijk A (2003) Taylor series expansions for stationary markov chains. Adv Appl Probab 35(4):1046–1070CrossRefGoogle Scholar
  7. Heidergott B, Vázquez-Abad FJ (2006) Measure-valued differentiation for random horizon problems. Markov Process Relat Fields 12(3):509–536Google Scholar
  8. Heidergott B, Vázquez-Abad FJ (2008) Measure-valued differentiation for markov chains. J Optim Theory Appl 136(2):187–209. CrossRefGoogle Scholar
  9. Heidergott B, Vázquez-Abad FJ, Pflug G, Farenhorst-Yuan T (2010) Gradient estimation for discrete-event systems by measure-valued differentiation. ACM Trans Model Comput Simul 20(1):5:1–5:28. CrossRefGoogle Scholar
  10. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554CrossRefGoogle Scholar
  11. Hinton GE, Sejnowski TJ (1983) Optimal perceptual inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEEGoogle Scholar
  12. Kirkland S (2003) Conditioning properties of the stationary distribution for a Markov chain. Electron. J. Linear Algebra 10(1):1CrossRefGoogle Scholar
  13. Kushner H, Clark D (1978) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer, BerlinCrossRefGoogle Scholar
  14. Little WA (1974) The existence of persistent states in the brain. Math Biosci 19(1–2):101–120CrossRefGoogle Scholar
  15. McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5(4):115–133CrossRefGoogle Scholar
  16. Minsky M (1961) Steps toward artificial intelligence. Proc IRE 49(1):8–30CrossRefGoogle Scholar
  17. Neal RM (1992) Connectionist learning of belief networks. Artif Intell 56(1):71–113CrossRefGoogle Scholar
  18. Peretto P (1984) Collective properties of neural networks: a statistical physics approach. Biol Cybern 50(1):51–62CrossRefGoogle Scholar
  19. Pflug GC (1990) On-line optimization of simulated Markovian processes. Math Oper Res 15(3):381–395CrossRefGoogle Scholar
  20. Pflug GC (1992) Gradient estimates for the performance of Markov chains and discrete event processes. Ann Oper Res 39(1):173–194. CrossRefGoogle Scholar
  21. Pflug GC (1996) Optimization of stochastic models: the interface between simulation and optimization. The Kluwer International Series in Engineering and Computer Science. Kluwer, DordrechtGoogle Scholar
  22. Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386CrossRefGoogle Scholar
  23. Ross SM (1990) A course in simulation. Prentice Hall PTR, Englewood CliffsGoogle Scholar
  24. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536CrossRefGoogle Scholar
  25. Smolensky P (1987) Information processing in dynamical systems: foundations of harmony theory, vol 1. MIT Press, Cambridge, pp 194–281Google Scholar
  26. Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341CrossRefGoogle Scholar
  27. Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. Adv Neural Inf Process Syst 25:2231–2239Google Scholar

Copyright information

© This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2019

Authors and Affiliations

  1. 1.Computational Science InitiativeBrookhaven National LaboratoryUptonUSA
  2. 2.Department of Computer ScienceHunter CollegeNew YorkUSA

Personalised recommendations