Encyclopedia of Computational Neuroscience

2015 Edition
| Editors: Dieter Jaeger, Ranu Jung

Reinforcement Learning in Cortical Networks

  • Walter Senn
  • Jean-Pascal Pfister
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-6675-8_580



Reinforcement learning represents a basic paradigm of learning in artificial intelligence and biology. The paradigm considers an agent (robot, human, animal) that acts in a typically stochastic environment and receives rewards when reaching certain states. The agent’s goal is to maximize the expected reward by choosing the optimal action at any given state. In a cortical implementation, the states are defined by sensory stimuli that feed into a neuronal network, and after the network activity is settled, an action is read out. Learning consists in adapting the synaptic connection strengths into and within the neuronal network based on a (typically binary) feedback about the appropriateness of the chosen action. Policy gradient and temporal difference learning are two methods for deriving synaptic plasticity rules that maximize the expected reward in...

This is a preview of subscription content, log in to check access.



This work was supported by the Swiss National Science Foundation with grants 31003A_133094 and CRSII2_147636 to W.S and Grants PZ00P3_137200 and PP00P3_150637 to J.-P.P. We thank Robert Urbanczik and Johannes Friedrich for valuable comments on the manuscript.


  1. Baxter J, Bartlett P (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350Google Scholar
  2. Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441:876–879PubMedCentralPubMedGoogle Scholar
  3. Dayan P, Niv Y (2008) Reinforcement learning: the good, the bad and the ugly. Curr Opin Neurobiol 18:185–196PubMedGoogle Scholar
  4. Di Castro D, Volkinshtein S, Meir R (2009) Temporal difference based actor critic learning: convergence and neural implementation. In: Advances in neural information processing systems, vol 21. MIT Press, Cambridge, MA, pp 385–392Google Scholar
  5. Fiete IR, Seung HS (2006) Gradient learning in spiking neural networks by dynamic perturbation of conductances. Phys Rev Lett 97:048104PubMedGoogle Scholar
  6. Florian RV (2007) Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Comput 19:1468–1502PubMedGoogle Scholar
  7. Frémaux N, Sprekeler H, Gerstner W (2010) Functional requirements for reward-modulated spike-timing-dependent plasticity. J Neurosci 30:13326–13337PubMedGoogle Scholar
  8. Frémaux N, Sprekeler H, Gerstner W (2013) Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS Comput Biol 9:e1003024PubMedCentralPubMedGoogle Scholar
  9. Friedrich J, Urbanczik R, Senn W (2011) Spatio-temporal credit assignment in neuronal population learning. PLoS Comput Biol 7:e1002092PubMedCentralPubMedGoogle Scholar
  10. Friedrich J, Urbanczik R, Senn W (2014) Code-specific learning rules improve action selection by populations of spiking neurons. Int J of Neural Syst 24:1–17Google Scholar
  11. Izhikevich EM (2007) Solving the distal reward problem through linkage of STDP and dopamine signaling. Cereb Cortex 17:2443–2452PubMedGoogle Scholar
  12. Kolodziejski C, Porr B, Worgotter F (2009) On the asymptotic equivalence between differential hebbian and temporal difference learning. Neural Comput 21:1173–1202PubMedGoogle Scholar
  13. Legenstein R, Pecevski D, Maass W (2008) A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. PLoS Comput Biol 4:e1000180PubMedCentralPubMedGoogle Scholar
  14. Pfister J, Toyoizumi T, Barber D, Gerstner W (2006) Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Comput 18:1318–1348PubMedGoogle Scholar
  15. Potjans W, Morrison A, Diesmann M (2009) A spiking neural network model of an actor-critic learning agent. Neural Comput 21:301–339PubMedGoogle Scholar
  16. Potjans W, Diesmann M, Morrison A (2011) An imperfect dopaminergic error signal can drive temporal-difference learning. PLoS Comput Biol 7:e1001133PubMedCentralPubMedGoogle Scholar
  17. Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275:1593–1599PubMedGoogle Scholar
  18. Seung HS (2003) Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron 40:1063–1073PubMedGoogle Scholar
  19. Sjöström J, Gerstner W (2010) Spike-timing dependent plasticity. Scholarpedia 5:1362Google Scholar
  20. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge, MAGoogle Scholar
  21. Tanimoto H, Heisenberg M, Gerber B (2004) Experimental psychology: event timing turns punishment to reward. Nature 430:983PubMedGoogle Scholar
  22. Urbanczik R, Senn W (2009) Reinforcement learning in populations of spiking neurons. Nat Neurosci 12:250–252PubMedGoogle Scholar
  23. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256Google Scholar
  24. Wunderlich K, Dayan P, Dolan RJ (2012) Mapping value based planning and extensively trained choice in the human brain. Nat Neurosci 15:786–791PubMedCentralPubMedGoogle Scholar
  25. Xie X, Seung HS (2004) Learning in neural networks by reinforcement of irregular spiking. Phys Rev E Stat Nonlin Soft Matter Phys 69:041909PubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of PhysiologyUniversity of BernBernSwitzerland
  2. 2.Theoretical Neuroscience Group, Institute of NeuroinformaticsUniversity of Zürich and ETH ZürichZürichSwitzerland