Reinforcement Learning in Cortical Networks
Reinforcement learning represents a basic paradigm of learning in artificial intelligence and biology. The paradigm considers an agent (robot, human, animal) that acts in a typically stochastic environment and receives rewards when reaching certain states. The agent’s goal is to maximize the expected reward by choosing the optimal action at any given state. In a cortical implementation, the states are defined by sensory stimuli that feed into a neuronal network, and after the network activity is settled, an action is read out. Learning consists in adapting the synaptic connection strengths into and within the neuronal network based on a (typically binary) feedback about the appropriateness of the chosen action. Policy gradient and temporal difference learning are two methods for deriving synaptic plasticity rules that maximize the expected reward in...
This work was supported by the Swiss National Science Foundation with grants 31003A_133094 and CRSII2_147636 to W.S and Grants PZ00P3_137200 and PP00P3_150637 to J.-P.P. We thank Robert Urbanczik and Johannes Friedrich for valuable comments on the manuscript.
- Baxter J, Bartlett P (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350Google Scholar
- Di Castro D, Volkinshtein S, Meir R (2009) Temporal difference based actor critic learning: convergence and neural implementation. In: Advances in neural information processing systems, vol 21. MIT Press, Cambridge, MA, pp 385–392Google Scholar
- Friedrich J, Urbanczik R, Senn W (2014) Code-specific learning rules improve action selection by populations of spiking neurons. Int J of Neural Syst 24:1–17Google Scholar
- Sjöström J, Gerstner W (2010) Spike-timing dependent plasticity. Scholarpedia 5:1362Google Scholar
- Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge, MAGoogle Scholar
- Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256Google Scholar