Skip to main content
Log in

Part 4: Reinforcement learning: Machine learning and natural learning

  • Tutorial Series on Brain-Inspired Computing
  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

The theory of reinforcement learning (RL) was originally motivated by animal learning of sequential behavior, but has been developed and extended in the field of machine learning as an approach to Markov decision processes. Recently, a number of neuroscience studies have suggested a relationship between reward-related activities in the brain and functions necessary for RL. Regarding the history of RL, we introduce in this article the theory of RL and present two engineering applications. Then we discuss possible implementations in the brain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alexander, G.E., Crutcher, M.D. and DeLong, M.R., “Basal Gangliathalamocortical Circuits: Parallel Substrates for Motor, Oculomotor, “Prefrontal” and “Limbic” Functions,”Progress in Brain Research, 85, pp. 119–146, 1990.

    Article  Google Scholar 

  2. Amari, S., “Natural Gradient Works Efficiently in Learning,”neural Computation, 10, 2, pp. 251–276, 1998.

    Article  Google Scholar 

  3. Barraclough, D.J., Conroy, M.L. and Lee, D., “Prefrontal Cortex and Decision Making in a Mixed-strategy Game,”Nature Neuroscience, 7, pp. 404–410, 2004.

    Article  Google Scholar 

  4. Barto, A.G., Sutton, R.S. and Anderson, C.W., “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems,”IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, pp. 834–846, 1983.

    Article  Google Scholar 

  5. Barto, A.G., “Adaptive Critics And the Basal Ganglia,” inModels of Information Processing in the Basal Ganglia, pp. 215–232, MIT Press, Cambridge, MA, 1994.

    Google Scholar 

  6. Bellman, R.E.,Dynamic Programming, Princeton University Press, Princeton, 1957.

    MATH  Google Scholar 

  7. Braver, T.S. and Barch, D.M., “A Theory of Cognitive Control, Aging Cognition, and Neuromodulation,”Neuroscience and Biobehavioral Reviews, 26, 7, pp. 809–817, 2002.

    Article  Google Scholar 

  8. Brafman, R.I., “A Heuristic Variable Grid Solution for POMDPs,” inFourteenth National Conference on Artificial Intelligence, AAAI-9, pp. 33–42, 1997.

  9. Cassandra, A.R., Kaelbling, L.P. and Littman, M.L., “Acting Optimally in Partially Observable Stochastic Domains,” inTwelfth National Conference on Artificial Intelligence, AAAI-94, pp. 1023–1028, 1994.

  10. Cohen, J.D., Perlstein, W.M., Braver, T.S., Nystrom, L.E., Noll, D.C., Jonides, J. and Smith, E.E., “Temporal Dynamics of Brain Activation During a Working Memory Task,”Nature, 386, pp. 604–608, 1997.

    Article  Google Scholar 

  11. Cohen, J.D., Braver, T.S. and Brown, J.W., “Computational Perspectives on Dopamine Function in Prefrontal Cortex,”Current Opinion in Neurobiology, 12, 2, pp. 223–229, 2002.

    Article  Google Scholar 

  12. Doya, K., “Complementary Roles of Basal Ganglia and Cerebellum in Learning and Motor Control,”Current Opinion in Neurobiology, 10, 6, pp. 732–739, 2000.

    Article  Google Scholar 

  13. Doya, K., “Computational Model of Neuromodulation,”Neural Networks, 15, 4–6, pp. 475–477, 2002.

    Article  Google Scholar 

  14. Daw, N.D., Niv, Y. and Dayan, P., “Uncertainty-based Competition between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control,”Nature Neuroscience, 8, pp. 1704–1711, 2005.

    Article  Google Scholar 

  15. Fiorillo, C.D., Tobler, P.N. and Schultz, W., “Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons,”Science, 299, pp. 1898–1902, 2003.

    Article  Google Scholar 

  16. Garfen, C.R., Herkenham, M. and Thibault, J., “The Neostriatal Mosaic. II. Patch and Matrix Directed Ddesostriatal Dopaminergic and Nondopaminergic Systems,”The Journal of Neuroscience, 7, pp. 3915–3934, 1987.

    Article  Google Scholar 

  17. Graybiel, A.M., “Neurotransmitters and Neuromodulators in the Basal Ganglia,”Trends in Neurosciences, 13, pp. 244–254, 1990.

    Article  Google Scholar 

  18. Grillner, S., Wallen, P., Brodin, L. and Lansner, A., “Neural Network Generating Locomotor Behavior in Lamprey,”Annual Review of Neuroscience, 14, pp. 169–199, 1991.

    Article  Google Scholar 

  19. Hoshi, E., Shima, K. and Tanji, J., “Neuronal Activity in the Primate Prefrontal Cortex in the Process of Motor Selection Based on Two Behavioral Rules,”Journal of Neurophysiology, 83, pp. 2355–2373, 2000.

    Article  Google Scholar 

  20. Howard, R.A.,Dynamic Programming and Markov Processes, MIT Press, Cambridge, MA, 1960.

    MATH  Google Scholar 

  21. Ishii, S., Yoshida, W. and Yoshimoto, J., “Control of Exploitation-exploration Meta-parameter in Reinforcement Learning,”Neural Networks, 15, pp. 665–687, 2002.

    Article  Google Scholar 

  22. Ishii, S., Fujita, H., Mitsutake, M., Yamazaki, T., Matsuda, J. and Matsuno, Y., “A Reinforcement Learning Scheme for a Partially-observable Multi-agent Game,”Machine Learning, 59, pp. 31–54, 2005.

    Article  Google Scholar 

  23. Kaelbling, L.P., Littman, M. and Cassandra, A., “Planning and Acting in Partially Observable Stochastic Domains,”Artificial Intelligence, 101, pp. 99–134, 1998.

    Article  MathSciNet  Google Scholar 

  24. Kakade, S., “A Natural Policy Gradient,” inAdvances in Neural Information Processing Systems 14, pp. 1531–1538, 2001.

  25. Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Expectation of Reward Modulates Cognitive Signals in the Basal Ganglia,”Nature Neuroscience, 1, 5, pp. 411–416, 1998.

    Article  Google Scholar 

  26. Konda, V.R. and Tsitsiklis, J.N., “Actor-critic Algorithms,”SIAM Journal on Control and Optimization, 42, pp. 1143–1146, 2003.

    Article  MathSciNet  Google Scholar 

  27. Leon, M.I. and Shadlen, M.N., “Effect of Expected Reward Magnitude on the Response of Neurons in the Dorsolateral Prefrontal Cortex of the Macaque,”Neuron, 24, pp. 415–425, 1999.

    Article  Google Scholar 

  28. Mogenson, G.J., Takigawa, M., Robertson, A. and Wu, M., “Self-stimulation of the Nucleus Accumbens and Ventral Tegmental Area of Tsai Attenuated by Microinjections of Spiroperidol into the Nucleus Accumbens,”Brain Research, 171, 2, pp. 247–259, 1979.

    Article  Google Scholar 

  29. Montague, P.R., Dayan, P. and Sejnowski, T.J., “A Framework for Mesencephalic Dopamine Systems Based on Predictive Hebbian Learning,”The Journal of Neuroscience, 16, pp. 1936–1947, 1996.

    Article  Google Scholar 

  30. Moore, A.W. and Atkeson, C.G., “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time,”Machine Learning, 13, pp. 103–130, 1993.

    Google Scholar 

  31. Mori, T., Nakamura, Y., Sato, M. and Ishii, S., “Reinforcement Learning for CPG-driven Biped Robot,” inThe Nineteenth National Conference on Artificial Intelligence, AAAI-04, pp. 623–630, 2004.

  32. Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Dopamine Neurons Can Represent Context-dependent Prediction Error,”Neuron, 41, pp. 269–280, 2004.

    Article  Google Scholar 

  33. Nakamura, Y., Mori, T., Tokita, Y., Shibata, T. and Ishii, S., “Off-policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,”Journal of Robotics and Mechatronics, 17, 6, pp. 636–644, 2005.

    Article  Google Scholar 

  34. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K. and Dolan, R.J., “Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning,”Science, 304, pp. 452–454, 2004.

    Article  Google Scholar 

  35. Olds, J. and Milner, P., “Positive Reinforcement Produced by Electrical Stimulation of Septal Area and Other Regions of Rat Brain,”Journal of Computational Physiological Psychology, 47, pp. 19–27, 1954.

    Google Scholar 

  36. Parr, R. and Russell, S., “Approximating Optimal Policies for Partially Observable Stochastic Domains,” inProceedings of International Joint Conference on Artificial Intelligence, IJCAI-95, pp. 1088–1094, 1995.

  37. Peters, J., Vijayakumar, S. and Schaal, S., “Reinforcement learning for humanoid robotics,” inThird IEEE International Conference on Humanoid Robotics, 2003.

  38. Precup, D., Sutton, R.S. and Dasgupra, S., “Off-policy Temporal-difference Learning with Function Approximation,” inProceedings of the 18th International Conference on Machine Learning, ICML, pp. 417–424, 2001.

  39. Pochon, J.B., Levy, R., Poline, J.B., Crozier, S., Lehericy, S., Pillon, B., Deweer, B., Le Bihan, D. and Dubois, B., “The Role of Dorsolateral Prefrontal Cortex in the Preparation of Forthcoming Actions: An fMRI Study,”Cerebral Cortex, 11, pp. 260–266, 2001.

    Article  Google Scholar 

  40. Poupart, P. and Boutilier, C., “Value-directed Compression of POMDPs,” inAdvances in Neural Information Processing Systems 15, pp. 1579–1586, 2003.

  41. Rescorla, R.A. and Wagner, A.R., “A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement” inClassical Conditioning II: Current Research and Theory, pp. 64–99, New York, NY: Appleton, 1972.

    Google Scholar 

  42. Reynolds, J.N., Hyland, B.I. and Wickens, J.R., “A Cellular Mechanism of Reward-related Learning,”Nature, 413, pp. 67–70, 2001.

    Article  Google Scholar 

  43. Robbins, T.W. and Everitt, B.J., “Neurobehavioural Mechanisms of Reward and Motivation,”Current Opinion in Neurobiology, 6, 2, pp. 228–236, 1996.

    Article  Google Scholar 

  44. Rodoriguez, A., Parr, R. and Koller, D., “Reinforcement Learning Using Approximate Belief State,” inAdvances in Neural Information Processing Systems 12, pp. 1036–1042, 2002.

  45. Samejima, K., Ueda, Y., Doya, K. and Kimura, M., “Representation of Actionspecific Reward Values in The Striatum,”Science, 310, pp. 1337–1340, 2005.

    Article  Google Scholar 

  46. Sato, M., Nakamura, Y. and Ishii, S., “Reinforcement Learning for Biped Locomotion, inNeural Networks-ICANN 2002, LNCS2415, pp. 777–782, Springer-Verlag, Berlin, 2002.

    Chapter  Google Scholar 

  47. Schultz, W., Dayan, P. and Montague, R.P., “A Neural Substrate of Prediction and Reward,”Science, 275, pp. 1593–1599, 1997.

    Article  Google Scholar 

  48. Seymour, B., O’Doherty, J.P., Dayan, P., Koltzenburg, M., Jones, A.K., Dolan, R.J., Friston, K.J. and Frackowiak, R.S., “Temporal Difference Models Describe Higher-order Learning in Humans,”Nature, 429, pp. 664–667, 2004.

    Article  Google Scholar 

  49. Shelton, C.R., “Policy Improvement for POMDPs Using Normalized Importance Sampling,” inProceedings of the Seventeenth International Conference on Uncertainty in Artificial Intelligence (UAI), pp. 496–503, 2001.

  50. Shidara, M., Aigner, T.G. and Richmond, B.J., “Neuronal Signals in the Monkey Ventral Striatum Related to Progress Through a Predictable Series of Trials,”The Journal of Neuroscience, 18, 7, pp. 2613–2625, 1998.

    Article  Google Scholar 

  51. Smallwood, R.D. and Sondik, E.J., “The Optimal Control of Partially Observable Markov Decision Processes Over a Finite Horizon,”Operations Research, 21, 1071–1088, 1973.

    Article  Google Scholar 

  52. Stolerman, I., “Drugs of Abuse: Behavioural Principles, Methods and Terms,”Trends in Pharmacological Sciences, 13, 5, pp. 170–176, 1992.

    Article  Google Scholar 

  53. Sutton, R.S. and Barto, B.G., “Towards a Modern Theory of Adaptive Networks: Expectation and Prediction,”Psychological Review, 88, pp. 135–170, 1981.

    Article  Google Scholar 

  54. Sutton, R.S., “Learning to Predict by the Method of Temporal Differences,”Machine Learning, 3, pp. 9–44, 1988.

    Google Scholar 

  55. Sutton, R.S. and Barto, A.G.,Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1988.

    MATH  Google Scholar 

  56. Sutton, R.S., McAllester, D., Singh, S. and Manour, Y., “Policy Gradient Method for Reinforcement Learning with Function Approximation,” inAdvances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.

  57. Taga, G., Yamaguchi, Y. and Shimizu, H., “Self-organized Control in Bipedal Locomotion by Neural Oscillators in Unpredictable Environment,”Biological Cybernetics, 65, pp. 147–159, 1991.

    Article  Google Scholar 

  58. Tanji, J. and Hoshi, E., “Behavioral Planning in the Prefrontal Cortex,”Current Opinion in Neurobiology, 11, pp. 164–170, 2001.

    Article  Google Scholar 

  59. Watanabe, M., “Reward expectancy in primate prefrontal neurons,”Nature, 382, pp. 629–632, 1996.

    Article  Google Scholar 

  60. Watkins, C.J.C.H. and Dayan, P., “Q-learning,”Machine Learning, 8(3/4), pp. 279–292, 1992.

    Article  Google Scholar 

  61. Williams, R., “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,”Machine Learning, 8, pp. 229–256, 1992.

    MATH  Google Scholar 

  62. Yoshida, W. and Ishii, S., “Model-based Reinforcement Learning: A Computational Model and an fMRI Study,”Neurocomputing, 3C, pp. 253–269, 2005.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shin Ishii.

Additional information

Shin Ishii, Ph.D.: He is a professor of Graduate School of Information Science at Nara Institute of Science and Technology. He received his B.E. in 1986, M.E. in 1988, and Ph.D. in 1987 from University of Tokyo. His current research interests are computational neuroscience, systems neurobiology and statistical learning theory.

Wako Yoshida, Ph.D.: She is a researcher of Graduate School of Information Science at Nara Institute of Science and Technology. She received her B.A. in 1998 from Kobe College, M.E. in 2000 and Ph.D. in 2003 both from Nara Institute of Science and Technology. Her research interest includes theoretical and experimental approach to human’s decision-making process through learning, memory and communication.

About this article

Cite this article

Ishii, S., Yoshida, W. Part 4: Reinforcement learning: Machine learning and natural learning. New Gener Comput 24, 325–350 (2006). https://doi.org/10.1007/BF03037338

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF03037338

Keywords

Navigation