New Generation Computing

, Volume 24, Issue 3, pp 325–350 | Cite as

Part 4: Reinforcement learning: Machine learning and natural learning

  • Shin Ishii
  • Wako Yoshida
Tutorial Series on Brain-Inspired Computing


The theory of reinforcement learning (RL) was originally motivated by animal learning of sequential behavior, but has been developed and extended in the field of machine learning as an approach to Markov decision processes. Recently, a number of neuroscience studies have suggested a relationship between reward-related activities in the brain and functions necessary for RL. Regarding the history of RL, we introduce in this article the theory of RL and present two engineering applications. Then we discuss possible implementations in the brain.


Reinforcement Learning Temporal Difference Actor-critic Reward System Dopamine 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1).
    Alexander, G.E., Crutcher, M.D. and DeLong, M.R., “Basal Gangliathalamocortical Circuits: Parallel Substrates for Motor, Oculomotor, “Prefrontal” and “Limbic” Functions,”Progress in Brain Research, 85, pp. 119–146, 1990.CrossRefGoogle Scholar
  2. 2).
    Amari, S., “Natural Gradient Works Efficiently in Learning,”neural Computation, 10, 2, pp. 251–276, 1998.CrossRefMathSciNetGoogle Scholar
  3. 3).
    Barraclough, D.J., Conroy, M.L. and Lee, D., “Prefrontal Cortex and Decision Making in a Mixed-strategy Game,”Nature Neuroscience, 7, pp. 404–410, 2004.CrossRefGoogle Scholar
  4. 4).
    Barto, A.G., Sutton, R.S. and Anderson, C.W., “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems,”IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, pp. 834–846, 1983.Google Scholar
  5. 5).
    Barto, A.G., “Adaptive Critics And the Basal Ganglia,” inModels of Information Processing in the Basal Ganglia, pp. 215–232, MIT Press, Cambridge, MA, 1994.Google Scholar
  6. 6).
    Bellman, R.E.,Dynamic Programming, Princeton University Press, Princeton, 1957.Google Scholar
  7. 7).
    Braver, T.S. and Barch, D.M., “A Theory of Cognitive Control, Aging Cognition, and Neuromodulation,”Neuroscience and Biobehavioral Reviews, 26, 7, pp. 809–817, 2002.CrossRefGoogle Scholar
  8. 8).
    Brafman, R.I., “A Heuristic Variable Grid Solution for POMDPs,” inFourteenth National Conference on Artificial Intelligence, AAAI-9, pp. 33–42, 1997.Google Scholar
  9. 9).
    Cassandra, A.R., Kaelbling, L.P. and Littman, M.L., “Acting Optimally in Partially Observable Stochastic Domains,” inTwelfth National Conference on Artificial Intelligence, AAAI-94, pp. 1023–1028, 1994.Google Scholar
  10. 10).
    Cohen, J.D., Perlstein, W.M., Braver, T.S., Nystrom, L.E., Noll, D.C., Jonides, J. and Smith, E.E., “Temporal Dynamics of Brain Activation During a Working Memory Task,”Nature, 386, pp. 604–608, 1997.CrossRefGoogle Scholar
  11. 11).
    Cohen, J.D., Braver, T.S. and Brown, J.W., “Computational Perspectives on Dopamine Function in Prefrontal Cortex,”Current Opinion in Neurobiology, 12, 2, pp. 223–229, 2002.CrossRefGoogle Scholar
  12. 12).
    Doya, K., “Complementary Roles of Basal Ganglia and Cerebellum in Learning and Motor Control,”Current Opinion in Neurobiology, 10, 6, pp. 732–739, 2000.CrossRefGoogle Scholar
  13. 13).
    Doya, K., “Computational Model of Neuromodulation,”Neural Networks, 15, 4–6, pp. 475–477, 2002.CrossRefGoogle Scholar
  14. 14).
    Daw, N.D., Niv, Y. and Dayan, P., “Uncertainty-based Competition between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control,”Nature Neuroscience, 8, pp. 1704–1711, 2005.CrossRefGoogle Scholar
  15. 15).
    Fiorillo, C.D., Tobler, P.N. and Schultz, W., “Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons,”Science, 299, pp. 1898–1902, 2003.CrossRefGoogle Scholar
  16. 16).
    Garfen, C.R., Herkenham, M. and Thibault, J., “The Neostriatal Mosaic. II. Patch and Matrix Directed Ddesostriatal Dopaminergic and Nondopaminergic Systems,”The Journal of Neuroscience, 7, pp. 3915–3934, 1987.Google Scholar
  17. 17).
    Graybiel, A.M., “Neurotransmitters and Neuromodulators in the Basal Ganglia,”Trends in Neurosciences, 13, pp. 244–254, 1990.CrossRefGoogle Scholar
  18. 18).
    Grillner, S., Wallen, P., Brodin, L. and Lansner, A., “Neural Network Generating Locomotor Behavior in Lamprey,”Annual Review of Neuroscience, 14, pp. 169–199, 1991.CrossRefGoogle Scholar
  19. 19).
    Hoshi, E., Shima, K. and Tanji, J., “Neuronal Activity in the Primate Prefrontal Cortex in the Process of Motor Selection Based on Two Behavioral Rules,”Journal of Neurophysiology, 83, pp. 2355–2373, 2000.Google Scholar
  20. 20).
    Howard, R.A.,Dynamic Programming and Markov Processes, MIT Press, Cambridge, MA, 1960.MATHGoogle Scholar
  21. 21).
    Ishii, S., Yoshida, W. and Yoshimoto, J., “Control of Exploitation-exploration Meta-parameter in Reinforcement Learning,”Neural Networks, 15, pp. 665–687, 2002.CrossRefGoogle Scholar
  22. 22).
    Ishii, S., Fujita, H., Mitsutake, M., Yamazaki, T., Matsuda, J. and Matsuno, Y., “A Reinforcement Learning Scheme for a Partially-observable Multi-agent Game,”Machine Learning, 59, pp. 31–54, 2005.MATHCrossRefGoogle Scholar
  23. 23).
    Kaelbling, L.P., Littman, M. and Cassandra, A., “Planning and Acting in Partially Observable Stochastic Domains,”Artificial Intelligence, 101, pp. 99–134, 1998.MATHCrossRefMathSciNetGoogle Scholar
  24. 24).
    Kakade, S., “A Natural Policy Gradient,” inAdvances in Neural Information Processing Systems 14, pp. 1531–1538, 2001.Google Scholar
  25. 25).
    Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Expectation of Reward Modulates Cognitive Signals in the Basal Ganglia,”Nature Neuroscience, 1, 5, pp. 411–416, 1998.CrossRefGoogle Scholar
  26. 26).
    Konda, V.R. and Tsitsiklis, J.N., “Actor-critic Algorithms,”SIAM Journal on Control and Optimization, 42, pp. 1143–1146, 2003.MATHCrossRefMathSciNetGoogle Scholar
  27. 27).
    Leon, M.I. and Shadlen, M.N., “Effect of Expected Reward Magnitude on the Response of Neurons in the Dorsolateral Prefrontal Cortex of the Macaque,”Neuron, 24, pp. 415–425, 1999.CrossRefGoogle Scholar
  28. 28).
    Mogenson, G.J., Takigawa, M., Robertson, A. and Wu, M., “Self-stimulation of the Nucleus Accumbens and Ventral Tegmental Area of Tsai Attenuated by Microinjections of Spiroperidol into the Nucleus Accumbens,”Brain Research, 171, 2, pp. 247–259, 1979.CrossRefGoogle Scholar
  29. 29).
    Montague, P.R., Dayan, P. and Sejnowski, T.J., “A Framework for Mesencephalic Dopamine Systems Based on Predictive Hebbian Learning,”The Journal of Neuroscience, 16, pp. 1936–1947, 1996.Google Scholar
  30. 30).
    Moore, A.W. and Atkeson, C.G., “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time,”Machine Learning, 13, pp. 103–130, 1993.Google Scholar
  31. 31).
    Mori, T., Nakamura, Y., Sato, M. and Ishii, S., “Reinforcement Learning for CPG-driven Biped Robot,” inThe Nineteenth National Conference on Artificial Intelligence, AAAI-04, pp. 623–630, 2004.Google Scholar
  32. 32).
    Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Dopamine Neurons Can Represent Context-dependent Prediction Error,”Neuron, 41, pp. 269–280, 2004.CrossRefGoogle Scholar
  33. 33).
    Nakamura, Y., Mori, T., Tokita, Y., Shibata, T. and Ishii, S., “Off-policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,”Journal of Robotics and Mechatronics, 17, 6, pp. 636–644, 2005.Google Scholar
  34. 34).
    O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K. and Dolan, R.J., “Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning,”Science, 304, pp. 452–454, 2004.CrossRefGoogle Scholar
  35. 35).
    Olds, J. and Milner, P., “Positive Reinforcement Produced by Electrical Stimulation of Septal Area and Other Regions of Rat Brain,”Journal of Computational Physiological Psychology, 47, pp. 19–27, 1954.Google Scholar
  36. 36).
    Parr, R. and Russell, S., “Approximating Optimal Policies for Partially Observable Stochastic Domains,” inProceedings of International Joint Conference on Artificial Intelligence, IJCAI-95, pp. 1088–1094, 1995.Google Scholar
  37. 37).
    Peters, J., Vijayakumar, S. and Schaal, S., “Reinforcement learning for humanoid robotics,” inThird IEEE International Conference on Humanoid Robotics, 2003.Google Scholar
  38. 38).
    Precup, D., Sutton, R.S. and Dasgupra, S., “Off-policy Temporal-difference Learning with Function Approximation,” inProceedings of the 18th International Conference on Machine Learning, ICML, pp. 417–424, 2001.Google Scholar
  39. 39).
    Pochon, J.B., Levy, R., Poline, J.B., Crozier, S., Lehericy, S., Pillon, B., Deweer, B., Le Bihan, D. and Dubois, B., “The Role of Dorsolateral Prefrontal Cortex in the Preparation of Forthcoming Actions: An fMRI Study,”Cerebral Cortex, 11, pp. 260–266, 2001.CrossRefGoogle Scholar
  40. 40).
    Poupart, P. and Boutilier, C., “Value-directed Compression of POMDPs,” inAdvances in Neural Information Processing Systems 15, pp. 1579–1586, 2003.Google Scholar
  41. 41).
    Rescorla, R.A. and Wagner, A.R., “A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement” inClassical Conditioning II: Current Research and Theory, pp. 64–99, New York, NY: Appleton, 1972.Google Scholar
  42. 42).
    Reynolds, J.N., Hyland, B.I. and Wickens, J.R., “A Cellular Mechanism of Reward-related Learning,”Nature, 413, pp. 67–70, 2001.CrossRefGoogle Scholar
  43. 43).
    Robbins, T.W. and Everitt, B.J., “Neurobehavioural Mechanisms of Reward and Motivation,”Current Opinion in Neurobiology, 6, 2, pp. 228–236, 1996.CrossRefGoogle Scholar
  44. 44).
    Rodoriguez, A., Parr, R. and Koller, D., “Reinforcement Learning Using Approximate Belief State,” inAdvances in Neural Information Processing Systems 12, pp. 1036–1042, 2002.Google Scholar
  45. 45).
    Samejima, K., Ueda, Y., Doya, K. and Kimura, M., “Representation of Actionspecific Reward Values in The Striatum,”Science, 310, pp. 1337–1340, 2005.CrossRefGoogle Scholar
  46. 46).
    Sato, M., Nakamura, Y. and Ishii, S., “Reinforcement Learning for Biped Locomotion, inNeural Networks-ICANN 2002, LNCS2415, pp. 777–782, Springer-Verlag, Berlin, 2002.CrossRefGoogle Scholar
  47. 47).
    Schultz, W., Dayan, P. and Montague, R.P., “A Neural Substrate of Prediction and Reward,”Science, 275, pp. 1593–1599, 1997.CrossRefGoogle Scholar
  48. 48).
    Seymour, B., O’Doherty, J.P., Dayan, P., Koltzenburg, M., Jones, A.K., Dolan, R.J., Friston, K.J. and Frackowiak, R.S., “Temporal Difference Models Describe Higher-order Learning in Humans,”Nature, 429, pp. 664–667, 2004.CrossRefGoogle Scholar
  49. 49).
    Shelton, C.R., “Policy Improvement for POMDPs Using Normalized Importance Sampling,” inProceedings of the Seventeenth International Conference on Uncertainty in Artificial Intelligence (UAI), pp. 496–503, 2001.Google Scholar
  50. 50).
    Shidara, M., Aigner, T.G. and Richmond, B.J., “Neuronal Signals in the Monkey Ventral Striatum Related to Progress Through a Predictable Series of Trials,”The Journal of Neuroscience, 18, 7, pp. 2613–2625, 1998.Google Scholar
  51. 51).
    Smallwood, R.D. and Sondik, E.J., “The Optimal Control of Partially Observable Markov Decision Processes Over a Finite Horizon,”Operations Research, 21, 1071–1088, 1973.MATHCrossRefGoogle Scholar
  52. 52).
    Stolerman, I., “Drugs of Abuse: Behavioural Principles, Methods and Terms,”Trends in Pharmacological Sciences, 13, 5, pp. 170–176, 1992.CrossRefGoogle Scholar
  53. 53).
    Sutton, R.S. and Barto, B.G., “Towards a Modern Theory of Adaptive Networks: Expectation and Prediction,”Psychological Review, 88, pp. 135–170, 1981.CrossRefGoogle Scholar
  54. 54).
    Sutton, R.S., “Learning to Predict by the Method of Temporal Differences,”Machine Learning, 3, pp. 9–44, 1988.Google Scholar
  55. 55).
    Sutton, R.S. and Barto, A.G.,Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1988.Google Scholar
  56. 56).
    Sutton, R.S., McAllester, D., Singh, S. and Manour, Y., “Policy Gradient Method for Reinforcement Learning with Function Approximation,” inAdvances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.Google Scholar
  57. 57).
    Taga, G., Yamaguchi, Y. and Shimizu, H., “Self-organized Control in Bipedal Locomotion by Neural Oscillators in Unpredictable Environment,”Biological Cybernetics, 65, pp. 147–159, 1991.MATHCrossRefGoogle Scholar
  58. 58).
    Tanji, J. and Hoshi, E., “Behavioral Planning in the Prefrontal Cortex,”Current Opinion in Neurobiology, 11, pp. 164–170, 2001.CrossRefGoogle Scholar
  59. 59).
    Watanabe, M., “Reward expectancy in primate prefrontal neurons,”Nature, 382, pp. 629–632, 1996.CrossRefGoogle Scholar
  60. 60).
    Watkins, C.J.C.H. and Dayan, P., “Q-learning,”Machine Learning, 8(3/4), pp. 279–292, 1992.MATHCrossRefGoogle Scholar
  61. 61).
    Williams, R., “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,”Machine Learning, 8, pp. 229–256, 1992.MATHGoogle Scholar
  62. 62).
    Yoshida, W. and Ishii, S., “Model-based Reinforcement Learning: A Computational Model and an fMRI Study,”Neurocomputing, 3C, pp. 253–269, 2005.Google Scholar

Copyright information

© Ohmsha, Ltd. and Springer 2006

Authors and Affiliations

  1. 1.Nara Institute of Science and TechnologyNaraJapan

Personalised recommendations