Part 4: Reinforcement learning: Machine learning and natural learning

Ishii, Shin; Yoshida, Wako

doi:10.1007/BF03037338

Part 4: Reinforcement learning: Machine learning and natural learning

Tutorial Series on Brain-Inspired Computing
Published: 01 September 2006

Volume 24, pages 325–350, (2006)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Shin Ishii¹ &
Wako Yoshida¹

164 Accesses
Explore all metrics

Abstract

The theory of reinforcement learning (RL) was originally motivated by animal learning of sequential behavior, but has been developed and extended in the field of machine learning as an approach to Markov decision processes. Recently, a number of neuroscience studies have suggested a relationship between reward-related activities in the brain and functions necessary for RL. Regarding the history of RL, we introduce in this article the theory of RL and present two engineering applications. Then we discuss possible implementations in the brain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reinforcement Learning: A Survey

Reinforcement Learning: A Friendly Introduction

Reinforcement Learning

References

Alexander, G.E., Crutcher, M.D. and DeLong, M.R., “Basal Gangliathalamocortical Circuits: Parallel Substrates for Motor, Oculomotor, “Prefrontal” and “Limbic” Functions,”Progress in Brain Research, 85, pp. 119–146, 1990.
Article Google Scholar
Amari, S., “Natural Gradient Works Efficiently in Learning,”neural Computation, 10, 2, pp. 251–276, 1998.
Article Google Scholar
Barraclough, D.J., Conroy, M.L. and Lee, D., “Prefrontal Cortex and Decision Making in a Mixed-strategy Game,”Nature Neuroscience, 7, pp. 404–410, 2004.
Article Google Scholar
Barto, A.G., Sutton, R.S. and Anderson, C.W., “Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems,”IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, pp. 834–846, 1983.
Article Google Scholar
Barto, A.G., “Adaptive Critics And the Basal Ganglia,” inModels of Information Processing in the Basal Ganglia, pp. 215–232, MIT Press, Cambridge, MA, 1994.
Google Scholar
Bellman, R.E.,Dynamic Programming, Princeton University Press, Princeton, 1957.
MATH Google Scholar
Braver, T.S. and Barch, D.M., “A Theory of Cognitive Control, Aging Cognition, and Neuromodulation,”Neuroscience and Biobehavioral Reviews, 26, 7, pp. 809–817, 2002.
Article Google Scholar
Brafman, R.I., “A Heuristic Variable Grid Solution for POMDPs,” inFourteenth National Conference on Artificial Intelligence, AAAI-9, pp. 33–42, 1997.
Cassandra, A.R., Kaelbling, L.P. and Littman, M.L., “Acting Optimally in Partially Observable Stochastic Domains,” inTwelfth National Conference on Artificial Intelligence, AAAI-94, pp. 1023–1028, 1994.
Cohen, J.D., Perlstein, W.M., Braver, T.S., Nystrom, L.E., Noll, D.C., Jonides, J. and Smith, E.E., “Temporal Dynamics of Brain Activation During a Working Memory Task,”Nature, 386, pp. 604–608, 1997.
Article Google Scholar
Cohen, J.D., Braver, T.S. and Brown, J.W., “Computational Perspectives on Dopamine Function in Prefrontal Cortex,”Current Opinion in Neurobiology, 12, 2, pp. 223–229, 2002.
Article Google Scholar
Doya, K., “Complementary Roles of Basal Ganglia and Cerebellum in Learning and Motor Control,”Current Opinion in Neurobiology, 10, 6, pp. 732–739, 2000.
Article Google Scholar
Doya, K., “Computational Model of Neuromodulation,”Neural Networks, 15, 4–6, pp. 475–477, 2002.
Article Google Scholar
Daw, N.D., Niv, Y. and Dayan, P., “Uncertainty-based Competition between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control,”Nature Neuroscience, 8, pp. 1704–1711, 2005.
Article Google Scholar
Fiorillo, C.D., Tobler, P.N. and Schultz, W., “Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons,”Science, 299, pp. 1898–1902, 2003.
Article Google Scholar
Garfen, C.R., Herkenham, M. and Thibault, J., “The Neostriatal Mosaic. II. Patch and Matrix Directed Ddesostriatal Dopaminergic and Nondopaminergic Systems,”The Journal of Neuroscience, 7, pp. 3915–3934, 1987.
Article Google Scholar
Graybiel, A.M., “Neurotransmitters and Neuromodulators in the Basal Ganglia,”Trends in Neurosciences, 13, pp. 244–254, 1990.
Article Google Scholar
Grillner, S., Wallen, P., Brodin, L. and Lansner, A., “Neural Network Generating Locomotor Behavior in Lamprey,”Annual Review of Neuroscience, 14, pp. 169–199, 1991.
Article Google Scholar
Hoshi, E., Shima, K. and Tanji, J., “Neuronal Activity in the Primate Prefrontal Cortex in the Process of Motor Selection Based on Two Behavioral Rules,”Journal of Neurophysiology, 83, pp. 2355–2373, 2000.
Article Google Scholar
Howard, R.A.,Dynamic Programming and Markov Processes, MIT Press, Cambridge, MA, 1960.
MATH Google Scholar
Ishii, S., Yoshida, W. and Yoshimoto, J., “Control of Exploitation-exploration Meta-parameter in Reinforcement Learning,”Neural Networks, 15, pp. 665–687, 2002.
Article Google Scholar
Ishii, S., Fujita, H., Mitsutake, M., Yamazaki, T., Matsuda, J. and Matsuno, Y., “A Reinforcement Learning Scheme for a Partially-observable Multi-agent Game,”Machine Learning, 59, pp. 31–54, 2005.
Article Google Scholar
Kaelbling, L.P., Littman, M. and Cassandra, A., “Planning and Acting in Partially Observable Stochastic Domains,”Artificial Intelligence, 101, pp. 99–134, 1998.
Article MathSciNet Google Scholar
Kakade, S., “A Natural Policy Gradient,” inAdvances in Neural Information Processing Systems 14, pp. 1531–1538, 2001.
Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Expectation of Reward Modulates Cognitive Signals in the Basal Ganglia,”Nature Neuroscience, 1, 5, pp. 411–416, 1998.
Article Google Scholar
Konda, V.R. and Tsitsiklis, J.N., “Actor-critic Algorithms,”SIAM Journal on Control and Optimization, 42, pp. 1143–1146, 2003.
Article MathSciNet Google Scholar
Leon, M.I. and Shadlen, M.N., “Effect of Expected Reward Magnitude on the Response of Neurons in the Dorsolateral Prefrontal Cortex of the Macaque,”Neuron, 24, pp. 415–425, 1999.
Article Google Scholar
Mogenson, G.J., Takigawa, M., Robertson, A. and Wu, M., “Self-stimulation of the Nucleus Accumbens and Ventral Tegmental Area of Tsai Attenuated by Microinjections of Spiroperidol into the Nucleus Accumbens,”Brain Research, 171, 2, pp. 247–259, 1979.
Article Google Scholar
Montague, P.R., Dayan, P. and Sejnowski, T.J., “A Framework for Mesencephalic Dopamine Systems Based on Predictive Hebbian Learning,”The Journal of Neuroscience, 16, pp. 1936–1947, 1996.
Article Google Scholar
Moore, A.W. and Atkeson, C.G., “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time,”Machine Learning, 13, pp. 103–130, 1993.
Google Scholar
Mori, T., Nakamura, Y., Sato, M. and Ishii, S., “Reinforcement Learning for CPG-driven Biped Robot,” inThe Nineteenth National Conference on Artificial Intelligence, AAAI-04, pp. 623–630, 2004.
Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y. and Hikosaka, O., “Dopamine Neurons Can Represent Context-dependent Prediction Error,”Neuron, 41, pp. 269–280, 2004.
Article Google Scholar
Nakamura, Y., Mori, T., Tokita, Y., Shibata, T. and Ishii, S., “Off-policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller,”Journal of Robotics and Mechatronics, 17, 6, pp. 636–644, 2005.
Article Google Scholar
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K. and Dolan, R.J., “Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning,”Science, 304, pp. 452–454, 2004.
Article Google Scholar
Olds, J. and Milner, P., “Positive Reinforcement Produced by Electrical Stimulation of Septal Area and Other Regions of Rat Brain,”Journal of Computational Physiological Psychology, 47, pp. 19–27, 1954.
Google Scholar
Parr, R. and Russell, S., “Approximating Optimal Policies for Partially Observable Stochastic Domains,” inProceedings of International Joint Conference on Artificial Intelligence, IJCAI-95, pp. 1088–1094, 1995.
Peters, J., Vijayakumar, S. and Schaal, S., “Reinforcement learning for humanoid robotics,” inThird IEEE International Conference on Humanoid Robotics, 2003.
Precup, D., Sutton, R.S. and Dasgupra, S., “Off-policy Temporal-difference Learning with Function Approximation,” inProceedings of the 18th International Conference on Machine Learning, ICML, pp. 417–424, 2001.
Pochon, J.B., Levy, R., Poline, J.B., Crozier, S., Lehericy, S., Pillon, B., Deweer, B., Le Bihan, D. and Dubois, B., “The Role of Dorsolateral Prefrontal Cortex in the Preparation of Forthcoming Actions: An fMRI Study,”Cerebral Cortex, 11, pp. 260–266, 2001.
Article Google Scholar
Poupart, P. and Boutilier, C., “Value-directed Compression of POMDPs,” inAdvances in Neural Information Processing Systems 15, pp. 1579–1586, 2003.
Rescorla, R.A. and Wagner, A.R., “A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement” inClassical Conditioning II: Current Research and Theory, pp. 64–99, New York, NY: Appleton, 1972.
Google Scholar
Reynolds, J.N., Hyland, B.I. and Wickens, J.R., “A Cellular Mechanism of Reward-related Learning,”Nature, 413, pp. 67–70, 2001.
Article Google Scholar
Robbins, T.W. and Everitt, B.J., “Neurobehavioural Mechanisms of Reward and Motivation,”Current Opinion in Neurobiology, 6, 2, pp. 228–236, 1996.
Article Google Scholar
Rodoriguez, A., Parr, R. and Koller, D., “Reinforcement Learning Using Approximate Belief State,” inAdvances in Neural Information Processing Systems 12, pp. 1036–1042, 2002.
Samejima, K., Ueda, Y., Doya, K. and Kimura, M., “Representation of Actionspecific Reward Values in The Striatum,”Science, 310, pp. 1337–1340, 2005.
Article Google Scholar
Sato, M., Nakamura, Y. and Ishii, S., “Reinforcement Learning for Biped Locomotion, inNeural Networks-ICANN 2002, LNCS2415, pp. 777–782, Springer-Verlag, Berlin, 2002.
Chapter Google Scholar
Schultz, W., Dayan, P. and Montague, R.P., “A Neural Substrate of Prediction and Reward,”Science, 275, pp. 1593–1599, 1997.
Article Google Scholar
Seymour, B., O’Doherty, J.P., Dayan, P., Koltzenburg, M., Jones, A.K., Dolan, R.J., Friston, K.J. and Frackowiak, R.S., “Temporal Difference Models Describe Higher-order Learning in Humans,”Nature, 429, pp. 664–667, 2004.
Article Google Scholar
Shelton, C.R., “Policy Improvement for POMDPs Using Normalized Importance Sampling,” inProceedings of the Seventeenth International Conference on Uncertainty in Artificial Intelligence (UAI), pp. 496–503, 2001.
Shidara, M., Aigner, T.G. and Richmond, B.J., “Neuronal Signals in the Monkey Ventral Striatum Related to Progress Through a Predictable Series of Trials,”The Journal of Neuroscience, 18, 7, pp. 2613–2625, 1998.
Article Google Scholar
Smallwood, R.D. and Sondik, E.J., “The Optimal Control of Partially Observable Markov Decision Processes Over a Finite Horizon,”Operations Research, 21, 1071–1088, 1973.
Article Google Scholar
Stolerman, I., “Drugs of Abuse: Behavioural Principles, Methods and Terms,”Trends in Pharmacological Sciences, 13, 5, pp. 170–176, 1992.
Article Google Scholar
Sutton, R.S. and Barto, B.G., “Towards a Modern Theory of Adaptive Networks: Expectation and Prediction,”Psychological Review, 88, pp. 135–170, 1981.
Article Google Scholar
Sutton, R.S., “Learning to Predict by the Method of Temporal Differences,”Machine Learning, 3, pp. 9–44, 1988.
Google Scholar
Sutton, R.S. and Barto, A.G.,Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1988.
MATH Google Scholar
Sutton, R.S., McAllester, D., Singh, S. and Manour, Y., “Policy Gradient Method for Reinforcement Learning with Function Approximation,” inAdvances in Neural Information Processing Systems 12, pp. 1057–1063, 2000.
Taga, G., Yamaguchi, Y. and Shimizu, H., “Self-organized Control in Bipedal Locomotion by Neural Oscillators in Unpredictable Environment,”Biological Cybernetics, 65, pp. 147–159, 1991.
Article Google Scholar
Tanji, J. and Hoshi, E., “Behavioral Planning in the Prefrontal Cortex,”Current Opinion in Neurobiology, 11, pp. 164–170, 2001.
Article Google Scholar
Watanabe, M., “Reward expectancy in primate prefrontal neurons,”Nature, 382, pp. 629–632, 1996.
Article Google Scholar
Watkins, C.J.C.H. and Dayan, P., “Q-learning,”Machine Learning, 8(3/4), pp. 279–292, 1992.
Article Google Scholar
Williams, R., “Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learning,”Machine Learning, 8, pp. 229–256, 1992.
MATH Google Scholar
Yoshida, W. and Ishii, S., “Model-based Reinforcement Learning: A Computational Model and an fMRI Study,”Neurocomputing, 3C, pp. 253–269, 2005.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, 630-0192, Nara, Japan
Shin Ishii & Wako Yoshida

Authors

Shin Ishii
View author publications
You can also search for this author in PubMed Google Scholar
Wako Yoshida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shin Ishii.

Additional information

Shin Ishii, Ph.D.: He is a professor of Graduate School of Information Science at Nara Institute of Science and Technology. He received his B.E. in 1986, M.E. in 1988, and Ph.D. in 1987 from University of Tokyo. His current research interests are computational neuroscience, systems neurobiology and statistical learning theory.

Wako Yoshida, Ph.D.: She is a researcher of Graduate School of Information Science at Nara Institute of Science and Technology. She received her B.A. in 1998 from Kobe College, M.E. in 2000 and Ph.D. in 2003 both from Nara Institute of Science and Technology. Her research interest includes theoretical and experimental approach to human’s decision-making process through learning, memory and communication.

About this article

Cite this article

Ishii, S., Yoshida, W. Part 4: Reinforcement learning: Machine learning and natural learning. New Gener Comput 24, 325–350 (2006). https://doi.org/10.1007/BF03037338

Download citation

Received: 29 October 2005
Revised: 28 February 2006
Published: 01 September 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/BF03037338

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Part 4: Reinforcement learning: Machine learning and natural learning

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning: A Survey

Reinforcement Learning: A Friendly Introduction

Reinforcement Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Part 4: Reinforcement learning: Machine learning and natural learning

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning: A Survey

Reinforcement Learning: A Friendly Introduction

Reinforcement Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation