Skip to main content

Part of the book series: NATO ASI Series ((NATO ASI F,volume 144))

Abstract

This paper surveys the historical basis of reinforcement learning and some of the current work from a computer scientist’s point of view. It is an outgrowth of a number of talks given by the authors, including a NATO Advanced Study Institute and tutorials at AAAI’94 and Machine Learning’94. Reinforcement learning is a popular model of the learning problems that are encountered by an agent that learns behavior through trial-and-error interactions with a dynamic environment. It has a strong family resemblance to work in psychology, but differs considerably in the details and in the use of the word “reinforcement.” It is appropriately thought of as a class of problems, rather than as a set of techniques. The paper addresses a variety of subproblems in reinforcement learning, including exploration vs. exploitation, learning from delayed reinforcement, learning and using models, generalization and hierarchy, and hidden state. It concludes with a survey of some practical systems and an assessment of the practical utility of current reinforcement-learning systems

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Ackley, D. H. k Littman, M. L. (1989), Generalization and scaling in reinforcement learning, in Advances in Neural Information Processing 2, Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  • Anderson, C. W. (1986), Learning and Problem Solving with Multilayer Connectionist Systems, PhD thesis, University of Massachusetts, Amherst, MA Barto, A. G., Bradtke, S. J. k Singh, S. P. (1993), Learning to act using real-time dynamic programming, Technical Report 93–02, Department of Computer and Information Science, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Bellman, R. (1957), Dynamic Programming, Princeton University Press, Princeton, NJ

    Google Scholar 

  • Berry, D. A. k Fristedt, B. (1985), Bandit Problems: equential Allocation of Experiments, Chapman and Hall, London, UK

    Google Scholar 

  • Bertsekas, D. P. (1987), Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall

    MATH  Google Scholar 

  • Bertsekas, D. P. k Tsitsiklis, J. N. (1989), Parallel and Distributed Computation:Numerical Methods, Prentice-Hall, Englewood Cliffs, NJ

    MATH  Google Scholar 

  • Cassandra, A. R., Kaelbling, L. P. k Littman, M. L. (1994), Acting optimally in partially observable stochastic domains, in Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA

    Google Scholar 

  • Chapman, D. k Kaelbling, L. P. (1991), Input generalization in delayed reinforcement learning: An algorithm and performance comparisons, in Proceedings of the International Joint Conference on Artificial Intelligence, Sydney, Australia

    Google Scholar 

  • Cleveland, W. S. Sz Delvin, S. J. (1988), Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting, Journal of the American Statistical Association 83(403), 596–610 Dayan, P. (1992), The convergence of TD(A) for general A, Machine Learning 8 (3), 341–362

    Google Scholar 

  • Dayan, P. Sz Hinton, G. E. (1993), Feudal reinforcement learning, in Advances in Neural Information Processing Systems 5, Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  • Dayan, P.; Sejnowski, T. J. (1994), TD(A) converges with probability 1, Machine Learning

    Google Scholar 

  • Dean, T., Kaelbling, L. P., Kirman, J. Sz Nicholson, A. (1993), Planning with deadlines in stochastic domains, in Proceedings of the Eleventh National Conference on Artificial Intelligence, Washington, DC

    Google Scholar 

  • Gullapalli, V. (1990), A stochastic reinforcement learning algorithm for learning real-valued functions, Neural Networks 3, 671–692

    Article  Google Scholar 

  • Gullapalli, V. (1992), Reinforcement Learning and its application to control, PhD thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Howard, R. A. (1960), Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA

    MATH  Google Scholar 

  • Kaelbling, L. P. ( 1993 a), ierarchical learning in stochastic domains: Preliminary results, in Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann, Amherst, MA

    Google Scholar 

  • Kaelbling, L. P. ( 1993 b), Learning in Embedded Systems, The MIT Press, Cambridge, MA

    Google Scholar 

  • Kaelbling, L. P. ( 1994 a), Associative reinforcement learning: A generate and test algorithm, Machine Learning

    Google Scholar 

  • Kaelbling, L. P. ( 1994 b), Associative reinforcement learning: Functions in fc-DNF, Machine Learning

    Google Scholar 

  • Lin, L.-J. (1993 a), Hierachical learning of robot skills by reinforcement, in Proceedings of the International Conference on Neural Networks

    Google Scholar 

  • Lin, L.-J. ( 1993 b), Reinforcement Learning for Robots Using Neural Networks, PhD thesis, Carnegie Mellon University, Pittsburgh, PA

    Google Scholar 

  • Lin, L.-J. Sz Mitchell, T. M. (1992), Memory approaches to reinforcement learning in non-markovian domains, Technical Report CMU-CS-92-138, Carnegie Mellon University, School of Computer Science

    Google Scholar 

  • Littman, M. L. (1994), Memoryless policies: Theoretical limitations and practical results, in From Animals to Animats 3, Brighton, UK

    Google Scholar 

  • Lovejoy, W. S. (1991), A survey of algorithmic methods for partially observed markov decision processes, Annals of Operations Research 28 (1), 47–65

    Article  MathSciNet  MATH  Google Scholar 

  • Maes, P. Brooks, R. A. (1990), Learning to coordinate behaviors, in Proceedings Eighth National Conference on Artificial Intelligence, AAAI, Morgan Kaufmann, pp. 796–802

    Google Scholar 

  • Mahadevan, S. Connell, J. ( 1991 a), Automatic programming of behavior-based robots using reinforcement learning, in Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim, CA

    Google Scholar 

  • Mahadevan, S. Sz Connell, J. (1991 b), Scaling reinforcement learning to robotics by exploiting the subsumption architecture, in Proceedings of the Eighth International Workshop on Machine Learning, pp. 328–332

    Google Scholar 

  • Mataric, M. J. (1994), Reward Functions for Accelerated Learning, in W. W. Cohen k H. Hirsh (eds.) Proceedings of the Eleventh International Conference on Machine Learning, Morgan Kaufmann

    Google Scholar 

  • Monahan, G. E. (1982), A survey of partially observable Markov decision processes:Theory, models, and algorithms, Management Science 28 (1), 1–16

    Article  MathSciNet  MATH  Google Scholar 

  • Moore, A. W. (1991), Variable resolution dynamic programming: Efficiently learning action maps in multivariate real-valued spaces, in Proc. Eighth International Machine Learning Workshop

    Google Scholar 

  • Moore, A. W. (1994), The parti-game algorithm for variable resolution reinforcement learning in multidimensional state spaces, in S. J. Hanson, J. D. Cowan k C. L. Giles (eds.) Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  • Moore, A. W. k Atkeson, C. G. (1992), An Investigation of Memory-based Function Approximators for Learning Control, Technical Report, MIT Artifical Intelligence Laboratory

    Google Scholar 

  • Moore, A. W. k Atkeson, C. G. (1993), Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time, Machine Learning

    Google Scholar 

  • Narendra, K. k Thathachar, M. A. L. (1989), Learning Automata: An Introduction,

    Google Scholar 

  • Prentice–Hall, Englewood Cliffs, NJ

    Google Scholar 

  • Peng, J. k Williams, R. J. (1993), Efficient learning and planning within the dyna framework, Adaptive Behavior 1 (4), 437–454

    Google Scholar 

  • Peng, J. Williams, R. J. (1994), Incremental multi-step Q-learning, in Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, New Brunswick, New Jersey

    Google Scholar 

  • Sage, A. P. k White, C. C. (1977), Optimum Systems Control, Prentice Hall

    MATH  Google Scholar 

  • Samuel, A. L. (1959), Some studies in machine learning using the game of checkers,IBM Journal of Research and Development 3, 211–229

    MathSciNet  Google Scholar 

  • Schaal, S. k Atkeson, C. (1994), Robot Juggling: An Implementation of Memory-based Learning, Control Systems Magazine

    Google Scholar 

  • Schmidhuber, J. H. (1991), Reinforcement learning in markovian and non mar kovian environments, in D. S. Lippman, J. E. Moody k D. S. Touretzky (eds.) Advances in Neural Information Processing Systems 3, Morgan Kaufmann, San Mateo, CA, pp. 500–506

    Google Scholar 

  • Schraudolph, N. N., Dayan, P. k Sejnowski, T. J. (1994), Using the td(lambda) algorithm to learn an evaluation function for the game of go, in Advances in Neural Information Processing Systems 6, Morgan Kaufmann, San Mateo, CA

    Google Scholar 

  • Singh, S. P. ( 1992 a), Reinforcement learning with a hierarchy of abstract models, in Proceedings of the Tenth National Conference on Artificial Intelligence, A A AI Press, San Jose, CA, pp. 202–207

    Google Scholar 

  • Singh, S. P. (1992 b), Transfer of learning by composing solutions of elemental sequential tasks, Machine Learning 8(3), 323–340

    Google Scholar 

  • Singh, S. P., Jaakkola, T. k Jordan, M. I. (1994), Model-free reinforcement learning for non-Markovian decision problems, in Proceedings of the Machine Learning Conference

    Google Scholar 

  • Sutton, R. S. (1984), Temporal Credit Assignment in Reinforcement Learning, PhD thesis, University of Massachusetts, Amherst, MA

    Google Scholar 

  • Sutton, R. S. (1988), Learning to predict by the method of temporal differences,Machine Learning 3 (1), 9–44

    Google Scholar 

  • Sutton, R. S. (1990), Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in Proceedings of the Seventh International Conference on Machine Learning, Morgan Kaufmann, Austin, TX

    Google Scholar 

  • Sutton, R. S. (1991), Reinforcement learning architectures for animats, in Proceedings of the International Workshop on the Simulation of Adaptive Behavior: From Animals to Animats, The MIT Press, Cambridge, MA, pp. 288–296

    Google Scholar 

  • Tesauro, G. (1992), Practical issues in temporal difference learning, Machine Learning 8, 257–277

    MATH  Google Scholar 

  • Tesauro, G. (To appear), TD-Gammon, a sel-teaching backgammon program, achieves master-level play, Neural Computation

    Google Scholar 

  • Thrun, S. (1994), Personal Communication

    Google Scholar 

  • Thrun, S. B. (1992), The role of exploration in learning control, in D. A. White D. A. Sofge (eds.) Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, Van Nostrand Reinhold, New York, NYWatkins, C. J. C. H. ( 1989 ), Learning from Delayed Rewards, PhD thesis, King’s College, Cambridge, UK

    Google Scholar 

  • Williams, R. J. (1992), Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning 8 (3), 229–256

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kaelbling, L.P., Littman, M.L., Moore, A.W. (1995). An Introduction to Reinforcement Learning. In: Steels, L. (eds) The Biology and Technology of Intelligent Autonomous Agents. NATO ASI Series, vol 144. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-79629-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-79629-6_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-79631-9

  • Online ISBN: 978-3-642-79629-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics