Skip to main content

As humans, we continually interpret sensory input to try to make sense of the world around us, that is, we develop mappings from observations to a useful estimate of the “environmental state”. A number of artificial intelligence methods for producing such mappings are described in this book, along with applications showing how they may be used to better understand a physical phenomenon or contribute to a decision support system. However, people don't want simply to understand the world around us. Rather, we interact with it to accomplish certain goals — for instance, to obtain food, water, warmth, shelter, status or wealth. Learning how to accurately estimate the state of our environment is intimately tied to how we then use that knowledge to manipulate it. Our actions change the environmental state and generate positive or negative feedback, which we evaluate and use to inform our future behavior in a continuing cycle of observation, action, environmental change and feedback.

In the field of machine learning, this common human experience is abstracted to that of a “learning agent” whose purpose is to discover through interacting with its environment how to act to achieve its goals. In general, no teacher is available to supply correct actions, nor will feedback always be immediate. Instead, the learner must use the sequence of experiences resulting from its actions to determine which actions to repeat and which to avoid. In doing so, it must be able to assign credit or blame to actions that may be long past, and it must balance the exploitation of knowledge previously gained with the need to explore untried, possibly superior strategies. Reinforcement learning, also called stochastic dynamic programming, is the area of machine learning devoted to solving this general learning problem. Although the term “reinforcement learning” has traditionally been used in a number of contexts, the modern field is the result of a synthesis in the 1980s of ideas from optimal control theory, animal learning, and temporal difference methods from artificial intelligence. Finding a mapping that prescribes actions based on measured environmental states in a way that optimizes some long-term measure of success is the subject of what mathematicians and engineers call “optimal control” problems and psychologists call “planning” problems. There is a deep body of mathematical literature on optimal control theory describing how to analyze a system and develop optimal mappings. However, in many applications the system is poorly understood, complex, difficult to analyze mathematically, or changing in time. In such cases, a machine learning approach that learns a good control strategy from real or simulated experience may be the only practical approach (Si et al. 2004).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Atlas, D. (1982). Adaptively pointing spacebome radar for precipitation measurements. Journal of Applied Meteorology,21, 429–443

    Article  Google Scholar 

  • Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis & S. J. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 30–37). 9–12 July 1995. Tahoe City, CA/San Francisco: Morgan Kaufmann

    Google Scholar 

  • Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDP via direct gradient ascent. Proceedings of the 17th International Conference on Machine Learning (pp. 41–48). 29 June–2 July 2000. Stanford, CA/San Francisco: Morgan Kaufmann

    Google Scholar 

  • Bellman, R. E. (1957). Dynamic programming (342 pp.). Princeton, NJ: Princeton University Press

    Google Scholar 

  • Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1, Vol. 2, 387 pp., 292 pp.). Belmont, MA:Athena Scientific

    Google Scholar 

  • Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming (491 pp.). Belmont, MA: Athena Scientific

    Google Scholar 

  • Bertsimas, D., & Patterson, S. S. (1998). The air traffic flow management problem with enroute capacities. Operations Research, 46, 406–422

    Article  Google Scholar 

  • Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183–188). 12–16 July 1992. San Jose/Menlo Park, CA:AAAI Press

    Google Scholar 

  • Dayan, P., & Sejnowski, T. (1994). TD(0) converges with probability 1. Machine Learning, 14, 295–301

    Google Scholar 

  • Evans, J. E., Weber, M. E., & Moser, W. R. (2006). Integrating advanced weather forecast technologies into air traffic management decision support. Lincoln Laboratory Journal, 16, 81–96

    Google Scholar 

  • Hamilton, W. R. (1835). Second essay on a general method in dynamics. Philosophical Transactions of the Royal Society, Part I for 1835, 95–144

    Google Scholar 

  • Jaakkola, T., Jordan, M., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1185–1201

    Article  Google Scholar 

  • Jaakkola, T., Singh, S., & Jordan, M. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems: Proceedings of the 1994 Conference (pp. 345–352). Cambridge, MA: MIT Press

    Google Scholar 

  • Joint Planning and Development Office (JPDO). (2006). Next generation air transportation system (NGATS)—weather concept of operations (30 pp.). Washington, DC: Weather Integration Product Team

    Google Scholar 

  • Krozel, J., Andre, A. D. & Smith, P. (2006). Future air traffic management requirements for dynamic weather avoidance routing. Preprints, 25th Digital Avionics Systems Conference (pp. 1–9). October 2006. Portland, OR: IEEE/AIAA

    Chapter  Google Scholar 

  • Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications (417 pp.). New York: Springer

    Google Scholar 

  • Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47–66

    Article  Google Scholar 

  • McLaughlin, D. J., Chandrasekar, V., Droegemeier, K., Frasier, S., Kurose, J., Junyent, F., et al. (2005). Distributed Collaborative Adaptive Sensing (DCAS) for improved detection, understanding, and prediction of atmospheric hazards. Preprints-CD, AMS Ninth Symposium on Integrated Observing and Assimilation Systems for the Atmosphere, Oceans, and Land Surface. 10–13 January 2005. Paper 11.3. San Diego, CA

    Google Scholar 

  • Myers, W. L. (2000). Effects of visual representations of dynamic hazard worlds on human navigational performance. Ph.D. thesis, Department of Computer Science, University of Colorado, 64 pp

    Google Scholar 

  • Peng, J., & Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283–290

    Google Scholar 

  • Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley and A. P. Danylok (Eds.), Proceedings of the 18th International Conference on Machine Learning (pp. 417–424). 28 June–1 July 2001. Williamstown, MA/San Francisco, CA: Morgan Kaufmann

    Google Scholar 

  • Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming (649 pp.). Hoboken, NJ:Wiley Interscience

    Google Scholar 

  • Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407

    Article  Google Scholar 

  • Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 211–229

    Google Scholar 

  • Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. (Eds.). (2004). Handbook of learning and approximate dynamic programming (644 pp.). Piscataway, NJ: Wiley-Interscience

    Google Scholar 

  • Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158

    Google Scholar 

  • Singh, S. P., Jaakkola, T., Littman, M. L., & Szepasvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38, 287–308

    Article  Google Scholar 

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning:An introduction (322 pp.). Cambridge, MA: MIT Press

    Google Scholar 

  • Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. Machine Learning, 42, 241–267

    Article  Google Scholar 

  • Tsitsiklis, J. N. (2002). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3, 59–72

    Article  Google Scholar 

  • Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690

    Article  Google Scholar 

  • Turing, A. M. (1948). Intelligent machinery, National Physical Laboratory report. In D. C. Ince (Ed.). 1992, Collected works of A. M. Turing: Mechanical intelligence (227 pp.). New York: Elsevier Science

    Google Scholar 

  • Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460

    Article  Google Scholar 

  • Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, King's College, Cambridge University, Cambridge, 234 pp

    Google Scholar 

  • Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292

    Google Scholar 

  • Williams, J. K. (2000). On the convergence of model-free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics. Ph.D. thesis, Department of Mathematics, University of Colorado, Colorado, 173 pp

    Google Scholar 

  • Williams, J. K., & Singh, S. (1999). Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in neural information processing systems 11. Proceedings of the 1998 Conference (pp. 1073–1079). Cambridge, MA: MIT Press

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John K. Williams .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media B.V

About this chapter

Cite this chapter

Williams, J.K. (2009). Reinforcement Learning of Optimal Controls. In: Haupt, S.E., Pasini, A., Marzban, C. (eds) Artificial Intelligence Methods in the Environmental Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9119-3_15

Download citation

Publish with us

Policies and ethics