Reinforcement Learning of Optimal Controls

Williams, John K.

doi:10.1007/978-1-4020-9119-3_15

John K. Williams⁴

3111 Accesses
2 Citations

As humans, we continually interpret sensory input to try to make sense of the world around us, that is, we develop mappings from observations to a useful estimate of the “environmental state”. A number of artificial intelligence methods for producing such mappings are described in this book, along with applications showing how they may be used to better understand a physical phenomenon or contribute to a decision support system. However, people don't want simply to understand the world around us. Rather, we interact with it to accomplish certain goals — for instance, to obtain food, water, warmth, shelter, status or wealth. Learning how to accurately estimate the state of our environment is intimately tied to how we then use that knowledge to manipulate it. Our actions change the environmental state and generate positive or negative feedback, which we evaluate and use to inform our future behavior in a continuing cycle of observation, action, environmental change and feedback.

In the field of machine learning, this common human experience is abstracted to that of a “learning agent” whose purpose is to discover through interacting with its environment how to act to achieve its goals. In general, no teacher is available to supply correct actions, nor will feedback always be immediate. Instead, the learner must use the sequence of experiences resulting from its actions to determine which actions to repeat and which to avoid. In doing so, it must be able to assign credit or blame to actions that may be long past, and it must balance the exploitation of knowledge previously gained with the need to explore untried, possibly superior strategies. Reinforcement learning, also called stochastic dynamic programming, is the area of machine learning devoted to solving this general learning problem. Although the term “reinforcement learning” has traditionally been used in a number of contexts, the modern field is the result of a synthesis in the 1980s of ideas from optimal control theory, animal learning, and temporal difference methods from artificial intelligence. Finding a mapping that prescribes actions based on measured environmental states in a way that optimizes some long-term measure of success is the subject of what mathematicians and engineers call “optimal control” problems and psychologists call “planning” problems. There is a deep body of mathematical literature on optimal control theory describing how to analyze a system and develop optimal mappings. However, in many applications the system is poorly understood, complex, difficult to analyze mathematically, or changing in time. In such cases, a machine learning approach that learns a good control strategy from real or simulated experience may be the only practical approach (Si et al. 2004).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atlas, D. (1982). Adaptively pointing spacebome radar for precipitation measurements. Journal of Applied Meteorology,21, 429–443
Article Google Scholar
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis & S. J. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 30–37). 9–12 July 1995. Tahoe City, CA/San Francisco: Morgan Kaufmann
Google Scholar
Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDP via direct gradient ascent. Proceedings of the 17th International Conference on Machine Learning (pp. 41–48). 29 June–2 July 2000. Stanford, CA/San Francisco: Morgan Kaufmann
Google Scholar
Bellman, R. E. (1957). Dynamic programming (342 pp.). Princeton, NJ: Princeton University Press
Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1, Vol. 2, 387 pp., 292 pp.). Belmont, MA:Athena Scientific
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming (491 pp.). Belmont, MA: Athena Scientific
Google Scholar
Bertsimas, D., & Patterson, S. S. (1998). The air traffic flow management problem with enroute capacities. Operations Research, 46, 406–422
Article Google Scholar
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183–188). 12–16 July 1992. San Jose/Menlo Park, CA:AAAI Press
Google Scholar
Dayan, P., & Sejnowski, T. (1994). TD(0) converges with probability 1. Machine Learning, 14, 295–301
Google Scholar
Evans, J. E., Weber, M. E., & Moser, W. R. (2006). Integrating advanced weather forecast technologies into air traffic management decision support. Lincoln Laboratory Journal, 16, 81–96
Google Scholar
Hamilton, W. R. (1835). Second essay on a general method in dynamics. Philosophical Transactions of the Royal Society, Part I for 1835, 95–144
Google Scholar
Jaakkola, T., Jordan, M., & Singh, S. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1185–1201
Article Google Scholar
Jaakkola, T., Singh, S., & Jordan, M. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems: Proceedings of the 1994 Conference (pp. 345–352). Cambridge, MA: MIT Press
Google Scholar
Joint Planning and Development Office (JPDO). (2006). Next generation air transportation system (NGATS)—weather concept of operations (30 pp.). Washington, DC: Weather Integration Product Team
Google Scholar
Krozel, J., Andre, A. D. & Smith, P. (2006). Future air traffic management requirements for dynamic weather avoidance routing. Preprints, 25th Digital Avionics Systems Conference (pp. 1–9). October 2006. Portland, OR: IEEE/AIAA
Chapter Google Scholar
Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications (417 pp.). New York: Springer
Google Scholar
Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47–66
Article Google Scholar
McLaughlin, D. J., Chandrasekar, V., Droegemeier, K., Frasier, S., Kurose, J., Junyent, F., et al. (2005). Distributed Collaborative Adaptive Sensing (DCAS) for improved detection, understanding, and prediction of atmospheric hazards. Preprints-CD, AMS Ninth Symposium on Integrated Observing and Assimilation Systems for the Atmosphere, Oceans, and Land Surface. 10–13 January 2005. Paper 11.3. San Diego, CA
Google Scholar
Myers, W. L. (2000). Effects of visual representations of dynamic hazard worlds on human navigational performance. Ph.D. thesis, Department of Computer Science, University of Colorado, 64 pp
Google Scholar
Peng, J., & Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22, 283–290
Google Scholar
Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley and A. P. Danylok (Eds.), Proceedings of the 18th International Conference on Machine Learning (pp. 417–424). 28 June–1 July 2001. Williamstown, MA/San Francisco, CA: Morgan Kaufmann
Google Scholar
Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming (649 pp.). Hoboken, NJ:Wiley Interscience
Google Scholar
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407
Article Google Scholar
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 211–229
Google Scholar
Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. (Eds.). (2004). Handbook of learning and approximate dynamic programming (644 pp.). Piscataway, NJ: Wiley-Interscience
Google Scholar
Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22, 123–158
Google Scholar
Singh, S. P., Jaakkola, T., Littman, M. L., & Szepasvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38, 287–308
Article Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning:An introduction (322 pp.). Cambridge, MA: MIT Press
Google Scholar
Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. Machine Learning, 42, 241–267
Article Google Scholar
Tsitsiklis, J. N. (2002). On the convergence of optimistic policy iteration. Journal of Machine Learning Research, 3, 59–72
Article Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690
Article Google Scholar
Turing, A. M. (1948). Intelligent machinery, National Physical Laboratory report. In D. C. Ince (Ed.). 1992, Collected works of A. M. Turing: Mechanical intelligence (227 pp.). New York: Elsevier Science
Google Scholar
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460
Article Google Scholar
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, King's College, Cambridge University, Cambridge, 234 pp
Google Scholar
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292
Google Scholar
Williams, J. K. (2000). On the convergence of model-free policy iteration algorithms for reinforcement learning: Stochastic approximation under discontinuous mean dynamics. Ph.D. thesis, Department of Mathematics, University of Colorado, Colorado, 173 pp
Google Scholar
Williams, J. K., & Singh, S. (1999). Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes. In M. S. Kearns, S. A. Solla, and D. A. Cohn (Eds.), Advances in neural information processing systems 11. Proceedings of the 1998 Conference (pp. 1073–1079). Cambridge, MA: MIT Press
Google Scholar

Download references

Author information

Authors and Affiliations

Research Applications Laboratory, National Center for Atmospheric Research, 3000, Boulder, CO, 80307, USA
John K. Williams

Authors

John K. Williams
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to John K. Williams .

Editor information

Editors and Affiliations

Applied Research Laboratory, Pennsylvania State University, Box 30, State College, PA, 16804-0030, USA
Sue Ellen Haupt
Institute of Atmospheric Pollution, National Research Council, Via Salaria Km. 29.300, Monterotondo Stazione, Rome, 00016, Italy
Antonello Pasini
Dept. of Statistics, University of Washington and the Applied Physics Laboratory, Box 354322, Seattle, WA, 98195-4322, USA
Caren Marzban

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Williams, J.K. (2009). Reinforcement Learning of Optimal Controls. In: Haupt, S.E., Pasini, A., Marzban, C. (eds) Artificial Intelligence Methods in the Environmental Sciences. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9119-3_15

Download citation

DOI: https://doi.org/10.1007/978-1-4020-9119-3_15
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-9117-9
Online ISBN: 978-1-4020-9119-3
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics