Temporal Difference Learning: A Chemical Process Control Application

Miller, Scott; Williams, Ronald J.

doi:10.1007/978-1-4757-2379-3_12

Scott Miller² &
Ronald J. Williams²

446 Accesses

Abstract

Learning to control can be considered a trial-and-error process in which the controlling agent explores the consequences of various actions. Actions that produce good results become reinforced while those that produce bad results are suppressed. Eventually, the best control actions become dominant over all others, resulting in an optimal solution to the control problem. Central to this approach is the existence of an appropriate performance measure, or reinforcement function that can distinguish good from bad consequences among possible control actions. Often, the control objective is specified as an operating setpoint, suggesting a simple reinforcement function based on distance to setpoint. Actions that result in states closer to the setpoint are assigned relatively higher reinforcement values, while those that result in states further from the setpoint are assigned relatively lower reinforcement values. Control of dynamical systems is complicated by time lags between control actions and their eventual consequences. In such systems, it is sometimes undesirable to move too rapidly toward the setpoint. Because of time lags between actions and consequences, it may be impossible to slow down in time once the controlled variable is moving rapidly toward the setpoint. The result is to overshoot. A controller that relies on a reinforcement function based only on distance to setpoint may never learn to control at all. Rather, it will approach the setpoint from one side, overshoot, approach from the other side, overshoot again, and thus forever oscillate. The problem is that the reinforcement function considers only local, short-term consequences of the controller’s actions, but we really want the controller to choose actions based on their long-term consequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, P., Lee, C., Lim, H. C., & Ramkrishna, D. (1982). Theoretical investigations of dynamic behavior of isothermal continuous stirred biological reactors. Chemical Engineering Science, 37, 453.
Article Google Scholar
Barto, A. G. (1990). Connectionist learning for control: an overview. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control. Cambridge, MA: MIT Press.
Google Scholar
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846.
Article Google Scholar
Brody, C. (1992). Fast learning with predictive forward models. In: J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.) Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann.
Google Scholar
Jordan, M. I. & Jacobs, R. A. (1990). Learning to control an unstable system with forward modeling. In: D. S. Touretzky (Ed.) Advances in Neural Information Processing Systems 2, 324–331. Cambridge, MA: MIT Press.
Google Scholar
Lin, Long-Ji. (1991). Self-improving reactive agents: Case studies of reinforcement learning frameworks. Proceedings of the International Conference on the Simulation of Adaptive Behavior, MIT Press.
Google Scholar
Moody, J. & Darken, C. J. (1989). Fast learning in networks of locally tuned processing units. Neural Computation, 1, 281–294.
Article Google Scholar
Munro, P. (1987) A dual back-propagation scheme for scalar reward learning Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, 165-176.
Google Scholar
Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1, 4–27.
Article Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 216-224.
Google Scholar
Sutton, R. S., Barto, A. G., & Williams, R. J. (1991). Reinforcement learning is direct adaptive optimal control. Proceedings of the American Control Conference, June 26–28, Boston, MA, 2143-2146.
Google Scholar
Ungar, L. H. (1990). A bioreactor benchmark for adaptive network-based process control. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control Cambridge, MA: MIT Press.
Google Scholar
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge, England.
Google Scholar
Werbos, P. J. (1987). Building and understanding adaptive systems: A s-tatistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 7–20.
Article Google Scholar
Williams, R. J. (1986). Inverting a connectionist network mapping by back-propagation of error. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, 859-865.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Northeastern University, Boston, MA, 02115, USA
Scott Miller & Ronald J. Williams

Authors

Scott Miller
View author publications
You can also search for this author in PubMed Google Scholar
Ronald J. Williams
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of Edinburgh, UK
Alan F. Murray

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Miller, S., Williams, R.J. (1995). Temporal Difference Learning: A Chemical Process Control Application. In: Murray, A.F. (eds) Applications of Neural Networks. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-2379-3_12

Download citation

DOI: https://doi.org/10.1007/978-1-4757-2379-3_12
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5140-3
Online ISBN: 978-1-4757-2379-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics