Temporal Difference Learning: A Chemical Process Control Application
Learning to control can be considered a trial-and-error process in which the controlling agent explores the consequences of various actions. Actions that produce good results become reinforced while those that produce bad results are suppressed. Eventually, the best control actions become dominant over all others, resulting in an optimal solution to the control problem. Central to this approach is the existence of an appropriate performance measure, or reinforcement function that can distinguish good from bad consequences among possible control actions. Often, the control objective is specified as an operating setpoint, suggesting a simple reinforcement function based on distance to setpoint. Actions that result in states closer to the setpoint are assigned relatively higher reinforcement values, while those that result in states further from the setpoint are assigned relatively lower reinforcement values. Control of dynamical systems is complicated by time lags between control actions and their eventual consequences. In such systems, it is sometimes undesirable to move too rapidly toward the setpoint. Because of time lags between actions and consequences, it may be impossible to slow down in time once the controlled variable is moving rapidly toward the setpoint. The result is to overshoot. A controller that relies on a reinforcement function based only on distance to setpoint may never learn to control at all. Rather, it will approach the setpoint from one side, overshoot, approach from the other side, overshoot again, and thus forever oscillate. The problem is that the reinforcement function considers only local, short-term consequences of the controller’s actions, but we really want the controller to choose actions based on their long-term consequences.
KeywordsControl Action Radial Basis Function Internal Model Reinforcement Function World Model
Unable to display preview. Download preview PDF.
- Barto, A. G. (1990). Connectionist learning for control: an overview. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control. Cambridge, MA: MIT Press.Google Scholar
- Brody, C. (1992). Fast learning with predictive forward models. In: J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.) Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann.Google Scholar
- Jordan, M. I. & Jacobs, R. A. (1990). Learning to control an unstable system with forward modeling. In: D. S. Touretzky (Ed.) Advances in Neural Information Processing Systems 2, 324–331. Cambridge, MA: MIT Press.Google Scholar
- Lin, Long-Ji. (1991). Self-improving reactive agents: Case studies of reinforcement learning frameworks. Proceedings of the International Conference on the Simulation of Adaptive Behavior, MIT Press.Google Scholar
- Munro, P. (1987) A dual back-propagation scheme for scalar reward learning Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, 165-176.Google Scholar
- Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
- Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 216-224.Google Scholar
- Sutton, R. S., Barto, A. G., & Williams, R. J. (1991). Reinforcement learning is direct adaptive optimal control. Proceedings of the American Control Conference, June 26–28, Boston, MA, 2143-2146.Google Scholar
- Ungar, L. H. (1990). A bioreactor benchmark for adaptive network-based process control. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control Cambridge, MA: MIT Press.Google Scholar
- Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge, England.Google Scholar
- Williams, R. J. (1986). Inverting a connectionist network mapping by back-propagation of error. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, 859-865.Google Scholar