Temporal Difference Learning: A Chemical Process Control Application

  • Scott Miller
  • Ronald J. Williams


Learning to control can be considered a trial-and-error process in which the controlling agent explores the consequences of various actions. Actions that produce good results become reinforced while those that produce bad results are suppressed. Eventually, the best control actions become dominant over all others, resulting in an optimal solution to the control problem. Central to this approach is the existence of an appropriate performance measure, or reinforcement function that can distinguish good from bad consequences among possible control actions. Often, the control objective is specified as an operating setpoint, suggesting a simple reinforcement function based on distance to setpoint. Actions that result in states closer to the setpoint are assigned relatively higher reinforcement values, while those that result in states further from the setpoint are assigned relatively lower reinforcement values. Control of dynamical systems is complicated by time lags between control actions and their eventual consequences. In such systems, it is sometimes undesirable to move too rapidly toward the setpoint. Because of time lags between actions and consequences, it may be impossible to slow down in time once the controlled variable is moving rapidly toward the setpoint. The result is to overshoot. A controller that relies on a reinforcement function based only on distance to setpoint may never learn to control at all. Rather, it will approach the setpoint from one side, overshoot, approach from the other side, overshoot again, and thus forever oscillate. The problem is that the reinforcement function considers only local, short-term consequences of the controller’s actions, but we really want the controller to choose actions based on their long-term consequences.


Control Action Radial Basis Function Internal Model Reinforcement Function World Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Agrawal, P., Lee, C., Lim, H. C., & Ramkrishna, D. (1982). Theoretical investigations of dynamic behavior of isothermal continuous stirred biological reactors. Chemical Engineering Science, 37, 453.CrossRefGoogle Scholar
  2. [2]
    Barto, A. G. (1990). Connectionist learning for control: an overview. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control. Cambridge, MA: MIT Press.Google Scholar
  3. [3]
    Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846.CrossRefGoogle Scholar
  4. [4]
    Brody, C. (1992). Fast learning with predictive forward models. In: J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.) Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann.Google Scholar
  5. [5]
    Jordan, M. I. & Jacobs, R. A. (1990). Learning to control an unstable system with forward modeling. In: D. S. Touretzky (Ed.) Advances in Neural Information Processing Systems 2, 324–331. Cambridge, MA: MIT Press.Google Scholar
  6. [6]
    Lin, Long-Ji. (1991). Self-improving reactive agents: Case studies of reinforcement learning frameworks. Proceedings of the International Conference on the Simulation of Adaptive Behavior, MIT Press.Google Scholar
  7. [7]
    Moody, J. & Darken, C. J. (1989). Fast learning in networks of locally tuned processing units. Neural Computation, 1, 281–294.CrossRefGoogle Scholar
  8. [8]
    Munro, P. (1987) A dual back-propagation scheme for scalar reward learning Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, 165-176.Google Scholar
  9. [9]
    Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1, 4–27.CrossRefGoogle Scholar
  10. [10]
    Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  11. [11]
    Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proceedings of the Seventh International Conference on Machine Learning, 216-224.Google Scholar
  12. [12]
    Sutton, R. S., Barto, A. G., & Williams, R. J. (1991). Reinforcement learning is direct adaptive optimal control. Proceedings of the American Control Conference, June 26–28, Boston, MA, 2143-2146.Google Scholar
  13. [13]
    Ungar, L. H. (1990). A bioreactor benchmark for adaptive network-based process control. In: W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.) Neural Networks for Control Cambridge, MA: MIT Press.Google Scholar
  14. [14]
    Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge, England.Google Scholar
  15. [15]
    Werbos, P. J. (1987). Building and understanding adaptive systems: A s-tatistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 7–20.CrossRefGoogle Scholar
  16. [16]
    Williams, R. J. (1986). Inverting a connectionist network mapping by back-propagation of error. Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Amherst, MA, 859-865.Google Scholar

Copyright information

© Springer Science+Business Media New York 1995

Authors and Affiliations

  • Scott Miller
    • 1
  • Ronald J. Williams
    • 1
  1. 1.College of Computer ScienceNortheastern UniversityBostonUSA

Personalised recommendations