Play Selection in American Football: A Case Study in Neuro-Dynamic Programming

  • Stephen D. Patek
  • Dimitri P. Bertsekas
Part of the Operations Research/Computer Science Interfaces Series book series (ORCS, volume 9)

Abstract

We present a computational case study of neuro-dynamic programming, a recent class of reinforcement learning methods. We cast the problem of play selection in American football as a stochastic shortest path Markov Decision Problem (MDP). In particular, we consider the problem faced by a quarterback in attempting to maximize the net score of an offensive drive. The resulting optimization problem serves as a medium-scale testbed for numerical algorithms based on policy iteration.

The algorithms we consider evolve as a sequence of approximate policy evaluations and policy updates. An (exact) evaluation amounts to the computation of the rewardto-go function associated with the policy in question. Approximations of reward-to-go are obtained either as the solution or as a step toward the solution of a training problem involving simulated state/reward data pairs. Within this methodological framework there is a great deal of flexibility. In specifying a particular algorithm, one must select a parametric form for estimating the reward-to-go function as well as a training algorithm for tuning the approximation. One example we consider, among many others, is the use of a multilayer perceptron (i.e. neural network) which is trained by backpropagation.

The objective of this paper is to illustrate the application of neuro-dynamic programming methods in solving a well-defined optimization problem. We will contrast and compare various algorithms mainly in terms of performance, although we will also consider complexity of implementation. Because our version of football leads to a medium-scale Markov decision problem, it is possible to compute the optimal solution numerically, providing a yardstick for meaningful comparison of the approximate methods.

Keywords

AMERICAN Football Policy Iteration Poisson Random Variable Sample Trajectory Opposing Team 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time dynamic programming Artificial Intelligence, 72: 81–138.CrossRefGoogle Scholar
  2. [2]
    Bertsekas, D. P. (1995). A counterexample to temporal differences learning. Neural Computation, 7: 270–279.CrossRefGoogle Scholar
  3. [3]
    Bertsekas, D. P. (1995). Dynamic Programming and Optimal Control, volume I and I I. Athena Scientific, Belmont, MA.Google Scholar
  4. [4]
    Bertsekas, D. R and Tsitsiklis, J. N. (1991). Analysis of Stochastic Shortest Path Problems. Mathematics of Operations Research, 16: 580–595.CrossRefGoogle Scholar
  5. [5]
    Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.Google Scholar
  6. [6]
    Haykin, S. (1996). Neural Networks, a comprehensive foundation. Macmillan, New York.Google Scholar
  7. [7]
    Hertz, J. A., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA.Google Scholar
  8. [8]
    Puterman, M. L. (1994). Markovian Decision Problems. Wiley, New York.CrossRefGoogle Scholar
  9. [9]
    Ross, S. M. (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New York.Google Scholar
  10. [10]
    Schweitzer, P. J. and Seidmann, A. (1985). Generalized polynommial approximations in markov decision processes. Journal of Mathematical Analysis and Applications, 110: 568–582.CrossRefGoogle Scholar
  11. [11]
    Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3: 9–44.Google Scholar
  12. [12]
    Tesauro, G. J. (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38: 58–68.CrossRefGoogle Scholar
  13. [13]
    Werbos, P. J. (1992). Handbook of Intelligent Control, chapter Approximate Dynamic Programming for Real-Time Control and Neural Modeling. Van Nostrand, New York. (eds. D. A. White and D. A. Sofge).Google Scholar
  14. [14]
    White, D. J. (1969). Dynamic Programming. Holden-Day.Google Scholar
  15. [15]
    Whitt, W. (1978). Approximations of dynamic programs i. Mathematics of Operations Research, 3: 231–243.CrossRefGoogle Scholar
  16. [16]
    Whitt, W. (1979). Approximations of dynamic programs ii. Mathematics of Operations Research, 4: 179–185.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 1998

Authors and Affiliations

  • Stephen D. Patek
    • 1
  • Dimitri P. Bertsekas
    • 1
  1. 1.Laboratory for Information and Decision SystemsMassachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations