TD-Gammon: A Self-Teaching Backgammon Program
This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research.
Furthermore, when a set of hand-crafted features is added to the network’s input representation, the result is a truly staggering level of performance: TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players. We discuss possible principles underlying the success of TD-Gammon, and the prospects for successful real-world applications of TD learning in other domains.
KeywordsReinforcement Learning Hide Unit Temporal Difference Learning Training Game Random Initial Weight
Unable to display preview. Download preview PDF.
- J. Christensen and R. Korf, “A unified theory of heuristic evaluation functions and its application to learning.” Proc. of AAAI-86, 148-152 (1986).Google Scholar
- P. W. Frey, “Algorithmic strategies for improving the performance of game playing programs.” In: D. Farmer et al. (Eds.), Evolution, Games and Learning. Amsterdam: North Holland (1986).Google Scholar
- P. Magriel, Backgammon. New York: Times Books (1976).Google Scholar
- D. H. Mitchell, “Using features to evaluate positions in experts’ and novices’ Othello games.” Master’s Thesis, Northwestern Univ., Evanston IL (1984).Google Scholar
- J. R. Quinlan, “Learning efficient classification procedures and their application to chess end games.” In: R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Machine Learning. Palo Alto CA: Tioga (1983).Google Scholar
- B. Robertie, Advanced Backgammon. Arlington MA: Gammon Press (1991).Google Scholar
- B. Robertie, “Carbon versus silicon: matching wits with TD-Gammon.” Inside Backgammon 2:2, 14–22 (1992).Google Scholar
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation.” In D. Rumelhart and J. McClelland (Eds.), Parallel Distributed Processing, Vol. 1. Cambridge MA: MIT Press (1986).Google Scholar
- R. S. Sutton, “Temporal credit assignment in reinforcement learning.” Ph. D. Thesis, Univ. of Massachusetts, Amherst MA (1984).Google Scholar
- R. S. Sutton, “Learning to predict by the methods of temporal differences.” Machine Learning 3, 9–44 (1988).Google Scholar
- G. Tesauro, “Connectionist learning of expert preferences by comparison training.” In D. Touretzky (Ed.), Advances in Neural Information Processing 1, 99–106. San Mateo, CA: Morgan Kauffmann (1989).Google Scholar
- G. Tesauro, “Neurogammon: a neural network backgammon program.” IJCNN Proceedings III, 33–39 (1990).Google Scholar