Abstract
This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research.
Furthermore, when a set of hand-crafted features is added to the network’s input representation, the result is a truly staggering level of performance: TD-Gammon is now estimated to play at a strong master level that is extremely close to the world’s best human players. We discuss possible principles underlying the success of TD-Gammon, and the prospects for successful real-world applications of TD learning in other domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
H. Berliner, “Computer backgammon.” Scientific American 243:1, 64–72 (1980).
D. P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models. En-glewood Cliffs NJ: Prentice Hall (1987).
J. Christensen and R. Korf, “A unified theory of heuristic evaluation functions and its application to learning.” Proc. of AAAI-86, 148-152 (1986).
P. Dayan, ‘The convergence of TD(λ) for general λ.” Machine Learning 8, 341–362 (1992).
P. W. Frey, “Algorithmic strategies for improving the performance of game playing programs.” In: D. Farmer et al. (Eds.), Evolution, Games and Learning. Amsterdam: North Holland (1986).
A. K. Griffith, “A comparison and evaluation of three machine learning procedures as applied to the game of checkers.” Artificial Intelligence 5, 137–148 (1974).
K. Hornik, M. Stinchcombe and H. White, “Multilayer feedforward networks are universal approximators.” Neural Networks 2, 359–366 (1989).
K.-F. Lee and S. Majahan, “A pattern classification approach to evaluation function learning.” Artificial Intelligence 36, 1–25 (1988).
P. Magriel, Backgammon. New York: Times Books (1976).
M. L. Minsky and S. A. Papert, Perceptrons. Cambridge MA: MIT Press (1969). (Republished as an expanded edition in 1988).
D. H. Mitchell, “Using features to evaluate positions in experts’ and novices’ Othello games.” Master’s Thesis, Northwestern Univ., Evanston IL (1984).
J. R. Quinlan, “Learning efficient classification procedures and their application to chess end games.” In: R. S. Michalski, J. G. Carbonell and T. M. Mitchell (Eds.), Machine Learning. Palo Alto CA: Tioga (1983).
B. Robertie, Advanced Backgammon. Arlington MA: Gammon Press (1991).
B. Robertie, “Carbon versus silicon: matching wits with TD-Gammon.” Inside Backgammon 2:2, 14–22 (1992).
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representation by error propagation.” In D. Rumelhart and J. McClelland (Eds.), Parallel Distributed Processing, Vol. 1. Cambridge MA: MIT Press (1986).
A. Samuel, “Some studies in machine learning using the game of checkers.” IBM J. of Research and Development 3, 210–229 (1959).
A. Samuel, “Some studies in machine learning using the game of checkers, II — recent progress.” IBM J. of Research and Development 11, 601–617 (1967).
R. S. Sutton, “Temporal credit assignment in reinforcement learning.” Ph. D. Thesis, Univ. of Massachusetts, Amherst MA (1984).
R. S. Sutton, “Learning to predict by the methods of temporal differences.” Machine Learning 3, 9–44 (1988).
G. Tesauro and T. J. Sejnowski, “A parallel network that learns to play backgammon.” Artificial Intelligence 39, 357–390 (1989).
G. Tesauro, “Connectionist learning of expert preferences by comparison training.” In D. Touretzky (Ed.), Advances in Neural Information Processing 1, 99–106. San Mateo, CA: Morgan Kauffmann (1989).
G. Tesauro, “Neurogammon: a neural network backgammon program.” IJCNN Proceedings III, 33–39 (1990).
G. Tesauro, “Practical issues in temporal difference learning.” Machine Learning 8, 257–277 (1992).
N. Zadeh and G. Kobliska, “On optimal doubling in backgammon.” Management Science 23, 853–858 (1977).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1995 Springer Science+Business Media New York
About this chapter
Cite this chapter
Tesauro, G. (1995). TD-Gammon: A Self-Teaching Backgammon Program. In: Murray, A.F. (eds) Applications of Neural Networks. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-2379-3_11
Download citation
DOI: https://doi.org/10.1007/978-1-4757-2379-3_11
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5140-3
Online ISBN: 978-1-4757-2379-3
eBook Packages: Springer Book Archive