Speeding up Q(λ)-learning

  • Marco Wiering
  • Jürgen Schmidhuber
Reinforcement Learning
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1398)


Q(λ)-learning uses TD(λ)-methods to accelerate Q-Learning. The worst case complexity for a single update step of previous online Q(λ) implementations based on lookup-tables is bounded by the size of the state/action space. Our faster algorithm's worst case complexity is bounded by the number of actions. The algorithm is based on the observation that Q-value updates may be postponed until they are needed.


Reinforcement learning Q-learning TD(λ) online Q(λ) lazy learning 


  1. [Albus, 1975]
    Albus, J. S. (1975). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Dynamic Systems, Measurement and Control, pages 220–227.Google Scholar
  2. [Atkeson et al., 1997]
    Atkeson, C. G., Schaal, S., and Moore, A. W. (1997). Locally weighted learning. Artificial Intelligence Review, 11:11–73.CrossRefGoogle Scholar
  3. [Barto et al., 1983]
    Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846.Google Scholar
  4. [Bertsekas and Tsitsiklis, 1996]
    Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neurodynamic Programming. Athena Scientific, Belmont, MA.Google Scholar
  5. [Caironi and Dorigo, 1994]
    Caironi, P. V. C. and Dorigo, M. (1994). Training Q-agents. Technical Report IRIDIA-94-14, Université Libre de Bruxelles.Google Scholar
  6. [Cichosz, 1995]
    Cichosz, P. (1995). Truncating temporal differences: On the efficient implementation of TD(λ) for reinforcement learning. Journal on Artificial Intelligence, 2:287–318.Google Scholar
  7. [Koenig and Simmons, 1996]
    Koenig, S. and Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement learning algorithms. Machine Learning, 22:228–250.Google Scholar
  8. [Kohonen, 1988]
    Kohonen, T. (1988). Self-Organization and Associative Memory. Springer, second edition.Google Scholar
  9. [Lin, 1993]
    Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh.Google Scholar
  10. [Peng and Williams, 1996]
    Peng, J. and Williams, R. (1996). Incremental multi-step Q-learning. Machine Learning, 22:283–290.Google Scholar
  11. [Rummery and Niranjan, 1994]
    Rummery, G. and Niranjan, M. (1994). On-line Q-learning using connectionist sytems. Technical Report CUED/F-INFENG-TR 166, Cambridge University, UK.Google Scholar
  12. [Singh and Sutton, 1996]
    Singh, S. and Sutton, R. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123–158.Google Scholar
  13. [Sutton, 1988]
    Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44.Google Scholar
  14. [Sutton, 1996]
    Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. M. and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 1038–1045. MIT Press, Cambridge MA.Google Scholar
  15. [Tesauro, 1992]
    Tesauro, G. (1992). Practical issues in temporal difference learning. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 4, pages 259–266. San Mateo, CA: Morgan Kaufmann.Google Scholar
  16. [Thrun, 1992]
    Thrun, S. (1992). Efficient exploration in reinforcement learning. Technical Report CMU-CS-92-102, Carnegie-Mellon University.Google Scholar
  17. [Watkins, 1989]
    Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge, England.Google Scholar
  18. [Watkins and Dayan, 1992]
    Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.Google Scholar
  19. [Whitehead, 1992]
    Whitehead, S. (1992). Reinforcement Learning for the adaptive control of perception and action. PhD thesis, University of Rochester.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Marco Wiering
    • 1
  • Jürgen Schmidhuber
    • 1
  1. 1.IDSIALuganoSwitzerland

Personalised recommendations