Temporal Difference Coding in Reinforcement Learning

  • Kazunori Iwata
  • Kazushi Ikeda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2690)


In this paper, we regard the sequence of returns as outputs from a parametric compound source. The coding rate of the source shows the amount of information on the return, so the information gain concerning future information is given by the sum of the discounted coding rates. We accordingly formulate a temporal difference learning for estimating the expected information gain, and give a convergence proof of the information gain under certain conditions. As an example of applications, we propose the ratio w of return loss to information gain to be used in probabilistic action selection strategies. We found in experiments that our w-based strategy performs well compared with the conventional Q-based strategy.


Reinforcement Learn Information Gain Markov Decision Process Return Loss Entropy Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning:An Introduction. Adaptive Computation and Machine Learning. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Zhang, W., Dietterich, T.G.: A reinforcement learning approach to job-stop scheduling. In: Mellish, C.S. (ed.) Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, Canada, pp. 1114–1120. Morgan Kaufmann, San Mateo (1995)Google Scholar
  3. 3.
    Likas, A.: A reinforcement learning approach to on-line clustering. Neural Computation 11, 1915–1932 (1999)CrossRefGoogle Scholar
  4. 4.
    Sato, M., Kobayashi, S.: Variance-penalized reinforcement learning for risk-averse asset allocation. In: Leung, K.-S., Chan, L., Meng, H. (eds.) IDEAL 2000. LNCS, vol. 1983, pp. 244–249. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Billingsley, P.: Probability and Measure, 3rd edn. Wiley Series in Probability and Mathematical Statistics. JohnWiley & Sons, NewYork (1995)zbMATHGoogle Scholar
  6. 6.
    Rissanen, J.: Stochastic complexity and modeling. The Annals of Statistics 14, 1080–1100 (1986)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Han, T.S., Kobayashi, K.: Mathematics of Information and Coding. In: Translations of Mathematical Monographs, vol. 203, American Mathematical Society, Providence (2002) (Translated by Joe Suzuki)Google Scholar
  8. 8.
    Watkins, C.J.C.H., Dayan, P.: Technical note: Q-learning. Machine Learning 8, 279–292 (1992)zbMATHGoogle Scholar
  9. 9.
    Kushner, H.J., Yin, G.G.: Exercises in Computer Systems Analysis. Apprications of Mathematics, vol. 35. Springer, NewYork (1997)zbMATHGoogle Scholar
  10. 10.
    Sato, M., Kobayashi, S.: Average-reward reinforcement learning for variance penalized markov decision problems. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the 18th International Conference on Machine Learning, Williams College, pp. 473–480. Morgan Kaufmann Publishers, San Francisco (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Kazunori Iwata
    • 1
  • Kazushi Ikeda
    • 1
  1. 1.Department of Systems Science, Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations