Iteratively Extending Time Horizon Reinforcement Learning

  • Damien Ernst
  • Pierre Geurts
  • Louis Wehenkel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2837)


Reinforcement learning aims to determine an (infinite time horizon) optimal control policy from interaction with a system. It can be solved by approximating the so-called Q-function from a sample of four-tuples (x t , u t , r t , x t + 1) where x t denotes the system state at time t, u t the control action taken, r t the instantaneous reward obtained and x t + 1 the successor state of the system, and by determining the optimal control from the Q-function. Classical reinforcement learning algorithms use an ad hoc version of stochastic approximation which iterates over the Q-function approximations on a four-tuple by four-tuple basis. In this paper, we reformulate this problem as a sequence of batch mode supervised learning problems which in the limit converges to (an approximation of) the Q-function. Each step of this algorithm uses the full sample of four-tuples gathered from interaction with the system and extends by one step the horizon of the optimality criterion. An advantage of this approach is to allow the use of standard batch mode supervised learning algorithms, instead of the incremental versions used up to now. In addition to a theoretical justification the paper provides empirical tests in the context of the “Car on the Hill” control problem based on the use of ensembles of regression trees. The resulting algorithm is in principle able to handle efficiently large scale reinforcement learning problems.


  1. 1.
    Bertsekas, D.: Dynamic Programming and Optimal Control, 2nd edn., vol. I. Athena Scientific, Belmont (2000)Google Scholar
  2. 2.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)zbMATHMathSciNetGoogle Scholar
  3. 3.
    Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  4. 4.
    Breiman, L., Friedman, J., Olsen, R., Stone, C.: Classification and Regression Trees. Wadsworth International, California (1984)zbMATHGoogle Scholar
  5. 5.
    Ernst, D.: Near optimal closed-loop control. Application to electric power systems. PhD thesis, University of Liège, Belgium (March 2003)Google Scholar
  6. 6.
    Geurts, P.: Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, University of Liège, Belgium (May 2002)Google Scholar
  7. 7.
    Geurts, P.: Extremely randomized trees. Technical report, University of Liège (2003)Google Scholar
  8. 8.
    Moore, A., Atkeson, C.: Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time. Machine Learning 13, 103–130 (1993)Google Scholar
  9. 9.
    Rosenstein, M.T., Barto, A.G.: Supervised learning combined with an actorcritic architecture. Technical report, University of Massachusetts, Department of Computer Science (2002)Google Scholar
  10. 10.
    Smart, W., Kaelbling, L.: Practical Reinforcement Learning in Continuous Spaces. In: Proceedings of the Sixteenth International Conference on Machine Learning (2000)Google Scholar
  11. 11.
    Watkins, C., Dayan, P.: Q-learning. Machine learning 8, 279–292 (1992)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Damien Ernst
    • 1
  • Pierre Geurts
    • 1
  • Louis Wehenkel
    • 1
  1. 1.Department of Electrical Engineering and Computer Science, Institut MontefioreUniversity of LiègeLiègeBelgium

Personalised recommendations