Abstract
The construction of a suitable set of features to approximate value functions is a central problem in reinforcement learning (RL). A popular approach to this problem is to use high-dimensional feature spaces together with least-squares temporal difference learning (LSTD). Although this combination allows for very accurate approximations, it often exhibits poor prediction performance because of overfitting when the number of samples is small compared to the number of features in the approximation space. In the linear regression setting, regularization is commonly used to overcome this problem. In this paper, we review some regularized approaches to policy evaluation and we introduce a novel scheme (L 21) which uses ℓ2 regularization in the projection operator and an ℓ1 penalty in the fixed-point step. We show that such formulation reduces to a standard Lasso problem. As a result, any off-the-shelf solver can be used to compute its solution and standardization techniques can be applied to the data. We report experimental results showing that L 21 is effective in avoiding overfitting and that it compares favorably to existing ℓ1 regularized methods.
Keywords
- Markov Decision Process
- Regularization Scheme
- Policy Iteration
- Projection Step
- Optimal Regularization Parameter
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning 71(1) (2008)
Bradtke, S., Barto, A.: Linear least-squares algorithms for temporal difference learning. Machine Learning 22, 33–57 (1996)
Bunea, F., Tsybakov, A., Wegkamp, M.: Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics 1, 169–194 (2007)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Annals of Statistics 32(2) (2004)
Farahmand, A., Ghavamzadeh, M., Szepesvari, C., Mannor, S.: Regularized policy iteration. In: Advances in Neural Information Processing Systems 21 (2009)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. The Annals of Applied Statistics 1(2), 302–332 (2007)
Friedman, J., Hastie, T., Tibshirani, R.: The elements of statistical learning. Springer, Heidelberg (2001)
Geist, M., Scherrer, B.: ℓ1-penalized projected bellman residual. In: European Workshop on Reinforcement Learning (2011)
Ghavamzadeh, M., Lazaric, A., Munos, R., Hoffman, M.: Finite-sample analysis of Lasso-TD. In: Proceedings of the International Conference on Machine Learning (2011)
Johns, J., Painter-Wakefield, C., Parr, R.: Linear complementarity for regularized policy evaluation and improvement. In: Advances in Neural Information Processing Systems 23 (2010)
Kolter, J.Z., Ng, A.Y.: Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the International Conference on Machine Learning (2009)
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4 (2003)
Schmidt, M.: Graphical Model Structure Learning with l1-Regularization. Ph.D. thesis, University of British Columbia (2010)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998)
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hoffman, M.W., Lazaric, A., Ghavamzadeh, M., Munos, R. (2012). Regularized Least Squares Temporal Difference Learning with Nested ℓ2 and ℓ1 Penalization. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-29946-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)