Abstract
Many reinforcement learning methods are based on a function Q(s,a) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q(s,a) = s T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.
References
Chakraborty, B., Strecher, V., Murphy, S.: Bias correction and confidence intervals for fitted Q-iteration. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)
Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., Lin, C.J.: Training and testing low-degree polynomial data mappings via linear SVM. Journal of Machine Learning Research 11, 1471–1490 (2010)
De Farias, D.P., Van Roy, B.: The linear programming approach to approximate dynamic programming. Operations Research 51(6), 850–865 (2003)
Dietterich, T.G.: Machine Learning and Ecosystem Informatics: Challenges and Opportunities. In: Zhou, Z.-H., Washio, T. (eds.) ACML 2009. LNCS, vol. 5828, pp. 1–5. Springer, Heidelberg (2009)
Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43(1), 7–52 (2001)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6(1), 503–556 (2005)
Gordon, G.J.: Stable fitted reinforcement learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 1052–1058 (1995a)
Gordon, G.J.: Stable function approximation in dynamic programming. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 261–268 (1995b)
Hannah, L.A., Dunson, D.B.: Approximate dynamic programming for storage problems. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Judd, K.L., Solnick, A.J.: Numerical dynamic programming with shape-preserving splines. Unpublished paper from the Hoover Institution (1994), http://bucky.stanford.edu/papers/dpshape.pdf
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)
Lazaric, A., Restelli, M., Bonarini, A.: Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. In: Advances in Neural Information Processing Systems 20 (NIPS). MIT Press (2007)
Melo, F.S., Lopes, M.: Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 66–81. Springer, Heidelberg (2008)
Murphy, S.A.: A generalization error for Q-learning. Journal of Machine Learning Research 6, 1073–1097 (2005)
Neumann, G.: Batch-mode reinforcement learning for continuous state spaces: A survey. ÖGAI Journal 27(1), 15–23 (2008)
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control policies. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 100–107 (2009)
Pazis, J., Parr, R.: Generalized value functions for large action sets. In: Proceedings of the International Conference on Machine Learning, ICML (2011)
Powell, W.B.: Approximate Dynamic Programming. John Wiley & Sons, Inc. (2007)
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005)
Simester, D.I., Sun, P., Tsitsiklis, J.N.: Dynamic catalog mailing policies. Management Science 52(5), 683–696 (2006)
Smart, W.D., Kaelbling, L.P.: Practical reinforcement learning in continuous spaces. In: Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 903–910 (2000)
Stachurski, J.: Continuous state dynamic programming via nonexpansive approximation. Computational Economics 31(2), 141–160 (2008)
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press (1998)
Todorov, E.: Efficient computation of optimal actions. In: Proceedings of the National Academy of Sciences 106(28), 11478–11483 (2009)
van Hasselt, H.P.: Double Q-learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 23 (2010)
Viviani, P., Flash, T.: Minimum-jerk, two-thirds power law, and isochrony: converging approaches to movement planning. Journal of Experimental Psychology 21, 32–53 (1995)
Yu, V.: Approximate dynamic programming for blood inventory management. Honors thesis, Princeton University (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elkan, C. (2012). Reinforcement Learning with a Bilinear Q Function. In: Sanner, S., Hutter, M. (eds) Recent Advances in Reinforcement Learning. EWRL 2011. Lecture Notes in Computer Science(), vol 7188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29946-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-29946-9_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29945-2
Online ISBN: 978-3-642-29946-9
eBook Packages: Computer ScienceComputer Science (R0)