A Bayesian Sarsa Learning Algorithm with Bandit-Based Method
We propose an efficient algorithm called Bayesian Sarsa (BS) on the consideration of balancing the tradeoff between exploration and exploitation in reinforcement learning. We adopt probability distributions to estimate Q-values and compute posterior distributions about Q-values by Bayesian Inference. It can improve the accuracy of Q-values function estimation. In the process of algorithm learning, we use a Bandit-based method to solve the exploration/exploitation problem. It chooses actions according to the current mean estimate of Q-values plus an additional reward bonus for state-action pairs that have been observed relatively little. We demonstrate that Bayesian Sarsa performs quite favorably compared to state-of-the-art reinforcement learning approaches.
KeywordsReinforcement learning Probability distribution Bayesian Inference Bandit-based method Exploration/exploitation
This work was funded by National Natural Science Foundation (61272005, 61303108, 61373094, 61502323, 61272005, 61303108, 61373094, 61472262). We would also like to thank he reviewers for their helpful comments. Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial application of basic research program part (SYG201422).
- 1.Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
- 3.Coggan, M.: Exploration and exploitation in reinforcement learning. In: 4th International Conference on Computational Intelligence and Multimedia Applications. IEEE Press, Japan (2001)Google Scholar
- 5.Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: 15th International Conference on Artificial Intelligence. AAAI Press, Menlo Park (1998)Google Scholar
- 7.Chalkiadakis, G., Boutilier, C.: Coordination in multiagent reinforcement learning: a Bayesian approach. In: Second International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 709–716. ACM (2003)Google Scholar
- 9.Brochu, E., Cora, V.M., De, Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 (2010)
- 10.Kolter, J.Z., Ng, A.: Near-Bayesian exploration in polynomial time. In: 26th International Conference on Machine Learning, pp. 513–520 (2009)Google Scholar
- 12.Strehl, A.L., Li, L., Wiewiora, E., et al.: PAC model-free reinforcement learning. In: 23rd International Conference on Machine Learning, pp. 881–888. ACM (2006)Google Scholar
- 13.Degroot, M., Schervish, M.: Probability and Statistics, 4th edn. Pearson Education, Inc., New York (2010)Google Scholar