A Bayesian Sarsa Learning Algorithm with Bandit-Based Method

  • Shuhua You
  • Quan LiuEmail author
  • Qiming Fu
  • Shan Zhong
  • Fei Zhu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9489)


We propose an efficient algorithm called Bayesian Sarsa (BS) on the consideration of balancing the tradeoff between exploration and exploitation in reinforcement learning. We adopt probability distributions to estimate Q-values and compute posterior distributions about Q-values by Bayesian Inference. It can improve the accuracy of Q-values function estimation. In the process of algorithm learning, we use a Bandit-based method to solve the exploration/exploitation problem. It chooses actions according to the current mean estimate of Q-values plus an additional reward bonus for state-action pairs that have been observed relatively little. We demonstrate that Bayesian Sarsa performs quite favorably compared to state-of-the-art reinforcement learning approaches.


Reinforcement learning Probability distribution Bayesian Inference Bandit-based method Exploration/exploitation 



This work was funded by National Natural Science Foundation (61272005, 61303108, 61373094, 61502323, 61272005, 61303108, 61373094, 61472262). We would also like to thank he reviewers for their helpful comments. Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial application of basic research program part (SYG201422).


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Kalidindi, K., Bowman, H.: Using \(\epsilon \)-greedy reinforcement learning methods to further understand ventromedial prefrontal patients’ deficits on the Iowa Gambling Task. Neural Netw. 20(6), 676–689 (2007)CrossRefzbMATHGoogle Scholar
  3. 3.
    Coggan, M.: Exploration and exploitation in reinforcement learning. In: 4th International Conference on Computational Intelligence and Multimedia Applications. IEEE Press, Japan (2001)Google Scholar
  4. 4.
    Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res. 7, 877–917 (2006)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: 15th International Conference on Artificial Intelligence. AAAI Press, Menlo Park (1998)Google Scholar
  6. 6.
    WClaxton, K., Neumann, P.J., Araki, S., et al.: Bayesian value-of-information analysis. Int. J. Technol. Assess. Health Care 17(1), 38–55 (2001)CrossRefGoogle Scholar
  7. 7.
    Chalkiadakis, G., Boutilier, C.: Coordination in multiagent reinforcement learning: a Bayesian approach. In: Second International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 709–716. ACM (2003)Google Scholar
  8. 8.
    Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Netw. 15(4), 665–687 (2002)CrossRefGoogle Scholar
  9. 9.
    Brochu, E., Cora, V.M., De, Freitas, N.: A tutorial on Bayesian optimization of expensive cost functions with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 (2010)
  10. 10.
    Kolter, J.Z., Ng, A.: Near-Bayesian exploration in polynomial time. In: 26th International Conference on Machine Learning, pp. 513–520 (2009)Google Scholar
  11. 11.
    Brafman, R.I., Tennenholtz, M.: R-max: a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Strehl, A.L., Li, L., Wiewiora, E., et al.: PAC model-free reinforcement learning. In: 23rd International Conference on Machine Learning, pp. 881–888. ACM (2006)Google Scholar
  13. 13.
    Degroot, M., Schervish, M.: Probability and Statistics, 4th edn. Pearson Education, Inc., New York (2010)Google Scholar
  14. 14.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)CrossRefzbMATHGoogle Scholar
  15. 15.
    Cox, C., Chu, H., Schneider, M.F., et al.: Parametric survival analysis and taxonomy of hazard functions for the generalized gamma distribution. Stat. Med. 26(23), 4352–4374 (2007)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Shin, J.W., Chang, J.H., Kim, N.S.: Statistical modeling of speech signals based on generalized gamma distribution. IEEE Sig. Process. Lett. 12(3), 258–261 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Shuhua You
    • 1
  • Quan Liu
    • 1
    Email author
  • Qiming Fu
    • 1
  • Shan Zhong
    • 1
  • Fei Zhu
    • 1
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina

Personalised recommendations