Accurate Q-Learning

Hu, Zhihui; Jiang, Yubin; Ling, Xinghong; Liu, Quan

doi:10.1007/978-3-030-04182-3_49

Zhihui Hu¹⁶,
Yubin Jiang¹⁶,
Xinghong Ling¹⁶ &
…
Quan Liu^16,17,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11303))

Included in the following conference series:

International Conference on Neural Information Processing

2367 Accesses
2 Citations

Abstract

In order to solve the problem that Q-learning can suffer from large overestimations in some stochastic environments, we first propose a new form of Q-learning, which proves that it is equivalent to the incremental form and analyze the reasons why the convergence rate of Q-learning will be affected by positive bias. We generalize the new form for the purpose of easy adaptations. By using the current value instead of the bias term, we present an accurate Q-learning algorithm and show that the new algorithm converges to an optimal policy. Experimentally, the new algorithm can avoid the effect of positive bias and the convergence rate is faster than Q-learning and its variants on several MDP problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Monte Carlo Bias Correction in Q-Learning

Multiscale Q-learning with linear function approximation

Article 30 August 2015

Input-Decoupled Q-Learning for Optimal Control

Article 14 May 2019

References

Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article Google Scholar
Azar, M.G., Munos, R., Ghavamzadeh, M., Kappen, H.J.: Speedy Q-learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2411–2419. Curran Associates Inc. (2011)
Google Scholar
Bellman, R.: Dynamic Programming. Courier Corporation, North Chelmsford (2013)
MATH Google Scholar
Bertsekas, D.P.: Stable optimal control and semicontractive dynamic programming. SIAM J. Control Optim. 56(1), 231–252 (2018)
Article MathSciNet Google Scholar
DEramo, C., Restelli, M., Nuara, A.: Estimating maximum expected value through gaussian approximation. In: ICML, pp. 1032–1040 (2016)
Google Scholar
Even-Dar, E., Mansour, Y.: Learning rates for Q-learning. JMLR 5, 1–25 (2003)
MathSciNet MATH Google Scholar
van Hasselt, H.P.: Double Q-learning. In: NIPS, pp. 2613–2621 (2010)
Google Scholar
van Hasselt, H.P.: Insights in reinforcement learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. Ph.D. thesis, Utrecht University, Netherlands (2011)
Google Scholar
Kearns, M., Singh, S.: Finite-sample convergence rates for Q-learning and indirect algorithms. In: NIPS, pp. 996–1002 (1999)
Google Scholar
Lee, D., Powell, W.B.: An intelligent battery controller using bias-corrected Q-learning. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 22–26 July 2012, Toronto, Ontario, Canada, pp. 316–322 (2012)
Google Scholar
Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: convergence and applications. In: ICML, vol. 96, pp. 310–318 (1996)
Google Scholar
Pandey, S., Chakrabarti, D., Agarwal, D.: Multi-armed bandit problems with dependent arms. In: Proceedings of the 24th International Conference on Machine Learning, pp. 721–728. ACM (2007)
Google Scholar
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38(3), 287–308 (2000)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An introduction, vol. 1. MIT Press, Cambridge (1998)
Google Scholar
Szepesvári, C.: The asymptotic convergence-rate of Q-learning. In: NIPS, pp. 1064–1070 (1997)
Google Scholar
Watkins, C.J.: Learning from delayed rewards. Robot. Auton. Syst. 15(4), 233–235 (1989)
Google Scholar
Wiering, M., Van Otterlo, M.: Reinforcement Learning. Adaptation, Learning, and Optimization, vol. 12. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27645-3
Book Google Scholar
Yu, H., Bertsekas, D.P.: Q-learning and policy iteration algorithms for stochastic shortest path problems. Ann. Oper. Res. 208(1), 95–132 (2013)
Article MathSciNet Google Scholar
Zhang, Z., Pan, Z., Kochenderfer, M.J.: Weighted double Q-learning. In: IJCAI, pp. 3455–3461 (2017)
Google Scholar

Download references

Acknowledgments

This work was funded by National Natural Science Foundation (61272005, 61303108, 61373094, 61502323, 61272005, 61303108, 61373094, 61472262). We would also like to thank the reviewers for their helpful comments. Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, Jiangsu, China
Zhihui Hu, Yubin Jiang, Xinghong Ling & Quan Liu
Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210000, China
Quan Liu
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, 130012, China
Quan Liu

Authors

Zhihui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Xinghong Ling
View author publications
You can also search for this author in PubMed Google Scholar
Quan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quan Liu .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, Z., Jiang, Y., Ling, X., Liu, Q. (2018). Accurate Q-Learning. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11303. Springer, Cham. https://doi.org/10.1007/978-3-030-04182-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-04182-3_49
Published: 18 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04181-6
Online ISBN: 978-3-030-04182-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Accurate Q-Learning

Abstract

Access this chapter

Similar content being viewed by others

Monte Carlo Bias Correction in Q-Learning

Multiscale Q-learning with linear function approximation

Input-Decoupled Q-Learning for Optimal Control

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Accurate Q-Learning

Abstract

Access this chapter

Similar content being viewed by others

Monte Carlo Bias Correction in Q-Learning

Multiscale Q-learning with linear function approximation

Input-Decoupled Q-Learning for Optimal Control

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation