A Deterministic Actor-Critic Approach to Stochastic Reinforcements

Okesanjo, Yemi; Kofia, Victor

doi:10.1007/978-3-319-63004-5_6

Yemi Okesanjo¹⁶ &
Victor Kofia¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10400))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1428 Accesses
3 Citations

Abstract

Learning optimal policies under stochastic rewards presents a challenge for well-known reinforcement learning algorithms such as Q-learning. Q-learning has been shown to suffer from a positive bias that inhibits it from learning under inconsistent rewards. Actor-critic methods however do not suffer from such bias but may also fail to acquire the optimal policy under rewards of high variance. We propose the use of a reward shaping function in order to minimize the variance within stochastic rewards. By reformulating Q-learning as a deterministic actor-critic, we show that the use of such reward shaping function improves the acquisition of optimal policies under stochastic reinforcements.

Y. Okesanjo—Code at https://github.com/ev0/Dac-mdp.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anschel, O., Baram, N., Shimkin, N.: Deep reinforcement learning with average target DQN (2016). https://arxiv.org/abs/1611.01929v2. Accessed Mar 2017
Baird, L.: Reinforcement learning in continuous time: advantage updating. In: IEEE International Conference on Neural Networks, pp. 2448–2453 (1994)
Google Scholar
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algorithms. Neural Inform. Process. Syst. 20, 105–112 (2007)
MATH Google Scholar
Casino-To-Go: Rules and playing guide: Chuck-a-luck (2007). http://www.casino-to-go.co.uk/downloads/Chuck-A-Luck%20Rules%20and%20Guide.pdf. Accessed Mar 2017
Chen, Y., Ghahramani, Z.: Scalable discrete sampling as a multi-armed bandit problem. In: 33rd International Conference on Machine Learning (2016)
Google Scholar
Fox, R., Pakman, A., Tishby, N.: Taming the noise in reinforcement learning via soft updates. In: 32nd Conference on Uncertainty in AI (2016)
Google Scholar
Harmon, M., Baird, L.: Residual advantage learning applied to a differential game. In: International Conference on Neural Networks (1996)
Google Scholar
Harmon, M., Harmon, S.: Reinforcement learning: a tutorial (1997)
Google Scholar
Hasselt, H.: Double q-learning. In: Neural Information Processing Systems, vol. 23 (2010)
Google Scholar
Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI (2016)
Google Scholar
Jaakkola, T., Jordan, M., Singh, S.: On the convergence of stochastic iterative dynamic programming algorithms. In: Neural Information Processing Systems, vol. 7 (1994)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparametrization with gumbel-softmax. In: International Conference on Learning Representations (2016)
Google Scholar
Kavouras, I.: How to play roulette: Rules, odds & payouts. http://www.roulette30.com/2014/04/how-to-play-roulette-beginners-guide.html. Accessed Mar 2017
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. Neural Inform. Process. Syst. 12, 1008–1014 (1999)
MATH Google Scholar
Moreno, A., Martin, J., Soria, E., Magdalena, R., Martinez, M.: Noisy reinforcements in reinforcement learning: some case studies based on grid worlds. In: 6th WSEAS International Conference on Applied Computer Science (2006)
Google Scholar
Ng, A., Harada, Y., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: 16th International Conference on Machine Learning, pp. 278–287 (1999)
Google Scholar
Papandreou, G., Yuille, A.: Perturb-and-map random fields: using discrete optimization to learn and sample from energy models. In: International Conference on Computer Vision (2011)
Google Scholar
Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IEEE International Conference on Intelligent Robotics Systems, pp. 2219–2225 (2006)
Google Scholar
Silver, D.: Reinforcement learning: policy gradient (2015). http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf. Accessed Mar 2017
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: 31st International Conference on Machine Learning (2014)
Google Scholar
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38, 287–308 (2000)
Article MATH Google Scholar
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, chap. 6, pp. 143–145, 2nd edn. The MIT Press, Cambridge (2016)
Google Scholar
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Neural Inform. Process. Syst. 12, 1057–1063 (2000)
Google Scholar
Tannenbaum, P.: Mini-Excursion 4: the mathematics of managing risk. In: Excursions in Modern Mathematics, 7 edn. Pearson, London (2010)
Google Scholar
Watkins, C., Dayan, P.: Q-learning. Machine Learning 8, 9–44 (1992)
MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank Prof. Guerzhoy for the helpful guidance and discussions.

Author information

Authors and Affiliations

University of Toronto, Toronto, ON, M5T 1N4, Canada
Yemi Okesanjo & Victor Kofia

Authors

Yemi Okesanjo
View author publications
You can also search for this author in PubMed Google Scholar
Victor Kofia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yemi Okesanjo .

Editor information

Editors and Affiliations

La Trobe University, Melbourne, Australia
Wei Peng
La Trobe Business School, La Trobe University, Bundoora, Victoria, Australia
Damminda Alahakoon
RMIT University, Melbourne, Australia
Xiaodong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Okesanjo, Y., Kofia, V. (2017). A Deterministic Actor-Critic Approach to Stochastic Reinforcements. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-63004-5_6
Published: 09 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics