Abstract
Learning optimal policies under stochastic rewards presents a challenge for well-known reinforcement learning algorithms such as Q-learning. Q-learning has been shown to suffer from a positive bias that inhibits it from learning under inconsistent rewards. Actor-critic methods however do not suffer from such bias but may also fail to acquire the optimal policy under rewards of high variance. We propose the use of a reward shaping function in order to minimize the variance within stochastic rewards. By reformulating Q-learning as a deterministic actor-critic, we show that the use of such reward shaping function improves the acquisition of optimal policies under stochastic reinforcements.
Y. Okesanjo—Code at https://github.com/ev0/Dac-mdp.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anschel, O., Baram, N., Shimkin, N.: Deep reinforcement learning with average target DQN (2016). https://arxiv.org/abs/1611.01929v2. Accessed Mar 2017
Baird, L.: Reinforcement learning in continuous time: advantage updating. In: IEEE International Conference on Neural Networks, pp. 2448–2453 (1994)
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algorithms. Neural Inform. Process. Syst. 20, 105–112 (2007)
Casino-To-Go: Rules and playing guide: Chuck-a-luck (2007). http://www.casino-to-go.co.uk/downloads/Chuck-A-Luck%20Rules%20and%20Guide.pdf. Accessed Mar 2017
Chen, Y., Ghahramani, Z.: Scalable discrete sampling as a multi-armed bandit problem. In: 33rd International Conference on Machine Learning (2016)
Fox, R., Pakman, A., Tishby, N.: Taming the noise in reinforcement learning via soft updates. In: 32nd Conference on Uncertainty in AI (2016)
Harmon, M., Baird, L.: Residual advantage learning applied to a differential game. In: International Conference on Neural Networks (1996)
Harmon, M., Harmon, S.: Reinforcement learning: a tutorial (1997)
Hasselt, H.: Double q-learning. In: Neural Information Processing Systems, vol. 23 (2010)
Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: AAAI (2016)
Jaakkola, T., Jordan, M., Singh, S.: On the convergence of stochastic iterative dynamic programming algorithms. In: Neural Information Processing Systems, vol. 7 (1994)
Jang, E., Gu, S., Poole, B.: Categorical reparametrization with gumbel-softmax. In: International Conference on Learning Representations (2016)
Kavouras, I.: How to play roulette: Rules, odds & payouts. http://www.roulette30.com/2014/04/how-to-play-roulette-beginners-guide.html. Accessed Mar 2017
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. Neural Inform. Process. Syst. 12, 1008–1014 (1999)
Moreno, A., Martin, J., Soria, E., Magdalena, R., Martinez, M.: Noisy reinforcements in reinforcement learning: some case studies based on grid worlds. In: 6th WSEAS International Conference on Applied Computer Science (2006)
Ng, A., Harada, Y., Russell, S.: Policy invariance under reward transformations: theory and application to reward shaping. In: 16th International Conference on Machine Learning, pp. 278–287 (1999)
Papandreou, G., Yuille, A.: Perturb-and-map random fields: using discrete optimization to learn and sample from energy models. In: International Conference on Computer Vision (2011)
Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IEEE International Conference on Intelligent Robotics Systems, pp. 2219–2225 (2006)
Silver, D.: Reinforcement learning: policy gradient (2015). http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/pg.pdf. Accessed Mar 2017
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: 31st International Conference on Machine Learning (2014)
Singh, S., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 38, 287–308 (2000)
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, chap. 6, pp. 143–145, 2nd edn. The MIT Press, Cambridge (2016)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Neural Inform. Process. Syst. 12, 1057–1063 (2000)
Tannenbaum, P.: Mini-Excursion 4: the mathematics of managing risk. In: Excursions in Modern Mathematics, 7 edn. Pearson, London (2010)
Watkins, C., Dayan, P.: Q-learning. Machine Learning 8, 9–44 (1992)
Acknowledgements
The authors would like to thank Prof. Guerzhoy for the helpful guidance and discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Okesanjo, Y., Kofia, V. (2017). A Deterministic Actor-Critic Approach to Stochastic Reinforcements. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-63004-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)