Regret Bounds for Reinforcement Learning with Policy Advice

Azar, Mohammad Gheshlaghi; Lazaric, Alessandro; Brunskill, Emma

doi:10.1007/978-3-642-40988-2_7

Mohammad Gheshlaghi Azar²³,
Alessandro Lazaric²⁴ &
Emma Brunskill²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8188))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3517 Accesses
4 Citations

Abstract

In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of \(\widetilde O(\sqrt{T})\) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided.

Download to read the full chapter text

Chapter PDF

Advice-Based Exploration in Model-Based Reinforcement Learning

Policy invariant explicit shaping: an efficient alternative to reward shaping

Article Open access 28 September 2021

The Impatient May Use Limited Optimism to Minimize Regret

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Bartlett, P.L., Tewari, A.: Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In: UAI, pp. 35–42 (2009)
Google Scholar
Brunskill, E.: Bayes-optimal reinforcement learning for discrete uncertainty domains. In: Abstract. Proceedings of the International Conference on Autonomous Agents and Multiagent System (2012)
Google Scholar
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D.P., Schapire, R.E., Warmuth, M.K.: How to use expert advice. Journal of the ACM 44(3), 427–485 (1997)
Article MathSciNet MATH Google Scholar
Diuk, C., Li, L., Leffler, B.R.: The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In: ICML (2009)
Google Scholar
Dyagilev, K., Mannor, S., Shimkin, N.: Efficient reinforcement learning in parameterized models: Discrete parameter case. In: European Workshop on Reinforcement Learning (2008)
Google Scholar
Fernández, F., Veloso, M.M.: Probabilistic policy reuse in a reinforcement learning agent. In: AAMAS, pp. 720–727 (2006)
Google Scholar
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11, 1563–1600 (2010)
MathSciNet MATH Google Scholar
Maillard, O., Nguyen, P., Ortner, R., Ryabko, D.: Optimal regret bounds for selecting the state representation in re inforcement learning. In: ICML, JMLR W&CP, Atlanta, USA, vol. 28(1), pp. 543–551 (2013)
Google Scholar
Ortner, R., Ryabko, D., Auer, P., Munos, R.: Regret bounds for restless markov bandits. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 214–228. Springer, Heidelberg (2012)
Chapter Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete bayesian reinforcement learning. In: ICML (2006)
Google Scholar
Pucci de Farias, D., Megiddo, N.: Exploration-exploitation tradeoffs for experts algorithms in reactive environments. In: Advances in Neural Information Processing Systems 17, pp. 409–416 (2004)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. John Wiley & Sons, Inc., New York (1994)
Book MATH Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Talvitie, E., Singh, S.: An experts algorithm for transfer learning. In: IJCAI (2007)
Google Scholar
Tekin, C., Liu, M.: Online learning of rested and restless bandits. IEEE Transactions on Information Theory 58(8), 5588–5611 (2012)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Mohammad Gheshlaghi Azar & Emma Brunskill
INRIA Lille - Nord Europe, Team SequeL, Villeneuve d’Ascq, France
Alessandro Lazaric

Authors

Mohammad Gheshlaghi Azar
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Lazaric
View author publications
You can also search for this author in PubMed Google Scholar
Emma Brunskill
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001, Leuven, Belgium
Hendrik Blockeel
Fraunhofer IAIS, Department of Knowledge Discovery, University of Bonn, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
Kristian Kersting
LIACS, Universiteit Leiden, Niels Bohrweg 1, 2333 CA, Leiden, The Netherlands
Siegfried Nijssen
Department of Computer Science and Engineering, Czech Technical University, Technicka 2, 16627, Prague 6, Czech Republic
Filip Železný

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azar, M.G., Lazaric, A., Brunskill, E. (2013). Regret Bounds for Reinforcement Learning with Policy Advice. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40988-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-40988-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40987-5
Online ISBN: 978-3-642-40988-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Regret Bounds for Reinforcement Learning with Policy Advice

Abstract

Chapter PDF

Similar content being viewed by others

Advice-Based Exploration in Model-Based Reinforcement Learning

Policy invariant explicit shaping: an efficient alternative to reward shaping

The Impatient May Use Limited Optimism to Minimize Regret

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Regret Bounds for Reinforcement Learning with Policy Advice

Abstract

Chapter PDF

Similar content being viewed by others

Advice-Based Exploration in Model-Based Reinforcement Learning

Policy invariant explicit shaping: an efficient alternative to reward shaping

The Impatient May Use Limited Optimism to Minimize Regret

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation