Regret Bounds for Restless Markov Bandits

Ortner, Ronald; Ryabko, Daniil; Auer, Peter; Munos, Rémi

doi:10.1007/978-3-642-34106-9_19

Ronald Ortner^23,24,
Daniil Ryabko²⁴,
Peter Auer²³ &
…
Rémi Munos²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7568))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

2334 Accesses
13 Citations

Abstract

We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner’s actions. We suggest an algorithm that after T steps achieves \(\tilde{O}(\sqrt{T})\) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6, 4–22 (1985)
Article MathSciNet MATH Google Scholar
Akyildiz, I.F., Lee, W.Y., Vuran, M.C., Mohanty, S.: A survey on spectrum management in cognitive radio networks. IEEE Commun. Mag. 46(4), 40–48 (2008)
Article Google Scholar
Anantharam, V., Varaiya, P., Walrand, J.: Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays, part II: Markovian rewards. IEEE Trans. Automat. Control 32(11), 977–982 (1987)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 48–77 (2002)
Article MathSciNet MATH Google Scholar
Audibert, J.-Y., Bubeck, S.: Minimax policies for adversarial and stochastic bandits. In: COLT 2009. Proc. 22nd Annual Conf. on Learning Theory, pp. 217–226 (2009)
Google Scholar
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)
MathSciNet MATH Google Scholar
Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In: Proc. 25th Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 35–42. AUAI Press (2009)
Google Scholar
Tekin, C., Liu, M.: Adaptive learning of uncontrolled restless bandits with logarithmic regret. In: 49th Annual Allerton Conference, pp. 983–990. IEEE (2011)
Google Scholar
Filippi, S., Cappe, O., Garivier, A.: Optimally sensing a single channel without prior information: The tiling algorithm and regret bounds. IEEE J. Sel. Topics Signal Process. 5(1), 68–76 (2011)
Article Google Scholar
Levin, D.A., Peres, Y., Wilmer, E.L.: Markov chains and mixing times. American Mathematical Society (2006)
Google Scholar
Gittins, J.C.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. Ser. B Stat. Methodol. 41(2), 148–177 (1979)
MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multi-armed bandit problem. Mach. Learn. 47, 235–256 (2002)
Article MATH Google Scholar
Whittle, P.: Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 25, 287–298 (1988)
Article MathSciNet Google Scholar
Ortner, R.: Pseudometrics for State Aggregation in Average Reward Markov Decision Processes. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 373–387. Springer, Heidelberg (2007)
Chapter Google Scholar
Aldous, D.J., Fill, J.: Reversible Markov Chains and Random Walks on Graphs (in preparation), http://www.stat.berkeley.edu/~aldous/RWG/book.html
Aldous, D.J.: Threshold limits for cover times. J. Theoret. Probab. 4, 197–211 (1991)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Montanuniversitaet Leoben, Austria
Ronald Ortner & Peter Auer
INRIA Lille-Nord Europe, équipe SequeL, France
Ronald Ortner, Daniil Ryabko & Rémi Munos

Authors

Ronald Ortner
View author publications
You can also search for this author in PubMed Google Scholar
Daniil Ryabko
View author publications
You can also search for this author in PubMed Google Scholar
Peter Auer
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Munos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Technion, 32000, Haifa, Israel
Nader H. Bshouty
Ecolre Normale Sup’erieure, CNRS, INRIA, 45 rue d’Ulm, 75005, Paris, France
Gilles Stoltz
Ecole Normale Supérieure de Cachan, 61, avenue du Président Wilson, 94 235, Cachan cedex, France
Nicolas Vayatis
Division of Computer Science, Hokkaido University, N-14, W-9, 060-0814, Sapporo, Japan
Thomas Zeugmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ortner, R., Ryabko, D., Auer, P., Munos, R. (2012). Regret Bounds for Restless Markov Bandits. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds) Algorithmic Learning Theory. ALT 2012. Lecture Notes in Computer Science(), vol 7568. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34106-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-34106-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34105-2
Online ISBN: 978-3-642-34106-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics