PAC Bounds for Multi-armed Bandit and Markov Decision Processes

Even-Dar, Eyal; Mannor, Shie; Mansour, Yishay

doi:10.1007/3-540-45435-7_18

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

Eyal Even-Dar³,
Shie Mannor⁴ &
Yishay Mansour³

Conference paper
First Online: 01 January 2002

1996 Accesses
80 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2375))

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/ε ²log1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/ε ²logn/δ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound.

We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.
Google Scholar
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.
Google Scholar
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. preprint, 2001.
Google Scholar
D. A. Berry and B. Fristedt. Bandit Problems. Chapman and Hall, 1985.
Google Scholar
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Progamming. Athena Scientific, 1995.
Google Scholar
V. S. Borkar and S.P Meyn. The O. D. E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control, 38(2):447–469, 2000.
Article MATH MathSciNet Google Scholar
R. Brafman and M. Tennenholtz. R-MAX-A General Polynomial Time Algorithm for Near Optimal Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2001.
Google Scholar
H. Chernoff. Sequential Analysis and Optimal Design. Society for industrial and Applied Mathematics, Philadelphia, 1972.
MATH Google Scholar
P. Dayan and C. Watkins. Q-learning. Machine Learning, 8:279–292, 1992.
MATH Google Scholar
Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. In Fourteenth Annual Conference on Computation Learning Theory, pages 589–604, 2001.
Google Scholar
C. N. Fiechter. PAC adaptive control of linear systems. In Tenth Annual conference on Computational Learing Theory, pages 72–80, 1997.
Google Scholar
J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, and I. Vincze, editors, Progress in Statistics, pages 241–266. North-Holland, Amsterdam, 1974.
Google Scholar
M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 1999.
Google Scholar
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.
Google Scholar
M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms near-optimal reinforcement learning in polynomial time. In Neural Information Processing Systems 11, pages 996–1002. Morgan Kaufmann, 1999.
Google Scholar
Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In International Joint Conference on AI, pages 1324–1231, 1999.
Google Scholar
T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
Article MATH MathSciNet Google Scholar
H. Robbins. Some aspects of sequential design of experiments. Bull. Amer. Math. Soc., 55:527–535, 1952.
Article MathSciNet Google Scholar
J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Tel Aviv University, Tel-Aviv, 69978, Israel
Eyal Even-Dar & Yishay Mansour
Department of Electrical Engineering, Technion, Haifa, 32000, Israel
Shie Mannor

Authors

Eyal Even-Dar
View author publications
You can also search for this author in PubMed Google Scholar
Shie Mannor
View author publications
You can also search for this author in PubMed Google Scholar
Yishay Mansour
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research School of Information Sciences and Engineering, Australian National University, Canberra, ACT, 0200, Australia
Jyrki Kivinen
Computer Science Department, University of Illinois at Chicago, 851 S. Morgan St., Chicago, IL, 60607, USA
Robert H. Sloan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Even-Dar, E., Mannor, S., Mansour, Y. (2002). PAC Bounds for Multi-armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds) Computational Learning Theory. COLT 2002. Lecture Notes in Computer Science(), vol 2375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45435-7_18

Download citation

DOI: https://doi.org/10.1007/3-540-45435-7_18
Published: 25 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43836-6
Online ISBN: 978-3-540-45435-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics