PAC Bounds for Multi-armed Bandit and Markov Decision Processes
The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/ε 2log1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/ε 2logn/δ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound.
We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.
KeywordsOptimal Policy Failure Probability Markov Decision Process Bandit Problem State Action Pair
Unable to display preview. Download preview PDF.
- 1.M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.Google Scholar
- 2.P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.Google Scholar
- 3.P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. preprint, 2001.Google Scholar
- 4.D. A. Berry and B. Fristedt. Bandit Problems. Chapman and Hall, 1985.Google Scholar
- 5.D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Progamming. Athena Scientific, 1995.Google Scholar
- 7.R. Brafman and M. Tennenholtz. R-MAX-A General Polynomial Time Algorithm for Near Optimal Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2001.Google Scholar
- 10.Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. In Fourteenth Annual Conference on Computation Learning Theory, pages 589–604, 2001.Google Scholar
- 11.C. N. Fiechter. PAC adaptive control of linear systems. In Tenth Annual conference on Computational Learing Theory, pages 72–80, 1997.Google Scholar
- 12.J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, and I. Vincze, editors, Progress in Statistics, pages 241–266. North-Holland, Amsterdam, 1974.Google Scholar
- 13.M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 1999.Google Scholar
- 14.M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.Google Scholar
- 15.M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms near-optimal reinforcement learning in polynomial time. In Neural Information Processing Systems 11, pages 996–1002. Morgan Kaufmann, 1999.Google Scholar
- 16.Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In International Joint Conference on AI, pages 1324–1231, 1999.Google Scholar