PAC Bounds for Multi-armed Bandit and Markov Decision Processes

  • Eyal Even-Dar
  • Shie Mannor
  • Yishay Mansour
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2375)


The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/ε 2log1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/ε 2logn/δ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound.

We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.


Optimal Policy Failure Probability Markov Decision Process Bandit Problem State Action Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.Google Scholar
  2. 2.
    P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.Google Scholar
  3. 3.
    P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. preprint, 2001.Google Scholar
  4. 4.
    D. A. Berry and B. Fristedt. Bandit Problems. Chapman and Hall, 1985.Google Scholar
  5. 5.
    D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Progamming. Athena Scientific, 1995.Google Scholar
  6. 6.
    V. S. Borkar and S.P Meyn. The O. D. E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control, 38(2):447–469, 2000.zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    R. Brafman and M. Tennenholtz. R-MAX-A General Polynomial Time Algorithm for Near Optimal Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2001.Google Scholar
  8. 8.
    H. Chernoff. Sequential Analysis and Optimal Design. Society for industrial and Applied Mathematics, Philadelphia, 1972.zbMATHGoogle Scholar
  9. 9.
    P. Dayan and C. Watkins. Q-learning. Machine Learning, 8:279–292, 1992.zbMATHGoogle Scholar
  10. 10.
    Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. In Fourteenth Annual Conference on Computation Learning Theory, pages 589–604, 2001.Google Scholar
  11. 11.
    C. N. Fiechter. PAC adaptive control of linear systems. In Tenth Annual conference on Computational Learing Theory, pages 72–80, 1997.Google Scholar
  12. 12.
    J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, and I. Vincze, editors, Progress in Statistics, pages 241–266. North-Holland, Amsterdam, 1974.Google Scholar
  13. 13.
    M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 1999.Google Scholar
  14. 14.
    M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.Google Scholar
  15. 15.
    M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms near-optimal reinforcement learning in polynomial time. In Neural Information Processing Systems 11, pages 996–1002. Morgan Kaufmann, 1999.Google Scholar
  16. 16.
    Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In International Joint Conference on AI, pages 1324–1231, 1999.Google Scholar
  17. 17.
    T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    H. Robbins. Some aspects of sequential design of experiments. Bull. Amer. Math. Soc., 55:527–535, 1952.MathSciNetCrossRefGoogle Scholar
  19. 19.
    J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Eyal Even-Dar
    • 1
  • Shie Mannor
    • 2
  • Yishay Mansour
    • 1
  1. 1.School of Computer ScienceTel Aviv UniversityTel-AvivIsrael
  2. 2.Department of Electrical EngineeringTechnionHaifaIsrael

Personalised recommendations