Multi-Armed Bandit Processes

  • Xiaoqiang Cai
  • Xianyi Wu
  • Xian Zhou
Part of the International Series in Operations Research & Management Science book series (ISOR, volume 207)


This chapter studies the powerful tool for stochastic scheduling, using theoretically elegant multi-armed bandit processes to maximize expected total discounted rewards. Multi-armed bandit models form a particular type of optimal resource allocation problems, in which a number of machines or processors are to be allocated to serve a set of competing projects (arms). We introduce the classical theory for multi-armed bandit processes in Section 6.1, and consider open bandit processes in which infinitely many arms are allowed in Section 6.2. An extension to generalized open bandit processes is given in Section 6.3. Finally, a concise account for closed bandit processes in continuous time is presented in Section 6.4.


Optimal Policy Reward Rate Index Policy Bandit Problem Decision Epoch 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Bank, P., & Küchler, C. (2007). On Gittins’ index theorem in continuous time. Stochastic Processes and Their Applications, 117, 1357–1371.CrossRefGoogle Scholar
  2. Banks, J. S., & Sundaram, R. K. (1994). Switching costs and the Gittins index. Econometrica: Journal of the Econometric Society, 62(3), 687–694.CrossRefGoogle Scholar
  3. Bertsimas, D., & Niǹo-Mora, J. (1996). Conservation laws, extended polymatroid and multi-armed bandit problems: A unified approachto indexable systems. Mathematics of Operations Research, 21, 257–306.CrossRefGoogle Scholar
  4. Crosbie, J. H., & Glazebrook, K. D. (2000). Index policies and a novel performance space structure for a class of generalised branching bandit problems. Mathematics of Operations Research, 25, 281–297.CrossRefGoogle Scholar
  5. EL Karoui, N., & Karatzas, I. (1993). General Gittins index processes in discrete time. Proceedings of the National Academy of Sciences of the United States of America, 90, 1232–1236.Google Scholar
  6. EL Karoui, N., & Karatzas, I. (1994). Dynamic allocation problems in continuous time. The Annals of Applied Probability, 4(2), 255–286.Google Scholar
  7. EL Karoui, N., & Karatzas, I. (1997). Synchronization and optimality for multi-armed bandit problems in continuous time. Computational and Applied Mathamatics, 16(2), 117–151.Google Scholar
  8. Gittins, J. C. (1979). Bandit processes and dynamic allocation indices (with discussion). Joural of Royal Statististical Society B, 41, 148–164.Google Scholar
  9. Gittins, J. C. (1989). Multi-armed bandit allocation indices (Wiley-Interscience series in systems and optimization). Chichester: Wiley. ISBN:0-471-92059-2.Google Scholar
  10. Gittins, J. C., & Jones, D. (1974). A Dynamic allocation index for the sequential allocation of experiments. In J. Gani, et al. (Eds.), Progress in statistics. Amsterdam: North Holland.Google Scholar
  11. Gittins, J. C., & Glazebrook, K. D. (1977). On Bayesian models in stochastic scheduling. Journal of Applied Probability, 14, 556–565.CrossRefGoogle Scholar
  12. Glazebrook, K. D., & Owen, R. W. (1991). New results for generalised bandit processes. International Journal of Systems Science, 22, 479–494.CrossRefGoogle Scholar
  13. Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83(1), 113–154.CrossRefGoogle Scholar
  14. Kaspi, H., & Mandelbaum, A. (1995). Lévy bandits: Multi-armed bandits driven by Lévy processes. Annals of Applied Probability, 5(2), 541–565.CrossRefGoogle Scholar
  15. Kaspi, H., & Mandelbaum, A. (1998). Multi-armed bandits in discrete and continuous time. Annals of Applied Probability, 8(4), 1270–1290.CrossRefGoogle Scholar
  16. Lai, T. L., & Ying, Z. (1988). Open bandit processes and optimal scheduling of queueing networks. Advances in Applied Probability, 20, 447–472.CrossRefGoogle Scholar
  17. Mandelbaum, A. (1986). Discrete multiarmed bandits and multiparameter processes. Probability Theory and Related Fields, 71, 129–147.CrossRefGoogle Scholar
  18. Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter processes. Annals of Probabability, 15(4), 1527–1556.CrossRefGoogle Scholar
  19. Nash, P. (1973). Optimal allocation of resources between research projects. Ph.D. Thesis, Cambridge University.Google Scholar
  20. Nash, P. (1980). A generalized bandit problem. Journal of the Royal Statistical Society, Series B, 42(2), 165–169.Google Scholar
  21. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535.CrossRefGoogle Scholar
  22. Snell, L. (1952). Applications of martingale systems theorems. Transactions of American Mathematical Society, 73, 293–312.CrossRefGoogle Scholar
  23. Tsitsiklis, J. N. (1994). A short proof of the Gittins index theorem. The Annals of Applied Probability, 4(1), 194–199.CrossRefGoogle Scholar
  24. Van Oyen, M. P., Pandelis, D. G., & Teneketzis, D. (1992). Optimality of index policies for stochastic scheduling with switching penalties. Journal of Applied Probability, 29(4), 957–966.CrossRefGoogle Scholar
  25. Varaiya, P., Walrand, J., & Buyukkoc, C. (1985). Extensions of the multiarmed bandit problem: The discounted case. IEEE Transactions on Automatic Control, 230, 426–439.CrossRefGoogle Scholar
  26. Weber, R. R. (1992). On the Gittins index for multiarmed bandits. Annals of Probability, 2(4), 1024–1033.CrossRefGoogle Scholar
  27. Weiss, G. (1988). Branching bandit processes. Probability in Engineering and Information Science, 2, 269–278.CrossRefGoogle Scholar
  28. Whittle, P. (1980). Multi-armed bandits and the Gittins index. Journal of Royal Statistical Society, Series B, 42(2), 143–149.Google Scholar
  29. Whittle, P. (1981). Arm-acquiring bandits. The Annals of Probability, 9(2), 284–292CrossRefGoogle Scholar
  30. Whittle, P. (1988). Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 25, 287–298. A Celebration of Applied Probability.Google Scholar
  31. Wu, X., & Zhou, X. (2013). Open bandit processes with uncountable states and time-backward effects. Journal of Applied Probability, 50(2), 388–402.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Xiaoqiang Cai
    • 1
  • Xianyi Wu
    • 2
  • Xian Zhou
    • 3
  1. 1.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongShatin, N.T.Hong Kong SAR
  2. 2.Department of Statistics and Actuarial ScienceEast China Normal UniversityShanghaiPeople’s Republic of China
  3. 3.Department of Applied Finance and Actuarial StudiesMacquarie UniversityNorth RydeAustralia

Personalised recommendations