Machine Learning

, Volume 108, Issue 11, pp 1919–1949 | Cite as

Asymptotically optimal algorithms for budgeted multiple play bandits

  • Alex LuedtkeEmail author
  • Emilie Kaufmann
  • Antoine Chambaz


We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.


Budgeted bandits KL-UCB Knapsack bandits Multiple-play bandits Thompson sampling 



The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under Grant ANR-13-BS01-0005 (Project SPADRO) and ANR-16-CE40-0002 (Project BADASS). Alex Luedtke gratefully acknowledges the support of a Berkeley Fellowship.

Supplementary material

10994_2019_5799_MOESM1_ESM.pdf (282 kb)
Supplementary material 1 (pdf 281 KB)


  1. Agrawal, S., & Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation (pp. 989–1006). ACM.Google Scholar
  2. Agrawal, S., & Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797.
  3. Agrawal, S., & Goyal, N. (2012). Further optimal regret bounds for thompson sampling. arXiv preprint arXiv:1209.3353.
  4. Anantharam, V., Varaiya, P., & Walrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: IID rewards. IEEE Transactions on Automatic Control, 32(11), 968–976.MathSciNetCrossRefGoogle Scholar
  5. Audibert, J. -Y., Bubeck, S., & Lugosi, G. (2011). Minimax policies for combinatorial prediction games. arXiv preprint arXiv:1105.4871.
  6. Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. In 2013 IEEE 54th annual symposium on foundations of computer science (FOCS) (pp. 207–216). IEEE.Google Scholar
  7. Burnetas, A. N., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2), 122–142.MathSciNetCrossRefGoogle Scholar
  8. Cappé, O., Garivier, A., Maillard, O. A., Munos, R., & Stoltz, G. (2013a). Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3), 1516–1541.MathSciNetCrossRefGoogle Scholar
  9. Cappé, O., Garivier, A., Maillard, O. A., Munos, R., & Stoltz, G. (2013b). Supplement to “Kullback-Leibler upper confidence bounds for optimal sequential allocation”.
  10. Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78, 1404–1422.MathSciNetCrossRefGoogle Scholar
  11. Chen, W., Wang, Y., & Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th international conference on machine learning (pp. 151–159).Google Scholar
  12. Combes, R., Magureanu, S., Proutière, A., & Laroche, C. (2015a). Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems (pp. 231–244).Google Scholar
  13. Combes, R., Shahi, M. S. T. M., Proutiere, A., & Lelarge, M. (2015b). Combinatorial bandits revisited. In Advances in neural information processing systems (pp. 2107–2115).Google Scholar
  14. Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations Research, 5(2), 266–288.MathSciNetCrossRefGoogle Scholar
  15. Garivier, A., Ménard, P., & Stoltz, G. (2016). Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182.
  16. Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society Series B, 41(2), 148–177.MathSciNetzbMATHGoogle Scholar
  17. Karp, R. M. (1972). Reducibility among combinatorial problems. New York, Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
  18. Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic learning theory (pp. 199–213). Springer.Google Scholar
  19. Komiyama, J., Honda, J., & Nakagawa, H. (2015). Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv preprint arXiv:1506.00779.
  20. Korda, N., Kaufmann, E., & Munos, R. (2013). Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems (pp. 1448–1456).Google Scholar
  21. Kveton, B., Szepesvári, C., Wen, Z., & Ashkan, A. (2015a). Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd international conference on machine learning (pp. 767–776).Google Scholar
  22. Kveton, B., Weng, Z., Ashkan, A., Hoda, E., & Eriksson, B. (2014). Matroid bandits: Fast combinatorial optimization with learning. In Uncertainty in artificial intelligence (UAI).Google Scholar
  23. Kveton, B., Zheng, W., Ashkan, A., & Szepesvári, C. (2015b). Combinatorial cascading bandits. In Advances in neural information processing systems (NIPS).Google Scholar
  24. Lagrée, P., Vernade, C., & Cappé, O. (2016). Multiple-play bandits in the postition-based model. arXiv preprint arXiv:1606.02448.
  25. Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15, 1091–1114.MathSciNetCrossRefGoogle Scholar
  26. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.MathSciNetCrossRefGoogle Scholar
  27. Li, H., & Xia, Y. (2017). Infinitely many-armed bandits with budget constraints. In AAAI (pp. 2182–2188).Google Scholar
  28. Luedtke, A., Kaufmann, E., & Chambaz, A. (2016). Asymptotically Optimal algorithms for multiple play bandits with partial feedback. arXiv preprint arXiv:1606.09388.
  29. R Core Team. (2014). R: A language and environment for statistical computing. Retrieved July 1, 2016 from
  30. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535.MathSciNetCrossRefGoogle Scholar
  31. Sankararaman, K., & Slivkins, A. (2018). Combinatorial semi-bandits with knapsacks. In AISTATS.Google Scholar
  32. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.CrossRefGoogle Scholar
  33. Tran-Thanh, L., Chapman, A. C., Rogers, A., & Jennings, N. R. (2012). Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI.Google Scholar
  34. Wen, Z., Kveton, B., & Ashkan, A. (2015). Efficient learning in large-scale combinatorial semi-bandits. In International conference on machine learning (ICML).Google Scholar
  35. Xia, Y., Ding, W., Zhang, X. -D., Yu, N., & Qin, T. (2016a). Budgeted bandit problems with continuous random costs. In Asian conference on machine learning (pp. 317–332).Google Scholar
  36. Xia, Y., Li, H., Qin, T., Yu, N., & Liu, T. -Y. (2015). Thompson sampling for budgeted multi-armed bandits. In IJCAI (pp. 3960–3966).Google Scholar
  37. Xia, Y., Qin, T., Ma, W., Yu, N., & Liu, T. -Y. (2016b). Budgeted multi-armed bandits with multiple plays. In IJCAI (pp. 2210–2216).Google Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of StatisticsUniversity of WashingtonSeattleUSA
  2. 2.Univ. Lille, CRIStAL (UMR 9189), Inria Lille Nord EuropeCNRSVilleneuve d’AscqFrance
  3. 3.Laboratoire MAP5Université Paris DescartesParis Cedex 06France

Personalised recommendations