Asymptotically optimal algorithms for budgeted multiple play bandits
- 121 Downloads
We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.
KeywordsBudgeted bandits KL-UCB Knapsack bandits Multiple-play bandits Thompson sampling
The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR), under Grant ANR-13-BS01-0005 (Project SPADRO) and ANR-16-CE40-0002 (Project BADASS). Alex Luedtke gratefully acknowledges the support of a Berkeley Fellowship.
- Agrawal, S., & Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation (pp. 989–1006). ACM.Google Scholar
- Agrawal, S., & Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797.
- Agrawal, S., & Goyal, N. (2012). Further optimal regret bounds for thompson sampling. arXiv preprint arXiv:1209.3353.
- Audibert, J. -Y., Bubeck, S., & Lugosi, G. (2011). Minimax policies for combinatorial prediction games. arXiv preprint arXiv:1105.4871.
- Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. In 2013 IEEE 54th annual symposium on foundations of computer science (FOCS) (pp. 207–216). IEEE.Google Scholar
- Cappé, O., Garivier, A., Maillard, O. A., Munos, R., & Stoltz, G. (2013b). Supplement to “Kullback-Leibler upper confidence bounds for optimal sequential allocation”. https://doi.org/10.1214/13-AOS1119SUPP.
- Chen, W., Wang, Y., & Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th international conference on machine learning (pp. 151–159).Google Scholar
- Combes, R., Magureanu, S., Proutière, A., & Laroche, C. (2015a). Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems (pp. 231–244).Google Scholar
- Combes, R., Shahi, M. S. T. M., Proutiere, A., & Lelarge, M. (2015b). Combinatorial bandits revisited. In Advances in neural information processing systems (pp. 2107–2115).Google Scholar
- Garivier, A., Ménard, P., & Stoltz, G. (2016). Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182.
- Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic learning theory (pp. 199–213). Springer.Google Scholar
- Komiyama, J., Honda, J., & Nakagawa, H. (2015). Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv preprint arXiv:1506.00779.
- Korda, N., Kaufmann, E., & Munos, R. (2013). Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems (pp. 1448–1456).Google Scholar
- Kveton, B., Szepesvári, C., Wen, Z., & Ashkan, A. (2015a). Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd international conference on machine learning (pp. 767–776).Google Scholar
- Kveton, B., Weng, Z., Ashkan, A., Hoda, E., & Eriksson, B. (2014). Matroid bandits: Fast combinatorial optimization with learning. In Uncertainty in artificial intelligence (UAI).Google Scholar
- Kveton, B., Zheng, W., Ashkan, A., & Szepesvári, C. (2015b). Combinatorial cascading bandits. In Advances in neural information processing systems (NIPS).Google Scholar
- Lagrée, P., Vernade, C., & Cappé, O. (2016). Multiple-play bandits in the postition-based model. arXiv preprint arXiv:1606.02448.
- Li, H., & Xia, Y. (2017). Infinitely many-armed bandits with budget constraints. In AAAI (pp. 2182–2188).Google Scholar
- Luedtke, A., Kaufmann, E., & Chambaz, A. (2016). Asymptotically Optimal algorithms for multiple play bandits with partial feedback. arXiv preprint arXiv:1606.09388.
- R Core Team. (2014). R: A language and environment for statistical computing. Retrieved July 1, 2016 from http://www.r-project.org/.
- Sankararaman, K., & Slivkins, A. (2018). Combinatorial semi-bandits with knapsacks. In AISTATS.Google Scholar
- Tran-Thanh, L., Chapman, A. C., Rogers, A., & Jennings, N. R. (2012). Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI.Google Scholar
- Wen, Z., Kveton, B., & Ashkan, A. (2015). Efficient learning in large-scale combinatorial semi-bandits. In International conference on machine learning (ICML).Google Scholar
- Xia, Y., Ding, W., Zhang, X. -D., Yu, N., & Qin, T. (2016a). Budgeted bandit problems with continuous random costs. In Asian conference on machine learning (pp. 317–332).Google Scholar
- Xia, Y., Li, H., Qin, T., Yu, N., & Liu, T. -Y. (2015). Thompson sampling for budgeted multi-armed bandits. In IJCAI (pp. 3960–3966).Google Scholar
- Xia, Y., Qin, T., Ma, W., Yu, N., & Liu, T. -Y. (2016b). Budgeted multi-armed bandits with multiple plays. In IJCAI (pp. 2210–2216).Google Scholar