Infinite Horizon Multi-armed Bandits with Reward Vectors: Exploration/Exploitation Trade-off

Drugan, Madalina M.

doi:10.1007/978-3-319-27947-3_7

Infinite Horizon Multi-armed Bandits with Reward Vectors: Exploration/Exploitation Trade-off

Madalina M. Drugan¹⁷

Conference paper
First Online: 19 December 2015

603 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9494))

Abstract

We focus on the effect of the exploration/exploitation trade-off strategies on the algorithmic design off multi-armed bandits (MAB) with reward vectors. Pareto dominance relation assesses the quality of reward vectors in infinite horizon MABs, like the UCB1 and UCB2 algorithms. In single objective MABs, there is a trade-off between the exploration of the suboptimal arms, and exploitation of a single optimal arm. Pareto dominance based MABs fairly exploit all Pareto optimal arms, and explore suboptimal arms. We study the exploration vs exploitation trade-off for two UCB like algorithms for reward vectors. We analyse the properties of the proposed MAB algorithms in terms of upper regret bounds and we experimentally compare their exploration vs exploitation trade-off on a bi-objective Bernoulli environment coming from control theory.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Lizotte, D., Bowling, M., Murphy, S.: Efficient reinforcement learning with multiple reward functions for randomized clinical trial analysis. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML) (2010)
Google Scholar
Wiering, M., de Jong, E.: Computing optimal stationary policies for multi-objective markov decision processes. In: Proceedings of Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pp. 158–165. IEEE (2007)
Google Scholar
van Moffaert, K., Drugan, M.M., Nowé, A.: Hypervolume-based multi-objective reinforcement learning. In: Purshouse, R.C., Fleming, P.J., Fonseca, C.M., Greco, S., Shaw, J. (eds.) EMO 2013. LNCS, vol. 7811, pp. 352–366. Springer, Heidelberg (2013)
Chapter Google Scholar
Wang, W., Sebag, M.: Multi-objective Monte Carlo tree search. In: Asian Conference on Machine Learning, pp. 1–16 (2012)
Google Scholar
Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7, 117–132 (2003)
Article Google Scholar
Drugan, M., Nowe, A.: Designing multi-objective multi-armed bandits: a study. In: Proceedings of International Joint Conference of Neural Networks (IJCNN) (2013)
Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)
Article MATH Google Scholar
Maron, O., Moore, A.: Hoeffding races: accelerating model selection search for classification and function approximation. In: Advances in Neural Information Processing Systems, vol. 6, pp. 59–66. Morgan Kaufmann (1994)
Google Scholar
Vaerenbergh, K.V., Rodriguez, A., Gagliolo, M., Vrancx, P., Nowe, A., Stoev, J., Goossens, S., Pinte, G., Symens, W.: Improving wet clutch engagement with reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE (2012)
Google Scholar

Download references

Acknowledgements

Madalina M. Drugan was supported by the IWT-SBO project PERPETUAL (gr. nr. 110041).

Author information

Authors and Affiliations

Artificial Intelligence Lab, Vrije Universiteit Brussel, Pleinlaan 2, 1050, Brussels, Belgium
Madalina M. Drugan

Authors

Madalina M. Drugan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Madalina M. Drugan .

Editor information

Editors and Affiliations

LERIA - UFR Sciences , Angers, France
Béatrice Duval
Leiden University , Leiden, Zuid-Holland, The Netherlands
Jaap van den Herik
LERIA - UFR Sciences , Angers, France
Stephane Loiseau
Polytechnic Institute of Setúbal , Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Drugan, M.M. (2015). Infinite Horizon Multi-armed Bandits with Reward Vectors: Exploration/Exploitation Trade-off. In: Duval, B., van den Herik, J., Loiseau, S., Filipe, J. (eds) Agents and Artificial Intelligence. ICAART 2015. Lecture Notes in Computer Science(), vol 9494. Springer, Cham. https://doi.org/10.1007/978-3-319-27947-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-27947-3_7
Published: 19 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27946-6
Online ISBN: 978-3-319-27947-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics