Skip to main content

Improving the Exploration Strategy in Bandit Algorithms

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5313))

Abstract

The K-armed bandit problem is a formalization of the exploration versus exploitation dilemma, a well-known issue in stochastic optimization tasks. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms where each arm is associated to an unknown gain distribution and the goal is to maximize the sum of the rewards (or minimize the sum of losses). Several approaches have been proposed in literature to deal with the K-armed bandit problem. Most of them combine a greedy exploitation strategy with a random exploratory phase. This paper focuses on the improvement of the exploration step by having recourse to the notion of probability of correct selection (PCS), a well-known notion in the simulation literature yet overlooked in the optimization domain. The rationale of our approach is to perform at each exploration step the arm sampling which maximizes the probability of selecting the optimal arm (i.e. the PCS) at the following step. This strategy is implemented by a bandit algorithm, called ε-PCSgreedy, which integrates the PCS exploration approach with the classical ε-greedy schema. A set of numerical experiments on artificial and real datasets shows that a more effective exploration may improve the performance of the entire bandit strategy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)

    Article  MATH  Google Scholar 

  2. Azoulay-Schwartz, R., Kraus, S., Wilkenfeld, J.: Exploitation vs. exploration: choosing a supplier in an environment of incomplete information. Decision support systems 38(1), 1–18 (2004)

    Article  Google Scholar 

  3. Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs (1987)

    MATH  Google Scholar 

  4. Genz, A.: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics (1), 141–149 (1992)

    Google Scholar 

  5. Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, Chichester (1989)

    MATH  Google Scholar 

  6. Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Science and Statistics 23, 421–424 (1991)

    Google Scholar 

  7. Kaelbling, L.P., Littman, M.L., Moore, A.P.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)

    Google Scholar 

  8. Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Research and Management Science. Elsevier Science, Amsterdam (2006)

    Google Scholar 

  9. Kim, S.-H., Nelson, B.L.: Selecting the best system: theory and methods. In: WSC 2003: Proceedings of the 35th conference on Winter simulation, pp. 101–112 (2003)

    Google Scholar 

  10. Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings of the 34th Interface Symposium (2002)

    Google Scholar 

  11. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  12. Tong, Y.L.: The Multivariate Normal Distribution. Springer, Heidelberg (1990)

    Book  MATH  Google Scholar 

  13. Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 437–448. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Caelen, O., Bontempi, G. (2008). Improving the Exploration Strategy in Bandit Algorithms. In: Maniezzo, V., Battiti, R., Watson, JP. (eds) Learning and Intelligent Optimization. LION 2007. Lecture Notes in Computer Science, vol 5313. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92695-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-92695-5_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-92694-8

  • Online ISBN: 978-3-540-92695-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics