Meta-learning of Exploration/Exploitation Strategies: The Multi-armed Bandit Case

  • Francis Maes
  • Louis Wehenkel
  • Damien Ernst
Part of the Communications in Computer and Information Science book series (CCIS, volume 358)


The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. However, in practice, it is often the case that a form of prior information is available about the specific class of target problems. Prior knowledge is rarely used in current solutions due to the lack of a systematic approach to incorporate it into the E/E strategy.

To address a specific class of E/E problems, we propose to proceed in three steps: (i) model prior knowledge in the form of a probability distribution over the target class of E/E problems; (ii) choose a large hypothesis space of candidate E/E strategies; and (iii), solve an optimization problem to find a candidate E/E strategy of maximal average performance over a sample of problems drawn from the prior distribution.

We illustrate this meta-learning approach with two different hypothesis spaces: one where E/E strategies are numerically parameterized and another where E/E strategies are represented as small symbolic formulas. We propose appropriate optimization algorithms for both cases. Our experiments, with two-armed “Bernoulli” bandit problems and various playing budgets, show that the meta-learnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-Tuned, UCB-V, KL-UCB and ε n -Greedy); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution.


Exploration-exploitation dilemma Prior knowledge Multi-armed bandit problems Reinforcement learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American Mathematical Society 58, 527–536 (1952)MathSciNetzbMATHCrossRefGoogle Scholar
  2. 2.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Agrawal, R.: Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics 27, 1054–1078 (1995)zbMATHGoogle Scholar
  4. 4.
    Auer, P., Fischer, P., Cesa-Bianchi, N.: Finite-time analysis of the multi-armed bandit problem. Machine Learning 47, 235–256 (2002)zbMATHCrossRefGoogle Scholar
  5. 5.
    Audibert, J.-Y., Munos, R., Szepesvári, C.: Tuning Bandit Algorithms in Stochastic Environments. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007. LNCS (LNAI), vol. 4754, pp. 150–165. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  6. 6.
    Audibert, J., Munos, R., Szepesvari, C.: Exploration-exploitation trade-off using variance estimates in multi-armed bandits. In: Theoretical Computer Science (2008)Google Scholar
  7. 7.
    Maes, F., Wehenkel, L., Ernst, D.: Learning to play K-armed bandit problems. In: Proc. of the 4th International Conference on Agents and Artificial Intelligence (2012)Google Scholar
  8. 8.
    Maes, F., Wehenkel, L., Ernst, D.: Automatic Discovery of Ranking Formulas for Playing with Multi-armed Bandits. In: Sanner, S., Hutter, M. (eds.) EWRL 2011. LNCS, vol. 7188, pp. 5–17. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Gonzalez, C., Lozano, J., Larrañaga, P.: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2002)Google Scholar
  10. 10.
    Pelikan, M., Mühlenbein, H.: Marginal distributions in evolutionary algorithms. In: Proceedings of the 4th International Conference on Genetic Algorithms (1998)Google Scholar
  11. 11.
    Bubeck, S., Munos, R., Stoltz, G.: Pure Exploration in Multi-armed Bandits Problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS, vol. 5809, pp. 23–37. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: X-armed bandits. Journal of Machine Learning Research 12, 1655–1695 (2011)Google Scholar
  13. 13.
    Garivier, A., Cappé, O.: The KL-UCB algorithm for bounded stochastic bandits and beyond. CoRR abs/1102.2490 (2011)Google Scholar
  14. 14.
    Rubenstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simluation, and machine learning. Springer, New York (2004)Google Scholar
  15. 15.
    Castronovo, M., Maes, F., Fonteneau, R., Ernst, D.: Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: Proc. of 10th European Workshop on Reinforcement Learning (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Francis Maes
    • 1
  • Louis Wehenkel
    • 1
  • Damien Ernst
    • 1
  1. 1.Dept. of Electrical Engineering and Computer Science, Institut MontefioreUniversity of LiègeLiègeBelgium

Personalised recommendations