Secure Best Arm Identification in Multi-armed Bandits

  • Radu CiucanuEmail author
  • Pascal Lafourcade
  • Marius Lombard-Platet
  • Marta Soare
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11879)


The stochastic multi-armed bandit is a classical decision making model, where an agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen action. A popular objective for the agent is that of identifying the arm with the maximum expected reward, also known as the best-arm identification problem. We address the inherent privacy concerns that occur in a best-arm identification problem when outsourcing the data and computations to a honest-but-curious cloud.

Our main contribution is a distributed protocol that computes the best arm while guaranteeing that (i) no cloud node can learn at the same time information about the rewards and about the arms ranking, and (ii) by analyzing the messages communicated between the different cloud nodes, no information can be learned about the rewards or about the ranking. In other words, the two properties ensure that the protocol has no security single point of failure. We rely on the partially homomorphic property of the well-known Paillier’s cryptosystem as a building block in our protocol. We prove the correctness of our protocol and we present proof-of-concept experiments suggesting its practical feasibility.


Multi-armed bandits Best arm identification Privacy Distributed computation Paillier cryptosystem 


  1. 1.
    Audibert, J., Bubeck, S., Munos, R.: Best arm identification in multi-armed bandits. In: Conference on Learning Theory (COLT) (2010)Google Scholar
  2. 2.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)CrossRefGoogle Scholar
  3. 3.
    Chen, S., Lin, T., King, I., Lyu, M.R., Chen, W.: Combinatorial pure exploration of multi-armed bandits. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  4. 4.
    Coquelin, P., Munos, R.: Bandit algorithms for tree search. In: Conference on Uncertainty in Artificial Intelligence (UAI) (2007)Google Scholar
  5. 5.
    Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages and Programming (ICALP) (2006)Google Scholar
  6. 6.
    Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407 (2014)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, 1079–1105 (2006)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Gabillon, V., Ghavamzadeh, M., Lazaric, A.: Best arm identification: a unified approach to fixed budget and fixed confidence. In: Conference on Neural Information Processing Systems (NIPS) (2012)Google Scholar
  9. 9.
    Gajane, P., Urvoy, T., Kaufmann, E.: Corrupt bandits for preserving local privacy. In: Algorithmic Learning Theory (ALT) (2018)Google Scholar
  10. 10.
    Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best-arm identification in multi-armed bandit models. J. Mach. Learn. Res. 17, 1–42 (2016)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). Scholar
  12. 12.
    Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: International Conference on World Wide Web (WWW) (2010)Google Scholar
  13. 13.
    Mishra, N., Thakurta, A.: (Nearly) optimal differentially private stochastic multi-arm bandits. In: Conference on Uncertainty in Artificial Intelligence (UAI) (2015)Google Scholar
  14. 14.
    Munos, R.: From bandits to Monte-Carlo tree search: the optimistic principle applied to optimization and planning. Found. Trends Mach. Learn. 7, 1–129 (2014)CrossRefGoogle Scholar
  15. 15.
    Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 223–238. Springer, Heidelberg (1999). Scholar
  16. 16.
    Soare, M., Lazaric, A., Munos, R.: Best-arm identification in linear bandits. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  17. 17.
    Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)CrossRefGoogle Scholar
  18. 18.
    Tossou, A.C.Y., Dimitrakakis, C.: Algorithms for differentially private multi-armed bandits. In: AAAI Conference on Artificial Intelligence (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Radu Ciucanu
    • 1
    Email author
  • Pascal Lafourcade
    • 2
  • Marius Lombard-Platet
    • 3
    • 4
  • Marta Soare
    • 1
  1. 1.INSA Centre Val de Loire, Univ. Orléans, LIFO EA 4022OrléansFrance
  2. 2.Université Clermont Auvergne, LIMOS CNRS UMR 6158AubièreFrance
  3. 3.Département d’informatique de l’ENS, École normale supérieure, CNRS, PSL Research UniversityParisFrance
  4. 4.Be-StudysGenevaSwitzerland

Personalised recommendations