A Neural Networks Committee for the Contextual Bandit Problem

Allesiardo, Robin; Féraud, Raphaël; Bouneffouf, Djallel

doi:10.1007/978-3-319-12637-1_47

Robin Allesiardo^20,21,
Raphaël Féraud²¹ &
Djallel Bouneffouf²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8834))

Included in the following conference series:

International Conference on Neural Information Processing

4999 Accesses
30 Citations
6 Altmetric

Abstract

This paper presents a new contextual bandit algorithm, NeuralBandit, which does not need hypothesis on stationarity of contexts and rewards. Several neural networks are trained to modelize the value of rewards knowing the context. Two variants, based on multi-experts approach, are proposed to choose online the parameters of multi-layer perceptrons. The proposed algorithms are successfully tested on a large dataset with and without stationarity of rewards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6(1), 4–22 (1985)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256 (2002)
Article MATH Google Scholar
Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Article MATH Google Scholar
Kaufmann, E., Korda, N., Munos, R.: Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS (LNAI), vol. 7568, pp. 199–213. Springer, Heidelberg (2012)
Chapter Google Scholar
Auer, P., Cesa-Bianchi, N.: On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23(1-2), 83–99 (1998)
Article MathSciNet MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Kleinberg, R.D., Niculescu-Mizil, A., Sharma, Y.: Regret bounds for sleeping experts and bandits. In: COLT, pp. 425–436 (2008)
Google Scholar
Chakrabarti, D., Kumar, R., Radlinski, F., Upfal, E.: Mortal multi-armed bandits. In: NIPS, pp. 273–280 (2008)
Google Scholar
Feraud, R., Urvoy, T.: A stochastic bandit algorithm for scratch games. In: Hoi, S.C.H., Buntine, W.L. (eds.) ACML. JMLR Proceedings, vol. 25, pp. 129–143. JMLR.org. (2012)
Google Scholar
Feraud, R., Urvoy, T.: Exploration and exploitation of scratch games. Machine Learning 92(2-3), 377–401 (2013)
Article MathSciNet MATH Google Scholar
Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: Online optimization in x-armed bandits. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) NIPS, pp. 201–208. Curran Associates, Inc. (2008)
Google Scholar
Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)
Google Scholar
Gaudel, R., Sebag, M.: Feature selection as a one-player game. In: Fürnkranz, J., Joachims, T. (eds.) ICML, pp. 359–366. Omnipress (2010)
Google Scholar
Seldin, Y., Auer, P., Laviolette, F., Shawe-Taylor, J., Ortner, R.: Pac-bayesian analysis of contextual bandits. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) NIPS, pp. 1683–1691 (2011)
Google Scholar
Dudík, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., Zhang, T.: Efficient optimal learning for contextual bandits. CoRR (2011)
Google Scholar
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) NIPS. Curran Associates, Inc. (2007)
Google Scholar
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. CoRR (2010)
Google Scholar
Chu, W., Li, L., Reyzin, L., Schapire, R.E.: Contextual bandits with linear payoff functions. In: Gordon, G.J., Dunson, D.B., Dudk, M. (eds.) AISTATS. JMLR Proceedings, vol. 15, pp. 208–214. JMLRorg. (2011)
Google Scholar
Agrawal, S., Goyal, N.: Thompson sampling for contextual bandits with linear payoffs. CoRR (2012)
Google Scholar
Kakade, S.M., Shalev-Shwartz, S., Tewari, A.: Efficient bandit algorithms for online multiclass prediction. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 440–447. ACM, New York (2008)
Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Article Google Scholar
Tesauro, G.: Programming backgammon using self-teaching neural nets. Artificial Intelligence 134, 181–199 (2002)
Article MATH Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. CoRR (2013)
Google Scholar
Bottou, L., LeCun, Y.: On-line learning for very large datasets. Applied Stochastic Models in Business and Industry 21(2), 137–151 (2005)
Article MathSciNet MATH Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1, pp. 318–362. MIT Press, Cambridge (1986)
Google Scholar

Download references

Author information

Authors and Affiliations

TAO - INRIA, CNRS, University of Paris-Sud, 91405, Orsay, France
Robin Allesiardo
Orange Labs, 2 av. Pierre Marzin, 22300, Lannion, France
Robin Allesiardo, Raphaël Féraud & Djallel Bouneffouf

Authors

Robin Allesiardo
View author publications
You can also search for this author in PubMed Google Scholar
Raphaël Féraud
View author publications
You can also search for this author in PubMed Google Scholar
Djallel Bouneffouf
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Artificial Intelligence, Faculty of Computer Science and Information Technology Building, University of Malaya, 50603, Kuala Lumpur, Malaysia
Chu Kiong Loo
Department of Electronics and Communication Engineering,College of Engineering, Jalan IKRAM-UNITEN, Universiti Tenaga Nasional, 43009, Kajang, Selangor, Malaysia
Keem Siah Yap
School of Engineering and Information Technology, Murdoch University, South St, 6150, Murdoch, Western Australia, Australia
Kok Wai Wong
Department of Electrical and Electronics Engineering, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, 120-749, Seoul, South Korea
Andrew Teoh
Department of Electrical and Electronic Engineering, Xi’an Jiaotong-Liverpool University, Ren’ai Road 111, SIP 215123, Suzhou, Jiangsu Province, China
Kaizhu Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Allesiardo, R., Féraud, R., Bouneffouf, D. (2014). A Neural Networks Committee for the Contextual Bandit Problem. In: Loo, C.K., Yap, K.S., Wong, K.W., Teoh, A., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8834. Springer, Cham. https://doi.org/10.1007/978-3-319-12637-1_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-12637-1_47
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12636-4
Online ISBN: 978-3-319-12637-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics