On the bandit problem
In this paper we propose first an approach of studying the so-called two-armed bandit problem which is essentially based on the theory of random systems with complete connections. Next we apply stochastic approximation techniques for finding an optimal strategy. For detailed proofs, see [2–5].
In Section 1 we present some basic definitions and several results from the theory of random systems with complete connections. Next we adapt several concepts concerning general control systems, which we developed in a previous paper , to the actual circumstances. Further we deal with the two-armed bandit problem under two possible decision procedures. The first procedure is based on learning techniques, whereas the latter is based on sequential techniques. In both cases we examine the expediency and the optimality of these procedures. In Section 2 we propose an optimal strategy for the two-armed bandit problem by making use of the Kiefer-Wolfowitz stochastic approximation procedure. We further apply the same technique to a market pricing problem.
AMS 1970 subject classificationPrimary 93A10 62L20 Secondary 93C55 90A15
Key words and phrasescontrol systems learning automata learning algorithms optimality expediency two-armed bandit problem stochastic chastic approximation
Unable to display preview. Download preview PDF.
- Blum, J.R., Approximation methods which converge with probability one. Ann. Math. Statist. 25 (1954), 382–386.Google Scholar
- Herkenrath, U., Theodorescu, R., General control systems. Information Sci. 14 (1978), 57–73.Google Scholar
- —, On certain aspects of the two-armed bandit problem. Elektron. Informationsverarbeit. Kybernetik (1978).Google Scholar
- —, Expediency and optimality for general control systems. Coll. Internat. C.N.R.S., Cachan, July 4–8, 1977.Google Scholar
- —, On a stochastic approximation procedure applied to the bandit problem. (submitted to publication).Google Scholar
- Iosifescu, M., Theodorescu, R., Random processes and learning. Springer, New York 1969.Google Scholar
- Kiefer, J., Wolfowitz, J., Stochastic estimation of the maximum of a regression function. Ann. Math. Statist. 23 (1952), 462–466.Google Scholar
- Norman, M.F., On the linear model with two absorbing barriers. J. Math. Psychology 5 (1968), 225–241.Google Scholar
- —, Markov processes and learning models. Academic Press, New York 1972.Google Scholar
- Rotschild, M., A two-armed bandit theory of market pricing. J. Econom. Theory 9 (1974), 430–443.Google Scholar
- Vogel, W., A sequential design for the two armed bandit. Ann. Math. Statist. 31 (1960), 430–443.Google Scholar
- —, An asymptotic minimax theorem for the two armed bandit problem. Ann. Math. Statist. 31 (1960), 444–451.Google Scholar
- Wasan, M.T., Stochastic approximation. Cambridge University Press, Cambridge 1969.Google Scholar
- Witten, I.H., Finite-time performance of some two-armed bandit controllers. IEEE Trans. Syst., Man., Cybern. SMC-3 (1973), 194–197.Google Scholar