Exploration and exploitation of scratch games
 806 Downloads
 3 Citations
Abstract
We consider a variant of the multiarmed bandit model, which we call scratch games, where the sequences of rewards are finite and drawn in advance with unknown starting dates. This new problem is motivated by online advertising applications where the number of ad displays is fixed according to a contract between the advertiser and the publisher, and where a new ad may appear at any time. The drawninadvance assumption is natural for the adversarial approach where an oblivious adversary is supposed to choose the reward sequences in advance. For the stochastic setting, it is functionally equivalent to an urn where draws are performed without replacement. The nonreplacement assumption is suited to the sequential design of nonreproducible experiments, which is often the case in real world. By adapting the standard multiarmed bandit algorithms to take advantage of this setting, we propose three new algorithms: the first one is designed for adversarial rewards; the second one assumes a stochastic urn model; and the last one is based on a Bayesian approach. For the adversarial and stochastic approaches, we provide upper bounds of the regret which compare favorably with the ones of Exp3 and UCB1. We also confirm experimentally that these algorithms compare favorably with Exp3, UCB1 and Thompson Sampling by simulation with synthetic models and adserving data.
Keywords
Adversarial multiarmed bandits Stochastic multiarmed bandits Finite sequences Scratch games1 Introduction
In its most basic formulation, the multiarmed bandit problem can be stated as follows: there are K arms, each having an unknown, and infinite sequence of bounded rewards. At each step, a player chooses an arm and receives a reward issued from the corresponding sequence of rewards. The player needs to explore the arms to find profitable actions, but on other hand the player would like to exploit as soon as possible the best arm identified. Which policy should the player adopt in order to minimize the regret against the best arm?
The stochastic formulation of this problem assumes that each arm delivers rewards that are independently drawn from an unknown distribution. Efficient solutions based on optimism in the face of uncertainty have been proposed for this setting (Lai and Robbins 1985; Agrawal 1995). They compute an upper confidence index for each arm and choose the arm with the highest index. In this case, it can be shown that the regret, the cumulative difference between the optimal reward and the expectation of reward, is bounded by a logarithmic function of time, which is the best possible. Subsequent work introduced simpler policies, proven to achieve logarithmic bound uniformly over time (Auer et al. 2002a). Recently, different variants of these policies have been proposed, to take into account the observed variance (Audibert et al. 2009), based on KullbackLeibler divergence (Garivier and Cappé 2011), the tree structure of arms (Kocsis and Szeoesvàri 2006; Bubeck et al. 2008), or the dependence between arms (Pandey et al. 2007).
Another approach to solve the multiarmed bandit problem is to use a randomized algorithm. The Thompson Sampling algorithm, one of the oldest multiarmed bandit algorithm (Thompson 1933), is based on a Bayesian approach. At each step an arm is drawn according to its probability of being optimal. The observation of the reward updates this probability. Recent papers have shown its accuracy on real problems (Chapelle and Li 2011), and that it achieves logarithmic expected regret (Agrawal and Goyal 2012), and that it is asymptotically optimal (Kaufman et al. 2012b).
There are however several applications, including online advertising, where the rewards are far from being stationary random sequences. A solution to cope with nonstationarity is to drop the stochastic reward assumption and assume the reward sequences to be chosen by an adversary. Even with this adversarial formulation of the multiarmed bandit problem, a randomized strategy like Exp3 provides the guarantee of a minimal regret (Auer et al. 2002b; CesaBianchi and Lugosi 2006).
Another usual assumption which does not fit well the reality of online advertising is the unlimited access to actions. Indeed, an ad server must control the ad displays in order to respect the advertiser’s budgets, or specific requirements like “this ads has to be displayed only on Saturday”. To model the limited access to actions, sleeping bandits have been proposed (Kleinberg and NiculescuMizil Sharma 2008). At each time step, a set of available actions is drawn according to an unknown probability distribution or selected by an adversary. The player then observes the set, and plays an available action. This setting was later completed to take into account adversarial rewards (Kanade et al. 2009). Another way to model the limited access to actions is to consider that each arm has a finite lifetime. In this mortal bandits setting, each appearing or disappearing arm changes the set of available actions. Several algorithms were proposed and analyzed by Chakrabarti et al. (2008) for mortal bandits under stochastic reward assumptions.
In this paper, we propose a variant of mortal and sleeping bandits, which we call scratch games, where the sequences of rewards are finite and drawn in advance with known lengths, and with unknown starting dates. We assume the sequences lengths to be known in advance: indeed, the maximum display counts are usually fixed in advance by a contract between the advertiser and the publisher. This knowledge makes our setting different from sleeping bandits where the sequences of reward are infinite. The ad serving optimization is a continuous process which has to cope with appearing and disappearing ads along the way. During a long period of time, it is not possible to know in advance the number of ads that the ad server has to display, since it depends of new contracts. To fit this application constraints, for the scratch games setting, the starting dates of new scratch games and the maximum number of scratch games are unknown to the player. This point differs from mortal bandits.
We consider both an adversarial reward setting where the sequences are determined by an oblivious adversary and a stochastic setting where each sequence is assumed to be drawn without replacement from a finite urn. These two settings extends and complete the work of Chakrabarti et al. (2008). The nonreplacement assumption is better suited to the sequential design of nonreproducible experiments. This is the case for telemarketing where each targeted client is only reachable once for each campaign. This is also the case for targeted online advertising when the number of display of a banner to an individual is limited. This limit, called capping, leads to an urn model (formally when the capping is set to one).

the first one (in Sect. 3), E3FAS, is a randomized algorithm based on a deterministic assumption: an adversary has chosen the sequences of rewards,

the second one (in Sect. 4), UCBWR, is a deterministic algorithm based on a stochastic assumption: the sequences of rewards are drawn without replacement according to unknown distributions,

the last one (in Sect. 5), TSWR, is a randomized algorithm based on a Bayesian assumption: the mean reward of each scratch games is distributed according to a betabinomial law.
For the first two, we will provide regret bounds for scratch games, which compare favorably to the UCB1 and Exp3 bounds. In Sect. 6, we will test these policies on synthetic problems to study their behavior with respect to different factors coming from application constraints. We will complete these tests with a realistic ad serving simulation.
2 Problem setup: scratch games
We consider a set of K scratch games. Each game i has a finite number of tickets N _{ i } including M _{ i } winning tickets, and a starting date t _{ i }. Let x _{ i,j } be the reward of the jth ticket for game i. We assume that the reward of each ticket is bounded: 0≤x _{ i,j }≤1. A winning ticket is defined as a ticket which has a reward greater than zero.
The number of tickets N _{ i } of current games are known to the player.
The number of winning tickets M _{ i }, the starting dates t _{ i } of new scratch games, the total number of scratch games K, and the sequence of reward x _{ i,j } of each game are unknown to the player.
At each time step t, the player chooses a scratch game i in the set of current scratch games [K _{ t }], and receive the reward \(x_{i,n_{i}(t)}\), where n _{ i }(t) is the number of scratched tickets at time t of the game i.
The scratch game problem is different from the multiarmed bandit problem. Indeed, in the multiarmed bandit setting, to maximize his gain the player has to find the best arm as soon as possible, and then exploit it. In the scratch game setting, the number of tickets is finite. When the player has found the best game, he knows that this game will expire at a given date. The player needs to reexplore before the best game finishes in order to find the next best game. Moreover, a new best game may appear. The usual tradeoff between exploration and exploitation has to be revisited. In the next sections, we will detail respectively adversarial, stochastic, and stochastic Bayesian declinations for the scratch games of the well known algorithms Exp3 (Auer et al. 2002b), UCB1 (Auer et al. 2002a), and TS (Thompson 1933) used for the multiarmed bandits.
3 Adversarial bandit algorithms for scratch games
3.1 Introduction
3.2 E3FAS
Exp3 (Exponential weight algorithm for Exploration and Exploitation) is a powerful and popular algorithm for adversarial bandits (Auer et al. 2002b). The Achilles heel of this algorithm is its sensitivity to the exploration parameter γ: a value too low or too high for this parameter leads to a bad tradeoff between exploitation and exploration. The intuition used to adapt Exp3 to scratch games is the following: when a new game starts the exploration term γ has to increase in order to explore it, and when a game ends, the number of games decreases and the exploration term γ has to decrease.
Let T _{ m } be the time when a game starts or ends, T _{ m+1} be the time when another game starts or ends, K _{ m } be the number of games during the time period [T _{ m },T _{ m+1}[, \(\varDelta_{m}=G_{T_{m+1}}G_{T_{m}}\) be the gain between times T _{ m } and T _{ m+1} and \(\varDelta ^{*}_{m}=G^{*}_{T_{m+1}}G^{*}_{T_{m}}\) be the optimal gain between times T _{ m } and T _{ m+1}. We first give an upper bound for the expected regret obtained by E3FAS for a given period and a given γ _{ m }.
Theorem 1
If the optimal gain \(\varDelta^{*}_{m}\) is known, we can use this bound to evaluate the value of γ _{ m } each time a game ends or starts. We use \(\gamma_{m}^{*}\) to denote the value of the parameter γ _{ m } which optimizes the upper bound given by Theorem 1.
Corollary 1.1
Corollary 1.2
The proofs of Theorem 1 and Corollaries 1.1 and 1.2 can be found in the Appendix. The obtained bound (the first inequality) is less or equal to the one of Exp3 for the scratch games (the second inequality). Notice that these upper bounds of the weak regret for scratch games are theoretical: Exp3 requires the knowledge of \(G_{T}^{*}\) and K, while E3FAS requires the knowledge of a the number of time periods L, the numbers of scratch games K _{ m }, and the values of \(\varDelta_{m}^{*}\).
In real application, the order of magnitude of the mean reward of all games μ is usually known. For example, μ≈1/1000 for the clickthroughrate on banners, μ≈1/100 for emailing campaigns, and μ≈5/100 for telemarketing. In these cases, it is reasonable to assume that \(G^{*}_{T} \approx\mu T \ll T\). If this prior knowledge is not available, it is possible to use T to bound \(G_{T}^{*}\).
4 Stochastic bandit algorithms for scratch games
4.1 Introduction
Where G _{ t }(i) is the cumulated reward of game i. The expectation is taken over the sequences of draws \(x_{i_{1}} (1), x_{i_{2}} (2),\ldots, x_{i_{t}} (t)\) according to the probability law of each game i. Knowing the number of tickets N _{ i }, the starting date t _{ i } of each current scratch game i, we would like to find an efficient policy in order to optimize uniformly the expected gain E[G _{ t }(π)] over all the sequences of draws.
4.2 UCBWR
With UCBWR policy, the mean reward is balanced with a confidence interval weighted by one minus the sampling rate of the game. Then, when the number of plays n _{ i }(t) of the scratch game i increases, the confidence interval term decreases faster than with UCB1 policy. The exploration term tends to zero when the sampling rate tends to 1. The decrease of the exploration term is justified by the fact that the potential reward decreases as the sampling rate increases. Notice that if all games have the same sampling rates, the rankings provided by UCB1 and UCBWR are the same. The difference between rankings provided by UCBWR and UCB1 increases as the dispersion of sampling rates increases. We can expect different performances, when the initial numbers of tickets N _{ i } are different. In this case, scratching a ticket from a game with a low number of tickets has more impact on its upper bound than from a game with a high number of tickets.
Theorem 2

equal at the initialization when the expected sampling rate is zero,

and lower when the expected sampling rate increases.
Corollary 2.1
The proofs of Theorem 2 and Corollary 2.1 can be found in the Appendix.
5 Bayesian bandit algorithms for scratch games
5.1 Introduction
5.2 Thompson sampling without replacement
The Thompson sampling algorithm is a generic algorithm which can be applied with various priors. In the case of a Bernoulli distribution of rewards, recent papers have shown that it achieves logarithmic expected regret (Agrawal and Goyal 2012) and that it is asymptotically optimal (Kaufman et al. 2012b). We propose to use the Thompson sampling algorithm to the scratch games using a betabinomial law to model the parameters likelihood rather than the beta law used for the multiarmed bandits (Chapelle and Li 2011).
To speed up the convergence of the algorithm, we can use the value of the mean reward of all games μ, which is often known in real application, for initializing the values of m _{ i }, n _{ i }, and μ _{ i } of each game i: we choose the initial value μ _{ i }=μ for having a prior in the order of magnitude of its true value, and we choose m _{ i }=1, and n _{ i }=1/μ _{ i } in order to begin with a large variance around this initial prior. For example, the clickthrough rate on a webportal is approximatively of 1/1000. In this case, we will use to initialize the game i: μ _{ i }=1/1000, m _{ i }=1, n _{ i }=1000. When μ is unknown, μ _{ i } can be initialized to any value lesser than 1, with m _{ i }=0 and n _{ i }=0.
We do not provide bounds of the regret for Thomson Sampling Without Replacement, and to the best of our knowledge, it is an open problem.
6 Experimental results
6.1 Experimental setup
6.2 Synthetic problem
 1.
Due to the budget constraints, the scratch games have finite sequences of rewards.
 2.
Due to the continuous optimization, the scratch games have different and unknown starting dates.
 3.
Due to the competition between ads on a same target (cookies, profiles …), to the relevance of the ad with page content, which can change, and to unknown external factors, the mean reward of a scratch game can change over time.
We have chosen a Pareto distribution to draw the number of tickets of 100 scratch games, with parameters x _{ m }=200 and k=1. This choice is driven by the concern to be as close as possible to our application: a lot of small sequences and a small number of very large sequences. The number of winning tickets of each scratch game is drawn according to a Bernoulli distribution, with parameter p _{ i } drawn from a uniform distribution between 0 and 0.25. 210314 tickets including 33688 winning tickets spread over 100 scratch games are drawn. For each trial and for each scratch game i, a sequence of rewards is drawn according to the urn model parametrized by the number of winning tickets m _{ i } and the number of tickets n _{ i }.
Mean regret and rank. asynchronous starts: 50 % of games begin after t=100000; nonstationary: for each game the probability of rewards changes at time t=N/2; ad serving: simulation on ad serving data
Problem  UCB1  UCBWR  Exp3  E3FAS  TS  TSWR 

finite budget  2030(6)  1648(5)  1498(4)  1433(3)  1381(2)  1354(1) 
asynchronous starts  1450(6)  1177(2)  1358(5)  1241(4)  1187(3)  992(1) 
nonstationary  1154(4)  324(1)  709(3)  596(2)  1313(6)  1303(5) 
ad serving  14233(4)  13049(3)  5962(2)  5651(1)  25681(6)  19004(5) 

for games with even index, during the time period [0,N/2] the probability of reward is multiplied by two, and during the time period [N/2,N] the probability of reward is divided by two.

otherwise, during the time period [0,N/2] the probability of reward is divided by two, and during the time period [N/2,N] the probability of reward is multiplied by two.
In these illustrative synthetic problems, we observe that if its prior holds, the Thompson Sampling algorithm Without Replacement is the best, and for the tested nonstationary distributions of rewards it is one of the worst. UCBWR is an efficient and robust algorithm which obtains good performances on the three synthetic problems. E3FAS also exhibits good performances, and it provides more guarantees on non stationary data. Finally, we observe that algorithms designed for scratch games outperform those designed for multiarmed bandits: for adversarial approach E3FAS outperforms Exp3, for stochastic approach UCBWR outperforms UCB1, and for Bayesian approach TSWR outperforms TS. In the next section we will test the same algorithms on complex real data.
6.3 Test on ad serving data
In the payperclick model, the ad server displays different ads on different contexts (profiles × web pages) in order to maximize the clickthrough rate. To evaluate the impact of E3FAS, Exp3, UCBWR, UCB1, TS and TSWR on the ad server optimization, we have simulated it.
Each ad is considered as a scratch game, with a finite number of tickets corresponding to the number of displays, including a finite number of winning tickets corresponding to the clicks, and with a sequence of rewards corresponding to sequences of ad displays and clicks. When a scratch game is selected by a policy, the reward is read from its sequence of rewards. In the simulation, we consider that the observed displays of each ad represent their total inventories on all web pages. The goal of the optimization is then to increase the number of displays of the ads with a high clickthrough rate and to decrease the number of displays of the others.
In this experiment, we have considered a simple context: the optimization of a single page for all profiles of cookies during one week. The optimization over a long period of time of many pages for several profiles of cookies will generate many scratch games. As the number of scratch games K increases, the difference between the regret bounds of Exp3 and the one of E3FAS increases (see Corollary 1.2). We can expect more significant differences of gains between Exp3 and E3FAS. Notice that for an optimization over a long period of time, such as several months, to tune the exploration factor of E3FAS, we have to consider a sliding time horizon T as a parameter of the optimization, that we have to tune. UCBWR does not suffer from this drawback.
7 Conclusion
We have proposed a new problem to take into account finite sequences of rewards drawn in advance with unknown starting dates: the scratch games. This problem corresponds to applications where the budget is limited, and where an action cannot be repeated identically, which is often the case in real world. We have proposed three new versions of well known algorithms, which take advantage of this problem setup. In our experiments on synthetic problems and on real data, the three proposed algorithms outperformed those designed for multiarmed bandits. For E3FAS and UCBWR, we have shown that the proposed upper bounds of the weak regret are less or equal respectively to those of Exp3 and of UCB1. For TSWR the upper bound of the weak regret is an open problem. Between the three policies proposed for this problem, our experiments lead us to conclude that TSWR is the best when its prior holds, and E3FAS is the best for complex distributions of rewards, which correspond to the data collected on our ad server. This preliminary work on scratch games has shown its interest for online advertising. In a future work, we will consider extensions of this setting to better fit application constraints. Indeed, due to the information system and application constraints, there is a delay between the choices of the game and the reception of rewards. Moreover, in this work we have considered the optimization of ads on a single page. To optimize the ad server policy, we need to optimize the ad displays on many pages having a structure of dependence.
Notes
Acknowledgements
We would like to thank anonymous reviewers and our colleagues Vincent Lemaire and Dominique Gay for their comments, which were helpful to improve the quality of this paper.
References
 Agrawal, R. (1995). Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Probability, 27, 1054–1078. MathSciNetzbMATHCrossRefGoogle Scholar
 Agrawal, S., & Goyal, N. (2012). Analysis of Thomson sampling for the multiarmed bandit problem. In COLT. Google Scholar
 Audibert, J. Y., Munos, R., & Szeoesvàri, C. (2009). Explorationexploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410(19), 1876–1902. MathSciNetzbMATHCrossRefGoogle Scholar
 Auer, P., Bianchi, N. C., & Fischer, P. (2002a). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256. zbMATHCrossRefGoogle Scholar
 Auer, P., CesaBianchi, N., Freund, Y., & Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77. MathSciNetzbMATHCrossRefGoogle Scholar
 Briggs, W. M., & Zaretzki, R. (2009). A new look at inference for the hypergeometric distribution. Tech. rep. www.wmbriggs.com/public/HGDAmstat4.pdf.
 Bubeck, S., Munos, R., Stoltz, G., & Szepesvàri, C. (2008). Online optimization in xarmed bandits. In Neural information processing systems, Vancouver, Canada. Google Scholar
 CesaBianchi, N., & Lugosi, G. (2006). Prediction, learning and games. Cambridge: Cambridge University Press. zbMATHCrossRefGoogle Scholar
 Chakrabarti, D., Kumar, R., Radlinski, F., & Upfal, E. (2008). Mortal multiarmed bandits. In NIPS (pp. 273–280). Google Scholar
 Chapelle, O., & Li, L. (2011). An empirical evaluation of Thomson sampling. In NIPS. Google Scholar
 Féraud, R., & Urvoy, T. (2012). A stochastic bandit algorithm for scratch games. In ACML (Vol. 25, pp. 129–145). Google Scholar
 Kleinberg, R. D., & NiculescuMizil Sharma, T. (2008). Regrets bounds for sleeping experts and bandits. In COLT. Google Scholar
 Kanade, V., McMahan, B., & Bryan, B. (2009). Sleeping experts and bandits with stochastic action availability and adversarial rewards. In AISTATS. Google Scholar
 Garivier, A., & Cappé, O. (2011). The klucb algorithm for bounded stochastic bandits and beyond. In COLT. Google Scholar
 Kaufman, E., Cappé, O., & Garivier, A. (2012a). On Bayesian upper confidence bounds for bandits problems. In AISTATS. Google Scholar
 Kaufman, E., Korda, N., & Munos, R. (2012b). Thomson sampling: an asymptotically optimal finite time analysis. In COLT. Google Scholar
 Kocsis, L., & Szeoesvàri, C. (2006). Bandit based MonteCarlo planning. In ECML (pp. 282–293). Google Scholar
 Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22. MathSciNetzbMATHCrossRefGoogle Scholar
 Pandey, S., Agarwal, D., & Chakrabarti, D. (2007). Multiarmed bandit problems with dependent arms. In ICML (pp. 721–728). CrossRefGoogle Scholar
 Serfling, R. J. (1974). Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, 2, 39–48. MathSciNetzbMATHCrossRefGoogle Scholar
 Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25, 285–294. zbMATHGoogle Scholar