Incentivizing Exploration with Heterogeneous Value of Money

Han, Li; Kempe, David; Qiang, Ruixin

doi:10.1007/978-3-662-48995-6_27

Li Han¹⁶,
David Kempe¹⁶ &
Ruixin Qiang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9470))

Included in the following conference series:

International Conference on Web and Internet Economics

1136 Accesses
4 Citations

Abstract

Recently, Frazier et al. proposed a natural model for crowdsourced exploration of different a priori unknown options: a principal is interested in the long-term welfare of a population of agents who arrive one by one in a multi-armed bandit setting. However, each agent is myopic, so in order to incentivize him to explore options with better long-term prospects, the principal must offer the agent money. Frazier et al. showed that a simple class of policies called time-expanded are optimal in the worst case, and characterized their budget-reward tradeoff. The previous work assumed that all agents are equally and uniformly susceptible to financial incentives. In reality, agents may have different utility for money. We therefore extend the model of Frazier et al. to allow agents that have heterogeneous and non-linear utilities for money. The principal is informed of the agent’s tradeoff via a signal that could be more or less informative.

Our main result is to show that a convex program can be used to derive a signal-dependent time-expanded policy which achieves the best possible Lagrangian reward in the worst case. The worst-case guarantee is matched by so-called “Diamonds in the Rough” instances; the proof that the guarantees match is based on showing that two different convex programs have the same optimal solution for these specific instances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
To avoid ambiguity, we consistently refer to the principal as female and the agents as male.
2.
Both Frazier et al. [4] and our work in fact consider a generalization in which each arm constitutes an independent Markov chain with Martingale rewards.
3.
We use the terms “round” and “time” interchangeably.
4.
When the signal space is uncountable, defining the posterior probability density requires the use of Radon-Nikodym derivatives, and raises computational and representational issues. In Sect. 6, we consider what is perhaps the most interesting special case: that the signal reveals the precise value of r to the principal.
5.
In Eq. (1), if the support of r is finite, f(r) can be replaced by the probability mass function.
6.
A natural justification for having the same discount factor is that after each round, with probability \(1-\gamma \), the game ends.
7.
Note that all \(R^{(\gamma )}(\mathcal {A})\), \(C^{(\gamma )}(\mathcal {A})\) and \({ \mathrm{OPT} }_\gamma \) depend on the MAB instance.
8.
As in [4], in order to facilitate the analysis, this may include myopic and non-myopic pulls of arm i. For instance, if arm 1 was pulled as non-myopic arm at times 1 and 6, and a myopic pull of arm 1 occurred at time 3, then we would use the state of arm 1 after the pulls at times 1 and 3.
9.
This is in contrast to the case where the performance of a policy is evaluated on a class of instances rather than single instance.
10.
Note that a priori, it is not clear that this threshold will not change in subsequent rounds; hence, we cannot yet state that a threshold policy is optimal.

References

Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed banditproblem. In: Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, pp. 322–331 (1995)
Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2003)
Article MathSciNet Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Frazier, P., Kempe, D., Kleinberg, J., Kleinberg, R.: Incentivizing exploration. In: Proceedings of the 16th ACM Conference on Economics and Computation, pp. 5–22 (2014)
Google Scholar
Gittins, J.C.: Multi-Armed Bandit Allocation Indices. Wiley, New York (1989)
Google Scholar
Gittins, J.C., Glazebrook, K.D., Weber, R.: Multi-Armed Bandit Allocation Indices, 2nd edn. Wiley, New York (2011)
Book Google Scholar
Gittins, J.C., Jones, D.M.: A dynamic allocation index for the sequential design of experiments. In: Gani, J. (ed.) Progress in Statistics, pp. 241–266 (1974)
Google Scholar
Ho, C.J., Slivkins, A., Vaughan, J.W.: Adaptive contract design for crowdsourcing markets: bandit algorithms for repeated principal-agent problems. In: Proceedings of the 16th ACM Conf. on Economics and Computation, pp. 359–376 (2014)
Google Scholar
Katehakis, M.N., Veinott Jr., A.F.: The multi-armed bandit problem: decomposition and computation. Math. Oper. Res. 12(2), 262–268 (1987)
Google Scholar
Kremer, I., Mansour, Y., Perry, M.: Implementing the “wisdom of the crowd”. In: Proceedings of the 15th ACM Conf. on Electronic Commerce, pp. 605–606 (2013)
Google Scholar
Lai, T.L., Robbins, H.E.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985)
Article MathSciNet Google Scholar
Mansour, Y., Slivkins, A., Syrgkanis, V.: Bayesian incentive-compatible bandit exploration. In: Proceedings of the 17th ACM Conference on Economics and Computation, pp. 565–582 (2015)
Google Scholar
Robbins, H.E.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58, 527–535 (1952)
Article MathSciNet Google Scholar
Singla, A., Krause, A.: Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. In: 22nd International World Wide Web Conference, pp. 1167–1178 (2013)
Google Scholar
Slivkins, A., Wortman Vaughan, J.: Online decision making in crowdsourcing markets: theoretical challenges (position paper). ACM SIGecam Exch. 12(2), 4–23 (2013)
Article Google Scholar
Spence, M.: Job market signaling. Q. J. Econ. 87, 355–374 (1973)
Article Google Scholar
Whittle, P.: Multi-armed bandits and the Gittins index. J. Roy. Stat. Soc. Ser. B (Methodol.) 42(2), 143–149 (1980)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

University of Southern California, Los Angeles, USA
Li Han, David Kempe & Ruixin Qiang

Authors

Li Han
View author publications
You can also search for this author in PubMed Google Scholar
David Kempe
View author publications
You can also search for this author in PubMed Google Scholar
Ruixin Qiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruixin Qiang .

Editor information

Editors and Affiliations

Athens University of Economics and Business, Athens, Greece
Evangelos Markakis
CWI and VU Amsterdam, Amsterdam, The Netherlands
Guido Schäfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, L., Kempe, D., Qiang, R. (2015). Incentivizing Exploration with Heterogeneous Value of Money. In: Markakis, E., Schäfer, G. (eds) Web and Internet Economics. WINE 2015. Lecture Notes in Computer Science(), vol 9470. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48995-6_27

Download citation

DOI: https://doi.org/10.1007/978-3-662-48995-6_27
Published: 30 December 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48994-9
Online ISBN: 978-3-662-48995-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics