Abstract
We study on-line play of repeated matrix games in which the observations of past actions of the other player and the obtained reward are partial and stochastic. We define the Partial Observation Bayes Envelope (POBE) as the best reward against the worst-case stationary strategy of the opponent that agrees with past observations. Our goal is to have the (unobserved) average reward above the POBE. For the case where the observations (but not necessarily the rewards) depend on the opponent play alone, an algorithm for attaining the POBE is derived. This algorithm is based on an application of approachability theory combined with a worst-case view over the unobserved rewards. We also suggest a simplified solution concept for general signaling structure. This concept may fall short of the POBE.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Fudenberg, D., Levine, D.: Universal consistency and cautious fictitious play. Journal of Economic Dynamic and Control 19, 1065–1190 (1995)
Hannan, J.: Approximation to Bayes risk in repeated play. In: Dresher, M., Tucker, A.W., Wolde, P. (eds.) Contribution to The Theory of Games, III, pp. 97–139. Princeton University Press, Princeton (1957)
Blackwell, D.: Controlled random walks. In: Proc. International Congress of Mathematicians, 1954, vol. III, pp. 336–338. North-Holland, Amsterdam (1956)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The non-stochastic multiarmed bandit problem. To appear in SIAM journal of Computation (2002)
Rustichini, A.: Minimizing regret: the general case. Games and Economic Behavior 29, 224–243 (1999)
Piccolboni, A., Schindelhauer, C.: Discrete prediction games with arbitrary feedback and loss. In: Helmbold, D., Williamson, B. (eds.) 14th Annual Conference on Computation Learning Theory, pp. 208–223. Springer, Heidelberg (2001)
Blackwell, D.: An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1), 1–8 (1956)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Ramakrishnan, K., Floyd, S., Black, D.: The addition of explicit congestion notification (ECN) to IP. IETF, Tech. Rep. (2001)
Mannor, S., Shimkin, N.: Regret minimization in signal space for repeated matrix games with partial observations. Technical report EE- 1242, Faculty of Electrical Engineering, Technion, Israel (March 2000), Available from http://web.mit.edu/~hie/www/pubs.htm
Shimkin, N.: Extremal large deviations in controlled I.I.D. processes with applications to hypothesis testing. Adv. Appl. Prob. 25, 875–894 (1993)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific, Belmont (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mannor, S., Shimkin, N. (2003). On-Line Learning with Imperfect Monitoring. In: Schölkopf, B., Warmuth, M.K. (eds) Learning Theory and Kernel Machines. Lecture Notes in Computer Science(), vol 2777. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45167-9_40
Download citation
DOI: https://doi.org/10.1007/978-3-540-45167-9_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40720-1
Online ISBN: 978-3-540-45167-9
eBook Packages: Springer Book Archive