Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

Hutter, Marcus

doi:10.1007/3-540-45435-7_25

Marcus Hutter³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2375))

Included in the following conference series:

International Conference on Computational Learning Theory

1161 Accesses
14 Citations

Abstract

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action y _t results in perception x _t and reward r _t, where all quantities in general may depend on the complete history. The perception x _t and reward r _t are sampled from the (reactive) environmental probability distribution μ. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ is known. Reinforcement learning is usually used if μ is unknown. In the Bayesian approach one defines a mixture distribution ξ as a weighted sum of distributions \( \mathcal{V} \in \mathcal{M} \) , where \( \mathcal{M} \) is any class of distributions including the true environment μ. We show that the Bayes-optimal policy p ^ξbased on the mixture ξ is self-optimizing in the sense that the average value converges asymptotically for all \( \mu \in \mathcal{M} \) to the optimal value achieved by the (infeasible) Bayes-optimal policy p ^μ which knows μ in advance. We show that the necessary condition that \( \mathcal{M} \) admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on \( \mathcal{M} \) . As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that p^λ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments \( \mathcal{V} \in \mathcal{M} \) and a strictly higher value in at least one.

This work was supported by SNF grant 2000-61847.00 to Jürgen Schmidhuber.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Bellman. Dynamic Programming. Princeton University Press, New Jersey, 1957.
Google Scholar
D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol. (I) and (II). Athena Scientific, Belmont, Massachusetts, 1995. Volumes 1 and 2.
Google Scholar
R. I. Brafman and M. Tennenholtz. A near-optimal polynomial time algorithm for learning in certain classes of stochastic games. Artificial Intelligence, 121(1–2):31–47, 2000.
Article MATH MathSciNet Google Scholar
J. L. Doob. Stochastic Processes. John Wiley & Sons, New York, 1953.
MATH Google Scholar
M. Hutter. A theory of universal artificial intelligence based on algorithmic complexity. Technical Report cs.AI/0004001, 62 pages, 2000. http://arxiv.org/abs/cs.AI/0004001.
M. Hutter. General loss bounds for universal sequence prediction. Proceedings of the 18 ^th International Conference on Machine Learning (ICML-2001), pages 210–217, 2001.
Google Scholar
L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of AI research, 4:237–285, 1996.
Google Scholar
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. 15th International Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, San Francisco, CA, 1998.
Google Scholar
P. R. Kumar and P. P. Varaiya. Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, Englewood Cliffs, NJ, 1986.
MATH Google Scholar
M. Li and P. M. B. Vitányi. An introduction to Kolmogorov complexity and its applications. Springer, 2nd edition, 1997.
Google Scholar
S. J. Russell and P. Norvig. Artificial Intelligence. A Modern Approach. Prentice-Hall, Englewood Cliffs, 1995.
MATH Google Scholar
R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.
Google Scholar
J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal computable predictions. Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), 2002.
Google Scholar
R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE Trans. Inform. Theory, IT-24:422–432, 1978.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

IDSIA, Galleria 2, CH-6928, Manno-Lugano, Switzerland
Marcus Hutter

Authors

Marcus Hutter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research School of Information Sciences and Engineering, Australian National University, Canberra, ACT, 0200, Australia
Jyrki Kivinen
Computer Science Department, University of Illinois at Chicago, 851 S. Morgan St., Chicago, IL, 60607, USA
Robert H. Sloan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hutter, M. (2002). Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures. In: Kivinen, J., Sloan, R.H. (eds) Computational Learning Theory. COLT 2002. Lecture Notes in Computer Science(), vol 2375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45435-7_25

Download citation

DOI: https://doi.org/10.1007/3-540-45435-7_25
Published: 25 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43836-6
Online ISBN: 978-3-540-45435-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics