# Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

## Abstract

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle *t* action *y* _{ t } results in perception *x* _{ t } and reward *r* _{ t }, where all quantities in general may depend on the complete history. The perception *x* _{ t } and reward *r* _{ t } are sampled from the (reactive) environmental probability distribution *μ*. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if *μ* is known. Reinforcement learning is usually used if *μ* is unknown. In the Bayesian approach one defines a mixture distribution *ξ* as a weighted sum of distributions \(
\mathcal{V} \in \mathcal{M}
\)
, where \(
\mathcal{M}
\)
is any class of distributions including the true environment *μ*. We show that the Bayes-optimal policy *p* ^{ξ}based on the mixture *ξ* is self-optimizing in the sense that the average value converges asymptotically for all \(
\mu \in \mathcal{M}
\)
to the optimal value achieved by the (infeasible) Bayes-optimal policy *p* ^{μ} which knows μ in advance. We show that the necessary condition that \(
\mathcal{M}
\)
admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on \(
\mathcal{M}
\)
. As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that p^{λ} is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in *all* environments \(
\mathcal{V} \in \mathcal{M}
\)
and a strictly higher value in at least one.

## Keywords

Optimal Policy Reinforcement Learning Markov Decision Process Pareto Optimality Probabilistic Policy## Preview

Unable to display preview. Download preview PDF.

## References

- [Bel57]R. Bellman.
*Dynamic Programming*. Princeton University Press, New Jersey, 1957.Google Scholar - [Ber95]D. P. Bertsekas.
*Dynamic Programming and Optimal Control, Vol. (I) and (II)*. Athena Scientific, Belmont, Massachusetts, 1995. Volumes 1 and 2.Google Scholar - [BT00]R. I. Brafman and M. Tennenholtz. A near-optimal polynomial time algorithm for learning in certain classes of stochastic games.
*Artificial Intelligence*, 121(1–2):31–47, 2000.zbMATHCrossRefMathSciNetGoogle Scholar - [Doo53]J. L. Doob.
*Stochastic Processes*. John Wiley & Sons, New York, 1953.zbMATHGoogle Scholar - [Hut00]M. Hutter. A theory of universal artificial intelligence based on algorithmic complexity. Technical Report cs.AI/0004001, 62 pages, 2000. http://arxiv.org/abs/cs.AI/0004001.
- [Hut01]M. Hutter. General loss bounds for universal sequence prediction.
*Proceedings of the 18*^{th}*International Conference on Machine Learning (ICML-2001)*, pages 210–217, 2001.Google Scholar - [KLM96]L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey.
*Journal of AI research*, 4:237–285, 1996.Google Scholar - [KS98]M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In
*Proc. 15th International Conf. on Machine Learning*, pages 260–268. Morgan Kaufmann, San Francisco, CA, 1998.Google Scholar - [KV86]P. R. Kumar and P. P. Varaiya.
*Stochastic Systems: Estimation, Identification, and Adaptive Control*. Prentice Hall, Englewood Cliffs, NJ, 1986.zbMATHGoogle Scholar - [LV97]M. Li and P. M. B. Vitányi.
*An introduction to Kolmogorov complexity and its applications*. Springer, 2nd edition, 1997.Google Scholar - [RN95]S. J. Russell and P. Norvig.
*Artificial Intelligence. A Modern Approach*. Prentice-Hall, Englewood Cliffs, 1995.zbMATHGoogle Scholar - [SB98]R. Sutton and A. Barto.
*Reinforcement learning: An introduction*. Cambridge, MA, MIT Press, 1998.Google Scholar - [Sch02]J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal computable predictions.
*Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002)*, 2002.Google Scholar - [Sol78]R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems.
*IEEE Trans. Inform. Theory*, IT-24:422–432, 1978.CrossRefMathSciNetGoogle Scholar