A Geometric Approach to Find Nondominated Policies to Imprecise Reward MDPs
Markov Decision Processes (MDPs) provide a mathematical framework for modelling decision-making of agents acting in stochastic environments, in which transitions probabilities model the environment dynamics and a reward function evaluates the agent’s behaviour. Lately, however, special attention has been brought to the difficulty of modelling precisely the reward function, which has motivated research on MDP with imprecisely specified reward. Some of these works exploit the use of nondominated policies, which are optimal policies for some instantiation of the imprecise reward function. An algorithm that calculates nondominated policies is πWitness, and nondominated policies are used to take decision under the minimax regret evaluation. An interesting matter would be defining a small subset of nondominated policies so that the minimax regret can be calculated faster, but accurately. We modified πWitness to do so. We also present the πHull algorithm to calculate nondominated policies adopting a geometric approach. Under the assumption that reward functions are linearly defined on a set of features, we show empirically that πHull can be faster than our modified version of πWitness.
KeywordsImprecise Reward MDP Minimax Regret Preference Elicitation
Unable to display preview. Download preview PDF.
- 6.Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility elicitation. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 363–369. AAAI Press / The MIT Press, Austin, Texas (2000)Google Scholar
- 9.Patrascu, R., Boutilier, C., Das, R., Kephart, J.O., Tesauro, G., Walsh, W.E.: New approaches to optimization and utility elicitation in autonomic computing. In: Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, pp. 140–145. AAAI Press / The MIT Press, Pittsburgh, Pennsylvania, USA (2005)Google Scholar
- 10.Regan, K., Boutilier, C.: Regret-based reward elicitation for markov decision processes. In: UAI 2009: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 444–451. AUAI Press, Arlington (2009)Google Scholar
- 11.Regan, K., Boutilier, C.: Robust policy computation in reward-uncertain mdps using nondominated policies. In: Fox, M., Poole, D. (eds.) AAAI, AAAI Press, Menlo Park (2010)Google Scholar
- 13.Xu, H., Mannor, S.: Parametric regret in uncertain markov decision processes. In: 48th IEEE Conference on Decision and Control, CDC 2009 (2009)Google Scholar