Discretetimemarkovdecisionprocesses: Total Reward
This chapter studies a discrete time Markov decision process with the total reward criterion, where the state space is countable, the action sets are measurable, the reward function is extended real-valued, and the discount factor β Î (–∞,+∞) may be any real number although β Î [0, 1] used to be required in the literature. Two conditions are presented, which are necessary for studying MDPs and are weaker than those presented in the literature. By eliminating some worst actions, the state space S can be partitioned into subsets S∞, S?∞, S0, on which the optimal value function equals +∞,?∞, or is finite, respectively. Furthermore, the validity of the optimality equation is shown when its right-hand side is well defined, especially, when it is restricted to the subset S0. The reward function r(i, a) becomes finite and bounded above in a for each i Î S0. Then, the optimal value function is characterized as a solution of the optimality equation in S0 and the structure of optimal policies is studied. Moreover, successive approximation is studied. Finally, some sufficient conditions for the necessary conditions are presented. The method we use here is elementary. In fact, only some basic concepts from MDPs and discrete time Markov chains are used.
KeywordsOptimal Policy Discount Factor Markov Decision Process Reward Function Optimality Equation
Unable to display preview. Download preview PDF.