Abstract
In this paper, we propose a simulation-based policy iteration algorithm for Markov decision process (MDP) problems with average cost criterion under the unichain assumption, which is a weaker assumption than found in previous work. In this algorithm, 1) the problem is converted to a stochastic shortest path problem and a reference state can be chosen as any recurrent state under the current policy, in which case the reference state is not necessarily the same from iteration to iteration; 2) the differential costs are evaluated indirectly by a temporal-difference learning scheme; 3) transient states are selected as the initial states for sample paths and the inverse of the visit count is chosen as the stepsize to improve the performance. Numerical results using the algorithm for an inventory control problem are also provided.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Puterman, M. L. (1994), Markov Decision Processes, John Wiley & Sons, Inc., New York.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Athena Scientific, Belmont, Massachusetts.
Cao, X. R. (1997) “Single sample path based optimization of Markov chains,” preprint.
Tsitsiklis, J. N. and Van Roy, B. (1999) “Average cost temporal-difference learning,” submitted to Machine Learing.
Konda, V.R. and Borkar, V. S. (1998) “Learning algorithms for Markov decision processes,” preprint.
Bertsekas, D. P. (1995), Dynamic Programming and Optimal Control Vol J & 2, Athena Scientific, Belmont, Massachusetts.
Abounadi, J., Bertsekas, D. P., and Borkar, V. S. (1998) “Learning algorithms for Markov decision processes with average cost,” Tech. Rep., MIT, LIDS-P-2434.
Van Roy, B., Bertsekas, D. P., Lee, P., and Tsitsiklis, J. N. (1997) “A neuro-dynamic programming approach to retailer inventory management,” Tech. Rep., Unica Technologies, Lincoln, MA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media New York
About this chapter
Cite this chapter
He, Y., Fu, M.C., Marcus, S.I. (2000). A Simulation-Based Policy Iteration Algorithm for Average Cost Unichain Markov Decision Processes. In: Laguna, M., Velarde, J.L.G. (eds) Computing Tools for Modeling, Optimization and Simulation. Operations Research/Computer Science Interfaces Series, vol 12. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-4567-5_9
Download citation
DOI: https://doi.org/10.1007/978-1-4615-4567-5_9
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7062-8
Online ISBN: 978-1-4615-4567-5
eBook Packages: Springer Book Archive