In Chapter 2, we introduced the basic principles of PA and derived the performance derivative formulas for queueing networks and Markov and semi-Markov systems with these principles. In Chapter 3, we developed sample-path-based (on-line learning) algorithms for estimating the performance derivatives and sample-path-based optimization schemes. In this chapter, we will show that the performance sensitivity based view leads to a unified approach to both PA and Markov decision processes (MDPs).
KeywordsOptimal Policy Poisson Equation Markov Decision Process Reward Function Optimality Equation
Unable to display preview. Download preview PDF.
- 71.X. R. Cao and J. Y. Zhang, “The nth-Order Bias Optimality for Multi-chain Markov Decision Processes,” IEEE Transactions on Automatic Control, submitted.Google Scholar
- 21.D. P. Bertsekas, Dynamic Programming and Optimal Control, Volumes I and II. Athena Scientific, Belmont, Massachusetts, 1995, 2001, 2007.Google Scholar
- 25.D. P. Bertsekas and T. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, Massachusetts, 1996.Google Scholar
- 157.J. Q. Hu, M. C. Fu, V. R. Ramezani, and S. I. Marcus, “An Evolutionary Random Search Algorithm for Solving Markov Decision Processes,” INFORMS Journal on Computing, to appear, 2006.Google Scholar
- 183.M. E. Lewis and M. L. Puterman, “Bias Optimality,” in E. A. Feinberg and A. Shwartz (eds.), The Handbook of Markov Decision Processes: Methods and Applications, Kluwer Academic Publishers, Boston, 89-111, 2002.Google Scholar