Markov Decision Processes in Practice pp 131-186 | Cite as

# Structures of Optimal Policies in MDPs with Unbounded Jumps: The State of Our Art

## Abstract

The derivation of structural properties of countable state Markov decision processes (MDPs) is generally based on sample path methods or value iteration arguments. In the latter case, the method is to inductively prove the structural properties of interest for the *n*-horizon value function. A limit argument then should allow to deduce the structural properties for the infinite-horizon value function.

In the case of discrete time MDPs with the objective to minimise the total expected *α*-discounted cost, this procedure is justified under mild conditions. When the objective is to minimise the long run average expected cost, value iteration does not necessarily converge. Allowing time to be continuous does not generate any further complications when the jump rates are bounded as a function of state, due to applicability of uniformisation. However, when the jump rates are *unbounded* as a function of state, uniformisation is only applicable after a suitable perturbation of the jump rates that does not destroy the desired structural properties. Thus, also a second limit argument is required.

The importance of unbounded rate countable state MDPs has increased lately, due to applications modelling customer or patient impatience and abandonment. The theory validating the required limit arguments however does not seem to be complete, and results are scattered over the literature.

In this chapter our objective has been to provide a systematic way to tackle this problem under relatively mild conditions, and to provide the necessary theory validating the presented approach. The base model is a parametrised Markov process (MP): both perturbed MPs and MDPs are special cases of a parametrised MP. The advantage is that the parameter can simultaneously model a policy and a perturbation.

## Notes

## References

- 1.I.J.B.F. Adan, V.G. Kulkarni, A.C.C. van Wijk, Optimal control of a server farm. INFOR
**51**(4), 241–252 (2013)Google Scholar - 2.E. Altman, A. Hordijk, F.M. Spieksma, Contraction conditions for average and
*α*-discount optimality in countable Markov games with unbounded rewards. Math. Oper. Res.**22**, 588–619 (1997)CrossRefGoogle Scholar - 3.T.M. Apostol,
*Mathematical Analysis*(Addison Wesley Publishing Company, 1974)Google Scholar - 4.Y. Aviv, A. Federgruen, The value iteration method for countable state Markov decision processes. Oper. Res. Lett.
**24**, 223–234 (1999)CrossRefGoogle Scholar - 5.R. Bellman, A Markovian decision process. Technical report, DTIC Document (1957)Google Scholar
- 6.S. Bhulai, G.M. Koole, On the structure of value functions for threshold policies in queueing models. J. Appl. Prob.
**40**(3), 613–622 (2003)CrossRefGoogle Scholar - 7.S. Bhulai, F.M. Spieksma, On the uniqueness of solutions to the Poisson equations for average cost Markov chains with unbounded cost functions. Math. Meth. Oper. Res.
**58**(2), 221–236 (2003)CrossRefGoogle Scholar - 8.S. Bhulai, A.C. Brooms, F.M. Spieksma, On structural properties of the value function for an unbounded jump Markov process with an application to a processor sharing retrial queue. Queueing Syst.
**76**(4), 425–446 (2014)CrossRefGoogle Scholar - 9.S. Bhulai, H. Blok, F.M. Spieksma, Competing queues with customer abandonment: optimality of a generalised c
*μ*-rule by the smoothed rate truncation method. Technical report, Mathematisch Instituut Leiden (2016)Google Scholar - 10.P. Billingsley,
*Convergence of Probability Measures*. Wiley Series in Probability and Statistics, 2nd edn. (Wiley, New York, 1999)Google Scholar - 11.H. Blok, Markov decision processes with unbounded transition rates: structural properties of the relative value function. Master’s thesis, Utrecht University, 2011Google Scholar
- 12.H. Blok, Unbounded-rate Markov Decision Processes and optimal control. Structural properties via a parametrisation approach. Universiteit Leiden (2016). http://pub.math.leidenuniv.nl/~spieksmafm/theses/blok.pdf
- 13.H. Blok, F.M. Spieksma, Countable state Markov decision processes with unbounded jump rates and discounted cost: optimality and approximations. Adv. Appl. Probab.
**47**, 1088–1107 (2015)CrossRefGoogle Scholar - 14.H. Blok, F.M. Spieksma, Structural properties of the server farm model. Technical report, Mathematisch Instituut Leiden (2016, in preparation)Google Scholar
- 15.V.S. Borkar,
*Topics in Controlled Markov Chains*(Longman Scientific & Technical, Harlow, 1991)Google Scholar - 16.R. Dekker, Denumerable Markov decision chains, optimal policies for small interest rates. PhD thesis, Universiteit Leiden, 1985Google Scholar
- 17.R. Dekker, A. Hordijk, Average, sensitive and Blackwell optimal policies in denumerable Markov decision chains with unbounded rewards. Math. Oper. Res.
**13**, 395–421 (1988)CrossRefGoogle Scholar - 18.R. Dekker, A. Hordijk, Recurrence conditions for average and Blackwell optimality in denumerable Markov decision chains. Math. Oper. Res.
**17**, 271–289 (1992)CrossRefGoogle Scholar - 19.R. Dekker, A. Hordijk, F.M. Spieksma, On the relation between recurrence and ergodicity properties in denumerable Markov decision chains. Math. Oper. Res.
**19**, 539–559 (1994)CrossRefGoogle Scholar - 20.C. Derman,
*Finite State Markovian Decision Processes*(Academic, New York, 1970)Google Scholar - 21.S. Doroudi, B. Fralix, M. Harchol-Balter, Clearing analysis on phases: exact limiting probabilities for skip-free, unidirectional, Quasi-Birth-Death processes (2016, submitted for publication)Google Scholar
- 22.D.G. Down, G. Koole, M.E. Lewis, Dynamic control of a single-server system with abandonments. Queueing Syst.
**67**(1), 63–90 (2011)CrossRefGoogle Scholar - 23.G. Fayolle, V.A. Malyshev, M.V. Menshikov,
*Constructive Theory of Countable Markov Chains*(Cambridge University Press, Cambridge, 1995)CrossRefGoogle Scholar - 24.E.A. Feinberg, Total reward criteria, in
*Handbook of Markov Decision Processes*, ed. by E.A. Feinberg, A. Shwartz. International Series in Operations Research and Management Science, vol. 40, chap. 5 (Kluwer Academic Publishers, Amsterdam, 2002), pp. 155–189Google Scholar - 25.L. Fisher, S.M. Ross, An example in denumerable decision processes. Ann. Math. Stat.
**39**(2), 674–675 (1968)CrossRefGoogle Scholar - 26.F.G. Foster, On the stochastic matrices associated with certain queuing processes. Ann. Math. Stat.
**24**(3), 355–360 (1953)CrossRefGoogle Scholar - 27.X.P. Guo, O. Hernández-Lerma,
*Continuous-Time Markov Decision Processes*. Stochastic Modelling and Applied Probability, vol. 62 (Springer, Berlin, 2009)Google Scholar - 28.X.P. Guo, O. Hernández-Lerma, T. Prieto-Rumeau, A survey of recent results on continuous-time Markov decision processes. TOP
**14**, 177–261 (2006)CrossRefGoogle Scholar - 29.A. Hordijk,
*Dynamic Programming and Markov Potential Theory*. Mathematical Centre Tracts, vol. 51 (C.W.I., Amsterdam, 1974)Google Scholar - 30.A. Hordijk, Regenerative Markov decision models. Math. Program. Study
**6**, 49–72 (1976)CrossRefGoogle Scholar - 31.A. Hordijk, P.J. Schweitzer, H.C. Tijms, The asymptotic behaviour of the minimal total expected cost for the denumerable state Markov decision model. J. Appl. Prob.
**12**, 298–305 (1975)CrossRefGoogle Scholar - 32.M. Kitaev, Semi-Markov and jump Markov controlled models: average cost criterion. Theory Prob. Appl.
**30**, 272–288 (1986)CrossRefGoogle Scholar - 33.G.M. Koole,
*Monotonicity in Markov Reward and Decision Chains: Theory and Applications*, vol. 1 (Now Publishers Inc., Hanover, 2007)Google Scholar - 34.S.A. Lippman, On dynamic programming with unbounded rewards. Manage. Sci.
**21**, 1225–1233 (1975)CrossRefGoogle Scholar - 35.S.P. Meyn, R.L. Tweedie,
*Markov Chains and Stochastic Stability*(Springer, Berlin, 1993)CrossRefGoogle Scholar - 36.A. Piunovskiy, Y. Zhang, Discounted continuous-time Markov decision processes with unbounded rates and history dependent policies: the dynamic programming approach. 4OR-Q J. Oper. Res.
**12**, 49–75 (2014)Google Scholar - 37.T. Prieto-Rumeau, O. Hernández-Lerma, Discounted continuous-time controlled Markov chains: convergence of control models. J. Appl. Probab.
**49**(4), 1072–1090 (2012)Google Scholar - 38.T. Prieto-Rumeau, O. Hernández-Lerma,
*Selected Topics on Continuous-Time Controlled Markov Chains and Markov Games*. ICP Advanced Texts in Mathematics, vol. 5 (Imperial College Press, London, 2012)Google Scholar - 39.M.L. Puterman,
*Markov Decision Processes: Discrete Stochastic Programming*, 2nd edn. (Wiley, Hoboken, NJ, 2005)Google Scholar - 40.S.M. Ross, Non-discounted denumerable Markovian decision models. Ann. Math. Stat.
**39**(2), 412–423 (1968)CrossRefGoogle Scholar - 41.H.L. Royden,
*Real Analysis*, 2nd edn. (Macmillan Publishing Company, New York, 1988)Google Scholar - 42.L.I. Sennott, Average cost optimal stationary policies in infinite state Markov decision processes with unbounded costs. Oper. Res.
**37**(4), 626–633 (1989)CrossRefGoogle Scholar - 43.L.I. Sennott, Valute iteration in countable state average cost Markov decision processes with unbounded costs. Ann. Oper. Res.
**28**, 261–272 (1991)CrossRefGoogle Scholar - 44.L.I. Sennott,
*Stochastic Dynamic Programming and the Control of Queueing Systems*. Wiley Series in Probability and Statistics (Wiley, New York, 1999)Google Scholar - 45.R.F. Serfozo, An equivalence between continuous and discrete time Markov decision processes. Oper. Res.
**27**(3), 616–620 (1979)CrossRefGoogle Scholar - 46.F.M. Spieksma, Geometrically ergodic Markov Chains and the optimal Control of Queues. PhD thesis, Leiden University, 1990. Available on request from the authorGoogle Scholar
- 47.F.M. Spieksma, Countable state Markov processes: non-explosiveness and moment function. Prob. Eng. Inf. Sci.
**29**(4), 623–637 (2015)CrossRefGoogle Scholar - 48.R.E. Strauch, Negative dynamic programming. Ann. Math. Stat.
**37**(4), 871–890 (1966)CrossRefGoogle Scholar - 49.E.C. Titchmarsh.
*The Theory of Functions*, 2nd edn. (Oxford University Press, Oxford, 1986)Google Scholar - 50.R.L. Tweedie, Criteria for ergodicity, exponential ergodicity and strong ergodicity of Markov processes. J. Appl. Prob.
**18**, 122–130 (1981)CrossRefGoogle Scholar - 51.N.M. van Dijk, On uniformization of time-inhomogeneous continuous-time Markov chains. Oper. Res. Lett.
**12**, 283–291 (1992)CrossRefGoogle Scholar - 52.N.M. van Dijk, Error bounds and comparison results: the Markov reward approach, in
*Queueing Networks. A Fundamental Approach*, ed. by R.J. Boucherie, N.M. van Dijk. International Series in Operations Research and Management Science, chap. 9 (Springer US, 2011), pp. 379–461Google Scholar - 53.R.R. Weber, S. Stidham Jr., Optimal control of service rates in networks of queues. Adv. Appl. Prob.
**19**(1), 202–218 (1987)CrossRefGoogle Scholar - 54.J. Wessels, Markov programming by successive approximations with respect to weighted supremum norms. J. Math. Anal. Appl.
**58**(2), 326–335 (1977)CrossRefGoogle Scholar - 55.D.V. Widder,
*The Laplace Transform*(Princeton University Press, Princeton, 1946)Google Scholar