Advertisement

Structures of Optimal Policies in MDPs with Unbounded Jumps: The State of Our Art

  • H. BlokEmail author
  • F. M. Spieksma
Chapter
Part of the International Series in Operations Research & Management Science book series (ISOR, volume 248)

Abstract

The derivation of structural properties of countable state Markov decision processes (MDPs) is generally based on sample path methods or value iteration arguments. In the latter case, the method is to inductively prove the structural properties of interest for the n-horizon value function. A limit argument then should allow to deduce the structural properties for the infinite-horizon value function.

In the case of discrete time MDPs with the objective to minimise the total expected α-discounted cost, this procedure is justified under mild conditions. When the objective is to minimise the long run average expected cost, value iteration does not necessarily converge. Allowing time to be continuous does not generate any further complications when the jump rates are bounded as a function of state, due to applicability of uniformisation. However, when the jump rates are unbounded as a function of state, uniformisation is only applicable after a suitable perturbation of the jump rates that does not destroy the desired structural properties. Thus, also a second limit argument is required.

The importance of unbounded rate countable state MDPs has increased lately, due to applications modelling customer or patient impatience and abandonment. The theory validating the required limit arguments however does not seem to be complete, and results are scattered over the literature.

In this chapter our objective has been to provide a systematic way to tackle this problem under relatively mild conditions, and to provide the necessary theory validating the presented approach. The base model is a parametrised Markov process (MP): both perturbed MPs and MDPs are special cases of a parametrised MP. The advantage is that the parameter can simultaneously model a policy and a perturbation.

Notes

Acknowledgements

We would like to thank Sandjai Bhulai for introducing us to the illustrative tandem queue model in Sect. 5.2.2. Moreover, he provided us with numerical results for Figs. 5.2 and 5.3.

References

  1. 1.
    I.J.B.F. Adan, V.G. Kulkarni, A.C.C. van Wijk, Optimal control of a server farm. INFOR 51 (4), 241–252 (2013)Google Scholar
  2. 2.
    E. Altman, A. Hordijk, F.M. Spieksma, Contraction conditions for average and α-discount optimality in countable Markov games with unbounded rewards. Math. Oper. Res. 22, 588–619 (1997)CrossRefGoogle Scholar
  3. 3.
    T.M. Apostol, Mathematical Analysis (Addison Wesley Publishing Company, 1974)Google Scholar
  4. 4.
    Y. Aviv, A. Federgruen, The value iteration method for countable state Markov decision processes. Oper. Res. Lett. 24, 223–234 (1999)CrossRefGoogle Scholar
  5. 5.
    R. Bellman, A Markovian decision process. Technical report, DTIC Document (1957)Google Scholar
  6. 6.
    S. Bhulai, G.M. Koole, On the structure of value functions for threshold policies in queueing models. J. Appl. Prob. 40 (3), 613–622 (2003)CrossRefGoogle Scholar
  7. 7.
    S. Bhulai, F.M. Spieksma, On the uniqueness of solutions to the Poisson equations for average cost Markov chains with unbounded cost functions. Math. Meth. Oper. Res. 58 (2), 221–236 (2003)CrossRefGoogle Scholar
  8. 8.
    S. Bhulai, A.C. Brooms, F.M. Spieksma, On structural properties of the value function for an unbounded jump Markov process with an application to a processor sharing retrial queue. Queueing Syst. 76 (4), 425–446 (2014)CrossRefGoogle Scholar
  9. 9.
    S. Bhulai, H. Blok, F.M. Spieksma, Competing queues with customer abandonment: optimality of a generalised cμ-rule by the smoothed rate truncation method. Technical report, Mathematisch Instituut Leiden (2016)Google Scholar
  10. 10.
    P. Billingsley, Convergence of Probability Measures. Wiley Series in Probability and Statistics, 2nd edn. (Wiley, New York, 1999)Google Scholar
  11. 11.
    H. Blok, Markov decision processes with unbounded transition rates: structural properties of the relative value function. Master’s thesis, Utrecht University, 2011Google Scholar
  12. 12.
    H. Blok, Unbounded-rate Markov Decision Processes and optimal control. Structural properties via a parametrisation approach. Universiteit Leiden (2016). http://pub.math.leidenuniv.nl/~spieksmafm/theses/blok.pdf
  13. 13.
    H. Blok, F.M. Spieksma, Countable state Markov decision processes with unbounded jump rates and discounted cost: optimality and approximations. Adv. Appl. Probab. 47, 1088–1107 (2015)CrossRefGoogle Scholar
  14. 14.
    H. Blok, F.M. Spieksma, Structural properties of the server farm model. Technical report, Mathematisch Instituut Leiden (2016, in preparation)Google Scholar
  15. 15.
    V.S. Borkar, Topics in Controlled Markov Chains (Longman Scientific & Technical, Harlow, 1991)Google Scholar
  16. 16.
    R. Dekker, Denumerable Markov decision chains, optimal policies for small interest rates. PhD thesis, Universiteit Leiden, 1985Google Scholar
  17. 17.
    R. Dekker, A. Hordijk, Average, sensitive and Blackwell optimal policies in denumerable Markov decision chains with unbounded rewards. Math. Oper. Res. 13, 395–421 (1988)CrossRefGoogle Scholar
  18. 18.
    R. Dekker, A. Hordijk, Recurrence conditions for average and Blackwell optimality in denumerable Markov decision chains. Math. Oper. Res. 17, 271–289 (1992)CrossRefGoogle Scholar
  19. 19.
    R. Dekker, A. Hordijk, F.M. Spieksma, On the relation between recurrence and ergodicity properties in denumerable Markov decision chains. Math. Oper. Res. 19, 539–559 (1994)CrossRefGoogle Scholar
  20. 20.
    C. Derman, Finite State Markovian Decision Processes (Academic, New York, 1970)Google Scholar
  21. 21.
    S. Doroudi, B. Fralix, M. Harchol-Balter, Clearing analysis on phases: exact limiting probabilities for skip-free, unidirectional, Quasi-Birth-Death processes (2016, submitted for publication)Google Scholar
  22. 22.
    D.G. Down, G. Koole, M.E. Lewis, Dynamic control of a single-server system with abandonments. Queueing Syst. 67 (1), 63–90 (2011)CrossRefGoogle Scholar
  23. 23.
    G. Fayolle, V.A. Malyshev, M.V. Menshikov, Constructive Theory of Countable Markov Chains (Cambridge University Press, Cambridge, 1995)CrossRefGoogle Scholar
  24. 24.
    E.A. Feinberg, Total reward criteria, in Handbook of Markov Decision Processes, ed. by E.A. Feinberg, A. Shwartz. International Series in Operations Research and Management Science, vol. 40, chap. 5 (Kluwer Academic Publishers, Amsterdam, 2002), pp. 155–189Google Scholar
  25. 25.
    L. Fisher, S.M. Ross, An example in denumerable decision processes. Ann. Math. Stat. 39 (2), 674–675 (1968)CrossRefGoogle Scholar
  26. 26.
    F.G. Foster, On the stochastic matrices associated with certain queuing processes. Ann. Math. Stat. 24 (3), 355–360 (1953)CrossRefGoogle Scholar
  27. 27.
    X.P. Guo, O. Hernández-Lerma, Continuous-Time Markov Decision Processes. Stochastic Modelling and Applied Probability, vol. 62 (Springer, Berlin, 2009)Google Scholar
  28. 28.
    X.P. Guo, O. Hernández-Lerma, T. Prieto-Rumeau, A survey of recent results on continuous-time Markov decision processes. TOP 14, 177–261 (2006)CrossRefGoogle Scholar
  29. 29.
    A. Hordijk, Dynamic Programming and Markov Potential Theory. Mathematical Centre Tracts, vol. 51 (C.W.I., Amsterdam, 1974)Google Scholar
  30. 30.
    A. Hordijk, Regenerative Markov decision models. Math. Program. Study 6, 49–72 (1976)CrossRefGoogle Scholar
  31. 31.
    A. Hordijk, P.J. Schweitzer, H.C. Tijms, The asymptotic behaviour of the minimal total expected cost for the denumerable state Markov decision model. J. Appl. Prob. 12, 298–305 (1975)CrossRefGoogle Scholar
  32. 32.
    M. Kitaev, Semi-Markov and jump Markov controlled models: average cost criterion. Theory Prob. Appl. 30, 272–288 (1986)CrossRefGoogle Scholar
  33. 33.
    G.M. Koole, Monotonicity in Markov Reward and Decision Chains: Theory and Applications, vol. 1 (Now Publishers Inc., Hanover, 2007)Google Scholar
  34. 34.
    S.A. Lippman, On dynamic programming with unbounded rewards. Manage. Sci. 21, 1225–1233 (1975)CrossRefGoogle Scholar
  35. 35.
    S.P. Meyn, R.L. Tweedie, Markov Chains and Stochastic Stability (Springer, Berlin, 1993)CrossRefGoogle Scholar
  36. 36.
    A. Piunovskiy, Y. Zhang, Discounted continuous-time Markov decision processes with unbounded rates and history dependent policies: the dynamic programming approach. 4OR-Q J. Oper. Res. 12, 49–75 (2014)Google Scholar
  37. 37.
    T. Prieto-Rumeau, O. Hernández-Lerma, Discounted continuous-time controlled Markov chains: convergence of control models. J. Appl. Probab. 49 (4), 1072–1090 (2012)Google Scholar
  38. 38.
    T. Prieto-Rumeau, O. Hernández-Lerma, Selected Topics on Continuous-Time Controlled Markov Chains and Markov Games. ICP Advanced Texts in Mathematics, vol. 5 (Imperial College Press, London, 2012)Google Scholar
  39. 39.
    M.L. Puterman, Markov Decision Processes: Discrete Stochastic Programming, 2nd edn. (Wiley, Hoboken, NJ, 2005)Google Scholar
  40. 40.
    S.M. Ross, Non-discounted denumerable Markovian decision models. Ann. Math. Stat. 39 (2), 412–423 (1968)CrossRefGoogle Scholar
  41. 41.
    H.L. Royden, Real Analysis, 2nd edn. (Macmillan Publishing Company, New York, 1988)Google Scholar
  42. 42.
    L.I. Sennott, Average cost optimal stationary policies in infinite state Markov decision processes with unbounded costs. Oper. Res. 37 (4), 626–633 (1989)CrossRefGoogle Scholar
  43. 43.
    L.I. Sennott, Valute iteration in countable state average cost Markov decision processes with unbounded costs. Ann. Oper. Res. 28, 261–272 (1991)CrossRefGoogle Scholar
  44. 44.
    L.I. Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems. Wiley Series in Probability and Statistics (Wiley, New York, 1999)Google Scholar
  45. 45.
    R.F. Serfozo, An equivalence between continuous and discrete time Markov decision processes. Oper. Res. 27 (3), 616–620 (1979)CrossRefGoogle Scholar
  46. 46.
    F.M. Spieksma, Geometrically ergodic Markov Chains and the optimal Control of Queues. PhD thesis, Leiden University, 1990. Available on request from the authorGoogle Scholar
  47. 47.
    F.M. Spieksma, Countable state Markov processes: non-explosiveness and moment function. Prob. Eng. Inf. Sci. 29 (4), 623–637 (2015)CrossRefGoogle Scholar
  48. 48.
    R.E. Strauch, Negative dynamic programming. Ann. Math. Stat. 37 (4), 871–890 (1966)CrossRefGoogle Scholar
  49. 49.
    E.C. Titchmarsh. The Theory of Functions, 2nd edn. (Oxford University Press, Oxford, 1986)Google Scholar
  50. 50.
    R.L. Tweedie, Criteria for ergodicity, exponential ergodicity and strong ergodicity of Markov processes. J. Appl. Prob. 18, 122–130 (1981)CrossRefGoogle Scholar
  51. 51.
    N.M. van Dijk, On uniformization of time-inhomogeneous continuous-time Markov chains. Oper. Res. Lett. 12, 283–291 (1992)CrossRefGoogle Scholar
  52. 52.
    N.M. van Dijk, Error bounds and comparison results: the Markov reward approach, in Queueing Networks. A Fundamental Approach, ed. by R.J. Boucherie, N.M. van Dijk. International Series in Operations Research and Management Science, chap. 9 (Springer US, 2011), pp. 379–461Google Scholar
  53. 53.
    R.R. Weber, S. Stidham Jr., Optimal control of service rates in networks of queues. Adv. Appl. Prob. 19 (1), 202–218 (1987)CrossRefGoogle Scholar
  54. 54.
    J. Wessels, Markov programming by successive approximations with respect to weighted supremum norms. J. Math. Anal. Appl. 58 (2), 326–335 (1977)CrossRefGoogle Scholar
  55. 55.
    D.V. Widder, The Laplace Transform (Princeton University Press, Princeton, 1946)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Eindhoven University of TechnologyEindhovenThe Netherlands
  2. 2.Leiden UniversityLeidenThe Netherlands

Personalised recommendations