Abstract
In this chapter we study Markov decision processes (MDPs) with finite state and action spaces. This is the classical theory developed since the end of the fifties. We consider finite and infinite horizon models. For the finite horizon model the utility function of the total expected reward is commonly used. For the infinite horizon the utility function is less obvious. We consider several criteria: total discounted expected reward, average expected reward and more sensitive optimality criteria including the Blackwell optimality criterion. We end with a variety of other subjects.
The emphasis is on computational methods to compute optimal policies for these criteria. These methods are based on concepts like value iteration, policy iteration and linear programming. This survey covers about three hundred papers. Although the subject of finite state and action MDPs is classical, there are still open problems. We also mention some of them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S.C. Albright, (1979): “Structural results for partially observable Markov decision processes”, Operations Research 27, 1041–1053.
E. Altman, (1999): “Constrained Markov decision processes”, Chapman & Hall/CRC, Boca Raton, Florida.
E. Altman, A. Hordijk and L.C.M. Kallenberg (1996): “On the value function in constrained control of Markov chains”, Mathematical Methods of Operations Research 44, 387–399.
E. Altman and A. Shwartz (1991a): “Sensitivity of constrained Markov decision processes”, Annals of Operations Research 33, 1–22.
E. Altman and A. Shwartz (1991b): “Adaptive control of constrained Markov chains”, IEEE-Transactions on Automatic Control 36, 454–462.
E. Altman and A. Shwartz (1991c): “Adaptive control of constrained Markov decision chains: criteria and policies”, Annals of Operations Research 28, 101–134.
E. Altman and A. Shwartz (1991): “Sensitivity of constrained Markov decision processes”, Annals of Operations Research 33, 1–22.
E. Altman and F.M. Spieksma (1995): “The linear program approach in Markov decision processes”, Mathematical Methods of Operations Research 42, 169–188.
J.S. Baras, D.J. Ma and A.M. Makowsky (1985): “K competing queues with linear costs and geometric service requirements: the µc-rule is always optimal” Systems Control Letters 6, 173–180.
J. Bather (1973a): “Optimal decision procedures for finite Markov chains. Part II: Communicating systems”, Advances in Applied Probability 5, 521–540.
J. Bather (1973b): “Optimal decision procedures for finite Markov chains. Part III: General convex systems”, Advances in Applied Probability 5, 541–553.
M. Bayal-Gursoy and K.W. Ross (1992): “Variability-sensitivity Markov decision processes”, Mathematics of Operations Research 17, 558–571.
R. Bellman (1957): “Dynamic programming”, Princeton University Press, Princeton.
A. Ben-Israel and S.D. Flam (1990): “A bisection/successive approximation method for computing Gittins indices”, Zeitschrift für Operations Research 34, 411–422.
D.P. Bertsekas (1976): “Dynamic programming and stochastic control”, Academic Press, New York.
D.R Bertsekas (1976b): “On error bounds for successive approximation methods”, IEEE Transactions on Automatic Control 21, 394–396.
D.R Bertsekas (1987): “Dynamic programming: deterministic and stochastic models”, Prentice-Hall, Englewood Cliff.
D.R Bertsekas (1995): “Dynamic programming and optimal control I”, Athena Scientific, Belmont, Massachusetts.
D.R Bertsekas (1995): “Dynamic programming and optimal control II”, Athena Scientific, Belmont, Massachusetts+.
D.R. Bertsekas (1995c): “Generic rank-one corrections for value iteration in Markovian decision problems”, OR Letters 17, 111–119.
D.R. Bertsekas (1998): “A new value iteration method for the average cost dynamic programming problem”, SIAM Journal on Control and Optimization 36, 742–759.
D.R Bertsekas and S.E. Shreve (1978) “Stochastic Optimal Control”, Academic Press, New York.
D.P. Bertsekas and J.N. Tsitsiklis (1991): “An analysis of stochastic shortest path problems”, Mathematics of Operations Research 16, 580–595.
D. Bertsimas and J. Niño-Mora (1996): “Conservations laws, extended polymatroids and multi-armed bandit problems; a polyhedral approach to indexable systems”, Mathematics of Operations Research 21, 257–306.
F.J. Beutler and K.W. Ross (1985): “Optimal policies for controlled Markov chains with a constraint”, Journal of Mathematical Analysis and Applications 112, 236–252.
K.-J. Bierth (1987): “An expected average reward criterion”, Stochastic Processes and Applications 26, 133–140.
D. Blackwell (1962): “Discrete dynamic programming”, Annals of Mathematical Statistics, 719–726.
L. Breiman (1964): “Stopping-rule problems”, in: E.F. Beckenbach (ed.), Applied Combinatorial Mathematics”, Wiley, New York, 284–319.
B.W. Brown (1965): “On the iterative method of dynamic programming on a finite space discrete time Markov process”, Annals of Mathematical Statistics 36, 1279–1285.
J. Bruno, P. Downey and G.N. Frederickson (1981): “Sequencing tasks with exponential service times to minimize the expected flowtime or makespan”, Journal of the Association for Computing Machinery 28, 100–113.
A.N. Burnetas, and M.N. Katehakis (1997): “Optimal adaptive policies for Markov decision processes”, Mathematics of Op. Research 22, 222–255.
R. Cavazos-Cadena (1991): “Nonparametric estimation and adaptive control in a class of finite Markov decision chains”, Annals of Operations Research 28, 169–184.
C.-S. Chang, A. Hordijk, R. Righter and G. Weiss (1994): “The stochastic optimality of SEPT in parallel machine scheduling”, Probability in the Engineering and Information Sciences 8, 179–188.
M.C. Chen, Jr. (1973): “Optimal stopping in a discrete search problem”, Operations Research 21, 741–747.
Y.-R. Chen and M.N. Katehakis (1986): “Linear programming for finite state bandit problems”, Mathematics of Operations Research 11, 180–183.
Y.S. Chow and H. Robbins (1961): “A martingale system theorem and applications” in: J. Neyman (ed), “Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability”, Vol.1, University of Berkeley Press, Berkeley, 93–104.
K.-J. Chung (1989): “A note on maximal mean/standard deviation ratio in an undiscounted MDP”, OR Letters 8, 201–204.
K.-J. Chung (1992): “Remarks on maximal mean/standard deviation ratio in an undiscounted MDPs”, Opimization 26, 385–392.
K.-J. Chung (1994): “Mean-variance trade-offs in an undiscounted MDP: the unichain case”, Operations Research 42, 184–188.
G.B. Dantzig (1963): “Linear programming and extensions”, Princeton University Press, Princeton, New Jersey.
J.S. De Cani (1964): “A dynamic programming algorithm for embedded Markov chains when the planning horizon is at infinity”, Management Science 10, 716–733.
G.T. De Ghellinck (1960): “Les problèmes de décisions séquentielles”, Cahiers du Centre de Recherche Opérationelle, 161–179.
G.T. De Ghellinck and G.D. Eppen (1967): “Linear programming solu- tions for separable Markovian decision problems”, Management Science 13, 371–394.
R.S. Dembo and M. Haviv (1984): “Truncated policy iteration methods”, OR Letters 3, 243–246.
E.V. Denardo (1967): “Contraction mappings in the theory underlying dynamic programming”, SIAM Review 9, 165–167.
E.V. Denardo (1968): “Separable Markovian decision problems”, Management Science 14, 451–462.
E.V. Denardo (1970): “Computing a bias-optimal policy in a discrete-time Markov decision problem”, Operations Research 18, 279–289.
E.V. Denardo (1971): “Markov renewal programs with small interest rates”, Annals of Mathematical Statistics 42, 477–496.
E.V. Denardo (1973): “A Markov decision problem”, in: T.C. Hu and S.M. Robinson (eds.), “Mathematical Programming”, Academic Press, 33–68.
E.V. Denardo (1982): “Dynamic programming: models and applications”, Prentice-Hall, Englewood Cliff.
E.V. Denardo and B.L. Fox (1968): “Multichain Markov renewal pro- grams”, SIAM Journal on Applied Mathematics 16, 468–487.
E.V. Denardo and B.L. Miller (1968): “An optimality condition for discrete dynamic programming with no discounting”, Annals of Mathematical Statistics 39, 1220–1227.
E.V. Denardo and U.G. Rothblum (1979a): “Optimal stopping, expo- nential utility and linear programming”, Mathematical Programming 16, 228–244.
E.V. Denardo and U.G. Rothblum (1979b): “Overtaking optimality for Markov decision chains”, Mathematics of Operations Research 4, 144–152.
F. D’Epenoux (1960): “Sur un problème de production et de stockage dans l’aléatoire”, Revue Française de Recherche Opérationelle, 3–16.
C. Derman (1962): “On sequential decisions and Markov chains”, Management Science 9, 16–24.
C. Derman (1963): “Optimal replacement rules when changes of states are Markovian”, in: R. Bellman (ed.), “Mathematical optimization techniques”, The Rand Corporation, R-396-PR, 201–212.
C. Derman (1970): “Finite state Markovian decision processes”, Academic Press, New York.
C. Derman and M. Klein (1965): “Some remarks on finite horizon Marko- vian decision models”, Operations Research 13, 272–278.
C. Derman and J. Sacks (1960): “Replacement of periodically inspected equipment (an optimal stopping rule)”, Naval Research Logistics Quarterly 7, 597–607.
C. Derman and R. Strauch (1966): “A note on memoryless rules for controlling sequential control problems”, Annals of Mathematical Statistics 37, 276–278.
C. Derman and A.F. Veinott, Jr. (1972): “Constrained Markov decision chains”, Management Science 19, 389–390.
H.M. Dietz and V. Nollau (1983): “Markov decision problems with countable state space”, Akademie-Verlag, Berlin.
L. Dubins and L.J. Savage (1965): “How to gamble if you must”, McGraw-Hill, New York.
S. Durinovics, H.M. Lee, M.N. Katehakis and J.A. Filar (1986): “Multiobjective Markov decision processes with average reward criterion”, Large Scale Systems 10, 215–226.
E.B. Dynkin (1979): “Controlled Markov process”, Springer-Verlag, New York.
J.H. Eaton and L.A. Zadeh (1962): “Optimal pursuit strategies in discrete state probabilistic systems”, Transactions ASME Series D, Journal of Basic Engineering 84, 23–29.
A. Ephremides, P. Varaiya and J. Walrand (1980): “A simple dynamic routing problem”, IEEE Transactions on Automatic Control AC-25, 690–693.
A. Federgruen (1984): “Markovian control problems: functional equations and algorithms”, Mathematical Centre Tract 97, Mathematical Centre, Amsterdam.
A. Federgruen and P.J. Schweitzer (1978): “Discounted and undiscounted value iteration in Markov decision problems: a survey”, in: M.L. Puterman (ed), “Dynamic programming and its applications”, Academic Press, New York, 23–52.
A. Federgruen and P.J. Schweitzer (1980): “A survey of asymptotic value-iteration for undiscounted Markovian decision processes”, in: R. Hartley, L.C. Thomas and D.J. White (eds.), “Recent development in Markov decision processes”, Academic Press, New York, 73–109.
A. Federgruen and P.J. Schweitzer (1984a): “A fixed-point approach to undiscounted Markov renewal programs”, SIAM Journal on Algebraic Discrete Methods 5, 539–550.
A. Federgruen and P.J. Schweitzer (1984b): “Successive approximation methods for solving nested functional equations in Markov decision problems”, Mathematics of Operations Research 9, 319–344.
A. Federgruen, P.J. Schweitzer and H.C. Tijms (1978): “Contraction map- pings underlying undiscounted Markov decision problems”, Journal of Mathematical Analysis and Applications 65, 711–730.
A. Federgruen and D. Spreen (1980): “A new specification of the multichain policy iteration algorithm in undiscounted Markov renewal programs”, Management Science 26, 1211–1217.
A. Federgruen and P. Zipkin (1984): “An efficient algorithm for computing optimal (s, S) policies”, Operations Research 34, 1268–1285.
E.A. Feinberg and A. Shwartz (1994): “Markov decision models with weighted discounted criteria”, Mathematics of Operations Research 19, 152–168.
J.A. Filar, L.C.M. Kallenberg and H.M. Lee (1989): “Variance-penalized Markov decision processes”, Mathematics of Operations Research 14, 147–161.
J.A. Filar and O. J. Vrieze (1997): “Competitive Markov decision processes”, Springer-Verlag, New York.
B.L. Fox (1968): “(g, w)-optima in Markov renewal programs”, Management Science 15, 210–212.
E. Frostig (1993): “Optimal policies for machine repairmen problems”, Journal of Applied Probability 30, 703–715.
N. Furakawa (1980): “Characterization of optimal policies in vector-valued Markovian decision processes”, Mathematics of Operations Research 5, 271–279.
S. Gal (1984): “An O(N3) algorithm for optimal replacement problems”, SIAM Journal on Control and Optimization 22, 902–910.
R. Garbe and K.D. Glazebrook (1998): “On a new approach to the analysis of complex multi-armed bandit problems”, Mathematical Methods of Operations Research 48, 419–442.
J.C. Gittins (1979): “Bandit processes and dynamic allocation indices”, Journal of the Royal Statistic Society Series B 14, 148–177.
J.C. Gittins and D.M. Jones (1974): “A dynamic allocation index for the sequential design of experiments”, in J. Gani (ed.) “Progress in Statistics”, North Holland, Amsterdam, 241–266.
K.D. Glazebrook and R. Garbe (1996): “Reflections on a new approach to Gittins indexation”, Journal of the Operational Research Society 47, 1301–1309.
K.D. Glazebrook and S. Greatrix (1995): “On transforming an index for generalized bandit problems”, J. of App. Prob. 32, 168–182.
K.D. Glazebrook and R.W. Owen (1991): “New results for generalized bandit problems”, International Journal of System Sciences 22, 479–494.
M.K. Ghosh (1990): “Markov decision processes with multiple costs”, OR Letters 9, 257–260.
R. Grinold (1973): “Elimination of suboptimal actions in Markov decision problems”, Operations Research 21, 848–851.
R. Hartley, A.C. Lavercombe and L.C. Thomas (1986): “Computational comparison of policy iteration algorithms for discounted Markov decision processes”, Computers and Operations Research 13, 411–420.
N.A.J. Hastings (1968): “Some notes on dynamic programming and replacement”, Operational Research Quarterly 19, 453–464.
N.A.J. Hastings (1969): “Optimization of discounted Markov decision problems”, Operations Research Quarterly 20, 499–500.
N.A.J. Hastings (1971): “Bounds on the gain of a Markov decision process”, Operations Research 19, 240–243.
N.A.J. Hastings (1976): “A test for nonoptimal actions in undiscounted finite Markov decision chains”, Management Science 23, 87–92.
N.A.J. Hastings and J.M.C.Mello (1973): “Tests for nonoptimal actions in discounted Markov decision problems”, Management Science 19, 1019–1022.
N.A.J. Hastings and D.Sadjani (1979): “Markov programming with policy constraints”, European Journal of Operations Research 3, 253–255.
N.A.J. Hastings and J.A.E.E. Van Nunen (1977): “The action elimination algorithm for Markov decision processes”, in H.C. Tijms and J. Wessels (eds), “Markov decision theory”, Mathematical Centre Tract 100, 161–170, Mathematical Centre, Amsterdam.
M. Haviv and M.L. Puterman (1991): “An improved algorithm for solving communicating average reward Markov decision processes”, Annals of Operations Research 28, 229–242.
M.I. Henig (1983): “Vector-valued dynamic programming”, SIAM Journal on Control and Optimization 21, 490–499.
O. Hernández-Lerma (1987): “Adaptive Markov control processes”, Springer-Verlag, New York.
O. Hernández-Lerma and J. B. Lasserre (1996): “Discrete-time Markov control processes: Basic optimality criteria”, Springer-Verlag, New York.
O. Hernández-Lerma and J. B. Lasserre (1999): “Further topics on discrete-time Markov control processes”, Springer-Verlag, New York.
M. Herzberg and U. Yechiali (1994): “Accelerating procedures of the value iteration algorithm for discounted Markov decision processes, based on a one-step look-ahead analysis”, Operations Research 42, 940–946.
D.P. Heyman and M. J. Sobel (1984): “Stochastic models in Operations Research, Volume II, MacGraw-Hill, New York.
K. Hinderer (1970): “Foundations of non-stationary dynamic programming with discrete time parameter”, Springer-Verlag, New York.
U.D. Holzbaur (1986a): “Entscheidungsmodelle über angeordneten Körpern”, Optimization 17, 515–524.
U.D. Holzbaur (1986b): “Sensitivitätsanalysen in Entscheidungsmodellen”, Optimization 17, 525–533.
U.D. Holzbaur (1994): “Bounds for the quality and the number of steps in Bellman’s value iteration algorithm”, OR Spektrum 15, 231–234.
A. Hordijk (1971): “A sufficient condition for the existence of an optimal policy with respect to the average cost criterion in Markovian decision processes”, Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, Academia, Prague, 263–274.
A. Hordijk (1974): “Dynamic programming and Markov potential theory”, Mathematical Centre Tract 51, Amsterdam.
A. Hordijk, R. Dekker and L.C.M. Kallenberg (1985): “Sensitivity-analysis in discounted Markovian decision problems”, OR Spektrum 7, 143–151.
A. Hordijk and L.C.M. Kallenberg (1979): “Linear programming and Markov decision chains”, Management Science 25, 352–362.
A. Hordijk and L.C.M. Kallenberg (1984a): “Transient policies in discrete dynamic programming: linear programming including suboptimality tests and additional constraints”, Mathematical Programming 30, 46–70.
A. Hordijk and L.C.M. Kallenberg (1984b): “Constrained undiscounted stochastic dynamic programming”, Mathematics of Operations Research 9, 276–289.
A. Hordijk and J.A. Loeve (1994): “Undiscounted Markov decision chains with partial information; an algorithm for computing a locally optimal periodic policy”, Mathematical Methods of Operations Research 40, 163–181.
A. Hordijk and H.C. Tijms (1974): “The method of successive approx- imations and Markovian decision problems”, Operations Research 22, 519–521.
A. Hordijk and H.C. Tijms (1975): “A modified form of the iterative method of dynamic programming”, Annals of Statictics 3, 203–208.
A. Hordijk and H.C. Tijms (1975): “On a conjecture of Iglehart”, Management Science 11, 1342–1345.
R.A. Howard (1960): “Dynamic programming and Markov processes”, MIT Press, Cambridge.
R.A. Howard (1963): “Semi-Markovian decision processes”, Proceedings International Statistical Institute, Ottawa, Canada.
Y. Huang and L.C.M. Kallenberg (1994): “On finding optimal policies for Markov decision chains: a unifying framework for mean-variance trade-offs”, Mathematics of Operations Research 19, 434–448.
G. Hübner (1977): “Improved procedures for eliminating suboptimal actions in Markov programming by the use of contraction properties”, Transactions of the 7th Prague Conference on Information Theory, Statistical Decision Functions, Reidel, Dordrecht, 257–263.
G. Hübner (1988): “A unified approach to adaptive control of average reward Markov decision processes”, OR Spektrum 10, 161–166.
D. Iglehart (1963): “Optimality of (s, S)-policies in the infinite horizon dynamic inventory problem”, Management Science 9, 259–267.
T. Ishikida and P. Varaiya (1994): “Multi-armed bandit problem revis- ited”, Journal of Optimization Theory and Applications 83, 113–154.
R.G. Jeroslow (1972): “An algorithm for discrete dynamic programming with interest rates near zero”, Management Science Research Report no. 300, Carnegie-Mellon University, Pittsburgh.
W.S. Jewell (1963a): “Markov renewal programming. I: Formulation, finite return models”, Operations Research 11, 938–948.
W.S. Jewell (1963b): “Markov renewal programming. II: Infinite return models, example”, Operations Research 11, 949–971.
L.C.M. Kallenberg (1981a): “Finite horizon dynamic programming and linear programming”, Methods of Operations Research 43, 105–112.
L.C.M. Kallenberg (1981b): “Unconstrained and constrained dynamic programming over a finite horizon”, Report, University of Leiden, The Netherlands.
L.C.M. Kallenberg (1981c): “Linear programming to compute a bias-optimal policy”, in: B. Fleischmann et al. (eds.) “Operations Research Proceedings”, 433–440.
L.C.M. Kallenberg (1983): “Linear programming and finite Markovian control problems”, Mathematical Centre Tract 148, Mathematical Centre, Amsterdam.
L.C.M. Kallenberg (1986): “Note on M.N.Katehakis and Y.-R.Chen’s computation of the Gittins index”, Mathematics of Operations Research 11, 184–186.
L.C.M. Kallenberg (1992): “Separable Markovian decision problem: the linear programming method in the multichain case”, OR Spektrum 14, 43–52.
L.C.M. Kallenberg (1999): “Combinatorial problems in MDPs”, Report, University of Leiden, The Netherlands (to appear in the Proceedings of the Changsha International Workshop on Markov Processes & Controlled Markov Chains).
P.C. Kao (1973): “Optimal replacement rules when the changes of states are semi-Markovian”, Operations Research 21, 1231–1249.
M.N. Katehakis and C. Derman (1984): “Optimal repair allocation in a series system”, Mathematics of Operations Research 9, 615–623.
M.N. Katehakis and C. Derman (1989): “On the maintenance of systems composed of highly reliable components”, Management Science 35, 551–560.
M.N. Katehakis and A.F. Veinott, Jr. (1987): “The multi-armed bandit problem: decomposition and computation”, Mathematics of Operations Research 12, 262–268.
H. Kawai (1987): “A variance minimization problem for a Markov decision process”, European Journal of Operational Research 31, 140–145.
H. Kawai and N. Katoh (1987): “Variance constrained Markov decision process”, Journal of the Operations Research Society of Japan 30, 88–100.
J.G. Kemeny and J.L. Snell (1960): “Finite Markov chains”, Van Nostrand, Princeton.
M. Klein (1962): “Inspection-maintenance-replacement schedules under Markovian deterioration”, Management Science 9, 25–32.
P. Kolesar (1966): “Minimum-cost replacement under Markovian deterioration”, Management Science 12, 694–706.
M. Kurano (1983): “Adaptive policies in Markov decision processes with uncertain transition matrices”, Journal of Information and Optimization Sciences 4, 21–40.
H. Kushner (1971): “Introduction to stochastic control”, Holt, Rineholt and Winston, New York.
H. Kushner and A.J. Keinmann (1971): “Accelerated procedures for the solution of discrete Markov control problems”, IEEE Transactions on Automatic Control 16, 147–152.
E. Lanery (1967): “Etude asymptotique des systèmes Markovien à commande”, Revue d’Informatique et Recherche Operationelle 1, 3–56.
J.B. Lasserre (1994a): “A new policy iteration scheme for Markov decision processes using Schweitzer’s formula”, Journal of Applied Probability 31, 268–273.
J.B. Lasserre (1994b): “Detecting optimal and non-optimal actions in average-cost Markov decision processes”, Journal of Applied Probability 31, 979–990.
W. Lin and P.R. Kumar (1984): “Optimal control of a queueing system with two heterogeneous servers”, IEEE Transactions on Automatic Control AC-29, 696–705.
S.A. Lippman (1969): “Criterion equivalence in discrete dynamic programming”, Operations Research 17, 920–923.
J.Y. Liu and K. Liu (1994): “An algorithm on the Gittins index”, Systems Science and Mathematical Science 7, 106–114.
Q.-S. Liu and K. Ohno (1992): “Multiobjective undiscounted Markov renewal program and its application to a tool replacement problem in an FMS”, Information and Decision Techniques 18, 67–77.
Q.-S. Liu, K. Ohno and H. Nakayama (1992): “Multi-objective discounted Markov processes with expectation and variance criteria”, International Journal of System Science 23, 903–914.
J.A. Loeve (1995): “Markov decision chains with partial information”, PhD dissertation, University of Leiden, The Netherlands.
W.S. Lovejoy (1987): “Some monotonicity results for partially observed Markov processes”, Operations Research 35, 736–743.
W.S. Lovejoy (1991a): “Computationally feasible bounds for partially observed Markov decision processes”, Operations Research 39, 162–175.
W.S. Lovejoy (1991b) “A survey of algorithmic methods for partially observed Markov decision processes”, Annals of Op. Research 28, 47–66.
J. Macqueen (1966): “A modified programming method for Markovian decision problems”, Journal of Mathematical Analysis and Applications 14, 38–43.
J. Macqueen (1967): “A test for suboptimal actions in Markov decision problems”, Operations Research 15, 559–561.
A.S. Manne (1960): “Linear programming and sequential decisions”, Management Science, 259–267.
U. Meister and U. Holzbaur (1986): “A polynomial time bound for Howard’s policy improvement algorithm”, OR Spektrum 8, 37–40.
B.L. Miller and A.F. Veinott Jr. (1969): “Discrete dynamic programming with a small interest rate”, Annals of Mathematical Statistics 40, 366–370.
H. Mine and S. Osaki (1970): “Markov decision processes”, American Elsevier, New York.
G.E. Monahan (1982): “A survey of partially observable Markov decision processes: theory, models and algorithms”, Management Science 28, 1–16.
T. Morton (1971): “Undiscounted Markov renewal programming via mod- ified successive approximations”, Operations Research 19, 1081–1089.
J.L. Nazareth and R.B. Kulkarni (1986): “Linear programming formulations of Markov decision processes”, OR Letters 5, 13–16.
M.K. Ng (1999): “A note on policy iteration algorithms for discounted Markov decision problems”, OR Letters 25, 195–197.
A. Odoni (1969): “On finding the maximal gain for Markov decision processes”, Operations Research 17, 857–860.
S. Oezekici (1988): “Optimal periodic replacement of multicomponent reliability systems”, Operations Research 36, 542–552.
K. Ohno (1981): “A unified approach to algorithms with a suboptimality test in discounted semi-Markov decision processes”, Journal of the Operations Research Society of Japan 24, 296–323.
S. Osaki and H. Mine (1968): “Linear programming algorithms for semi-Markovian decision processes”, Journal of Mathematical Analysis and Applications 22, 356–381.
T. Parthasarathy, S.H. Tijs and O.J. Vrieze (1984), “Stochastic games with state independent transitions and reparable rewards in: G. Hammer and D. Pallaschke (eds.), Selected Topics in Operations Research and Mathematical Economics.
L.K. Platzman (1977): “Improved conditions for convergence in undiscounted Markov renewal programming”, Op. Research 25, 529–533.
M.A. Pollatschek and B. Avi-Itzhak (1969): “Algorithms for stochastic games with geometric interpretation”, Management Science 15, 399–415.
E.L. Porteus (1971): “Some bounds for discounted sequential decision processes”, Management Science 18, 7–11.
E.L. Porteus (1975): “Bounds and transformations for discounted finite Markov decision chains”, Operations Research 23, 761–784.
E.L. Porteus (1980a): “Improved iterative computation of the expected return in Markov and semi-Markov chains”, Zeitschrift für Operations Research 24, 155–170.
E.L. Porteus (1980b): “Overview of iterative methods for discounted finite Markov and semi-Markov chains”, in: R. Hartley, L.C. Thomas and D.J. White (eds.), “Recent development in Markov decision processes”, Academic Press, New York, 1–20.
E.L. Porteus (1981): “Computing the discounted return in Markov and semi- Markov chains”, Naval Research Logistics Quarterly 28, 567–577.
E. L. Porteus and J.C. Totten (1978): “Accelerated computation of the expected discounted return in a Markov chain”, Operations Research 26, 350–358.
M.L. Puterman (1981): “Computational methods for Markov decision methods”, Proceedings of 1981 Joint Automatic Control Conference.
M.L. Puterman (1994): “Markov decision processes”, Wiley, New York.
M.L. Puterman and S.L. Brumelle (1979): “On the convergence of policy iteration in stationary dynamic programming”, Mathematics of Operations Research 4, 60–69.
M.L. Puterman and M.C. Shin (1978): “Modified policy iteration algorithms for discounted Markov decision chains”, Management Science 24, 1127–1137.
M.L. Puterman and M.C. Shin (1982): “Action elimination procedures for modified policy iteration algorithms” Operations Research 30, 301–318.
D. Reetz (1973): “Solution of a Markovian decision problem by successive overrelaxation”, Zeitschrift für Operations Research 17, 29–32.
D. Reetz (1976): “A decision exclusion algorithm for a class of Markovian decision processes”, Zeitschrift für Operations Research 20, 125–131.
U. Rieder (1991): “Structural results for partially observed control problems”, Zeitschrift für Operations Research 35, 473–490.
R. Righter (1994): “Scheduling”, in: M. Shaked and J.G. Shantikumar (eds.), “Stochastic orders and their applications”, Academic Press, 381–432.
M. Roosta (1982): “Routing through a network with maximum reliability”, Journal of Mathematical Analysis and Applications 88, 341–347.
K.W. Ross (1989): “Randomized and past-dependent policies for Markov decision processes with multiple constraints”, Operations Research 37, 474–477.
K.W. Ross and R. Varadarajan (1991): “Multichain Markov decision processes with a sample path constraint: a decomposition approach”, Mathematics of Operations Research 16, 195–207.
S.M. Ross (1969): “A problem in optimal search and stop”, Operations Research 17, 984–992.
S.M. Ross (1970): “Applied probability models with optimization applications”, Holden-Day, San Francisco.
S.M. Ross (1974): “Dynamic programming and gambling models”, Advances in Applied Probability 6, 593–606.
S.M. Ross (1983): “Introduction to stochastic dynamic programming”, Academic Press, New York.
U.G. Rothblum (1979): “Iterated successive approximation for sequential decision processes”, in J.W.B. van Overhagen and H.C. Tijms (eds.), “Stochastic control and optimization”, Free University, Amsterdam, 30–32.
H. Scarf (1960): “The optimality of (s, S) polices in the dynamic inventory problem”, Chapter 13 in: K.J. Arrow, S. Karlin and P. Suppes (eds.), “Mathematical methods in the social sciences”, Stanford University Press, Stanford.
H. Schellhaas (1974): “Zur extrapolation in Markorffschen Entscheidungsmodellen mit Diskontierung”, Zeitschrift für Operations Research 18, 91–104.
N. Schmitz (1985): “How good is Howard’s policy improvement algorithm?”, Zeitschrift fur Operations Research 29, 315–316.
L. Schrage (1968): “A proof of the optimality of the shortest remaining processing time discipline”, Operations Research 16, 687–690.
P.J. Schweitzer (1965): “Perturbation theory and Markovian decision processes”, Ph.D. dissertation, M.I.T., Op. Research Center Report 15.
P.J. Schweitzer (1968): “Perturbation theory and finite Markov chains” Journal of Applied Probability 5, 401–413.
P.J. Schweitzer (1971a): “Multiple policy improvements in undiscounted Markov renewal programming”, Operations Research 19. 784–793.
P.J. Schweitzer (1971b): “Iterative solution of the functional equations of undiscounted Markov renewal programming”, Journal of Mathematical Analysis and Applications 34, 495–501.
P.J. Schweitzer (1984): “A value-iteration scheme for undiscounted multichain Markov renewal programs”, ZOR—Zeitschrift für Operations Research 28, 143–152.
P.J. Schweitzer (1985): “The variational calculus and approximations in policy space for Markov decision processes”, Journal of Mathematical Analysis and Applications 110, 568–582.
P.J. Schweitzer (1987): “A Brouwer fixed-point mapping approach to communicating Markov decision processes”, Journal of Mathematical Analysis and Applications 123, 117–130.
P.J. Schweitzer (1991): “Block-scaling of value-iteration for discounted Markov renewal programming”, Annals of Op. Research 29, 603–630.
P.J. Schweitzer and A. Federgruen (1977): “The asymptotic behavior of value iteration in Markov decision problems”, Mathematics of Operations Research 2, 360–381.
P.J. Schweitzer and A. Federgruen (1978a): “Foolproof convergence in multichain policy iteration”, Journal of Mathematical Analysis and Applications 64, 360–368.
P.J. Schweitzer and A. Federgruen (1978b): “The functional equations of undiscounted Markov renewal programming”, Mathematics of Operations Research 3, 308–321.
P.J. Schweitzer and A. Federgruen (1979): “Geometric convergence of value iteration in multichain Markov decision problems”, Advances of Applied Probability 11, 188–217.
L.I. Sennott (1999): “Stochastic dynamic programming and the control of queueing systems”, Wiley, New York.
E.L. Sernik and S.I. Markus (1991): “On the computation of the optimal cost function for discrete time Markov models with partial observations”, Annals of Operations Research 29, 471–512.
J.F. Shapiro (1975): “Brouwer’s fixed point theorem and finite state space Markovian decision theory”, Journal of Mathematical Analysis and Applications 49, 710–712.
L.S. Shapley (1953): “Stochastic games”, Proceedings of the National Academy of Sciences, 1095–1100.
Y.S. Sherif and M.L. Smith (1981): “Optimal maintenance policies for systems subject to failure—A review”, Naval Research Logistics Quarterly 28, 47–74.
K. Sladky (1974): “On the set of optimal controls for Markov chains with rewards”, Kybernatika 10, 350–367.
R.D. Smallwood (1966): “Optimum policy regions for Markov processes with discounting”, Operations Research 14, 658–669.
R.D. Smallwood and E.Sondik (1973): “The optimal control of partially observable Markov processes over a finite horizon”, Operations Research 21, 1071–1088.
D.R. Smith (1978): “Optimal repairman allocation—asymptotic results”, Management Science 24, 665–674.
M.J. Sobel (1981), “Myopic solutions of Markov decision processes and stochastic games”, Operations Research 29, 995–1009.
M.J. Sobel (1985): “Maximal mean/standard deviation ratio in an undiscounted MDP”, OR Letters 4, 157–159.
M.J. Sobel (1994): “Mean-variance trade-offs in an undiscounted MDP”, Operations Research 42, 175–183.
E. Sondik (1978): “The optimal control of partially observable Markov processes over the infinite horizon: discounted costs”, Operations Research 26, 282–304.
I.M. Sonin (1999): “The elimination algorithm for the problem of optimal stopping”, Mathematical Methods of Operations Research 49, 111–124.
D. Spreen (1981): “A further anti-cycling rule in multi-chain policy iter- ation for undiscounted Markov renewal programs”, Zeitschrift für Operations Research 25, 225–234.
J. Stein (1988): “On efficiency of linear programming applied to dis- counted Markovian decision problems”, OR Spektrum 10, 153–160.
S.S. Stidham, Jr. (1985): “Optimal control of admission to a queueing system”, IEEE Transactions on Automatic Control AC-30, 705–713.
S.S. Stidham, Jr. and R.R. Weber (1993): “A survey of Markov decision models for control of networks of queues”, Queueing Systems 13, 291–314.
J. Stoer and R. Bulirsch (1980): “Introduction to numerical analysis”, Springer-Verlag, New York.
R. Strauch and A.F. Veinott, Jr. (1966): “A property of sequential control processes”, Report, Rand McNally, Chicago.
M. Sun (1993): “Revised simplex algorithm for finite Markov decision processes”, Journal of Optimization Theory and Applications 79, 405–413.
L.C. Thomas (1981): “Second order bounds for Markov decision processes”, Journal of Mathematical Analysis and Applications 80, 294–297.
L.C. Thomas (1983): “Constrained Markov decision processes as multiobjective problems”, in: “Multi-objective decision making”, Academic Press, 77–94.
H.C. Tijms (1986): “Stochastic modelling and analysis: a computational approach”, Wiley, Chichester.
J.N. Tsitsiklis (1986): “A lemma on the multi-armed bandit problem”, IEEE Transactions on Automatic Control 31, 576–577.
J.N. Tsitsiklis (1993): “A short proof of the Gittins index theorem”, Annals of Applied Probability 4, 194–199.
F.A. Van der Duyn Schouten and S.G. Vanneste (1990): “Analysis and computation of (n, N)-strategies for maintenance of a two-component system”, European Journal of Operations Research 48, 260–274.
J. Van der Wal (1980): “The method of value oriented successive approximations for the average reward Markov decision processes”, OR Spektrum 1, 233–242.
J. Van der Wal (1981): “Stochastic dynamic programming”, Mathematical Centre Tract 139, Mathematical Centre, Amsterdam.
K.M. Van Hee (1978): “Markov strategies in dynamic programming”, Mathematics of Operations Research 3, 191–201.
K.M. Van Hee, A. Hordijk and J. Van der Wal (1977): “Successive approximations for convergent dynamic programming”, in: H.C. Tijms and J. Wessels (eds.), “Markov decision theory”, Mathematical Centre Tract no. 93, Mathematical Centre, Amsterdam, 183–211.
J.A.E.E. Van Nunen (1976a): “A set of successive approximation method for discounted Markovian decision problems”, Zeitschrift für Operations Research 20, 203–208.
J.A.E.E. Van Nunen (1976b): “Contracting Markov decision processes”, Mathematical Centre Tract 71, Mathematical Centre, Amsterdam.
J.A.E.E. Van Nunen (1976c): “Improved successive approximation methods for discounted Markovian decision processes”, in: A. Prekopa (ed.), “Progress in Operations Research”, North Holland, Amsterdam, 667–682.
J.A.E.E. Van Nunen and J. Wessels (1976): “A principle for generating optimization procedures for discounted Markov decision processes”, Colloquia Mathematica Societatis Bolyai Janos, Vol. 12, North Holland, Amsterdam, 683–695.
J.A.E.E. Van Nunen and J. Wessels (1977): “The generation of successive approximations for Markov decision processes using stopping times”, in: “Markov decision theory”, H. Tijms and J. Wessels (eds.), Mathematical Centre Tract 93, Mathematical Centre, Amsterdam, 25–37.
P.P. Varaiya, J.C. Walrand and C. Buyukkoc (1985): “Extensions of the multi-armed bandit problem: the discounted case”, IEEE Transactions on Automatic Control 30, 426–439.
A.F. Veinott, Jr. (1966a): “On the optimality of (s, S) inventory policies: new condition and a new proof”, SIAM Journal on Applied Mathematics 14, 1067–1083.
A.F. Veinott, Jr. (1966b): “On finding optimal policies in discrete dynamic programming with no discounting”, Annals of Math. Stats. 37, 1284–1294.
A.F. Veinott, Jr. (1969): “Discrete dynamic programming with sensitive discount optimality criteria”, Annals of Math. Stats. 40, 1635–1660.
A.F. Veinott, Jr. (1974): “Markov decision chains”, in: G.B. Dantzig and B.C. Eaves (eds.), “Studies in Optimization”, Studies in Mathematics, Volume 10, The Mathematical Association of America, 124–159.
R.C. Vergin and M. Scribian (1977): “Maintenance scheduling for multicomponent equipment”, AIIE Transactions 9, 297–305.
O.J. Vrieze, (1987): “Stochastic games with finite state and action spaces”, CWI Tract 33, Centre for Mathematics and Computer Science, Amsterdam.
K. Wakuta (1992): “Optimal stationary policies in the vector-valued Markov decision process”, Stochastic Processes and its Applications 42, 149–156.
K. Wakuta (1995): “Vector-valued Markov decision processes and the systems of linear inequalities”, Stochastic Processes and its Applications 56, 159–169.
K. Wakuta (1996): “A new class of policies in vector-valued Markov deci- sion processes”, Journal of Mathematical Analysis and Applications 202, 623–628.
K. Wakuta (1999): “A note on the structure of value spaces in vector-valued Markov decision processes”, Mathematical Methods of Operations Research 49, 77–86.
J. Walrand (1988): “An introduction to queueing networks”, Prentice-Hall, Englewood Cliffs, New Jersey.
R.R. Weber (1982): “Scheduling jobs with stochastic processing requirements on parallel machines to minimize makespan or flowtime.
R.R. Weber (1992): “On the Gittins index for multi-armed bandits”, Annals of Applied Probability 2, 1024–1033.
R.R. Weber and S.S. Stidham, Jr. (1987): “Optimal control of services rates in networks of queues”, Advances in Applied Probability 19, 202–218.
G. Weiss (1982): “Multiserver stochastic scheduling”, in: M.A.H. Dempster, J.K. Lenstra and A.H.G. Rinnooy Kan (eds.), “Deterministic and stochastic scheduling”, Reidel, Dordrecht, Holland, 157–179.
G. Weiss (1988): “Branching bandit processes”, Probability in the Engineering and Information Sciences 2, 269–278.
J. Wessels and J.A.E.E. Van Nunen (1975): “Discounted semi-Markov decision processes: linear programming and policy iteration”, Statistical Neerlandica 29, 1–7.
J. Wessels (1977): “Stopping times on Markov programming”, in: Transactions of the 7th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, Academia, Prague, pp. 575–585.
C.C. White, III (1976): “Procedures for the solution of a finite-horizon, partially observed, semi-Markov optimization problem”, Operations Research 24, 348–358.
C.C. White, III (1991): “A survey of solution techniques for the partially observed Markov decision process”, Annals of Operations Research 33, 215–230.
C.C. White, III and W.T. Scherer (1989): “Solution procedures for par- tially observed Markov decision processes”, Op. Research 37, 791–797.
C.C. White, III and W.T. Scherer (1994): “Finite-memory suboptimal design for partially observed Markov decision processes”, Op. Research 42, 439–455.
D.J. White (1963): “Dynamic programming, Markov chains, and the method of successive approximations”, Journal of Mathematical Analysis and Applications 6, 373–376.
D.J. White (1978): “Elimination of non-optimal actions in Markov decision processes”, in: M.L. Puterman (ed.) Dynamic programming and its applications, Academic Press, New York, 131–160.
D.J. White (1982): “Multi-objective infinite-horizon discounted Markov decision processes”, Journal of Mathematical Analysis and Applications 89, 639–647.
D.J. White (1985): “Real applications of Markov decision theory”, Interfaces 15:6, 73–83.
D.J. White (1988): “Further real applications of Markov decision theory”, Interfaces 18:5, 55–61
D.J. White (1988): “Mean, variance and probabilistic criteria in finite Markov decision processes: a review”, Journal of Optimization Theory and Applications 56, 1–30.
D.J. White (1992): “Computational approaches to variance-penalized Markov decision processes”, OR Spektrum 14, 79–83.
D.J. White (1993): “A survey of applications of Markov decision processes”, Journal of the Operational Research Society 44, 1073–1096.
D.J. White (1993): “Markov decision processes”, Wiley, Chichester.
D.J. White (1994): “A mathematical programming approach to a problem in variance penalised Markov decision processes”, OR Spektrum 15, 225–230.
D.J. White (1995): “A superharmonic approach to solving infinite horizon partially observable Markov decision problems”, Mathematical Methods of Operations Research 41, 71–88.
P. Whittle (1980): “Multi-armed bandits and the Gittins index”, Journal of the Royal Statistical Society, Series B 42, 143–149.
P. Whittle (1982): “Optimization over time; dynamic programming and stochastic control”, Volume I, Wiley, New York.
P. Whittle (1982): “Optimization over time; dynamic programming and stochastic control”, Volume II, Wiley, New York.
M. Yasuda (1988): “The optimal value of Markov stopping problems with one-step look ahead policy”, Journal of Applied Probability 25, 544–552.
Y.-S. Zheng and A. Federgruen (1991): “Finding optimal (s, S)-policies is about as ssimple as evaluating a single policy”, Op. Research 39, 654–665.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kallenberg, L. (2003). Finite State and Action MDPS. In: Feinberg, E.A., Shwartz, A. (eds) Handbook of Markov Decision Processes. International Series in Operations Research & Management Science, vol 40. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-0805-2_2
Download citation
DOI: https://doi.org/10.1007/978-1-4615-0805-2_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-5248-8
Online ISBN: 978-1-4615-0805-2
eBook Packages: Springer Book Archive