Abstract
Computing the exact solution of an MDP model is generally difficult and possibly intractable for realistically sized problem instances. A powerful technique to solve the large scale discrete time multistage stochastic control processes is Approximate Dynamic Programming (ADP). Although ADP is used as an umbrella term for a broad spectrum of methods to approximate the optimal solution of MDPs, the common denominator is typically to combine optimization with simulation, use approximations of the optimal values of the Bellman’s equations, and use approximate policies. This chapter aims to present and illustrate the basics of these steps by a number of practical and instructive examples. We use three examples (1) to explain the basics of ADP, relying on value iteration with an approximation of the value functions, (2) to provide insight into implementation issues, and (3) to provide test cases for the reader to validate its own ADP implementations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R. Bellman, Dynamic Programming, 1st edn. (Princeton University Press, Princeton, NJ, 1957)
D.P.D. Farias, B.V. Roy, On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res. 29 (3), 462–478 (2004)
A.P. George, W.B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65 (1), 167–198 (2006)
A.P. George, W.B. Powell, S.R. Kulkarni, S. Mahadevan, Value function approximation using multiple aggregation for multiattribute resource management. J. Mach. Learn. Res. 9, 2079–2111 (2008)
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics (Springer, New York, NY, 2001)
P.J.H. Hulshof, M.R.K. Mes, R.J. Boucherie, E.W. Hans, Patient admission planning using approximate dynamic programming. Flex. Serv. Manuf. J. 28 (1), 30–61 (2016)
D.R. Jiang, T.V. Pham, W.B. Powell, D.F. Salas, W.R. Scott, A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work?, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, pp. 1–8
M.R.K. Mes, W.B. Powell, P.I. Frazier, Hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)
A. Pérez Rivera, M.R.K. Mes, Dynamic multi-period freight consolidation, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 370–385
W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics (Wiley, London, 2011)
W.B. Powell, Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (1), 319–356 (2012)
W.B. Powell, Clearing the jungle of stochastic optimization, in Informs Tutorials in Operations Research, chap. 4 (INFORMS, Hanover, MD, 2014), pp. 109–137
W.B. Powell, I.O. Ryzhov, Optimal Learning and Approximate Dynamic Programming (Wiley, London, 2013), pp. 410–431
W.B. Powell, H.P. Simao, B. Bouzaiene-Ayari, Approximate dynamic programming in transportation and logistics: a unified framework. EURO J. Transp. Logist. 1 (3), 237–284 (2012)
I.O. Ryzhov, W.B. Powell, Approximate dynamic programming with correlated bayesian beliefs, in Proceedings of the 48th Allerton Conference on Communication, Control and Computing (2010)
R.S. Sutton, A.G. Barto, Introduction to Reinforcement Learning, 1st edn. (MIT Press, Cambridge, MA, 1998)
J.N. Tsitsiklis, B. Roy, Feature-based methods for large scale dynamic programming. Mach. Learn. 22 (1), 59–94 (1996)
W. van Heeswijk, M.R.K. Mes, M. Schutten, An approximate dynamic programming approach to urban freight distribution with batch arrivals, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 61–75
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Nomadic Trucker Settings
Transportation takes place in a square area of 1000 × 1000 miles. The locations lie on a 16 × 16 Euclidean grid placed on this area, where each location \(i \in \mathcal{L}\) is described by an (x i , y i )-coordinate. The first location has coordinate (0, 0) and the last location (location 256) has coordinate (1000, 1000). The minimum distance between two locations is 1000∕15.
For each location \(i \in \mathcal{L}\), there is a number 0 ≤ b i ≤ 1 representing the probability that a load originating at location i will appear at a given time step. The probability that, on a given day of the week d, a load from i to j will appear is given by p ij d = p d b i (1 − b j ), where p d gives the probability of loads appearing on a given day of the week d. The origin probabilities b i are given by
where ρ gives the arrival intensity of loads, and f(x i , y i ) is the Six-hump camel back function given by \(f(x_{i},y_{i}) = 4x_{i}^{2} - 2.1x_{i}^{4} + \frac{1} {3}x_{i}^{6} + x_{ i}y_{i} - 4y_{i}^{2} + 4y_{ i}^{4}\) on the domain (x i , y i ) ∈ [−1. 5, 2] × [−1, 1]. The highest value is achieved at coordinate (2, 1), with a value of ≈ 5. 73, which we reduce to 5 to create a somewhat smoother function (still the second highest value is ≈ 4. 72). Next, the values f(x i , y i ) are scaled to the domain (x i , y i ) ∈ [0, 0] × [1000, 1000]. The values \(f^{min} = min_{i\in \mathcal{L}}f(x_{i},y_{i}) \approx -1.03\) and \(f^{max} = max_{i\in \mathcal{L}}f(x_{i},y_{i}) = 5\) are used to scale f(x i , y i ) between [0, 1]. An impression of the resulting origin probabilities B i is given in Fig. 3.11.
We set ρ = 1, which corresponds with an expectation of approximately 93. 14 outgoing loads from the most popular origin location on the busiest day of the week. We use a load probability distribution p d = (1, 0. 8, 0. 6, 0. 7, 0. 9, 0. 2, 0. 1), for d from Monday till Sunday, which represents the situation in which loads are more likely to appear during the beginning of the week (Mondays) and towards the end (Fridays).
The results for the infinite horizon multi-attribute version of the nomadic trucker problem can be found below (Fig. 3.12).
1.2 Freight Consolidation Settings
Either one or two freights arrive each period (i.e., \(\mathcal{F} = \left \{1,2\right \}\)), with probability p f F = (0. 8, 0. 2) for \(f \in \mathcal{F}\). Each freight that arrives has destination \(d \in \mathcal{D} = \left \{1,2,3\right \}\) with probability p d D = (0. 1, 0. 8, 0. 1), is already released for transportation (i.e., \(r \in \mathcal{R} = \left \{0\right \}\) and p r R = 1), and has time-window length \(k \in \mathcal{K} = \left \{0,1,2\right \}\) with probability p k K = (0. 2, 0. 3, 0. 5).
The costs are defined as follows. The long-haul, high capacity vehicle costs (per subset of destinations visited) are \(C_{\mathcal{D}'} = (250,350,450,900,600,700,1000)\) for \(\mathcal{D}' = (\{1\},\{2\},\{3\},\) {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}), respectively. These costs are for the entire long-haul vehicle, independent on the number of freight consolidated. Furthermore, we consider there are no costs for the long-haul vehicle if no freights are consolidated. The alternative, low capacity mode costs (per freight) are B d = (500, 1000, 700) for \(d \in \mathcal{D}\). There is no discount factor, i.e. γ = 1.
We build three different sets of features based on a common “job” description used in transportation settings: MustGo, MayGo, and Future freights. MustGo freights are those released freights whose due-day is immediate. MayGo freights are those released freights whose due-day is not immediate. Future freights are those that have not yet been released. We use the MustGo, MayGo and Future adjectives in destinations as well, with an analogous meaning to those of freight. In Table 3.4 we show the three sets of features, which we name Value Function Approximation (VFA) 1, 2, and 3. All feature types in this table are related to the freights of a post-decision state. The symbol ‘*’ denotes a VFA set containing a feature type. All feature types are numerical, and either indicate (i.e., 1 if yes, 0 if no), count (1,2,…), number (add), or multiply (i.e., product between two numbers) the different type of freights and destinations. Between parentheses we show the number of basis functions (i.e., independent variables) that a feature type has for the test instance. For example, there is one post-decision state variable per destination, per time-window length, thus all post-decision state variables are 3 ∗ 3 = 9. The constant feature equals one for all post-decision states, and the weights θ a n are all initialized with one.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Mes, M.R.K., Rivera, A.P. (2017). Approximate Dynamic Programming by Practical Examples. In: Boucherie, R., van Dijk, N. (eds) Markov Decision Processes in Practice. International Series in Operations Research & Management Science, vol 248. Springer, Cham. https://doi.org/10.1007/978-3-319-47766-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-47766-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47764-0
Online ISBN: 978-3-319-47766-4
eBook Packages: Business and ManagementBusiness and Management (R0)