Approximate Dynamic Programming by Practical Examples

Mes, Martijn R. K.; Rivera, Arturo Pérez

doi:10.1007/978-3-319-47766-4_3

Martijn R. K. Mes⁶ &
Arturo Pérez Rivera⁶

Part of the book series: International Series in Operations Research & Management Science ((ISOR,volume 248))

5126 Accesses
13 Citations

Abstract

Computing the exact solution of an MDP model is generally difficult and possibly intractable for realistically sized problem instances. A powerful technique to solve the large scale discrete time multistage stochastic control processes is Approximate Dynamic Programming (ADP). Although ADP is used as an umbrella term for a broad spectrum of methods to approximate the optimal solution of MDPs, the common denominator is typically to combine optimization with simulation, use approximations of the optimal values of the Bellman’s equations, and use approximate policies. This chapter aims to present and illustrate the basics of these steps by a number of practical and instructive examples. We use three examples (1) to explain the basics of ADP, relying on value iteration with an approximation of the value functions, (2) to provide insight into implementation issues, and (3) to provide test cases for the reader to validate its own ADP implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

R. Bellman, Dynamic Programming, 1st edn. (Princeton University Press, Princeton, NJ, 1957)
Google Scholar
D.P.D. Farias, B.V. Roy, On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res. 29 (3), 462–478 (2004)
Article Google Scholar
A.P. George, W.B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65 (1), 167–198 (2006)
Article Google Scholar
A.P. George, W.B. Powell, S.R. Kulkarni, S. Mahadevan, Value function approximation using multiple aggregation for multiattribute resource management. J. Mach. Learn. Res. 9, 2079–2111 (2008)
Google Scholar
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics (Springer, New York, NY, 2001)
Google Scholar
P.J.H. Hulshof, M.R.K. Mes, R.J. Boucherie, E.W. Hans, Patient admission planning using approximate dynamic programming. Flex. Serv. Manuf. J. 28 (1), 30–61 (2016)
Article Google Scholar
D.R. Jiang, T.V. Pham, W.B. Powell, D.F. Salas, W.R. Scott, A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work?, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, pp. 1–8
Google Scholar
M.R.K. Mes, W.B. Powell, P.I. Frazier, Hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)
Google Scholar
A. Pérez Rivera, M.R.K. Mes, Dynamic multi-period freight consolidation, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 370–385
Google Scholar
W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics (Wiley, London, 2011)
Google Scholar
W.B. Powell, Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (1), 319–356 (2012)
Google Scholar
W.B. Powell, Clearing the jungle of stochastic optimization, in Informs Tutorials in Operations Research, chap. 4 (INFORMS, Hanover, MD, 2014), pp. 109–137
Google Scholar
W.B. Powell, I.O. Ryzhov, Optimal Learning and Approximate Dynamic Programming (Wiley, London, 2013), pp. 410–431
Google Scholar
W.B. Powell, H.P. Simao, B. Bouzaiene-Ayari, Approximate dynamic programming in transportation and logistics: a unified framework. EURO J. Transp. Logist. 1 (3), 237–284 (2012)
Article Google Scholar
I.O. Ryzhov, W.B. Powell, Approximate dynamic programming with correlated bayesian beliefs, in Proceedings of the 48th Allerton Conference on Communication, Control and Computing (2010)
Google Scholar
R.S. Sutton, A.G. Barto, Introduction to Reinforcement Learning, 1st edn. (MIT Press, Cambridge, MA, 1998)
Google Scholar
J.N. Tsitsiklis, B. Roy, Feature-based methods for large scale dynamic programming. Mach. Learn. 22 (1), 59–94 (1996)
Google Scholar
W. van Heeswijk, M.R.K. Mes, M. Schutten, An approximate dynamic programming approach to urban freight distribution with batch arrivals, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 61–75
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering and Business Information Systems, University of Twente, Enschede, The Netherlands
Martijn R. K. Mes & Arturo Pérez Rivera

Authors

Martijn R. K. Mes
View author publications
You can also search for this author in PubMed Google Scholar
Arturo Pérez Rivera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martijn R. K. Mes .

Editor information

Editors and Affiliations

Stochastic Operations Research, University of Twente, Enschede, The Netherlands
Richard J. Boucherie
Stochastic Operations Research, University of Twente, Enschede, The Netherlands
Nico M. van Dijk

Appendix

1.1 Nomadic Trucker Settings

Transportation takes place in a square area of 1000 × 1000 miles. The locations lie on a 16 × 16 Euclidean grid placed on this area, where each location $i \in \mathcal{L}$ is described by an (x _i, y _i)-coordinate. The first location has coordinate (0, 0) and the last location (location 256) has coordinate (1000, 1000). The minimum distance between two locations is 1000∕15.

For each location $i \in \mathcal{L}$, there is a number 0 ≤ b _i ≤ 1 representing the probability that a load originating at location i will appear at a given time step. The probability that, on a given day of the week d, a load from i to j will appear is given by p _ij ^d = p _d b _i(1 − b _j), where p _d gives the probability of loads appearing on a given day of the week d. The origin probabilities b _i are given by

$$\displaystyle{ b_{i} =\rho \left (1 -\frac{f(x_{i},y_{i}) - f^{min}} {f^{max} - f^{min}} \right ), }$$

(3.34)

where ρ gives the arrival intensity of loads, and f(x _i, y _i) is the Six-hump camel back function given by $f(x_{i},y_{i}) = 4x_{i}^{2} - 2.1x_{i}^{4} + \frac{1} {3}x_{i}^{6} + x_{ i}y_{i} - 4y_{i}^{2} + 4y_{ i}^{4}$ on the domain (x _i, y _i) ∈ [−1. 5, 2] × [−1, 1]. The highest value is achieved at coordinate (2, 1), with a value of ≈ 5. 73, which we reduce to 5 to create a somewhat smoother function (still the second highest value is ≈ 4. 72). Next, the values f(x _i, y _i) are scaled to the domain (x _i, y _i) ∈ [0, 0] × [1000, 1000]. The values $f^{min} = min_{i\in \mathcal{L}}f(x_{i},y_{i}) \approx -1.03$ and $f^{max} = max_{i\in \mathcal{L}}f(x_{i},y_{i}) = 5$ are used to scale f(x _i, y _i) between [0, 1]. An impression of the resulting origin probabilities B _i is given in Fig. 3.11.

We set ρ = 1, which corresponds with an expectation of approximately 93. 14 outgoing loads from the most popular origin location on the busiest day of the week. We use a load probability distribution p ^d = (1, 0. 8, 0. 6, 0. 7, 0. 9, 0. 2, 0. 1), for d from Monday till Sunday, which represents the situation in which loads are more likely to appear during the beginning of the week (Mondays) and towards the end (Fridays).

The results for the infinite horizon multi-attribute version of the nomadic trucker problem can be found below (Fig. 3.12).

1.2 Freight Consolidation Settings

Either one or two freights arrive each period (i.e., $\mathcal{F} = \left \{1,2\right \}$), with probability p _f ^F = (0. 8, 0. 2) for $f \in \mathcal{F}$. Each freight that arrives has destination $d \in \mathcal{D} = \left \{1,2,3\right \}$ with probability p _d ^D = (0. 1, 0. 8, 0. 1), is already released for transportation (i.e., $r \in \mathcal{R} = \left \{0\right \}$ and p _r ^R = 1), and has time-window length $k \in \mathcal{K} = \left \{0,1,2\right \}$ with probability p _k ^K = (0. 2, 0. 3, 0. 5).

The costs are defined as follows. The long-haul, high capacity vehicle costs (per subset of destinations visited) are $C_{\mathcal{D}'} = (250,350,450,900,600,700,1000)$ for $\mathcal{D}' = (\{1\},\{2\},\{3\},$ {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}), respectively. These costs are for the entire long-haul vehicle, independent on the number of freight consolidated. Furthermore, we consider there are no costs for the long-haul vehicle if no freights are consolidated. The alternative, low capacity mode costs (per freight) are B _d = (500, 1000, 700) for $d \in \mathcal{D}$. There is no discount factor, i.e. γ = 1.

We build three different sets of features based on a common “job” description used in transportation settings: MustGo, MayGo, and Future freights. MustGo freights are those released freights whose due-day is immediate. MayGo freights are those released freights whose due-day is not immediate. Future freights are those that have not yet been released. We use the MustGo, MayGo and Future adjectives in destinations as well, with an analogous meaning to those of freight. In Table 3.4 we show the three sets of features, which we name Value Function Approximation (VFA) 1, 2, and 3. All feature types in this table are related to the freights of a post-decision state. The symbol ‘*’ denotes a VFA set containing a feature type. All feature types are numerical, and either indicate (i.e., 1 if yes, 0 if no), count (1,2,…), number (add), or multiply (i.e., product between two numbers) the different type of freights and destinations. Between parentheses we show the number of basis functions (i.e., independent variables) that a feature type has for the test instance. For example, there is one post-decision state variable per destination, per time-window length, thus all post-decision state variables are 3 ∗ 3 = 9. The constant feature equals one for all post-decision states, and the weights θ _a ⁿ are all initialized with one.

Table 3.4 Various sets of features (basis functions of a post-decision state)

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mes, M.R.K., Rivera, A.P. (2017). Approximate Dynamic Programming by Practical Examples. In: Boucherie, R., van Dijk, N. (eds) Markov Decision Processes in Practice. International Series in Operations Research & Management Science, vol 248. Springer, Cham. https://doi.org/10.1007/978-3-319-47766-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-47766-4_3
Published: 11 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47764-0
Online ISBN: 978-3-319-47766-4
eBook Packages: Business and ManagementBusiness and Management (R0)

Publish with us

Policies and ethics

Approximate Dynamic Programming by Practical Examples

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Nomadic Trucker Settings

1.2 Freight Consolidation Settings

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation