Skip to main content

Approximate Dynamic Programming by Practical Examples

  • Chapter
  • First Online:
Markov Decision Processes in Practice

Abstract

Computing the exact solution of an MDP model is generally difficult and possibly intractable for realistically sized problem instances. A powerful technique to solve the large scale discrete time multistage stochastic control processes is Approximate Dynamic Programming (ADP). Although ADP is used as an umbrella term for a broad spectrum of methods to approximate the optimal solution of MDPs, the common denominator is typically to combine optimization with simulation, use approximations of the optimal values of the Bellman’s equations, and use approximate policies. This chapter aims to present and illustrate the basics of these steps by a number of practical and instructive examples. We use three examples (1) to explain the basics of ADP, relying on value iteration with an approximation of the value functions, (2) to provide insight into implementation issues, and (3) to provide test cases for the reader to validate its own ADP implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. R. Bellman, Dynamic Programming, 1st edn. (Princeton University Press, Princeton, NJ, 1957)

    Google Scholar 

  2. D.P.D. Farias, B.V. Roy, On constraint sampling in the linear programming approach to approximate dynamic programming. Math. Oper. Res. 29 (3), 462–478 (2004)

    Article  Google Scholar 

  3. A.P. George, W.B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65 (1), 167–198 (2006)

    Article  Google Scholar 

  4. A.P. George, W.B. Powell, S.R. Kulkarni, S. Mahadevan, Value function approximation using multiple aggregation for multiattribute resource management. J. Mach. Learn. Res. 9, 2079–2111 (2008)

    Google Scholar 

  5. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics (Springer, New York, NY, 2001)

    Google Scholar 

  6. P.J.H. Hulshof, M.R.K. Mes, R.J. Boucherie, E.W. Hans, Patient admission planning using approximate dynamic programming. Flex. Serv. Manuf. J. 28 (1), 30–61 (2016)

    Article  Google Scholar 

  7. D.R. Jiang, T.V. Pham, W.B. Powell, D.F. Salas, W.R. Scott, A comparison of approximate dynamic programming techniques on benchmark energy storage problems: does anything work?, in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014, pp. 1–8

    Google Scholar 

  8. M.R.K. Mes, W.B. Powell, P.I. Frazier, Hierarchical knowledge gradient for sequential sampling. J. Mach. Learn. Res. 12, 2931–2974 (2011)

    Google Scholar 

  9. A. Pérez Rivera, M.R.K. Mes, Dynamic multi-period freight consolidation, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 370–385

    Google Scholar 

  10. W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics (Wiley, London, 2011)

    Google Scholar 

  11. W.B. Powell, Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (1), 319–356 (2012)

    Google Scholar 

  12. W.B. Powell, Clearing the jungle of stochastic optimization, in Informs Tutorials in Operations Research, chap. 4 (INFORMS, Hanover, MD, 2014), pp. 109–137

    Google Scholar 

  13. W.B. Powell, I.O. Ryzhov, Optimal Learning and Approximate Dynamic Programming (Wiley, London, 2013), pp. 410–431

    Google Scholar 

  14. W.B. Powell, H.P. Simao, B. Bouzaiene-Ayari, Approximate dynamic programming in transportation and logistics: a unified framework. EURO J. Transp. Logist. 1 (3), 237–284 (2012)

    Article  Google Scholar 

  15. I.O. Ryzhov, W.B. Powell, Approximate dynamic programming with correlated bayesian beliefs, in Proceedings of the 48th Allerton Conference on Communication, Control and Computing (2010)

    Google Scholar 

  16. R.S. Sutton, A.G. Barto, Introduction to Reinforcement Learning, 1st edn. (MIT Press, Cambridge, MA, 1998)

    Google Scholar 

  17. J.N. Tsitsiklis, B. Roy, Feature-based methods for large scale dynamic programming. Mach. Learn. 22 (1), 59–94 (1996)

    Google Scholar 

  18. W. van Heeswijk, M.R.K. Mes, M. Schutten, An approximate dynamic programming approach to urban freight distribution with batch arrivals, in Computational Logistics, ed. by F. Corman, S. Voß, R.R. Negenborn. Lecture Notes in Computer Science, vol. 9335 (Springer, Cham, 2015), pp. 61–75

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martijn R. K. Mes .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Nomadic Trucker Settings

Transportation takes place in a square area of 1000 × 1000 miles. The locations lie on a 16 × 16 Euclidean grid placed on this area, where each location \(i \in \mathcal{L}\) is described by an (x i , y i )-coordinate. The first location has coordinate (0, 0) and the last location (location 256) has coordinate (1000, 1000). The minimum distance between two locations is 1000∕15.

For each location \(i \in \mathcal{L}\), there is a number 0 ≤ b i  ≤ 1 representing the probability that a load originating at location i will appear at a given time step. The probability that, on a given day of the week d, a load from i to j will appear is given by p ij d = p d b i (1 − b j ), where p d gives the probability of loads appearing on a given day of the week d. The origin probabilities b i are given by

$$\displaystyle{ b_{i} =\rho \left (1 -\frac{f(x_{i},y_{i}) - f^{min}} {f^{max} - f^{min}} \right ), }$$
(3.34)

where ρ gives the arrival intensity of loads, and f(x i , y i ) is the Six-hump camel back function given by \(f(x_{i},y_{i}) = 4x_{i}^{2} - 2.1x_{i}^{4} + \frac{1} {3}x_{i}^{6} + x_{ i}y_{i} - 4y_{i}^{2} + 4y_{ i}^{4}\) on the domain (x i , y i ) ∈ [−1. 5, 2] × [−1, 1]. The highest value is achieved at coordinate (2, 1), with a value of ≈ 5. 73, which we reduce to 5 to create a somewhat smoother function (still the second highest value is ≈ 4. 72). Next, the values f(x i , y i ) are scaled to the domain (x i , y i ) ∈ [0, 0] × [1000, 1000]. The values \(f^{min} = min_{i\in \mathcal{L}}f(x_{i},y_{i}) \approx -1.03\) and \(f^{max} = max_{i\in \mathcal{L}}f(x_{i},y_{i}) = 5\) are used to scale f(x i , y i ) between [0, 1]. An impression of the resulting origin probabilities B i is given in Fig. 3.11.

Fig. 3.11
figure 11

Origin probabilities for the 256 locations

We set ρ = 1, which corresponds with an expectation of approximately 93. 14 outgoing loads from the most popular origin location on the busiest day of the week. We use a load probability distribution p d = (1, 0. 8, 0. 6, 0. 7, 0. 9, 0. 2, 0. 1), for d from Monday till Sunday, which represents the situation in which loads are more likely to appear during the beginning of the week (Mondays) and towards the end (Fridays).

The results for the infinite horizon multi-attribute version of the nomadic trucker problem can be found below (Fig. 3.12).

Fig. 3.12
figure 12

Infinite horizon multi-attribute case: resulting estimate \(\overline{V }_{0}^{n}\left (S_{0}^{x,n}\right )\) (left) and realized rewards (right), using N = 25, 000, M = 10, O = 1000 and K = 10. For the rewards resulting from the simulations, the 2500 observations are smoothed using a window of 10. For the policies Expl and Eps, the BAKF stepsize is used

1.2 Freight Consolidation Settings

Either one or two freights arrive each period (i.e., \(\mathcal{F} = \left \{1,2\right \}\)), with probability p f F = (0. 8, 0. 2) for \(f \in \mathcal{F}\). Each freight that arrives has destination \(d \in \mathcal{D} = \left \{1,2,3\right \}\) with probability p d D = (0. 1, 0. 8, 0. 1), is already released for transportation (i.e., \(r \in \mathcal{R} = \left \{0\right \}\) and p r R = 1), and has time-window length \(k \in \mathcal{K} = \left \{0,1,2\right \}\) with probability p k K = (0. 2, 0. 3, 0. 5).

The costs are defined as follows. The long-haul, high capacity vehicle costs (per subset of destinations visited) are \(C_{\mathcal{D}'} = (250,350,450,900,600,700,1000)\) for \(\mathcal{D}' = (\{1\},\{2\},\{3\},\) {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}), respectively. These costs are for the entire long-haul vehicle, independent on the number of freight consolidated. Furthermore, we consider there are no costs for the long-haul vehicle if no freights are consolidated. The alternative, low capacity mode costs (per freight) are B d  = (500, 1000, 700) for \(d \in \mathcal{D}\). There is no discount factor, i.e. γ = 1.

We build three different sets of features based on a common “job” description used in transportation settings: MustGo, MayGo, and Future freights. MustGo freights are those released freights whose due-day is immediate. MayGo freights are those released freights whose due-day is not immediate. Future freights are those that have not yet been released. We use the MustGo, MayGo and Future adjectives in destinations as well, with an analogous meaning to those of freight. In Table 3.4 we show the three sets of features, which we name Value Function Approximation (VFA) 1, 2, and 3. All feature types in this table are related to the freights of a post-decision state. The symbol ‘*’ denotes a VFA set containing a feature type. All feature types are numerical, and either indicate (i.e., 1 if yes, 0 if no), count (1,2,…), number (add), or multiply (i.e., product between two numbers) the different type of freights and destinations. Between parentheses we show the number of basis functions (i.e., independent variables) that a feature type has for the test instance. For example, there is one post-decision state variable per destination, per time-window length, thus all post-decision state variables are 3 ∗ 3 = 9. The constant feature equals one for all post-decision states, and the weights θ a n are all initialized with one.

Table 3.4 Various sets of features (basis functions of a post-decision state)

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Mes, M.R.K., Rivera, A.P. (2017). Approximate Dynamic Programming by Practical Examples. In: Boucherie, R., van Dijk, N. (eds) Markov Decision Processes in Practice. International Series in Operations Research & Management Science, vol 248. Springer, Cham. https://doi.org/10.1007/978-3-319-47766-4_3

Download citation

Publish with us

Policies and ethics