Skip to main content

Stochastic Optimal Control with Perfect State Information over a Finite Horizon

  • Chapter
  • First Online:
Neural Approximations for Optimal Control and Decision

Abstract

Discrete-time stochastic optimal control problems are stated over a finite number of decision stages. The state vector is assumed to be perfectly measurable. Such problems are infinite-dimensional as one has to find control functions of the state. Because of the general assumptions under which the problems are formulated, two approximation techniques are addressed. The first technique consists of an approximation of dynamic programming. The approximation derives from the fact that the state space is discretized. Instead of using regular grids, which lead to an exponential growth of the number of samples (and thus to the curse of dimensionality), low-discrepancy sequences (as quasi-Monte Carlo ones) are considered. The second approximation technique is given by the application of the “Extended Ritz Method” (ERIM). The ERIM consists in substituting the admissible functions with fixed-structure parametrized functions containing vectors of “free” parameters. This requires solving easier nonlinear programming problems. If suitable regularity assumptions are verified, such problems can be solved by stochastic gradient algorithms. The computation of the gradient can be performed by resorting to the classical adjoint equations, which solve deterministic optimal control problems with the addition of one term, dependent on the chosen family of fixed-structure parametrized functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    As pointed out in Proposition 1.3.1 and in Sect. 1.5 of [11], this proof is “somewhat informal.” In particular, moving the minimum with respect to \( \varvec{\mu }_1,\ldots , \varvec{\mu }_{T-1} \) inside the expectation with respect to \(\varvec{\xi }_0\) is generally not a mathematically correct step of the proof, unless suitable conditions hold. It works, for example, if, among other assumptions, for any \(\varvec{x}_0\) and any admissible \(\varvec{u}_0 \in U_0(\varvec{x}_0)\), the disturbance \(\varvec{\xi }_0\) takes on a finite or a countable number of values in the set \(\varXi _0\), and the expected values of all the cost terms in (7.10) are finite. These expected values can be computed as summations of a finite or countable number of terms weighted by the probabilities of the elements in \(\varXi _0\). Moreover, as the proof continues, measurability assumptions may be required on the functions \(\varvec{f}_t\), \(h_t\), \(h_T\), and \(\varvec{\mu }_t\). Quoting [11, p. 42], we report: “It turns out that these” (i.e., the abovementioned) “difficulties are mainly technical and do not substantially affect the basic results to be obtained. For this reason, we find it convenient to proceed with informal derivations and arguments; this is consistent with most of the literature on the subject.” Finally, it is worth noting that, for a rigorous treatment of DP, [11] refers to [14].

  2. 2.

    Here, q does not refer to the dimensions of the random disturbances.

  3. 3.

    The terminology “reinforcement learning” originates from studies of animal learning in experimental psychology [79], in which it was observed that, when interacting with its own environment, an animal usually learns how to improve the outcomes of future choices from past experiences.

  4. 4.

    Since we have assumed that the system dynamics is known, our use of the term “approximate” in the abbreviation ADP has to be interpreted essentially as “numerically approximate,” without reference to the classical situation of RL, in which a further approximation is associated with the lack of the model knowledge. In this different situation, where the adaptive control context plays a basic role, the abbreviation stands for adaptive dynamic programming. However, we shall not use the abbreviation ADP with this meaning.

  5. 5.

    We assume that the system (7.49) is “stabilizable” on some set \({\Omega } \subset {\mathbb R}^d\). In qualitative terms, this means that there exists a control function \(\varvec{\mu }(\varvec{x}_t)\) such that the closed-loop system \(\varvec{x}_{t+1} = {\tilde{\varvec{f}}} (\varvec{x}_t) + F(\varvec{x}_t) \varvec{\mu }(\varvec{x}_t)\) is asymptotically stable on \({\Omega }\). A control function \(\varvec{\mu }(\varvec{x}_t)\) is said to be “admissible” if it is stabilizing and yields a finite cost-to-go \(J^{\varvec{\mu }} (\varvec{x}_t)\).

  6. 6.

    See Constraints (1.15) for the case of continuous-time optimal control problems: the equalities \(\varvec{g}_t (\varvec{u}_t) = \varvec{0}\) can be converted into the inequalities \(\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\) and \(-\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\).

References

  1. Archibald TW, McKinnon KIM, Thomas LC (1997) An aggregate stochastic dynamic programming model of multireservoir systems. Water Resour Res 33:333–340

    Article  Google Scholar 

  2. Atkeson C, Stephens B (2008) Random sampling of states in dynamic programming. IEEE Trans Syst Man Cybern Part B Cybern 38:924–929

    Article  Google Scholar 

  3. Baglietto M, Cervellera C, Sanguineti M, Zoppoli R (2010) Management of water resources systems in the presence of uncertainties by nonlinear approximators and deterministic sampling. Comput Optim Appl 47:349–376

    Article  MathSciNet  MATH  Google Scholar 

  4. Bellman R (1957) Dynamic programming. Princeton University Press

    Google Scholar 

  5. Bellman R, Dreyfus S (1959) Functional approximations and dynamic programming. Math Tables Aids Comput 13:247–251

    Article  MathSciNet  MATH  Google Scholar 

  6. Bellman R, Dreyfus SE (1962) Applied dynamic programming. Princeton University Press

    Google Scholar 

  7. Bellman R, Kalaba R, Kotkin B (1963) Polynomial approximation-a new computational technique in dynamic programming. Math Comput 17:155–161

    MathSciNet  MATH  Google Scholar 

  8. Benveniste LM, Scheinkman JA (1979) On the differentiability of the value function in dynamic models of economics. Econometrica 47:727–732

    Article  MathSciNet  MATH  Google Scholar 

  9. Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific

    Google Scholar 

  10. Bertsekas DP (2000) Dynamic programming and optimal control, vol 2. Athena Scientific

    Google Scholar 

  11. Bertsekas DP (2005) Dynamic programming and optimal control, vol 1. Athena Scientific

    Google Scholar 

  12. Bertsekas DP (2005) Dynamic programming and suboptimal control: a survey from ADP to MPC. Eur J Control 11:310–334

    Article  MathSciNet  MATH  Google Scholar 

  13. Bertsekas DP, Borkar VS, Nedic A (2004) A review of design and modeling in computer experiments. In: Si J, Barto AG, Powell WB, Wunsch D (eds) Handbook of learning and approximate dynamic programming. IEEE Press, pp 233–257

    Google Scholar 

  14. Bertsekas DP, Shreve SE (1978) Stochastic optimal control-the discrete time case. Academic Press

    Google Scholar 

  15. Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific

    Google Scholar 

  16. Buoniu L, Ernst D, De Schutter B, Babuk R (2010) Approximate dynamic programming with a fuzzy parameterization. Automatica 46:804–814

    Article  MathSciNet  Google Scholar 

  17. Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC Press

    Google Scholar 

  18. Buttou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R, (eds) Neural networks: tricks of the trade. Lecture notes in computer science, vol 7700. Springer, pp 421–436

    Google Scholar 

  19. Cervellera C, Chen VCP, Wen A (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur J Oper Res 171:1139–1151

    Article  MATH  Google Scholar 

  20. Cervellera C, Gaggero M, Macciò D (2012) Efficient kernel models for learning and approximate minimization problems. Neurocomputing 97:74–85

    Article  Google Scholar 

  21. Cervellera C, Gaggero M, Macciò D (2014) Low-discrepancy sampling for approximate dynamic programming with local approximators. Comput Oper Res 43:108–115

    Article  MathSciNet  MATH  Google Scholar 

  22. Cervellera C, Gaggero M, Macciò D, Marcialis R (2013) Quasi-random sampling for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks, pp 2567–2574

    Google Scholar 

  23. Cervellera C, Gaggero M, Macciò D, Marcialis R (2015) Efficient use of Nadaraya-Watson models and low-discrepancy sequences for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks

    Google Scholar 

  24. Cervellera C, Macciò D (2011) A comparison of global and semi-local approximation in \(T\)-stage stochastic optimization. Eur J Oper Res 208:109–118

    Article  MathSciNet  MATH  Google Scholar 

  25. Cervellera C, Muselli M (2007) Efficient sampling in approximate dynamic programming. Comput Optim Appl 38:417–443

    Article  MathSciNet  MATH  Google Scholar 

  26. Cervellera C, Wen A, Chen VCP (2007) Neural network and regression spline value function approximations for stochastic dynamic programming. Comput Oper Res 34:70–90

    Article  MathSciNet  MATH  Google Scholar 

  27. Chaudhary SK (2005) American options and the LSM algorithm: quasi-random sequences and Brownian bridges. J Comput Financ 8:101–115

    Article  Google Scholar 

  28. Chen VCP, Ruppert D, Shoemaker CA (1999) Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper Res 47:38–53

    Article  MathSciNet  MATH  Google Scholar 

  29. Chen VCP, Tsui KL, Barton RR, Allen JK (2003) A review of design and modeling in computer experiments. In: Khattree R, Rao CR (eds) Handbook in statistics: statistics in industry, vol 22. Elsevier, pp 231–261

    Google Scholar 

  30. Dion M, Lecuyer P (2010) American option pricing with randomized quasi-Monte Carlo simulations. In: Johansson B, Jain S, Montoya-Torres J, Hugan J, Yücesan E (eds) Proceedings of the 2010 winter simulation conference, pp 2705–2720

    Google Scholar 

  31. Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556

    MathSciNet  MATH  Google Scholar 

  32. Farahmand AM, Ghavamzadeh M, Szepesvári C, Mannor S (2009) Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of the American control conference, pp 725–730

    Google Scholar 

  33. Farahmand AM, Munos R, Szepesvári C (2010) Error propagation for approximate policy and value iteration. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. MIT Press, pp 568–576

    Google Scholar 

  34. Foufoula-Georgiou E, Kitanidis PK (1988) Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems. Water Resour Res 24:1345–1359

    Article  Google Scholar 

  35. Gaggero M, Gnecco G, Sanguineti M (2013) Dynamic programming and value-function approximation in sequential decision problems: error analysis and numerical results. J Optim Theory Appl 156:380–416

    Article  MathSciNet  MATH  Google Scholar 

  36. Gaggero M, Gnecco G, Sanguineti M (2014) Approximate dynamic programming for stochastic \(N\)-stage optimization with application to optimal consumption under uncertainty. Comput Optim Appl 58:31–85

    Article  MathSciNet  MATH  Google Scholar 

  37. Gaggero M, Gnecco G, Sanguineti M (2014) Suboptimal policies for stochastic \(N\)-stage optimization problems: accuracy analysis and a case study from optimal consumption. In: El Ouardighi F, Kogan K (eds) Models and methods in economics and management. Springer, pp 27–50

    Google Scholar 

  38. Gal S (1979) Optimal management of a multireservoir water supply system. Water Resour Res 15:737–749

    Article  Google Scholar 

  39. Gale D (1967) On optimal development in a multi-sector economy. Rev Econ Stud 34:1–18

    Article  Google Scholar 

  40. Giuliani M, Quinn JD, Herman JD, Castelletti A, Reed PM (2018) Scalable multiobjective control for large scale water resources systems under uncertainty. IEEE Trans Control Syst Technol 26:1492–1499

    Article  Google Scholar 

  41. Gnecco G, Sanguineti M (2010) Suboptimal solutions to dynamic optimization problems via approximations of the policy functions. J Optim Theory Appl 146:764–794

    Article  MathSciNet  MATH  Google Scholar 

  42. Gnecco G, Sanguineti M (2016) Neural approximations of the solutions to a class of stochastic optimal control problems. J Neurotechnol 1:1–16

    Google Scholar 

  43. Gnecco G, Sanguineti M (2018) Neural approximations in discounted infinite-horizon stochastic optimal control problems. Eng Appl Artif Intell 74:294–302

    Article  Google Scholar 

  44. Guo W, Si J, Liu F, Mei S (2018) Policy approximation in policy iteration approximate dynamic programming for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 29:2794–2807

    MathSciNet  Google Scholar 

  45. Haykin S (2008) Neural networks and learning systems. Pearson Prentice-Hall

    Google Scholar 

  46. Johnson SA, Stedinger JR, Shoemaker C, Li Y, Tejada-Guibert JA (1993) Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper Res 41:484–500

    Article  MATH  Google Scholar 

  47. Judd K (1998) Numerical methods in economics. MIT Press

    Google Scholar 

  48. Kamalapurkar R, Walters P, Rosenfeld J, Dixon W (2018) Reinforcement learning for optimal feedback control. Springer

    Google Scholar 

  49. Kitanidis PK (1986) Hermite interpolation on an \(n\)-dimensional rectangular grid. Technical report, St. Anthony Falls Hydraulics Laboratoty, University of Minnesota

    Google Scholar 

  50. Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst 29:2042–2062

    Article  MathSciNet  Google Scholar 

  51. Kleinman DL, Athans M (1968) The design of suboptimal linear time-varying systems. IEEE Trans Autom Control 13:150–159

    Article  MathSciNet  Google Scholar 

  52. Kwakernaak H, Sivan R (1972) Linear optimal control systems. Wiley

    Google Scholar 

  53. Larson RE (1968) State increment dynamic programming. Elsevier

    Google Scholar 

  54. Lewis FL, Vrabie D, Wamvoudakis KG (2012) Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag 32:76–105

    MathSciNet  MATH  Google Scholar 

  55. Montrucchio L (1987) Lipschitz continuous policy functions for strongly concave optimization problems. J Math Econ 16:259–273

    Article  MathSciNet  MATH  Google Scholar 

  56. Montrucchio L (1998) Thompson metric, contraction property and differentiability of policy functions. J Econ Behav Organ 33:449–466

    Article  Google Scholar 

  57. Montrucchio L, Boldrin M (1986) On the indeterminacy of capital accumulation paths. J Econ Behav Organ 40:26–36

    MathSciNet  MATH  Google Scholar 

  58. Munos R, Moore A (2002) Variable resolution discretization in optimal control. Mach Learn 49:291–323

    Article  MATH  Google Scholar 

  59. Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J Mach Learn Res 1:815–857

    MathSciNet  MATH  Google Scholar 

  60. Nguyen DH, Widrow B (1990) Neural networks for self-learning control systems. IEEE Control Syst Mag 10:18–23

    Article  Google Scholar 

  61. Ormoneit D, Sen S (2002) Kernel-based reinforcement learning. Mach Learn 49:161–178

    Article  MATH  Google Scholar 

  62. Parisini T, Zoppoli R (1994) Neural networks for feedback feedforward nonlinear control systems. IEEE Trans Neural Netw 5:436–449

    Article  MATH  Google Scholar 

  63. Powell WB (2011) Approximate dynamic programming: solving the curse of dimensionality, 2nd edn. Wiley

    Google Scholar 

  64. Philbrick CR Jr, Kitanidis PK (2001) Improved dynamic programming methods for optimal control of lumped-parameter stochastic systems. Oper Res 49:398–412

    Article  MathSciNet  MATH  Google Scholar 

  65. Raiffa H, Sclhaifer R (1961) Applied statistical decision theory. Harvard University Press

    Google Scholar 

  66. Riedmiller M (2005) Neural fitted Q iteration. First experiences with a data efficient neural reinforcement learning method. In: Proceedings of the 16th European conference on machine learning, pp 317–328

    Chapter  Google Scholar 

  67. Salas JD, Tabios GQ III, Bartolini P (1985) Approaches to multivariate modeling of water resources time series. Water Resour Bull 21:683–708

    Article  Google Scholar 

  68. Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3:211–229

    Article  MathSciNet  Google Scholar 

  69. Schweitzer PJ, Seidmann A (1985) Generalized polynomial approximations in Markovian decision processes. J Math Anal Appl 110:568–582

    Article  MathSciNet  MATH  Google Scholar 

  70. Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming: modeling and theory. MOS-SIAM series on optimization

    Google Scholar 

  71. Si J, Barto AG, Powell WB, Wunsch D (eds) (2004) Handbook of learning and approximate dynamic programming. IEEE Press

    Google Scholar 

  72. Smith JE, McCardle KF (2002) Structural properties of stochastic dynamic programs. Oper Res 50:796–809

    Article  MathSciNet  MATH  Google Scholar 

  73. Sokolov Y, Kozma R, Werbos LD, Werbos PJ (2015) Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 59:9–18

    Article  MathSciNet  MATH  Google Scholar 

  74. Stokey NL, Lucas RE, Prescott EC (1989) Recursive methods in economic dynamics. Harvard University Press

    Google Scholar 

  75. Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44

    Google Scholar 

  76. Sutton RS, Barto AG (1998) Reinforcement learning. MIT Press

    Google Scholar 

  77. Ten Hagen S, Kröse B (2003) Neural Q-learning. Neural Comput Appl 12:81–88

    Article  Google Scholar 

  78. Teytaud O, Gelly S, Mary J (2007) Active learning in regression, with application to stochastic dynamic programming. In: Proceedings of the 4th international conference on informatics in control, automation, and robotics, intelligent control systems in optimization, pp 198–205

    Google Scholar 

  79. Thorndike E (1911) Animal intelligence: experimental studies. Macmillan

    Google Scholar 

  80. Tsai JCC, Chen VCP, Beck MB, Chen J (2004) Stochastic dynamic programming formulation for a wastewater treatment decision-making framework. Ann Oper Res 132:207–221

    Article  MATH  Google Scholar 

  81. Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Mach Learn 22:59–94

    MATH  Google Scholar 

  82. Turgeon A (1981) A decomposition method for the long-term scheduling of reservoirs in series. Water Resour Res 17:1565–1570

    Article  Google Scholar 

  83. Wan EA, Benfags F (1996) Diagrammatic derivation of gradient algorithm for neural networks. Neural Comput 8:182–201

    Article  Google Scholar 

  84. Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: a survey. IEEE Trans Cybern 47:3429–3451

    Article  Google Scholar 

  85. Watkins C (1988) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, U.K

    Google Scholar 

  86. Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292

    MATH  Google Scholar 

  87. Werbos PJ (1989) Neural networks for control and system identification. In: Proceedings of the IEEE conference on decision and control, pp 260–265

    Google Scholar 

  88. Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WTI, Sutton RS, Werbos PJ (eds) Neural networks for control. MIT Press, pp 67–95

    Google Scholar 

  89. Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control. Von Nostrand Reinhold

    Google Scholar 

  90. Yakowitz S (1982) Dynamic programming applications in water resources. Water Resour Res 18:673–696

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Riccardo Zoppoli .

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Zoppoli, R., Sanguineti, M., Gnecco, G., Parisini, T. (2020). Stochastic Optimal Control with Perfect State Information over a Finite Horizon. In: Neural Approximations for Optimal Control and Decision. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-29693-3_7

Download citation

Publish with us

Policies and ethics