Abstract
Discrete-time stochastic optimal control problems are stated over a finite number of decision stages. The state vector is assumed to be perfectly measurable. Such problems are infinite-dimensional as one has to find control functions of the state. Because of the general assumptions under which the problems are formulated, two approximation techniques are addressed. The first technique consists of an approximation of dynamic programming. The approximation derives from the fact that the state space is discretized. Instead of using regular grids, which lead to an exponential growth of the number of samples (and thus to the curse of dimensionality), low-discrepancy sequences (as quasi-Monte Carlo ones) are considered. The second approximation technique is given by the application of the “Extended Ritz Method” (ERIM). The ERIM consists in substituting the admissible functions with fixed-structure parametrized functions containing vectors of “free” parameters. This requires solving easier nonlinear programming problems. If suitable regularity assumptions are verified, such problems can be solved by stochastic gradient algorithms. The computation of the gradient can be performed by resorting to the classical adjoint equations, which solve deterministic optimal control problems with the addition of one term, dependent on the chosen family of fixed-structure parametrized functions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
As pointed out in Proposition 1.3.1 and in Sect. 1.5 of [11], this proof is “somewhat informal.” In particular, moving the minimum with respect to \( \varvec{\mu }_1,\ldots , \varvec{\mu }_{T-1} \) inside the expectation with respect to \(\varvec{\xi }_0\) is generally not a mathematically correct step of the proof, unless suitable conditions hold. It works, for example, if, among other assumptions, for any \(\varvec{x}_0\) and any admissible \(\varvec{u}_0 \in U_0(\varvec{x}_0)\), the disturbance \(\varvec{\xi }_0\) takes on a finite or a countable number of values in the set \(\varXi _0\), and the expected values of all the cost terms in (7.10) are finite. These expected values can be computed as summations of a finite or countable number of terms weighted by the probabilities of the elements in \(\varXi _0\). Moreover, as the proof continues, measurability assumptions may be required on the functions \(\varvec{f}_t\), \(h_t\), \(h_T\), and \(\varvec{\mu }_t\). Quoting [11, p. 42], we report: “It turns out that these” (i.e., the abovementioned) “difficulties are mainly technical and do not substantially affect the basic results to be obtained. For this reason, we find it convenient to proceed with informal derivations and arguments; this is consistent with most of the literature on the subject.” Finally, it is worth noting that, for a rigorous treatment of DP, [11] refers to [14].
- 2.
Here, q does not refer to the dimensions of the random disturbances.
- 3.
The terminology “reinforcement learning” originates from studies of animal learning in experimental psychology [79], in which it was observed that, when interacting with its own environment, an animal usually learns how to improve the outcomes of future choices from past experiences.
- 4.
Since we have assumed that the system dynamics is known, our use of the term “approximate” in the abbreviation ADP has to be interpreted essentially as “numerically approximate,” without reference to the classical situation of RL, in which a further approximation is associated with the lack of the model knowledge. In this different situation, where the adaptive control context plays a basic role, the abbreviation stands for adaptive dynamic programming. However, we shall not use the abbreviation ADP with this meaning.
- 5.
We assume that the system (7.49) is “stabilizable” on some set \({\Omega } \subset {\mathbb R}^d\). In qualitative terms, this means that there exists a control function \(\varvec{\mu }(\varvec{x}_t)\) such that the closed-loop system \(\varvec{x}_{t+1} = {\tilde{\varvec{f}}} (\varvec{x}_t) + F(\varvec{x}_t) \varvec{\mu }(\varvec{x}_t)\) is asymptotically stable on \({\Omega }\). A control function \(\varvec{\mu }(\varvec{x}_t)\) is said to be “admissible” if it is stabilizing and yields a finite cost-to-go \(J^{\varvec{\mu }} (\varvec{x}_t)\).
- 6.
See Constraints (1.15) for the case of continuous-time optimal control problems: the equalities \(\varvec{g}_t (\varvec{u}_t) = \varvec{0}\) can be converted into the inequalities \(\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\) and \(-\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\).
References
Archibald TW, McKinnon KIM, Thomas LC (1997) An aggregate stochastic dynamic programming model of multireservoir systems. Water Resour Res 33:333–340
Atkeson C, Stephens B (2008) Random sampling of states in dynamic programming. IEEE Trans Syst Man Cybern Part B Cybern 38:924–929
Baglietto M, Cervellera C, Sanguineti M, Zoppoli R (2010) Management of water resources systems in the presence of uncertainties by nonlinear approximators and deterministic sampling. Comput Optim Appl 47:349–376
Bellman R (1957) Dynamic programming. Princeton University Press
Bellman R, Dreyfus S (1959) Functional approximations and dynamic programming. Math Tables Aids Comput 13:247–251
Bellman R, Dreyfus SE (1962) Applied dynamic programming. Princeton University Press
Bellman R, Kalaba R, Kotkin B (1963) Polynomial approximation-a new computational technique in dynamic programming. Math Comput 17:155–161
Benveniste LM, Scheinkman JA (1979) On the differentiability of the value function in dynamic models of economics. Econometrica 47:727–732
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific
Bertsekas DP (2000) Dynamic programming and optimal control, vol 2. Athena Scientific
Bertsekas DP (2005) Dynamic programming and optimal control, vol 1. Athena Scientific
Bertsekas DP (2005) Dynamic programming and suboptimal control: a survey from ADP to MPC. Eur J Control 11:310–334
Bertsekas DP, Borkar VS, Nedic A (2004) A review of design and modeling in computer experiments. In: Si J, Barto AG, Powell WB, Wunsch D (eds) Handbook of learning and approximate dynamic programming. IEEE Press, pp 233–257
Bertsekas DP, Shreve SE (1978) Stochastic optimal control-the discrete time case. Academic Press
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific
Buoniu L, Ernst D, De Schutter B, Babuk R (2010) Approximate dynamic programming with a fuzzy parameterization. Automatica 46:804–814
Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC Press
Buttou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R, (eds) Neural networks: tricks of the trade. Lecture notes in computer science, vol 7700. Springer, pp 421–436
Cervellera C, Chen VCP, Wen A (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur J Oper Res 171:1139–1151
Cervellera C, Gaggero M, Macciò D (2012) Efficient kernel models for learning and approximate minimization problems. Neurocomputing 97:74–85
Cervellera C, Gaggero M, Macciò D (2014) Low-discrepancy sampling for approximate dynamic programming with local approximators. Comput Oper Res 43:108–115
Cervellera C, Gaggero M, Macciò D, Marcialis R (2013) Quasi-random sampling for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks, pp 2567–2574
Cervellera C, Gaggero M, Macciò D, Marcialis R (2015) Efficient use of Nadaraya-Watson models and low-discrepancy sequences for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks
Cervellera C, Macciò D (2011) A comparison of global and semi-local approximation in \(T\)-stage stochastic optimization. Eur J Oper Res 208:109–118
Cervellera C, Muselli M (2007) Efficient sampling in approximate dynamic programming. Comput Optim Appl 38:417–443
Cervellera C, Wen A, Chen VCP (2007) Neural network and regression spline value function approximations for stochastic dynamic programming. Comput Oper Res 34:70–90
Chaudhary SK (2005) American options and the LSM algorithm: quasi-random sequences and Brownian bridges. J Comput Financ 8:101–115
Chen VCP, Ruppert D, Shoemaker CA (1999) Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper Res 47:38–53
Chen VCP, Tsui KL, Barton RR, Allen JK (2003) A review of design and modeling in computer experiments. In: Khattree R, Rao CR (eds) Handbook in statistics: statistics in industry, vol 22. Elsevier, pp 231–261
Dion M, Lecuyer P (2010) American option pricing with randomized quasi-Monte Carlo simulations. In: Johansson B, Jain S, Montoya-Torres J, Hugan J, Yücesan E (eds) Proceedings of the 2010 winter simulation conference, pp 2705–2720
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556
Farahmand AM, Ghavamzadeh M, Szepesvári C, Mannor S (2009) Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of the American control conference, pp 725–730
Farahmand AM, Munos R, Szepesvári C (2010) Error propagation for approximate policy and value iteration. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. MIT Press, pp 568–576
Foufoula-Georgiou E, Kitanidis PK (1988) Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems. Water Resour Res 24:1345–1359
Gaggero M, Gnecco G, Sanguineti M (2013) Dynamic programming and value-function approximation in sequential decision problems: error analysis and numerical results. J Optim Theory Appl 156:380–416
Gaggero M, Gnecco G, Sanguineti M (2014) Approximate dynamic programming for stochastic \(N\)-stage optimization with application to optimal consumption under uncertainty. Comput Optim Appl 58:31–85
Gaggero M, Gnecco G, Sanguineti M (2014) Suboptimal policies for stochastic \(N\)-stage optimization problems: accuracy analysis and a case study from optimal consumption. In: El Ouardighi F, Kogan K (eds) Models and methods in economics and management. Springer, pp 27–50
Gal S (1979) Optimal management of a multireservoir water supply system. Water Resour Res 15:737–749
Gale D (1967) On optimal development in a multi-sector economy. Rev Econ Stud 34:1–18
Giuliani M, Quinn JD, Herman JD, Castelletti A, Reed PM (2018) Scalable multiobjective control for large scale water resources systems under uncertainty. IEEE Trans Control Syst Technol 26:1492–1499
Gnecco G, Sanguineti M (2010) Suboptimal solutions to dynamic optimization problems via approximations of the policy functions. J Optim Theory Appl 146:764–794
Gnecco G, Sanguineti M (2016) Neural approximations of the solutions to a class of stochastic optimal control problems. J Neurotechnol 1:1–16
Gnecco G, Sanguineti M (2018) Neural approximations in discounted infinite-horizon stochastic optimal control problems. Eng Appl Artif Intell 74:294–302
Guo W, Si J, Liu F, Mei S (2018) Policy approximation in policy iteration approximate dynamic programming for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 29:2794–2807
Haykin S (2008) Neural networks and learning systems. Pearson Prentice-Hall
Johnson SA, Stedinger JR, Shoemaker C, Li Y, Tejada-Guibert JA (1993) Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper Res 41:484–500
Judd K (1998) Numerical methods in economics. MIT Press
Kamalapurkar R, Walters P, Rosenfeld J, Dixon W (2018) Reinforcement learning for optimal feedback control. Springer
Kitanidis PK (1986) Hermite interpolation on an \(n\)-dimensional rectangular grid. Technical report, St. Anthony Falls Hydraulics Laboratoty, University of Minnesota
Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst 29:2042–2062
Kleinman DL, Athans M (1968) The design of suboptimal linear time-varying systems. IEEE Trans Autom Control 13:150–159
Kwakernaak H, Sivan R (1972) Linear optimal control systems. Wiley
Larson RE (1968) State increment dynamic programming. Elsevier
Lewis FL, Vrabie D, Wamvoudakis KG (2012) Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag 32:76–105
Montrucchio L (1987) Lipschitz continuous policy functions for strongly concave optimization problems. J Math Econ 16:259–273
Montrucchio L (1998) Thompson metric, contraction property and differentiability of policy functions. J Econ Behav Organ 33:449–466
Montrucchio L, Boldrin M (1986) On the indeterminacy of capital accumulation paths. J Econ Behav Organ 40:26–36
Munos R, Moore A (2002) Variable resolution discretization in optimal control. Mach Learn 49:291–323
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J Mach Learn Res 1:815–857
Nguyen DH, Widrow B (1990) Neural networks for self-learning control systems. IEEE Control Syst Mag 10:18–23
Ormoneit D, Sen S (2002) Kernel-based reinforcement learning. Mach Learn 49:161–178
Parisini T, Zoppoli R (1994) Neural networks for feedback feedforward nonlinear control systems. IEEE Trans Neural Netw 5:436–449
Powell WB (2011) Approximate dynamic programming: solving the curse of dimensionality, 2nd edn. Wiley
Philbrick CR Jr, Kitanidis PK (2001) Improved dynamic programming methods for optimal control of lumped-parameter stochastic systems. Oper Res 49:398–412
Raiffa H, Sclhaifer R (1961) Applied statistical decision theory. Harvard University Press
Riedmiller M (2005) Neural fitted Q iteration. First experiences with a data efficient neural reinforcement learning method. In: Proceedings of the 16th European conference on machine learning, pp 317–328
Salas JD, Tabios GQ III, Bartolini P (1985) Approaches to multivariate modeling of water resources time series. Water Resour Bull 21:683–708
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3:211–229
Schweitzer PJ, Seidmann A (1985) Generalized polynomial approximations in Markovian decision processes. J Math Anal Appl 110:568–582
Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming: modeling and theory. MOS-SIAM series on optimization
Si J, Barto AG, Powell WB, Wunsch D (eds) (2004) Handbook of learning and approximate dynamic programming. IEEE Press
Smith JE, McCardle KF (2002) Structural properties of stochastic dynamic programs. Oper Res 50:796–809
Sokolov Y, Kozma R, Werbos LD, Werbos PJ (2015) Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 59:9–18
Stokey NL, Lucas RE, Prescott EC (1989) Recursive methods in economic dynamics. Harvard University Press
Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44
Sutton RS, Barto AG (1998) Reinforcement learning. MIT Press
Ten Hagen S, Kröse B (2003) Neural Q-learning. Neural Comput Appl 12:81–88
Teytaud O, Gelly S, Mary J (2007) Active learning in regression, with application to stochastic dynamic programming. In: Proceedings of the 4th international conference on informatics in control, automation, and robotics, intelligent control systems in optimization, pp 198–205
Thorndike E (1911) Animal intelligence: experimental studies. Macmillan
Tsai JCC, Chen VCP, Beck MB, Chen J (2004) Stochastic dynamic programming formulation for a wastewater treatment decision-making framework. Ann Oper Res 132:207–221
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Mach Learn 22:59–94
Turgeon A (1981) A decomposition method for the long-term scheduling of reservoirs in series. Water Resour Res 17:1565–1570
Wan EA, Benfags F (1996) Diagrammatic derivation of gradient algorithm for neural networks. Neural Comput 8:182–201
Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: a survey. IEEE Trans Cybern 47:3429–3451
Watkins C (1988) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, U.K
Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292
Werbos PJ (1989) Neural networks for control and system identification. In: Proceedings of the IEEE conference on decision and control, pp 260–265
Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WTI, Sutton RS, Werbos PJ (eds) Neural networks for control. MIT Press, pp 67–95
Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control. Von Nostrand Reinhold
Yakowitz S (1982) Dynamic programming applications in water resources. Water Resour Res 18:673–696
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Zoppoli, R., Sanguineti, M., Gnecco, G., Parisini, T. (2020). Stochastic Optimal Control with Perfect State Information over a Finite Horizon. In: Neural Approximations for Optimal Control and Decision. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-29693-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-29693-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29691-9
Online ISBN: 978-3-030-29693-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)