Stochastic Optimal Control with Perfect State Information over a Finite Horizon

Zoppoli, Riccardo; Sanguineti, Marcello; Gnecco, Giorgio; Parisini, Thomas

doi:10.1007/978-3-030-29693-3_7

Part of the book series: Communications and Control Engineering ((CCE))

666 Accesses

Abstract

Discrete-time stochastic optimal control problems are stated over a finite number of decision stages. The state vector is assumed to be perfectly measurable. Such problems are infinite-dimensional as one has to find control functions of the state. Because of the general assumptions under which the problems are formulated, two approximation techniques are addressed. The first technique consists of an approximation of dynamic programming. The approximation derives from the fact that the state space is discretized. Instead of using regular grids, which lead to an exponential growth of the number of samples (and thus to the curse of dimensionality), low-discrepancy sequences (as quasi-Monte Carlo ones) are considered. The second approximation technique is given by the application of the “Extended Ritz Method” (ERIM). The ERIM consists in substituting the admissible functions with fixed-structure parametrized functions containing vectors of “free” parameters. This requires solving easier nonlinear programming problems. If suitable regularity assumptions are verified, such problems can be solved by stochastic gradient algorithms. The computation of the gradient can be performed by resorting to the classical adjoint equations, which solve deterministic optimal control problems with the addition of one term, dependent on the chosen family of fixed-structure parametrized functions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
As pointed out in Proposition 1.3.1 and in Sect. 1.5 of [11], this proof is “somewhat informal.” In particular, moving the minimum with respect to \( \varvec{\mu }_1,\ldots , \varvec{\mu }_{T-1} \) inside the expectation with respect to \(\varvec{\xi }_0\) is generally not a mathematically correct step of the proof, unless suitable conditions hold. It works, for example, if, among other assumptions, for any \(\varvec{x}_0\) and any admissible \(\varvec{u}_0 \in U_0(\varvec{x}_0)\), the disturbance \(\varvec{\xi }_0\) takes on a finite or a countable number of values in the set \(\varXi _0\), and the expected values of all the cost terms in (7.10) are finite. These expected values can be computed as summations of a finite or countable number of terms weighted by the probabilities of the elements in \(\varXi _0\). Moreover, as the proof continues, measurability assumptions may be required on the functions \(\varvec{f}_t\), \(h_t\), \(h_T\), and \(\varvec{\mu }_t\). Quoting [11, p. 42], we report: “It turns out that these” (i.e., the abovementioned) “difficulties are mainly technical and do not substantially affect the basic results to be obtained. For this reason, we find it convenient to proceed with informal derivations and arguments; this is consistent with most of the literature on the subject.” Finally, it is worth noting that, for a rigorous treatment of DP, [11] refers to [14].
2.
Here, q does not refer to the dimensions of the random disturbances.
3.
The terminology “reinforcement learning” originates from studies of animal learning in experimental psychology [79], in which it was observed that, when interacting with its own environment, an animal usually learns how to improve the outcomes of future choices from past experiences.
4.
Since we have assumed that the system dynamics is known, our use of the term “approximate” in the abbreviation ADP has to be interpreted essentially as “numerically approximate,” without reference to the classical situation of RL, in which a further approximation is associated with the lack of the model knowledge. In this different situation, where the adaptive control context plays a basic role, the abbreviation stands for adaptive dynamic programming. However, we shall not use the abbreviation ADP with this meaning.
5.
We assume that the system (7.49) is “stabilizable” on some set \({\Omega } \subset {\mathbb R}^d\). In qualitative terms, this means that there exists a control function \(\varvec{\mu }(\varvec{x}_t)\) such that the closed-loop system \(\varvec{x}_{t+1} = {\tilde{\varvec{f}}} (\varvec{x}_t) + F(\varvec{x}_t) \varvec{\mu }(\varvec{x}_t)\) is asymptotically stable on \({\Omega }\). A control function \(\varvec{\mu }(\varvec{x}_t)\) is said to be “admissible” if it is stabilizing and yields a finite cost-to-go \(J^{\varvec{\mu }} (\varvec{x}_t)\).
6.
See Constraints (1.15) for the case of continuous-time optimal control problems: the equalities \(\varvec{g}_t (\varvec{u}_t) = \varvec{0}\) can be converted into the inequalities \(\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\) and \(-\varvec{g}_t (\varvec{u}_t) \ge \varvec{0}\).

References

Archibald TW, McKinnon KIM, Thomas LC (1997) An aggregate stochastic dynamic programming model of multireservoir systems. Water Resour Res 33:333–340
Article Google Scholar
Atkeson C, Stephens B (2008) Random sampling of states in dynamic programming. IEEE Trans Syst Man Cybern Part B Cybern 38:924–929
Article Google Scholar
Baglietto M, Cervellera C, Sanguineti M, Zoppoli R (2010) Management of water resources systems in the presence of uncertainties by nonlinear approximators and deterministic sampling. Comput Optim Appl 47:349–376
Article MathSciNet MATH Google Scholar
Bellman R (1957) Dynamic programming. Princeton University Press
Google Scholar
Bellman R, Dreyfus S (1959) Functional approximations and dynamic programming. Math Tables Aids Comput 13:247–251
Article MathSciNet MATH Google Scholar
Bellman R, Dreyfus SE (1962) Applied dynamic programming. Princeton University Press
Google Scholar
Bellman R, Kalaba R, Kotkin B (1963) Polynomial approximation-a new computational technique in dynamic programming. Math Comput 17:155–161
MathSciNet MATH Google Scholar
Benveniste LM, Scheinkman JA (1979) On the differentiability of the value function in dynamic models of economics. Econometrica 47:727–732
Article MathSciNet MATH Google Scholar
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific
Google Scholar
Bertsekas DP (2000) Dynamic programming and optimal control, vol 2. Athena Scientific
Google Scholar
Bertsekas DP (2005) Dynamic programming and optimal control, vol 1. Athena Scientific
Google Scholar
Bertsekas DP (2005) Dynamic programming and suboptimal control: a survey from ADP to MPC. Eur J Control 11:310–334
Article MathSciNet MATH Google Scholar
Bertsekas DP, Borkar VS, Nedic A (2004) A review of design and modeling in computer experiments. In: Si J, Barto AG, Powell WB, Wunsch D (eds) Handbook of learning and approximate dynamic programming. IEEE Press, pp 233–257
Google Scholar
Bertsekas DP, Shreve SE (1978) Stochastic optimal control-the discrete time case. Academic Press
Google Scholar
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific
Google Scholar
Buoniu L, Ernst D, De Schutter B, Babuk R (2010) Approximate dynamic programming with a fuzzy parameterization. Automatica 46:804–814
Article MathSciNet Google Scholar
Busoniu L, Babuska R, De Schutter B, Ernst D (2010) Reinforcement learning and dynamic programming using function approximators. CRC Press
Google Scholar
Buttou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R, (eds) Neural networks: tricks of the trade. Lecture notes in computer science, vol 7700. Springer, pp 421–436
Google Scholar
Cervellera C, Chen VCP, Wen A (2006) Optimization of a large-scale water reservoir network by stochastic dynamic programming with efficient state space discretization. Eur J Oper Res 171:1139–1151
Article MATH Google Scholar
Cervellera C, Gaggero M, Macciò D (2012) Efficient kernel models for learning and approximate minimization problems. Neurocomputing 97:74–85
Article Google Scholar
Cervellera C, Gaggero M, Macciò D (2014) Low-discrepancy sampling for approximate dynamic programming with local approximators. Comput Oper Res 43:108–115
Article MathSciNet MATH Google Scholar
Cervellera C, Gaggero M, Macciò D, Marcialis R (2013) Quasi-random sampling for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks, pp 2567–2574
Google Scholar
Cervellera C, Gaggero M, Macciò D, Marcialis R (2015) Efficient use of Nadaraya-Watson models and low-discrepancy sequences for approximate dynamic programming. In: Proceedings of the international joint conference on neural networks
Google Scholar
Cervellera C, Macciò D (2011) A comparison of global and semi-local approximation in \(T\)-stage stochastic optimization. Eur J Oper Res 208:109–118
Article MathSciNet MATH Google Scholar
Cervellera C, Muselli M (2007) Efficient sampling in approximate dynamic programming. Comput Optim Appl 38:417–443
Article MathSciNet MATH Google Scholar
Cervellera C, Wen A, Chen VCP (2007) Neural network and regression spline value function approximations for stochastic dynamic programming. Comput Oper Res 34:70–90
Article MathSciNet MATH Google Scholar
Chaudhary SK (2005) American options and the LSM algorithm: quasi-random sequences and Brownian bridges. J Comput Financ 8:101–115
Article Google Scholar
Chen VCP, Ruppert D, Shoemaker CA (1999) Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming. Oper Res 47:38–53
Article MathSciNet MATH Google Scholar
Chen VCP, Tsui KL, Barton RR, Allen JK (2003) A review of design and modeling in computer experiments. In: Khattree R, Rao CR (eds) Handbook in statistics: statistics in industry, vol 22. Elsevier, pp 231–261
Google Scholar
Dion M, Lecuyer P (2010) American option pricing with randomized quasi-Monte Carlo simulations. In: Johansson B, Jain S, Montoya-Torres J, Hugan J, Yücesan E (eds) Proceedings of the 2010 winter simulation conference, pp 2705–2720
Google Scholar
Ernst D, Geurts P, Wehenkel L (2005) Tree-based batch mode reinforcement learning. J Mach Learn Res 6:503–556
MathSciNet MATH Google Scholar
Farahmand AM, Ghavamzadeh M, Szepesvári C, Mannor S (2009) Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of the American control conference, pp 725–730
Google Scholar
Farahmand AM, Munos R, Szepesvári C (2010) Error propagation for approximate policy and value iteration. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in neural information processing systems 23. MIT Press, pp 568–576
Google Scholar
Foufoula-Georgiou E, Kitanidis PK (1988) Gradient dynamic programming for stochastic optimal control of multidimensional water resources systems. Water Resour Res 24:1345–1359
Article Google Scholar
Gaggero M, Gnecco G, Sanguineti M (2013) Dynamic programming and value-function approximation in sequential decision problems: error analysis and numerical results. J Optim Theory Appl 156:380–416
Article MathSciNet MATH Google Scholar
Gaggero M, Gnecco G, Sanguineti M (2014) Approximate dynamic programming for stochastic \(N\)-stage optimization with application to optimal consumption under uncertainty. Comput Optim Appl 58:31–85
Article MathSciNet MATH Google Scholar
Gaggero M, Gnecco G, Sanguineti M (2014) Suboptimal policies for stochastic \(N\)-stage optimization problems: accuracy analysis and a case study from optimal consumption. In: El Ouardighi F, Kogan K (eds) Models and methods in economics and management. Springer, pp 27–50
Google Scholar
Gal S (1979) Optimal management of a multireservoir water supply system. Water Resour Res 15:737–749
Article Google Scholar
Gale D (1967) On optimal development in a multi-sector economy. Rev Econ Stud 34:1–18
Article Google Scholar
Giuliani M, Quinn JD, Herman JD, Castelletti A, Reed PM (2018) Scalable multiobjective control for large scale water resources systems under uncertainty. IEEE Trans Control Syst Technol 26:1492–1499
Article Google Scholar
Gnecco G, Sanguineti M (2010) Suboptimal solutions to dynamic optimization problems via approximations of the policy functions. J Optim Theory Appl 146:764–794
Article MathSciNet MATH Google Scholar
Gnecco G, Sanguineti M (2016) Neural approximations of the solutions to a class of stochastic optimal control problems. J Neurotechnol 1:1–16
Google Scholar
Gnecco G, Sanguineti M (2018) Neural approximations in discounted infinite-horizon stochastic optimal control problems. Eng Appl Artif Intell 74:294–302
Article Google Scholar
Guo W, Si J, Liu F, Mei S (2018) Policy approximation in policy iteration approximate dynamic programming for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 29:2794–2807
MathSciNet Google Scholar
Haykin S (2008) Neural networks and learning systems. Pearson Prentice-Hall
Google Scholar
Johnson SA, Stedinger JR, Shoemaker C, Li Y, Tejada-Guibert JA (1993) Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Oper Res 41:484–500
Article MATH Google Scholar
Judd K (1998) Numerical methods in economics. MIT Press
Google Scholar
Kamalapurkar R, Walters P, Rosenfeld J, Dixon W (2018) Reinforcement learning for optimal feedback control. Springer
Google Scholar
Kitanidis PK (1986) Hermite interpolation on an \(n\)-dimensional rectangular grid. Technical report, St. Anthony Falls Hydraulics Laboratoty, University of Minnesota
Google Scholar
Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst 29:2042–2062
Article MathSciNet Google Scholar
Kleinman DL, Athans M (1968) The design of suboptimal linear time-varying systems. IEEE Trans Autom Control 13:150–159
Article MathSciNet Google Scholar
Kwakernaak H, Sivan R (1972) Linear optimal control systems. Wiley
Google Scholar
Larson RE (1968) State increment dynamic programming. Elsevier
Google Scholar
Lewis FL, Vrabie D, Wamvoudakis KG (2012) Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag 32:76–105
MathSciNet MATH Google Scholar
Montrucchio L (1987) Lipschitz continuous policy functions for strongly concave optimization problems. J Math Econ 16:259–273
Article MathSciNet MATH Google Scholar
Montrucchio L (1998) Thompson metric, contraction property and differentiability of policy functions. J Econ Behav Organ 33:449–466
Article Google Scholar
Montrucchio L, Boldrin M (1986) On the indeterminacy of capital accumulation paths. J Econ Behav Organ 40:26–36
MathSciNet MATH Google Scholar
Munos R, Moore A (2002) Variable resolution discretization in optimal control. Mach Learn 49:291–323
Article MATH Google Scholar
Munos R, Szepesvári C (2008) Finite-time bounds for fitted value iteration. J Mach Learn Res 1:815–857
MathSciNet MATH Google Scholar
Nguyen DH, Widrow B (1990) Neural networks for self-learning control systems. IEEE Control Syst Mag 10:18–23
Article Google Scholar
Ormoneit D, Sen S (2002) Kernel-based reinforcement learning. Mach Learn 49:161–178
Article MATH Google Scholar
Parisini T, Zoppoli R (1994) Neural networks for feedback feedforward nonlinear control systems. IEEE Trans Neural Netw 5:436–449
Article MATH Google Scholar
Powell WB (2011) Approximate dynamic programming: solving the curse of dimensionality, 2nd edn. Wiley
Google Scholar
Philbrick CR Jr, Kitanidis PK (2001) Improved dynamic programming methods for optimal control of lumped-parameter stochastic systems. Oper Res 49:398–412
Article MathSciNet MATH Google Scholar
Raiffa H, Sclhaifer R (1961) Applied statistical decision theory. Harvard University Press
Google Scholar
Riedmiller M (2005) Neural fitted Q iteration. First experiences with a data efficient neural reinforcement learning method. In: Proceedings of the 16th European conference on machine learning, pp 317–328
Chapter Google Scholar
Salas JD, Tabios GQ III, Bartolini P (1985) Approaches to multivariate modeling of water resources time series. Water Resour Bull 21:683–708
Article Google Scholar
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3:211–229
Article MathSciNet Google Scholar
Schweitzer PJ, Seidmann A (1985) Generalized polynomial approximations in Markovian decision processes. J Math Anal Appl 110:568–582
Article MathSciNet MATH Google Scholar
Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming: modeling and theory. MOS-SIAM series on optimization
Google Scholar
Si J, Barto AG, Powell WB, Wunsch D (eds) (2004) Handbook of learning and approximate dynamic programming. IEEE Press
Google Scholar
Smith JE, McCardle KF (2002) Structural properties of stochastic dynamic programs. Oper Res 50:796–809
Article MathSciNet MATH Google Scholar
Sokolov Y, Kozma R, Werbos LD, Werbos PJ (2015) Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 59:9–18
Article MathSciNet MATH Google Scholar
Stokey NL, Lucas RE, Prescott EC (1989) Recursive methods in economic dynamics. Harvard University Press
Google Scholar
Sutton RS (1988) Learning to predict by the method of temporal differences. Mach Learn 3:9–44
Google Scholar
Sutton RS, Barto AG (1998) Reinforcement learning. MIT Press
Google Scholar
Ten Hagen S, Kröse B (2003) Neural Q-learning. Neural Comput Appl 12:81–88
Article Google Scholar
Teytaud O, Gelly S, Mary J (2007) Active learning in regression, with application to stochastic dynamic programming. In: Proceedings of the 4th international conference on informatics in control, automation, and robotics, intelligent control systems in optimization, pp 198–205
Google Scholar
Thorndike E (1911) Animal intelligence: experimental studies. Macmillan
Google Scholar
Tsai JCC, Chen VCP, Beck MB, Chen J (2004) Stochastic dynamic programming formulation for a wastewater treatment decision-making framework. Ann Oper Res 132:207–221
Article MATH Google Scholar
Tsitsiklis JN, Van Roy B (1996) Feature-based methods for large scale dynamic programming. Mach Learn 22:59–94
MATH Google Scholar
Turgeon A (1981) A decomposition method for the long-term scheduling of reservoirs in series. Water Resour Res 17:1565–1570
Article Google Scholar
Wan EA, Benfags F (1996) Diagrammatic derivation of gradient algorithm for neural networks. Neural Comput 8:182–201
Article Google Scholar
Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: a survey. IEEE Trans Cybern 47:3429–3451
Article Google Scholar
Watkins C (1988) Learning from delayed rewards. PhD thesis, Cambridge University, Cambridge, U.K
Google Scholar
Watkins C, Dayan P (1992) Q-learning. Mach Learn 8:279–292
MATH Google Scholar
Werbos PJ (1989) Neural networks for control and system identification. In: Proceedings of the IEEE conference on decision and control, pp 260–265
Google Scholar
Werbos PJ (1991) A menu of designs for reinforcement learning over time. In: Miller WTI, Sutton RS, Werbos PJ (eds) Neural networks for control. MIT Press, pp 67–95
Google Scholar
Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control. Von Nostrand Reinhold
Google Scholar
Yakowitz S (1982) Dynamic programming applications in water resources. Water Resour Res 18:673–696
Article Google Scholar

Download references

Author information

Authors and Affiliations

DIBRIS, Università di Genova, Genoa, Italy
Riccardo Zoppoli & Marcello Sanguineti
AXES Research Unit, IMT—School of Advanced Studies Lucca, Lucca, Italy
Giorgio Gnecco
Imperial College London, London, UK
Thomas Parisini
University of Trieste, Trieste, Italy
Thomas Parisini

Authors

Riccardo Zoppoli
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Sanguineti
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Gnecco
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Parisini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riccardo Zoppoli .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zoppoli, R., Sanguineti, M., Gnecco, G., Parisini, T. (2020). Stochastic Optimal Control with Perfect State Information over a Finite Horizon. In: Neural Approximations for Optimal Control and Decision. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-29693-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-29693-3_7
Published: 18 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29691-9
Online ISBN: 978-3-030-29693-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics