Abstract
We consider transient or absorbing discrete-time Markov decision processes with expected total rewards. We prove inequalities to estimate the stability of optimal control policies with respect to the total variation norm and the Prokhorov metric. Some application examples are given.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
D.P. Bertsekas, J.N. Tsitsiklis, An analysis of stochastic shortest path problems. Math. Oper. Res. 16(3), 580–595 (1991)
A.A. Borovkov, S.G. Foss, Stochastically recursive sequences and their generalization. Sib. Adv. Math. 2, 16–81 (1992)
R.M. Dudley, Real Analysis and Probability. Volume 74 of Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2002). Revised reprint of the 1989 original
E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes (Springer, New York, 1979)
E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Discounted cost optimality problem: stability with respect to weak metrics. Math. Methods Oper. Res. 68, 77–96 (2008)
E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Average cost Markov control processes: stability with respect to the Kantorovich metric. Math. Methods Oper. Res. 70, 13–33 (2009)
E.I. Gordienko, A. Novikov, Characterization of optimal policies in a general stopping problem and stability estimating. Probab. Eng. Inf. Sci. 28(3), 335–352 (2014)
E.I. Gordienko, F. Salem, Estimates of stability of Markov control processes with unbounded cost. Kybernetika 36, 195–210 (2000)
O Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes (Springer, New York, 1999)
O. Hernández-Lerma, G. Carrasco, R. Pérez-Hernández, Markov control processes with the expected total cost criterion: optimality, stability, and transient models. Acta Appl. Math. 59(3), 229–269 (1999)
K. Hinderer, K.H. Waldmann, Algorithms for countable state Markov decision models with an absorbing set. SIAM J. Control Optim. 43(6), 2109–2131 (electronic) (2005)
A. Hordijk, Dynamic Programming and Markov Potential Theory. Volume No. 51 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1974)
H.W. James, E.J. Collins, An analysis of transient Markov decision processes. J. Appl. Probab. 43(3), 603–621 (2006)
L.C.M. Kallenberg, Linear Programming and Finite Markovian Control Problems. Volume 148 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1983)
S.P. Meyn, R.L. Tweedie, Markov Chains and Stochastic Stability. Communications and Control Engineering Series (Springer, London, 1993)
S.R. Pliska, On the transient case for Markov decision chains with general state spaces, in Dynamic Programming and Its Applications (Proc. Conf., University of British Columbia, Vancouver, 1977) (Academic, New York/London, 1978), pp. 335–349
S.M. Ross, Applied Probability Models with Optimization Applications (Dover Publications, New York, 1992). Reprint of the 1970 original
A.N. Shiryayev, Optimal Stopping Rules (Springer, New York/Heidelberg, 1978). Translated from the Russian by A.B. Aries, Applications of Mathematics, vol. 8
A.F. Veinott, Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40, 1635–1660 (1969)
E. Zaitseva, Stability estimating in optimal stopping problem. Kybernetika (Prague) 44(3), 400–415 (2008)
E. Zaitseva, Robustness estimating of optimal stopping problem with unbounded revenue and cost functions. Int. J. Pure Appl. Math. 59(3), 291–306 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Proofs of the Results
Appendix: Proofs of the Results
The following lemma establishes a connection between Assumption 2.1 and the definitions of transient policies given in [13, 16].
Lemma 1
Let \(\Theta \) be as in Assumption 2.1 , \(X_{0} = X\setminus \Theta \) and for \(f \in \mathbb{F}\) , Q f is the restriction of the kernel P f in ( 2.1 ) to \((X_{0},\mathfrak{B}_{X_{0}})\) . If ( 2.3 ) holds, then
In (1) \(\left \|\cdot \right \|_{0}\) is the operator norm corresponding to the supremum norm \(\left \|\cdot \right \|\) in \(\mathbb{B}\).
Proof
Since Q f t is a monotone operator,
For every t ≥ 0
because \(\Theta \) is absorbing set for P f .
□
1.1 1. Proof of Theorem 3.2
Proof
Let f ∗, \(\tilde{f}_{{\ast}}\) be the optimal stationary policies introduced in Proposition 2.3, and \(F_{{\ast}}:= \left \{f_{{\ast}},\tilde{f}_{{\ast}}\right \}\).
Under Assumptions 2.1 and 3.1, for every \(f \in \mathbb{F}_{{\ast}}\) the corresponding rewards V f ≡ V (x, f), \(\tilde{V }_{f}(x) \equiv \tilde{ V }(x,f)\) are bounded functions, and can be rewritten as
From Proposition 2.3 and Assumption 3.1, the following operators G f , \(\tilde{G}_{f}\) ( f ∈ F ∗)
act from \(\mathbb{B}\) to \(\mathbb{B}\).
Using (4), (5) and standard arguments (Markov property [17]) we find that, for \(f \in \mathbb{F}_{{\ast}}\)
For the stability index in (1.8) we have [8, 20]:
First, let f = f ∗ (omitting subindex ∗). Then, by (8), for every n ≥ 1,
From Proposition 2.3 (b), the policy f = f ∗ satisfies (2.3). Therefore, by Lemma 1, and the corresponding result in [13], there exists an integer n ≥ 1 such that the operator \(G_{f}^{n}\) is contractive in \(\mathbb{B}\) with some module β < 1. Thus, from (10),
Taking into account (6), (7) and applying the arguments used, for example, in [20], we obtain
The last term on the right-hand side of (12) is less than \(\frac{1} {2}\mathbb{V}(\xi,\tilde{\xi })\).
On the other hand, since r ≡ 0 on \(\Theta \), from Assumption 3.1 we have:
where \(b =\mathop{\sup }\limits_{ k \in \mathbb{K}}\left \vert r(k)\right \vert \), and M is the constant from Assumption 3.1.
Second, in (9) let \(f =\tilde{ f}_{{\ast}}\). Now we have the inequality (10) with \(f =\tilde{ f}_{{\ast}}\). Let m ≥ 1 and γ < 1 be the constants from Assumption 3.1. From (3) and (3.2) in Assumption 3.1, \(\left \|Q_{f}^{m}\right \| \leq \gamma\).
Since the set \(\Theta \) is absorbing under \(P_{f} \equiv P_{\tilde{f}_{{\ast}}}\) (see Assumption 3.1), and iterating (6), for each \(u,\upsilon \in \mathbb{B}\), \(\left \|Q_{f}^{m}u - Q_{f}^{m}\upsilon \right \| \leq \gamma \left \|u-\upsilon \right \|.\) Thus, from (10) it follows that
Proceeding as in (12) and (13) (with \(f \equiv \tilde{ f}_{{\ast}}\) rather than f = f ∗), and applying Assumption 3.1 (b), we get that for a given constant \(\tilde{K}\):
To conclude the proof of (3.3) it suffices to gather the inequalities (9) and (11)–(14). □
1.2 2. Proof of Theorem 4.3
Proof
Let \(f_{{\ast}},\tilde{f}_{{\ast}}\in \mathbb{F}\) be the stationary policies optimal for MDPs (1.1), (1.2), respectively, and \(V _{{\ast}} = V _{f_{{\ast}}}\), \(\tilde{V }_{{\ast}} = V _{\tilde{f}_{{\ast}}}\) be the corresponding value functions. The existence of f ∗ and \(\tilde{f}_{{\ast}}\) was ensured in Proposition 2.3. From Assumption 4.1 (a), for every \(f \in \mathbb{F}\) the corresponding rewards V f and \(\tilde{V }_{f}\) (see (2.6)–(2.9)) are zero on \(\Theta \). Particularly \(V _{{\ast}}(x) =\tilde{ V }_{{\ast}}(x) = 0\), for \(x \in \Theta \). Hence, we can consider all functions V f and \(\tilde{V }_{f}\), \(f \in \mathbb{F}\) as elements of the space \(\mathbb{B}\) (taking into account their boundedness which follows from Assumption 4.1).
In the usual manner, we introduce the dynamic programming operators \(T,\tilde{T}: \mathbb{B} \rightarrow \mathbb{B}\):
From Assumption 4.2 (a) (b), it follows that for each \(u \in \mathbb{B}\) there exists a stationary policy (selector), f u , such that
Thus for \(x \in \Theta \) by Assumption 4.1 (a) Tu(x) = 0, and \(T\mathbb{B} \subseteq \mathbb{B}\). (Similarly \(\tilde{T}\mathbb{B} \subseteq \mathbb{B}\).)
As it was proven in [13, 16] the fulfilment of Assumptions 4.1 and 4.2 is sufficient for validity of the following assertions.
Proposition 2
-
(a)
\(V _{{\ast}} = TV _{{\ast}}\) , \(\tilde{V }_{{\ast}} =\tilde{ T}\tilde{V }_{{\ast}}\) .
-
(b)
The optimal policy f ∗ is a selector in the right-hand side of ( 15 ) with u = V ∗ ; and the optimal policy \(\tilde{f}_{{\ast}}\) is a selector in the right-hand side of ( 16 ) with \(u =\tilde{ V }_{{\ast}}\) ;
-
(c)
There exists an integer m ≥ 1 such that the operator T m is contractive in \(\mathbb{B}_{0}\) with some module β < 1.
For any \((x,a) \in \mathbb{K}\) let
To simplify notation let \(f =\tilde{ f}_{{\ast}}\). Similarly to [5], let \(\Gamma _{t}\) \(=\{ x,a_{1},x_{1},a_{2},\ldots,\) \(x_{t-1},a_{t}\}\), (t ≥ 1), be the part of a trajectory of process (1.1) under the control policy \(f = \left \{f,f,\ldots,\right \}\) (with the initial state \(x \in X_{0}\)). By the Markov property, we have
By (15), (17) and Proposition 2 (a) we obtain:
where
Summing the last equality over t ∈ [1, n], we obtain that
Since \(r,V _{{\ast}}\in \mathbb{B}\), under Assumption 4.1 (b) as n → ∞, \(E_{x}^{f}\mathop{\mathop{\sum }\limits_{t = 1}}\limits^{n}r(x_{t-1},a_{t}) \rightarrow E_{x}^{f}\mathop{\mathop{\sum }\limits^{\infty }}\limits_{t = 1}r(x_{t-1},a_{t}) = V (x,f(x))\) (see (2.6)), and \(E_{x}^{f}V _{{\ast}}(x_{n}) = [Q_{f}^{n}V _{{\ast}}](x) \rightarrow 0\), where Q f is the kernel defined in Lemma 1.
Thus we can pass to the limit in (22) to find
Similarly to Lemma 1 it is proven that (4.1) yields that
On the other hand, in [13] it was shown that
Thus, from (24) we see that \(\left \|V _{{\ast}}\right \|\leq M\left \|r\right \|\), and, similarly, \(\left \|\tilde{V }_{{\ast}}\right \|\leq M\left \|r\right \|\). From the first of these inequalities it follows that (see (17), (18)) in (23) \(\Lambda _{t}\) is a function of x t−1 (a state under the policy f) bounded by
From Proposition 2 (a), (17) and (21):
and by Assumption 4.1, if \(x_{t-1} \in \Theta \), then \(x_{t} \in \Theta \), and therefore, (since r and V ∗ are zero on \(\Theta \)) \(\Lambda _{t}(x_{t-1}) = 0\) when \(x_{t-1} \in \Theta \). Hence, \(\Lambda _{t} = \Lambda (x_{t-1})\), where \(\Lambda \) is a function from \(\mathbb{B}\).
In [16] was proven that under Assumption 4.1 that there exist constants c < ∞ and α < 1 such that for every \(f \in \mathbb{F}\),
On the other hand, in view of the above properties of \(\Lambda \), the right-hand side of (23) can be rewritten as follows. Let N ≥ 1 be an arbitrary (for now) fixed integer. Then,
And from (26),
Combining (23), (27) and (28), we obtain the inequality
Let us bound \(\Lambda _{t}\) in the last inequality. From the definition of \(\Lambda _{t}\) in (21) and from (16)–(18), Proposition 2 (a), we have:
where expectations are interpreted as conditional expectations with x t−1 being fixed.
From (30) we get:
From Proposition 2 (c), there exists integers m ≥ 1 and β < 1 such that the operator T m is contractive with module β < 1. Thus, again using Proposition 2,
or
Now, since T is a nonexpansive operator, by induction we have
From (16) and Proposition 2 (a),
Since \(\tilde{V }_{{\ast}}\) is bounded by \(M\left \|r\right \|\), from Assumption 4.2 (a) and (c), in (34) the function under supremum is Lipschitzian with respect to k = (x, a). Then, as it was shown in [6], this fact and Assumption 4.2 (b), proves that the function \(\tilde{V }_{{\ast}}\) in (34) is Lipschitzian. Therefore applying (4.6) in Assumption 4.2 (d), to the function \(s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]\) in (33) we obtain that this function satisfies the Lipschitz condition with a constant not depending on k.
In the same way (using Assumption 4.2 (c)) we can confirm that the function \(s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]\) is Lipschitzian.
Finally, combining inequalities (31)–(33), \(\Lambda _{t}\) in (29) is less than \(\sup \left \vert E\varphi (\xi ) - E\varphi (\tilde{\xi })\right \vert \) over a certain class of functions \(\varphi\), which are bounded by the same constant \(\bar{b}\) and satisfy the Lipschitz conditions with the same constant \(\bar{L}\) (and these constants depend only on m, α, and the constant involved in Assumptions 4.1 and 4.2).
Therefore,
where \(Dud(\xi,\tilde{\xi })\) denotes the Dudley distance between the distributions of random vectors ξ and \(\tilde{\xi }\). (See [3] for the definition of the Dudley metric, and the inequality between Dudley and Prokhorov metrics.)
If \(\tilde{b} = 2(\bar{b} +\bar{ L})\), then from (35) and (29)
Finally, the desired inequality (4.7) follows from (36) if we choose
□
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Gordienko, E., Martinez, J., Ruiz de Chávez, J. (2015). Stability Estimation of Transient Markov Decision Processes. In: Mena, R., Pardo, J., Rivero, V., Uribe Bravo, G. (eds) XI Symposium on Probability and Stochastic Processes. Progress in Probability, vol 69. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-13984-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-13984-5_8
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-319-13983-8
Online ISBN: 978-3-319-13984-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)