Stability Estimation of Transient Markov Decision Processes

Gordienko, Evgueni; Martinez, Jaime; Ruiz de Chávez, Juan

doi:10.1007/978-3-319-13984-5_8

Evgueni Gordienko⁸,
Jaime Martinez⁸ &
Juan Ruiz de Chávez⁸

Part of the book series: Progress in Probability ((PRPR,volume 69))

632 Accesses

Abstract

We consider transient or absorbing discrete-time Markov decision processes with expected total rewards. We prove inequalities to estimate the stability of optimal control policies with respect to the total variation norm and the Prokhorov metric. Some application examples are given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

D.P. Bertsekas, J.N. Tsitsiklis, An analysis of stochastic shortest path problems. Math. Oper. Res. 16(3), 580–595 (1991)
Article MathSciNet Google Scholar
A.A. Borovkov, S.G. Foss, Stochastically recursive sequences and their generalization. Sib. Adv. Math. 2, 16–81 (1992)
MathSciNet Google Scholar
R.M. Dudley, Real Analysis and Probability. Volume 74 of Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2002). Revised reprint of the 1989 original
Google Scholar
E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes (Springer, New York, 1979)
Book Google Scholar
E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Discounted cost optimality problem: stability with respect to weak metrics. Math. Methods Oper. Res. 68, 77–96 (2008)
Article MathSciNet Google Scholar
E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Average cost Markov control processes: stability with respect to the Kantorovich metric. Math. Methods Oper. Res. 70, 13–33 (2009)
Article MathSciNet Google Scholar
E.I. Gordienko, A. Novikov, Characterization of optimal policies in a general stopping problem and stability estimating. Probab. Eng. Inf. Sci. 28(3), 335–352 (2014)
Article MathSciNet Google Scholar
E.I. Gordienko, F. Salem, Estimates of stability of Markov control processes with unbounded cost. Kybernetika 36, 195–210 (2000)
MathSciNet Google Scholar
O Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes (Springer, New York, 1999)
Google Scholar
O. Hernández-Lerma, G. Carrasco, R. Pérez-Hernández, Markov control processes with the expected total cost criterion: optimality, stability, and transient models. Acta Appl. Math. 59(3), 229–269 (1999)
Article MathSciNet Google Scholar
K. Hinderer, K.H. Waldmann, Algorithms for countable state Markov decision models with an absorbing set. SIAM J. Control Optim. 43(6), 2109–2131 (electronic) (2005)
Google Scholar
A. Hordijk, Dynamic Programming and Markov Potential Theory. Volume No. 51 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1974)
Google Scholar
H.W. James, E.J. Collins, An analysis of transient Markov decision processes. J. Appl. Probab. 43(3), 603–621 (2006)
Article MathSciNet Google Scholar
L.C.M. Kallenberg, Linear Programming and Finite Markovian Control Problems. Volume 148 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1983)
Google Scholar
S.P. Meyn, R.L. Tweedie, Markov Chains and Stochastic Stability. Communications and Control Engineering Series (Springer, London, 1993)
Book Google Scholar
S.R. Pliska, On the transient case for Markov decision chains with general state spaces, in Dynamic Programming and Its Applications (Proc. Conf., University of British Columbia, Vancouver, 1977) (Academic, New York/London, 1978), pp. 335–349
Google Scholar
S.M. Ross, Applied Probability Models with Optimization Applications (Dover Publications, New York, 1992). Reprint of the 1970 original
Google Scholar
A.N. Shiryayev, Optimal Stopping Rules (Springer, New York/Heidelberg, 1978). Translated from the Russian by A.B. Aries, Applications of Mathematics, vol. 8
Google Scholar
A.F. Veinott, Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40, 1635–1660 (1969)
Article MathSciNet Google Scholar
E. Zaitseva, Stability estimating in optimal stopping problem. Kybernetika (Prague) 44(3), 400–415 (2008)
Google Scholar
E. Zaitseva, Robustness estimating of optimal stopping problem with unbounded revenue and cost functions. Int. J. Pure Appl. Math. 59(3), 291–306 (2010)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Departmento de Matematicas, Universidad Autonoma Metropolitana-Iztapalapa, Av. San Rafael Atlixco 186, col. Vicentina, C.P. 09340, Mexico City, Mexico
Evgueni Gordienko, Jaime Martinez & Juan Ruiz de Chávez

Authors

Evgueni Gordienko
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Martinez
View author publications
You can also search for this author in PubMed Google Scholar
Juan Ruiz de Chávez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Ruiz de Chávez .

Editor information

Editors and Affiliations

Probabilidad y Estadística, IIMAS UNAM, Mexico
Ramsés H. Mena
CIMAT, Guanajuato, Mexico
Juan Carlos Pardo
CIMAT, Guanajuato, Mexico
Víctor Rivero
Instituto de Matemáticas, UNAM, Mexico
Gerónimo Uribe Bravo

Appendix: Proofs of the Results

The following lemma establishes a connection between Assumption 2.1 and the definitions of transient policies given in [13, 16].

Lemma 1

Let $\Theta $ be as in Assumption 2.1 , $X_{0} = X\setminus \Theta $ and for $f \in \mathbb{F}$ , Q _f is the restriction of the kernel P _f in ( 2.1 ) to $(X_{0},\mathfrak{B}_{X_{0}})$ . If ( 2.3 ) holds, then

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} \leq M_{f}. }$$

(1)

In (1) $\left \|\cdot \right \|_{0}$ is the operator norm corresponding to the supremum norm $\left \|\cdot \right \|$ in $\mathbb{B}$.

Proof

Since Q _f ^t is a monotone operator,

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} = \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\text{I}\right \| =\sup _{ x\in X_{0}}\left \vert \sum _{t=0}^{\infty }Q_{ f}^{t}\text{I}(x)\right \vert. }$$

(2)

For every t ≥ 0

$$\displaystyle{ Q_{f}^{t}\text{I}(x) = P_{ x}^{f}(x_{ t} \in X_{0}) = P_{x}^{f}(\tau _{ x,f}(\Theta ) > t) }$$

(3)

because $\Theta $ is absorbing set for P _f.

Thus, from (2), (3) we find

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} =\sup _{x\in X_{0}}\sum _{t=0}^{\infty }P_{ x}^{f}(\tau _{ x,f}(\Theta ) > t) \leq M_{f}. }$$

□

1.1 1. Proof of Theorem 3.2

Proof

Let f _∗, $\tilde{f}_{{\ast}}$ be the optimal stationary policies introduced in Proposition 2.3, and $F_{{\ast}}:= \left \{f_{{\ast}},\tilde{f}_{{\ast}}\right \}$.

Under Assumptions 2.1 and 3.1, for every $f \in \mathbb{F}_{{\ast}}$ the corresponding rewards V _f ≡ V (x, f), $\tilde{V }_{f}(x) \equiv \tilde{ V }(x,f)$ are bounded functions, and can be rewritten as

$$\displaystyle{ V _{f}(x) = E_{x}^{f}\left [\sum _{ t=1}^{\infty }r(x_{ t-1},f(x_{t-1}))\right ], }$$

(4)

$$\displaystyle{ \tilde{V }_{f}(x) =\tilde{ E}_{x}^{f}\left [\sum _{ t=1}^{\infty }r(\tilde{x}_{ t-1},f(\tilde{x}_{t-1}))\right ]. }$$

(5)

From Proposition 2.3 and Assumption 3.1, the following operators G _f, $\tilde{G}_{f}$ ( f ∈ F _∗)

$$\displaystyle{ G_{f}u(x):= r(x,f(x)) + Eu[F(x,f(x),\xi )], }$$

(6)

$$\displaystyle{ \tilde{G}_{f}u(x):= r(x,f(x)) + Eu(F(x,f(x),\tilde{\xi })) }$$

(7)

act from $\mathbb{B}$ to $\mathbb{B}$.

Using (4), (5) and standard arguments (Markov property [17]) we find that, for $f \in \mathbb{F}_{{\ast}}$

$$\displaystyle{ V _{f} = G_{f}V _{f}\quad \text{and}\quad \tilde{V }_{f} =\tilde{ G}_{f}\tilde{V }_{f}. }$$

(8)

For the stability index in (1.8) we have [8, 20]:

$$\displaystyle{ \Delta (x) \leq 2\max _{f\in \mathbb{F}_{{\ast}}}\left \vert V (x,f) -\tilde{ V }(x,f)\right \vert. }$$

(9)

First, let f = f _∗ (omitting subindex ∗). Then, by (8), for every n ≥ 1,

$$\displaystyle\begin{array}{rcl} \left \vert V (x,f) -\tilde{ V }(x,f)\right \vert & \leq & \left \|V _{f} -\tilde{ V }_{f}\right \| = \left \|G_{f}^{n}V _{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \| \\ & \leq & \left \|G_{f}^{n}V _{ f} - G_{f}^{n}\tilde{V }_{ f}\right \| + \left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \|.{}\end{array}$$

(10)

From Proposition 2.3 (b), the policy f = f _∗ satisfies (2.3). Therefore, by Lemma 1, and the corresponding result in [13], there exists an integer n ≥ 1 such that the operator $G_{f}^{n}$ is contractive in $\mathbb{B}$ with some module β < 1. Thus, from (10),

$$\displaystyle{ \left \|V _{f} -\tilde{ V }_{f}\right \| \leq \frac{1} {\left (1-\beta \right )}\left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \|. }$$

(11)

Taking into account (6), (7) and applying the arguments used, for example, in [20], we obtain

$$\displaystyle\begin{array}{rcl} & & \left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \| \leq \\ & & n\left \|\tilde{V }_{f}\right \|\sup _{x\in X_{0},a\in A\left (x\right )}\sup _{B\in B_{X_{ 0}}}\left \vert P(F(x,a,\xi ) \in B) - P(F(x,a,\tilde{\xi }) \in B)\right \vert.{}\end{array}$$

(12)

The last term on the right-hand side of (12) is less than $\frac{1} {2}\mathbb{V}(\xi,\tilde{\xi })$.

On the other hand, since r ≡ 0 on $\Theta $, from Assumption 3.1 we have:

$$\displaystyle{ \left \|\tilde{V }_{f}\right \| \equiv \left \|\tilde{V }_{f_{{\ast}}}\right \| =\sup _{x\in X_{0}}\left \vert \tilde{E}_{x}^{f}\sum _{ t=1}^{\infty }r(\tilde{x}_{ t-1},f_{{\ast}}(\tilde{x}_{t-1}))\text{I}_{\left \{\tilde{\tau }_{x,f_{{\ast}}}(\Theta )>t-1\right \}}\right \vert \leq bM, }$$

(13)

where $b =\mathop{\sup }\limits_{ k \in \mathbb{K}}\left \vert r(k)\right \vert $, and M is the constant from Assumption 3.1.

Second, in (9) let $f =\tilde{ f}_{{\ast}}$. Now we have the inequality (10) with $f =\tilde{ f}_{{\ast}}$. Let m ≥ 1 and γ < 1 be the constants from Assumption 3.1. From (3) and (3.2) in Assumption 3.1, $\left \|Q_{f}^{m}\right \| \leq \gamma$.

Since the set $\Theta $ is absorbing under $P_{f} \equiv P_{\tilde{f}_{{\ast}}}$ (see Assumption 3.1), and iterating (6), for each $u,\upsilon \in \mathbb{B}$, $\left \|Q_{f}^{m}u - Q_{f}^{m}\upsilon \right \| \leq \gamma \left \|u-\upsilon \right \|.$ Thus, from (10) it follows that

$$\displaystyle{ \left \|V _{f} -\tilde{ V }_{f}\right \| \leq \frac{1} {(1-\gamma )}\left \|G_{f}^{m}\tilde{V }_{ f} -\tilde{ G}_{f}^{m}\tilde{V }_{ f}\right \|. }$$

Proceeding as in (12) and (13) (with $f \equiv \tilde{ f}_{{\ast}}$ rather than f = f _∗), and applying Assumption 3.1 (b), we get that for a given constant $\tilde{K}$:

$$\displaystyle{ \left \|V _{\tilde{f}_{{\ast}}}-\tilde{ V }_{\tilde{f}_{{\ast}}}\right \| \leq \tilde{ K}\mathbb{V}(\xi,\tilde{\xi }). }$$

(14)

To conclude the proof of (3.3) it suffices to gather the inequalities (9) and (11)–(14). □

1.2 2. Proof of Theorem 4.3

Proof

Let $f_{{\ast}},\tilde{f}_{{\ast}}\in \mathbb{F}$ be the stationary policies optimal for MDPs (1.1), (1.2), respectively, and $V _{{\ast}} = V _{f_{{\ast}}}$, $\tilde{V }_{{\ast}} = V _{\tilde{f}_{{\ast}}}$ be the corresponding value functions. The existence of f _∗ and $\tilde{f}_{{\ast}}$ was ensured in Proposition 2.3. From Assumption 4.1 (a), for every $f \in \mathbb{F}$ the corresponding rewards V _f and $\tilde{V }_{f}$ (see (2.6)–(2.9)) are zero on $\Theta $. Particularly $V _{{\ast}}(x) =\tilde{ V }_{{\ast}}(x) = 0$, for $x \in \Theta $. Hence, we can consider all functions V _f and $\tilde{V }_{f}$, $f \in \mathbb{F}$ as elements of the space $\mathbb{B}$ (taking into account their boundedness which follows from Assumption 4.1).

In the usual manner, we introduce the dynamic programming operators $T,\tilde{T}: \mathbb{B} \rightarrow \mathbb{B}$:

$$\displaystyle{ Tu(x):=\sup _{a\in A(x)}\left \{r(x,a) + Eu(F(x,a,\xi ))\right \},\:x \in X, }$$

(15)

$$\displaystyle{ \tilde{T}u(x):=\sup _{a\in A(x)}\left \{r(x,a) + Eu(F(x,a,\tilde{\xi }))\right \},\:x \in X. }$$

(16)

From Assumption 4.2 (a) (b), it follows that for each $u \in \mathbb{B}$ there exists a stationary policy (selector), f _u, such that

$$\displaystyle\begin{array}{rcl} \sup _{a\in A(x)}\left \{r(x,a) + Eu[F(x,a,\xi )]\right \}& =& r(x,f_{u}(x)) + Eu[F(x,f_{u}(x),\xi )] {}\\ & =& r(x,f_{u}(x)) + E_{x}^{f_{u} }u(x_{1}),\:\,x \in X. {}\\ \end{array}$$

Thus for $x \in \Theta $ by Assumption 4.1 (a) Tu(x) = 0, and $T\mathbb{B} \subseteq \mathbb{B}$. (Similarly $\tilde{T}\mathbb{B} \subseteq \mathbb{B}$.)

As it was proven in [13, 16] the fulfilment of Assumptions 4.1 and 4.2 is sufficient for validity of the following assertions.

Proposition 2

(a)
$V _{{\ast}} = TV _{{\ast}}$ , $\tilde{V }_{{\ast}} =\tilde{ T}\tilde{V }_{{\ast}}$ .
(b)
The optimal policy f _∗ is a selector in the right-hand side of ( 15 ) with u = V _∗ ; and the optimal policy $\tilde{f}_{{\ast}}$ is a selector in the right-hand side of ( 16 ) with $u =\tilde{ V }_{{\ast}}$ ;
(c)
There exists an integer m ≥ 1 such that the operator T ^m is contractive in $\mathbb{B}_{0}$ with some module β < 1.

For any $(x,a) \in \mathbb{K}$ let

$$\displaystyle{ H(x,a):= r(x,a) + EV _{{\ast}}[F(x,a,\xi )], }$$

(17)

$$\displaystyle{ \tilde{H}(x,a):= r(x,a) + E\tilde{V }_{{\ast}}[F(x,a,\tilde{\xi })]. }$$

(18)

To simplify notation let $f =\tilde{ f}_{{\ast}}$. Similarly to [5], let $\Gamma _{t}$ $=\{ x,a_{1},x_{1},a_{2},\ldots,$ $x_{t-1},a_{t}\}$, (t ≥ 1), be the part of a trajectory of process (1.1) under the control policy $f = \left \{f,f,\ldots,\right \}$ (with the initial state $x \in X_{0}$). By the Markov property, we have

$$\displaystyle\begin{array}{rcl} \zeta _{t}&:=& E^{f}[V _{ {\ast}}(x_{t})\vert \Gamma _{t}] \\ & =& H(x_{t-1},a_{t}) - r(x_{t-1},a_{t}) -\sup _{a\in A(x_{t-1})}H(x_{t-1},a) \\ & & +\sup _{a\in A(x_{t-1})}H(x_{t-1},a). {}\end{array}$$

(19)

By (15), (17) and Proposition 2 (a) we obtain:

$$\displaystyle\begin{array}{rcl} \zeta _{t}& =& H(x_{t-1},a_{t}) -\sup _{a\in A(x_{t-1})}H(x_{t-1},a) - r(x_{t-1},a_{t}) + V _{{\ast}}(x_{t-1}) \\ & =& \Lambda _{t} - r(x_{t-1},a_{t}) + V _{{\ast}}(x_{t-1}), {}\end{array}$$

(20)

where

$$\displaystyle{ \Lambda _{t}:=\mathop{\sup }\limits_{ a \in A(x_{t-1})}H(x_{t-1},a) - H(x_{t-1},a_{t}) \geq 0. }$$

(21)

From (19) and (20) we get:

$$\displaystyle{ E_{x}^{f}V _{ {\ast}}(x_{t}) = E_{x}^{f}\Lambda _{ t} - E_{x}^{f}r(x_{ t-1},a_{t}) + E_{x}^{f}V _{ {\ast}}(x_{t-1}). }$$

Summing the last equality over t ∈ [1, n], we obtain that

$$\displaystyle{ E_{x}^{f}\sum _{ t=1}^{n}r(x_{ t-1},a_{t}) = V _{{\ast}}(x) - E_{x}^{f}V _{ {\ast}}(x_{n}) -\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}. }$$

(22)

Since $r,V _{{\ast}}\in \mathbb{B}$, under Assumption 4.1 (b) as n → ∞, $E_{x}^{f}\mathop{\mathop{\sum }\limits_{t = 1}}\limits^{n}r(x_{t-1},a_{t}) \rightarrow E_{x}^{f}\mathop{\mathop{\sum }\limits^{\infty }}\limits_{t = 1}r(x_{t-1},a_{t}) = V (x,f(x))$ (see (2.6)), and $E_{x}^{f}V _{{\ast}}(x_{n}) = [Q_{f}^{n}V _{{\ast}}](x) \rightarrow 0$, where Q _f is the kernel defined in Lemma 1.

Thus we can pass to the limit in (22) to find

$$\displaystyle{ \Delta (x) = V _{{\ast}}(x) - V (x,f) \leq \limsup _{n\rightarrow \infty }\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}. }$$

(23)

Similarly to Lemma 1 it is proven that (4.1) yields that

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} \leq M,\quad \mbox{for every f $\in \mathbb{F}$.} }$$

(24)

On the other hand, in [13] it was shown that

$$\displaystyle{ V _{{\ast}}(x) = V (x,f_{{\ast}}) = \left [\sum _{t=0}^{\infty }Q_{ f_{{\ast}}}^{t}r\right ](x). }$$

Thus, from (24) we see that $\left \|V _{{\ast}}\right \|\leq M\left \|r\right \|$, and, similarly, $\left \|\tilde{V }_{{\ast}}\right \|\leq M\left \|r\right \|$. From the first of these inequalities it follows that (see (17), (18)) in (23) $\Lambda _{t}$ is a function of x _t−1 (a state under the policy f) bounded by

$$\displaystyle{ 2\left \|r\right \|\left (1 + M\right ) =: b. }$$

(25)

From Proposition 2 (a), (17) and (21):

$$\displaystyle{ \Lambda _{t}(x_{t-1}) = V _{{\ast}}(x_{t-1}) - r(x_{t-1},f(x_{t-1})) - EV _{{\ast}}(x_{t}), }$$

and by Assumption 4.1, if $x_{t-1} \in \Theta $, then $x_{t} \in \Theta $, and therefore, (since r and V _∗ are zero on $\Theta $) $\Lambda _{t}(x_{t-1}) = 0$ when $x_{t-1} \in \Theta $. Hence, $\Lambda _{t} = \Lambda (x_{t-1})$, where $\Lambda $ is a function from $\mathbb{B}$.

In [16] was proven that under Assumption 4.1 that there exist constants c < ∞ and α < 1 such that for every $f \in \mathbb{F}$,

$$\displaystyle{ \left \|Q_{f}^{n}\right \|_{ 0} \leq c\alpha ^{n},\quad n = 1,2,\ldots. }$$

(26)

On the other hand, in view of the above properties of $\Lambda $, the right-hand side of (23) can be rewritten as follows. Let N ≥ 1 be an arbitrary (for now) fixed integer. Then,

$$\displaystyle\begin{array}{rcl} I(x):=\limsup _{n\rightarrow \infty }\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}& =& \sum _{t=1}^{\infty }E_{ x}^{f}\Lambda _{ t} \\ & =& \sum _{t=1}^{N}E_{ x}^{f}\Lambda _{ t} +\sum _{t>N}E_{x}^{f}\Lambda _{ t},{}\end{array}$$

(27)

And from (26),

$$\displaystyle{ \sup _{x\in X_{0}}\left \vert \sum _{t>N}E_{x}^{f}\Lambda _{ t}\right \vert = \left \|\sum _{t>N}Q_{f}^{t}\Lambda _{ t}\right \| \leq \sum _{t>N}\left \|Q_{f}^{t}\Lambda _{ t}\right \| \leq \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$

(28)

Combining (23), (27) and (28), we obtain the inequality

$$\displaystyle{ \Delta (x) \leq \sum _{t=1}^{N}E_{ x}^{f}\Lambda _{ t} + \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$

(29)

Let us bound $\Lambda _{t}$ in the last inequality. From the definition of $\Lambda _{t}$ in (21) and from (16)–(18), Proposition 2 (a), we have:

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& =& \sup _{a\in A(x_{t-1})}H(x_{t-1},a) -\sup _{a\in A(x_{t-1})}\tilde{H}(x_{t-1},a) +\tilde{ H}(x_{t-1},a_{t}) - H(x_{t-1},a_{t}) \\ & \leq & 2\sup _{a\in A(x_{t-1})}\left \vert H(x_{t-1},a) -\tilde{ H}(x_{t-1},a)\right \vert \\ &\leq & 2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\xi )] - E\tilde{V }_{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert, {}\end{array}$$

(30)

where expectations are interpreted as conditional expectations with x _t−1 being fixed.

From (30) we get:

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& \leq & 2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\xi )] - EV _{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert \\ & & +2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\tilde{\xi })] - E\tilde{V }_{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert \\ &\leq & 2\sup _{k\in \mathbb{K}}\left \vert EV _{{\ast}}[F(k,\xi )] - EV _{{\ast}}[F(k,\tilde{\xi })]\right \vert + 2\left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \|. {}\end{array}$$

(31)

From Proposition 2 (c), there exists integers m ≥ 1 and β < 1 such that the operator T ^m is contractive with module β < 1. Thus, again using Proposition 2,

$$\displaystyle{ \left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \| = \left \|T^{m}V _{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|\leq \left \|T^{m}V _{ {\ast}}- T^{m}\tilde{V }_{ {\ast}}\right \| + \left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|, }$$

or

$$\displaystyle{ \left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \|\leq \frac{1} {1-\beta }\left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|. }$$

(32)

Now, since T is a nonexpansive operator, by induction we have

$$\displaystyle\begin{array}{rcl} \left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|& \leq & \left \|TT^{m-1}\tilde{V }_{ {\ast}}- T\tilde{T}^{m-1}\tilde{V }_{ {\ast}}\right \| + \left \|T\tilde{T}^{m-1}\tilde{V }_{ {\ast}}-\tilde{ T}\tilde{T}^{m-1}\tilde{V }_{ {\ast}}\right \| \\ &\leq & \left \|T^{m-1}\tilde{V }_{ {\ast}}-\tilde{ T}^{m-1}\tilde{V }_{ {\ast}}\right \| + \left \|T\tilde{V }_{{\ast}}-\tilde{ T}\tilde{V }_{{\ast}}\right \| \\ &\leq & m\left \|T\tilde{V }_{{\ast}}-\tilde{ T}\tilde{V }_{{\ast}}\right \| \\ &\leq & m\sup _{k\in K}\left \vert E\tilde{V }_{{\ast}}[F(k,\xi )] - E\tilde{V }_{{\ast}}[F(k,\tilde{\xi })]\right \vert. {}\end{array}$$

(33)

From (16) and Proposition 2 (a),

$$\displaystyle{ \tilde{V }_{{\ast}}(x) =\sup _{a\in A(x)}\left \{r(x,a) + E\tilde{V }_{{\ast}}[F(x,a,\tilde{\xi })]\right \}. }$$

(34)

Since $\tilde{V }_{{\ast}}$ is bounded by $M\left \|r\right \|$, from Assumption 4.2 (a) and (c), in (34) the function under supremum is Lipschitzian with respect to k = (x, a). Then, as it was shown in [6], this fact and Assumption 4.2 (b), proves that the function $\tilde{V }_{{\ast}}$ in (34) is Lipschitzian. Therefore applying (4.6) in Assumption 4.2 (d), to the function $s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]$ in (33) we obtain that this function satisfies the Lipschitz condition with a constant not depending on k.

In the same way (using Assumption 4.2 (c)) we can confirm that the function $s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]$ is Lipschitzian.

Finally, combining inequalities (31)–(33), $\Lambda _{t}$ in (29) is less than $\sup \left \vert E\varphi (\xi ) - E\varphi (\tilde{\xi })\right \vert $ over a certain class of functions $\varphi$, which are bounded by the same constant $\bar{b}$ and satisfy the Lipschitz conditions with the same constant $\bar{L}$ (and these constants depend only on m, α, and the constant involved in Assumptions 4.1 and 4.2).

Therefore,

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& \leq & (\bar{b} +\bar{ L})Dud(\xi,\tilde{\xi }) \\ & \leq & 2(\bar{b} +\bar{ L})\pi _{r}(\xi,\tilde{\xi }),{}\end{array}$$

(35)

where $Dud(\xi,\tilde{\xi })$ denotes the Dudley distance between the distributions of random vectors ξ and $\tilde{\xi }$. (See [3] for the definition of the Dudley metric, and the inequality between Dudley and Prokhorov metrics.)

If $\tilde{b} = 2(\bar{b} +\bar{ L})$, then from (35) and (29)

$$\displaystyle{ \sup _{x\in X_{0}}\Delta (x) \leq N\tilde{b}\pi _{r}(\xi,\tilde{\xi }) + \frac{bc} {1-\alpha }\alpha ^{N+1} \equiv N\tilde{b}\pi _{ r}(\mu,\tilde{\mu }) + \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$

(36)

Finally, the desired inequality (4.7) follows from (36) if we choose

$$\displaystyle{ N = \left [\max \left \{1,\log _{\alpha }\left ( \frac{1} {\pi _{r}(\mu,\tilde{\mu })}\right )\right \}\right ] + 1. }$$

□

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gordienko, E., Martinez, J., Ruiz de Chávez, J. (2015). Stability Estimation of Transient Markov Decision Processes. In: Mena, R., Pardo, J., Rivero, V., Uribe Bravo, G. (eds) XI Symposium on Probability and Stochastic Processes. Progress in Probability, vol 69. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-13984-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-13984-5_8
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-319-13983-8
Online ISBN: 978-3-319-13984-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Stability Estimation of Transient Markov Decision Processes

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proofs of the Results

Appendix: Proofs of the Results

Lemma 1

Proof

1.1 1. Proof of Theorem 3.2

Proof

1.2 2. Proof of Theorem 4.3

Proof

Proposition 2

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation