Skip to main content

Stability Estimation of Transient Markov Decision Processes

  • Conference paper
XI Symposium on Probability and Stochastic Processes

Part of the book series: Progress in Probability ((PRPR,volume 69))

  • 632 Accesses

Abstract

We consider transient or absorbing discrete-time Markov decision processes with expected total rewards. We prove inequalities to estimate the stability of optimal control policies with respect to the total variation norm and the Prokhorov metric. Some application examples are given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. D.P. Bertsekas, J.N. Tsitsiklis, An analysis of stochastic shortest path problems. Math. Oper. Res. 16(3), 580–595 (1991)

    Article  MathSciNet  Google Scholar 

  2. A.A. Borovkov, S.G. Foss, Stochastically recursive sequences and their generalization. Sib. Adv. Math. 2, 16–81 (1992)

    MathSciNet  Google Scholar 

  3. R.M. Dudley, Real Analysis and Probability. Volume 74 of Cambridge Studies in Advanced Mathematics (Cambridge University Press, Cambridge, 2002). Revised reprint of the 1989 original

    Google Scholar 

  4. E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes (Springer, New York, 1979)

    Book  Google Scholar 

  5. E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Discounted cost optimality problem: stability with respect to weak metrics. Math. Methods Oper. Res. 68, 77–96 (2008)

    Article  MathSciNet  Google Scholar 

  6. E.I. Gordienko, E. Lemus-Rodriguez, R. Montes-de Oca, Average cost Markov control processes: stability with respect to the Kantorovich metric. Math. Methods Oper. Res. 70, 13–33 (2009)

    Article  MathSciNet  Google Scholar 

  7. E.I. Gordienko, A. Novikov, Characterization of optimal policies in a general stopping problem and stability estimating. Probab. Eng. Inf. Sci. 28(3), 335–352 (2014)

    Article  MathSciNet  Google Scholar 

  8. E.I. Gordienko, F. Salem, Estimates of stability of Markov control processes with unbounded cost. Kybernetika 36, 195–210 (2000)

    MathSciNet  Google Scholar 

  9. O Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes (Springer, New York, 1999)

    Google Scholar 

  10. O. Hernández-Lerma, G. Carrasco, R. Pérez-Hernández, Markov control processes with the expected total cost criterion: optimality, stability, and transient models. Acta Appl. Math. 59(3), 229–269 (1999)

    Article  MathSciNet  Google Scholar 

  11. K. Hinderer, K.H. Waldmann, Algorithms for countable state Markov decision models with an absorbing set. SIAM J. Control Optim. 43(6), 2109–2131 (electronic) (2005)

    Google Scholar 

  12. A. Hordijk, Dynamic Programming and Markov Potential Theory. Volume No. 51 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1974)

    Google Scholar 

  13. H.W. James, E.J. Collins, An analysis of transient Markov decision processes. J. Appl. Probab. 43(3), 603–621 (2006)

    Article  MathSciNet  Google Scholar 

  14. L.C.M. Kallenberg, Linear Programming and Finite Markovian Control Problems. Volume 148 of Mathematical Centre Tracts (Mathematisch Centrum, Amsterdam, 1983)

    Google Scholar 

  15. S.P. Meyn, R.L. Tweedie, Markov Chains and Stochastic Stability. Communications and Control Engineering Series (Springer, London, 1993)

    Book  Google Scholar 

  16. S.R. Pliska, On the transient case for Markov decision chains with general state spaces, in Dynamic Programming and Its Applications (Proc. Conf., University of British Columbia, Vancouver, 1977) (Academic, New York/London, 1978), pp. 335–349

    Google Scholar 

  17. S.M. Ross, Applied Probability Models with Optimization Applications (Dover Publications, New York, 1992). Reprint of the 1970 original

    Google Scholar 

  18. A.N. Shiryayev, Optimal Stopping Rules (Springer, New York/Heidelberg, 1978). Translated from the Russian by A.B. Aries, Applications of Mathematics, vol. 8

    Google Scholar 

  19. A.F. Veinott, Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Stat. 40, 1635–1660 (1969)

    Article  MathSciNet  Google Scholar 

  20. E. Zaitseva, Stability estimating in optimal stopping problem. Kybernetika (Prague) 44(3), 400–415 (2008)

    Google Scholar 

  21. E. Zaitseva, Robustness estimating of optimal stopping problem with unbounded revenue and cost functions. Int. J. Pure Appl. Math. 59(3), 291–306 (2010)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Ruiz de Chávez .

Editor information

Editors and Affiliations

Appendix: Proofs of the Results

Appendix: Proofs of the Results

The following lemma establishes a connection between Assumption 2.1 and the definitions of transient policies given in [13, 16].

Lemma 1

Let \(\Theta \) be as in Assumption 2.1 , \(X_{0} = X\setminus \Theta \) and for \(f \in \mathbb{F}\) , Q f is the restriction of the kernel P f in ( 2.1 ) to \((X_{0},\mathfrak{B}_{X_{0}})\) . If ( 2.3 ) holds, then

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} \leq M_{f}. }$$
(1)

In (1) \(\left \|\cdot \right \|_{0}\) is the operator norm corresponding to the supremum norm \(\left \|\cdot \right \|\) in \(\mathbb{B}\).

Proof

Since Q f t is a monotone operator,

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} = \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\text{I}\right \| =\sup _{ x\in X_{0}}\left \vert \sum _{t=0}^{\infty }Q_{ f}^{t}\text{I}(x)\right \vert. }$$
(2)

For every t ≥ 0

$$\displaystyle{ Q_{f}^{t}\text{I}(x) = P_{ x}^{f}(x_{ t} \in X_{0}) = P_{x}^{f}(\tau _{ x,f}(\Theta ) > t) }$$
(3)

because \(\Theta \) is absorbing set for P f .

Thus, from (2), (3) we find

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} =\sup _{x\in X_{0}}\sum _{t=0}^{\infty }P_{ x}^{f}(\tau _{ x,f}(\Theta ) > t) \leq M_{f}. }$$

 □ 

1.1 1. Proof of Theorem 3.2

Proof

Let f , \(\tilde{f}_{{\ast}}\) be the optimal stationary policies introduced in Proposition 2.3, and \(F_{{\ast}}:= \left \{f_{{\ast}},\tilde{f}_{{\ast}}\right \}\).

Under Assumptions 2.1 and 3.1, for every \(f \in \mathbb{F}_{{\ast}}\) the corresponding rewards V f  ≡ V (x, f), \(\tilde{V }_{f}(x) \equiv \tilde{ V }(x,f)\) are bounded functions, and can be rewritten as

$$\displaystyle{ V _{f}(x) = E_{x}^{f}\left [\sum _{ t=1}^{\infty }r(x_{ t-1},f(x_{t-1}))\right ], }$$
(4)
$$\displaystyle{ \tilde{V }_{f}(x) =\tilde{ E}_{x}^{f}\left [\sum _{ t=1}^{\infty }r(\tilde{x}_{ t-1},f(\tilde{x}_{t-1}))\right ]. }$$
(5)

From Proposition 2.3 and Assumption 3.1, the following operators G f , \(\tilde{G}_{f}\) ( f ∈ F )

$$\displaystyle{ G_{f}u(x):= r(x,f(x)) + Eu[F(x,f(x),\xi )], }$$
(6)
$$\displaystyle{ \tilde{G}_{f}u(x):= r(x,f(x)) + Eu(F(x,f(x),\tilde{\xi })) }$$
(7)

act from \(\mathbb{B}\) to \(\mathbb{B}\).

Using (4), (5) and standard arguments (Markov property [17]) we find that, for \(f \in \mathbb{F}_{{\ast}}\)

$$\displaystyle{ V _{f} = G_{f}V _{f}\quad \text{and}\quad \tilde{V }_{f} =\tilde{ G}_{f}\tilde{V }_{f}. }$$
(8)

For the stability index in (1.8) we have [8, 20]:

$$\displaystyle{ \Delta (x) \leq 2\max _{f\in \mathbb{F}_{{\ast}}}\left \vert V (x,f) -\tilde{ V }(x,f)\right \vert. }$$
(9)

First, let f = f (omitting subindex ∗). Then, by (8), for every n ≥ 1,

$$\displaystyle\begin{array}{rcl} \left \vert V (x,f) -\tilde{ V }(x,f)\right \vert & \leq & \left \|V _{f} -\tilde{ V }_{f}\right \| = \left \|G_{f}^{n}V _{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \| \\ & \leq & \left \|G_{f}^{n}V _{ f} - G_{f}^{n}\tilde{V }_{ f}\right \| + \left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \|.{}\end{array}$$
(10)

From Proposition 2.3 (b), the policy f = f satisfies (2.3). Therefore, by Lemma 1, and the corresponding result in [13], there exists an integer n ≥ 1 such that the operator \(G_{f}^{n}\) is contractive in \(\mathbb{B}\) with some module β < 1. Thus, from (10),

$$\displaystyle{ \left \|V _{f} -\tilde{ V }_{f}\right \| \leq \frac{1} {\left (1-\beta \right )}\left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \|. }$$
(11)

Taking into account (6), (7) and applying the arguments used, for example, in [20], we obtain

$$\displaystyle\begin{array}{rcl} & & \left \|G_{f}^{n}\tilde{V }_{ f} -\tilde{ G}_{f}^{n}\tilde{V }_{ f}\right \| \leq \\ & & n\left \|\tilde{V }_{f}\right \|\sup _{x\in X_{0},a\in A\left (x\right )}\sup _{B\in B_{X_{ 0}}}\left \vert P(F(x,a,\xi ) \in B) - P(F(x,a,\tilde{\xi }) \in B)\right \vert.{}\end{array}$$
(12)

The last term on the right-hand side of (12) is less than \(\frac{1} {2}\mathbb{V}(\xi,\tilde{\xi })\).

On the other hand, since r ≡ 0 on \(\Theta \), from Assumption 3.1 we have:

$$\displaystyle{ \left \|\tilde{V }_{f}\right \| \equiv \left \|\tilde{V }_{f_{{\ast}}}\right \| =\sup _{x\in X_{0}}\left \vert \tilde{E}_{x}^{f}\sum _{ t=1}^{\infty }r(\tilde{x}_{ t-1},f_{{\ast}}(\tilde{x}_{t-1}))\text{I}_{\left \{\tilde{\tau }_{x,f_{{\ast}}}(\Theta )>t-1\right \}}\right \vert \leq bM, }$$
(13)

where \(b =\mathop{\sup }\limits_{ k \in \mathbb{K}}\left \vert r(k)\right \vert \), and M is the constant from Assumption 3.1.

Second, in (9) let \(f =\tilde{ f}_{{\ast}}\). Now we have the inequality (10) with \(f =\tilde{ f}_{{\ast}}\). Let m ≥ 1 and γ < 1 be the constants from Assumption 3.1. From (3) and (3.2) in Assumption 3.1, \(\left \|Q_{f}^{m}\right \| \leq \gamma\).

Since the set \(\Theta \) is absorbing under \(P_{f} \equiv P_{\tilde{f}_{{\ast}}}\) (see Assumption 3.1), and iterating (6), for each \(u,\upsilon \in \mathbb{B}\), \(\left \|Q_{f}^{m}u - Q_{f}^{m}\upsilon \right \| \leq \gamma \left \|u-\upsilon \right \|.\) Thus, from (10) it follows that

$$\displaystyle{ \left \|V _{f} -\tilde{ V }_{f}\right \| \leq \frac{1} {(1-\gamma )}\left \|G_{f}^{m}\tilde{V }_{ f} -\tilde{ G}_{f}^{m}\tilde{V }_{ f}\right \|. }$$

Proceeding as in (12) and (13) (with \(f \equiv \tilde{ f}_{{\ast}}\) rather than f = f ), and applying Assumption 3.1 (b), we get that for a given constant \(\tilde{K}\):

$$\displaystyle{ \left \|V _{\tilde{f}_{{\ast}}}-\tilde{ V }_{\tilde{f}_{{\ast}}}\right \| \leq \tilde{ K}\mathbb{V}(\xi,\tilde{\xi }). }$$
(14)

To conclude the proof of (3.3) it suffices to gather the inequalities (9) and (11)–(14). □ 

1.2 2. Proof of Theorem 4.3

Proof

Let \(f_{{\ast}},\tilde{f}_{{\ast}}\in \mathbb{F}\) be the stationary policies optimal for MDPs (1.1), (1.2), respectively, and \(V _{{\ast}} = V _{f_{{\ast}}}\), \(\tilde{V }_{{\ast}} = V _{\tilde{f}_{{\ast}}}\) be the corresponding value functions. The existence of f and \(\tilde{f}_{{\ast}}\) was ensured in Proposition 2.3. From Assumption 4.1 (a), for every \(f \in \mathbb{F}\) the corresponding rewards V f and \(\tilde{V }_{f}\) (see (2.6)–(2.9)) are zero on \(\Theta \). Particularly \(V _{{\ast}}(x) =\tilde{ V }_{{\ast}}(x) = 0\), for \(x \in \Theta \). Hence, we can consider all functions V f and \(\tilde{V }_{f}\), \(f \in \mathbb{F}\) as elements of the space \(\mathbb{B}\) (taking into account their boundedness which follows from Assumption 4.1).

In the usual manner, we introduce the dynamic programming operators \(T,\tilde{T}: \mathbb{B} \rightarrow \mathbb{B}\):

$$\displaystyle{ Tu(x):=\sup _{a\in A(x)}\left \{r(x,a) + Eu(F(x,a,\xi ))\right \},\:x \in X, }$$
(15)
$$\displaystyle{ \tilde{T}u(x):=\sup _{a\in A(x)}\left \{r(x,a) + Eu(F(x,a,\tilde{\xi }))\right \},\:x \in X. }$$
(16)

From Assumption 4.2 (a) (b), it follows that for each \(u \in \mathbb{B}\) there exists a stationary policy (selector), f u , such that

$$\displaystyle\begin{array}{rcl} \sup _{a\in A(x)}\left \{r(x,a) + Eu[F(x,a,\xi )]\right \}& =& r(x,f_{u}(x)) + Eu[F(x,f_{u}(x),\xi )] {}\\ & =& r(x,f_{u}(x)) + E_{x}^{f_{u} }u(x_{1}),\:\,x \in X. {}\\ \end{array}$$

Thus for \(x \in \Theta \) by Assumption 4.1 (a) Tu(x) = 0, and \(T\mathbb{B} \subseteq \mathbb{B}\). (Similarly \(\tilde{T}\mathbb{B} \subseteq \mathbb{B}\).)

As it was proven in [13, 16] the fulfilment of Assumptions 4.1 and 4.2 is sufficient for validity of the following assertions.

Proposition 2

  1. (a)

    \(V _{{\ast}} = TV _{{\ast}}\) ,  \(\tilde{V }_{{\ast}} =\tilde{ T}\tilde{V }_{{\ast}}\) .

  2. (b)

    The optimal policy f is a selector in the right-hand side of ( 15 ) with u = V ; and the optimal policy \(\tilde{f}_{{\ast}}\) is a selector in the right-hand side of ( 16 ) with \(u =\tilde{ V }_{{\ast}}\) ;

  3. (c)

    There exists an integer m ≥ 1 such that the operator T m is contractive in \(\mathbb{B}_{0}\) with some module β < 1.

For any \((x,a) \in \mathbb{K}\) let

$$\displaystyle{ H(x,a):= r(x,a) + EV _{{\ast}}[F(x,a,\xi )], }$$
(17)
$$\displaystyle{ \tilde{H}(x,a):= r(x,a) + E\tilde{V }_{{\ast}}[F(x,a,\tilde{\xi })]. }$$
(18)

To simplify notation let \(f =\tilde{ f}_{{\ast}}\). Similarly to [5], let \(\Gamma _{t}\) \(=\{ x,a_{1},x_{1},a_{2},\ldots,\) \(x_{t-1},a_{t}\}\), (t ≥ 1), be the part of a trajectory of process (1.1) under the control policy \(f = \left \{f,f,\ldots,\right \}\) (with the initial state \(x \in X_{0}\)). By the Markov property, we have

$$\displaystyle\begin{array}{rcl} \zeta _{t}&:=& E^{f}[V _{ {\ast}}(x_{t})\vert \Gamma _{t}] \\ & =& H(x_{t-1},a_{t}) - r(x_{t-1},a_{t}) -\sup _{a\in A(x_{t-1})}H(x_{t-1},a) \\ & & +\sup _{a\in A(x_{t-1})}H(x_{t-1},a). {}\end{array}$$
(19)

By (15), (17) and Proposition 2 (a) we obtain:

$$\displaystyle\begin{array}{rcl} \zeta _{t}& =& H(x_{t-1},a_{t}) -\sup _{a\in A(x_{t-1})}H(x_{t-1},a) - r(x_{t-1},a_{t}) + V _{{\ast}}(x_{t-1}) \\ & =& \Lambda _{t} - r(x_{t-1},a_{t}) + V _{{\ast}}(x_{t-1}), {}\end{array}$$
(20)

where

$$\displaystyle{ \Lambda _{t}:=\mathop{\sup }\limits_{ a \in A(x_{t-1})}H(x_{t-1},a) - H(x_{t-1},a_{t}) \geq 0. }$$
(21)

From (19) and (20) we get:

$$\displaystyle{ E_{x}^{f}V _{ {\ast}}(x_{t}) = E_{x}^{f}\Lambda _{ t} - E_{x}^{f}r(x_{ t-1},a_{t}) + E_{x}^{f}V _{ {\ast}}(x_{t-1}). }$$

Summing the last equality over t ∈ [1, n], we obtain that

$$\displaystyle{ E_{x}^{f}\sum _{ t=1}^{n}r(x_{ t-1},a_{t}) = V _{{\ast}}(x) - E_{x}^{f}V _{ {\ast}}(x_{n}) -\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}. }$$
(22)

Since \(r,V _{{\ast}}\in \mathbb{B}\), under Assumption 4.1 (b) as n → , \(E_{x}^{f}\mathop{\mathop{\sum }\limits_{t = 1}}\limits^{n}r(x_{t-1},a_{t}) \rightarrow E_{x}^{f}\mathop{\mathop{\sum }\limits^{\infty }}\limits_{t = 1}r(x_{t-1},a_{t}) = V (x,f(x))\) (see (2.6)), and \(E_{x}^{f}V _{{\ast}}(x_{n}) = [Q_{f}^{n}V _{{\ast}}](x) \rightarrow 0\), where Q f is the kernel defined in Lemma 1.

Thus we can pass to the limit in (22) to find

$$\displaystyle{ \Delta (x) = V _{{\ast}}(x) - V (x,f) \leq \limsup _{n\rightarrow \infty }\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}. }$$
(23)

Similarly to Lemma 1 it is proven that (4.1) yields that

$$\displaystyle{ \left \|\sum _{t=0}^{\infty }Q_{ f}^{t}\right \|_{ 0} \leq M,\quad \mbox{for every f $\in \mathbb{F}$.} }$$
(24)

On the other hand, in [13] it was shown that

$$\displaystyle{ V _{{\ast}}(x) = V (x,f_{{\ast}}) = \left [\sum _{t=0}^{\infty }Q_{ f_{{\ast}}}^{t}r\right ](x). }$$

Thus, from (24) we see that \(\left \|V _{{\ast}}\right \|\leq M\left \|r\right \|\), and, similarly, \(\left \|\tilde{V }_{{\ast}}\right \|\leq M\left \|r\right \|\). From the first of these inequalities it follows that (see (17), (18)) in (23) \(\Lambda _{t}\) is a function of x t−1 (a state under the policy f) bounded by

$$\displaystyle{ 2\left \|r\right \|\left (1 + M\right ) =: b. }$$
(25)

From Proposition 2 (a), (17) and (21):

$$\displaystyle{ \Lambda _{t}(x_{t-1}) = V _{{\ast}}(x_{t-1}) - r(x_{t-1},f(x_{t-1})) - EV _{{\ast}}(x_{t}), }$$

and by Assumption 4.1, if \(x_{t-1} \in \Theta \), then \(x_{t} \in \Theta \), and therefore, (since r and V are zero on \(\Theta \)) \(\Lambda _{t}(x_{t-1}) = 0\) when \(x_{t-1} \in \Theta \). Hence, \(\Lambda _{t} = \Lambda (x_{t-1})\), where \(\Lambda \) is a function from \(\mathbb{B}\).

In [16] was proven that under Assumption 4.1 that there exist constants c <  and α < 1 such that for every \(f \in \mathbb{F}\),

$$\displaystyle{ \left \|Q_{f}^{n}\right \|_{ 0} \leq c\alpha ^{n},\quad n = 1,2,\ldots. }$$
(26)

On the other hand, in view of the above properties of \(\Lambda \), the right-hand side of (23) can be rewritten as follows. Let N ≥ 1 be an arbitrary (for now) fixed integer. Then,

$$\displaystyle\begin{array}{rcl} I(x):=\limsup _{n\rightarrow \infty }\sum _{t=1}^{n}E_{ x}^{f}\Lambda _{ t}& =& \sum _{t=1}^{\infty }E_{ x}^{f}\Lambda _{ t} \\ & =& \sum _{t=1}^{N}E_{ x}^{f}\Lambda _{ t} +\sum _{t>N}E_{x}^{f}\Lambda _{ t},{}\end{array}$$
(27)

And from (26),

$$\displaystyle{ \sup _{x\in X_{0}}\left \vert \sum _{t>N}E_{x}^{f}\Lambda _{ t}\right \vert = \left \|\sum _{t>N}Q_{f}^{t}\Lambda _{ t}\right \| \leq \sum _{t>N}\left \|Q_{f}^{t}\Lambda _{ t}\right \| \leq \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$
(28)

Combining (23), (27) and (28), we obtain the inequality

$$\displaystyle{ \Delta (x) \leq \sum _{t=1}^{N}E_{ x}^{f}\Lambda _{ t} + \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$
(29)

Let us bound \(\Lambda _{t}\) in the last inequality. From the definition of \(\Lambda _{t}\) in (21) and from (16)–(18), Proposition 2 (a), we have:

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& =& \sup _{a\in A(x_{t-1})}H(x_{t-1},a) -\sup _{a\in A(x_{t-1})}\tilde{H}(x_{t-1},a) +\tilde{ H}(x_{t-1},a_{t}) - H(x_{t-1},a_{t}) \\ & \leq & 2\sup _{a\in A(x_{t-1})}\left \vert H(x_{t-1},a) -\tilde{ H}(x_{t-1},a)\right \vert \\ &\leq & 2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\xi )] - E\tilde{V }_{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert, {}\end{array}$$
(30)

where expectations are interpreted as conditional expectations with x t−1 being fixed.

From (30) we get:

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& \leq & 2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\xi )] - EV _{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert \\ & & +2\sup _{a\in A(x_{t-1})}\left \vert EV _{{\ast}}[F(x_{t-1},a,\tilde{\xi })] - E\tilde{V }_{{\ast}}[F(x_{t-1},a,\tilde{\xi })]\right \vert \\ &\leq & 2\sup _{k\in \mathbb{K}}\left \vert EV _{{\ast}}[F(k,\xi )] - EV _{{\ast}}[F(k,\tilde{\xi })]\right \vert + 2\left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \|. {}\end{array}$$
(31)

From Proposition 2 (c), there exists integers m ≥ 1 and β < 1 such that the operator T m is contractive with module β < 1. Thus, again using Proposition 2,

$$\displaystyle{ \left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \| = \left \|T^{m}V _{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|\leq \left \|T^{m}V _{ {\ast}}- T^{m}\tilde{V }_{ {\ast}}\right \| + \left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|, }$$

or

$$\displaystyle{ \left \|V _{{\ast}}-\tilde{ V }_{{\ast}}\right \|\leq \frac{1} {1-\beta }\left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|. }$$
(32)

Now, since T is a nonexpansive operator, by induction we have

$$\displaystyle\begin{array}{rcl} \left \|T^{m}\tilde{V }_{ {\ast}}-\tilde{ T}^{m}\tilde{V }_{ {\ast}}\right \|& \leq & \left \|TT^{m-1}\tilde{V }_{ {\ast}}- T\tilde{T}^{m-1}\tilde{V }_{ {\ast}}\right \| + \left \|T\tilde{T}^{m-1}\tilde{V }_{ {\ast}}-\tilde{ T}\tilde{T}^{m-1}\tilde{V }_{ {\ast}}\right \| \\ &\leq & \left \|T^{m-1}\tilde{V }_{ {\ast}}-\tilde{ T}^{m-1}\tilde{V }_{ {\ast}}\right \| + \left \|T\tilde{V }_{{\ast}}-\tilde{ T}\tilde{V }_{{\ast}}\right \| \\ &\leq & m\left \|T\tilde{V }_{{\ast}}-\tilde{ T}\tilde{V }_{{\ast}}\right \| \\ &\leq & m\sup _{k\in K}\left \vert E\tilde{V }_{{\ast}}[F(k,\xi )] - E\tilde{V }_{{\ast}}[F(k,\tilde{\xi })]\right \vert. {}\end{array}$$
(33)

From (16) and Proposition 2 (a),

$$\displaystyle{ \tilde{V }_{{\ast}}(x) =\sup _{a\in A(x)}\left \{r(x,a) + E\tilde{V }_{{\ast}}[F(x,a,\tilde{\xi })]\right \}. }$$
(34)

Since \(\tilde{V }_{{\ast}}\) is bounded by \(M\left \|r\right \|\), from Assumption 4.2 (a) and (c), in (34) the function under supremum is Lipschitzian with respect to k = (x, a). Then, as it was shown in [6], this fact and Assumption 4.2 (b), proves that the function \(\tilde{V }_{{\ast}}\) in (34) is Lipschitzian. Therefore applying (4.6) in Assumption 4.2 (d), to the function \(s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]\) in (33) we obtain that this function satisfies the Lipschitz condition with a constant not depending on k.

In the same way (using Assumption 4.2 (c)) we can confirm that the function \(s \rightarrow \tilde{ V }_{{\ast}}[F(k,s)]\) is Lipschitzian.

Finally, combining inequalities (31)–(33), \(\Lambda _{t}\) in (29) is less than \(\sup \left \vert E\varphi (\xi ) - E\varphi (\tilde{\xi })\right \vert \) over a certain class of functions \(\varphi\), which are bounded by the same constant \(\bar{b}\) and satisfy the Lipschitz conditions with the same constant \(\bar{L}\) (and these constants depend only on m, α, and the constant involved in Assumptions 4.1 and 4.2).

Therefore,

$$\displaystyle\begin{array}{rcl} \Lambda _{t}& \leq & (\bar{b} +\bar{ L})Dud(\xi,\tilde{\xi }) \\ & \leq & 2(\bar{b} +\bar{ L})\pi _{r}(\xi,\tilde{\xi }),{}\end{array}$$
(35)

where \(Dud(\xi,\tilde{\xi })\) denotes the Dudley distance between the distributions of random vectors ξ and \(\tilde{\xi }\). (See [3] for the definition of the Dudley metric, and the inequality between Dudley and Prokhorov metrics.)

If \(\tilde{b} = 2(\bar{b} +\bar{ L})\), then from (35) and (29)

$$\displaystyle{ \sup _{x\in X_{0}}\Delta (x) \leq N\tilde{b}\pi _{r}(\xi,\tilde{\xi }) + \frac{bc} {1-\alpha }\alpha ^{N+1} \equiv N\tilde{b}\pi _{ r}(\mu,\tilde{\mu }) + \frac{bc} {1-\alpha }\alpha ^{N+1}. }$$
(36)

Finally, the desired inequality (4.7) follows from (36) if we choose

$$\displaystyle{ N = \left [\max \left \{1,\log _{\alpha }\left ( \frac{1} {\pi _{r}(\mu,\tilde{\mu })}\right )\right \}\right ] + 1. }$$

 □ 

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Gordienko, E., Martinez, J., Ruiz de Chávez, J. (2015). Stability Estimation of Transient Markov Decision Processes. In: Mena, R., Pardo, J., Rivero, V., Uribe Bravo, G. (eds) XI Symposium on Probability and Stochastic Processes. Progress in Probability, vol 69. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-13984-5_8

Download citation

Publish with us

Policies and ethics