Skip to main content
Log in

Uniformly constrained reinforcement learning

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

We propose new multi-objective reinforcement learning algorithms that aim to find a globally Pareto-optimal deterministic policy that uniformly (in all states) maximizes a reward subject to a uniform probabilistic constraint over reaching forbidden states of a Markov decision process. Our requirements arise naturally in the context of safety-critical systems, but pose a significant unmet challenge. This class of learning problem is known to be hard and there are no off-the-shelf solutions that fully address the combined requirements of determinism and uniform optimality. Having formalized our requirements and highlighted the specific challenge of learning instability, using a simple counterexample, we define from first principles a stable Bellman operator that we prove partially respects our requirements. This operator is therefore a partial solution to our problem, but produces conservative polices in comparison to our previous approach, which was not designed to satisfy the same requirements. We thus propose a relaxation of the stable operator, using adaptive hysteresis, that forms the basis of a heuristic approach that is stable w.r.t. our counterexample and learns policies that are less conservative than those of the stable operator and our previous algorithm. In comparison to our previous approach, the policies of our adaptive hysteresis algorithm demonstrate improved monotonicity with increasing constraint probabilities, which is one of the characteristics we desire. We demonstrate that adaptive hysteresis works well with dynamic programming and reinforcement learning, and can be adapted to function approximation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. We slightly abuse the term ‘Bellman operator’ to mean a Bellman-like operator that acts to improve policies.

  2. We only consider deterministic policies by definition.

  3. Source code available at github.com/jyounglee/uniformly-constrained-rl

References

  1. Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. In D. Precup & Y. W. Teh (Eds.), 34th international conference on machine learning. Proceedings of machine learning research (Vol. 70, pp. 22–31). MLR Press.

  2. Alshiekh, M., Bloem, R., Ehlers, R., et al. (2018). Safe reinforcement learning via shielding. In 32nd AAAI conference on artificial intelligence (Vol. 32, No. 1, pp. 2669–2678). https://doi.org/10.1609/aaai.v32i1.11797. https://ojs.aaai.org/index.php/AAAI/article/view/11797.

  3. Altman, E. (1999). Constrained Markov decision processes. Boca Raton: CRC Press.

    MATH  Google Scholar 

  4. Boutilier, C., & Lu, T. (2016). Budget allocation using weakly coupled, constrained Markov decision processes. In Proceedings of the 32nd conference on uncertainty in artificial intelligence (pp. 52–61).

  5. Carrara, N., Leurent, E., Laroche, R., et al. (2019). Budgeted reinforcement learning in continuous state space. In Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.

  6. Chen, R. C., & Feinberg, E. A. (2007). Non-randomized policies for constrained Markov decision processes. In Mathematical methods of operations research (Vol. 66, pp. 165–179). https://doi.org/10.1007/s00186-006-0133-x.

  7. Chow, Y., Nachum, O., Duenez-Guzman, E., & Ghavamzadeh, M. (2018). A Lyapunov based approach to safe reinforcement learning. In Advances in neural information processing systems (NeurIPS) (Vol. 31, pp. 8103–8112). OpenReview.

  8. Dolgov, D., & Durfee, E. (2005). Stationary deterministic policies for constrained MDPs with multiple rewards, costs, and discount factors. In Proceedings of the 19th international joint conference on artificial intelligence. IJCAI (pp. 1326–1331). Morgan Kaufmann.

  9. Feinberg, E. A. (2000). Constrained discounted Markov decision processes and Hamiltonian cycles. Mathematics of Operations Research, 25(1), 130–140.

    Article  MathSciNet  MATH  Google Scholar 

  10. Feinberg, E. A., & Shwartz, A. (1999). Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Transactions on Automatic Control, 44(3), 628–631.

    Article  MathSciNet  MATH  Google Scholar 

  11. Forejt, V., Kwiatkowska, M., Norman, G., Parker, D., & Qu, H. (2011). Quantitative multi-objective verification for probabilistic systems. In P. A. Abdulla & K. R. M. Leino (Eds.), Tools and algorithms for the construction and analysis of systems (pp. 112–127). Berlin: Springer.

    Chapter  MATH  Google Scholar 

  12. Forejt, V., Kwiatkowska, M., & Parker, D. (2012). Pareto curves for probabilistic model checking. In S. Chakraborty & M. Mukund (Eds.), Automated technology for verification and analysis (pp. 317–332). Berlin: Springer.

    Chapter  MATH  Google Scholar 

  13. Garía, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42), 1437–1480.

    MathSciNet  MATH  Google Scholar 

  14. Geibel, P., & Wysotzki, F. (2005). Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24, 81–108.

    Article  MATH  Google Scholar 

  15. Hahn,J., & Zoubir, A. M. (2016). Risk-sensitive decision making via constrained expected returns. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2569–2573). https://doi.org/10.1109/ICASSP.2016.7472141.

  16. Kalweit, G., Huegle, M., Werling, M., & Boedecker, J. (2020). Deep constrained Q-learning. arXiv:2003.09398.

  17. Lee, J., Balakrishnan, A., Gaurav, A., Czarnecki, K., & Sedwards, S. (2019). WiseMove: A framework to investigate safe deep reinforcement learning for autonomous driving. In D. Parker & V. Wolf (Eds.), Quantitative evaluation of systems (pp. 350–354). Cham: Springer.

    Chapter  Google Scholar 

  18. Lee, J. , Sedwards, S., & Czarnecki, K. (2021). Recursive constraints to prevent instability in constrained reinforcement learning. In Hayes, Mannion, & Vamplew (Eds.), Proceedings of 1st multi-objective decision making workshop (MODeM 2021). http://modem2021.cs.nuigalway.ie. arXiv:2201.07958.

  19. Lee, S. Y., Sungik, C., & Chung, S.-Y. (2019). Sample-efficient deep reinforcement learning via episodic backward update. In Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.

  20. Miettinen, K. (1999). Nonlinear multiobjective optimization (Vol. 12). Berlin: Springer.

    MATH  Google Scholar 

  21. Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.

    Book  MATH  Google Scholar 

  22. Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295. https://doi.org/10.48550/ARXIV.1610.03295. arXiv:1610.03295.

  23. Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1), 123–158. https://doi.org/10.1023/A:1018012322525

    Article  MATH  Google Scholar 

  24. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.

    MATH  Google Scholar 

  25. Tessler, C., Mankowitz, D. J., & Mannor, S. (2019). Reward constrained policy optimization. In International conference on learning representations (ICLR). OpenReview.

  26. Undurti, A., Geramifard, A., & How, J. P. (2011). Function approximation for continuous constrained MDPs. Technical report, MIT. https://people.csail.mit.edu/agf/Files/11ICRA-LinearCMDP.pdf.

Download references

Acknowledgements

The authors gratefully acknowledge the support of Japanese Science and Technology agency (JST) ERATO project JPMJER1603: HASUO Metamathematics for Systems Design.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sean Sedwards.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Recursive constraints

The recursive constraints approach was first proposed in our previous work [18]. Here we present an updated efficient sequential version of it. The simple intuition of recursive constraints is that for each action \(a \in \mathcal{A}(s)\) in state s, we superimpose all the constraints on a by recursion, up to the current iteration, and use it to judge whether a is a safe action in s.

To illustrate the idea, we revisit the naive policy iteration example in Sect. 3, using the same conditions but now with recursive constraints. We illustrate this in Table 5, where action \(a \in \{\textsf{L}, \textsf{R}\}\) is constrained at each iteration by \(\mathcal{C}_a = 1\). Denoting by \(\mathcal{C}_a^i \in \{0, 1\}\) a recursive constraint satisfaction variable for action a at iteration i (in state \(s^1\)), then at each iteration in Table 5, it satisfies by its construction

$$\begin{aligned}&\mathcal{C}_\textsf{L}^1 = {\textbf {1}}(\mathcal{P}_\textsf{LR}\le \theta ) \\&{\mathcal{C}_\textsf{L}^2} = {\textbf {1}}({\mathcal{P}_\textsf{LL}\leq \theta }\text { and } \mathcal{C}_\textsf{L}^1 = 1) = {\textbf {1}}({\mathcal{P}_\textsf{LL}\leq \theta }\text { and } \mathcal{P}_\textsf{LR}\le \theta ) \\&{\mathcal{C}_\textsf{L}^3} = {\textbf {1}}(\mathcal{P}_\textsf{LR}\le \theta \text { and } {\mathcal{C}_\textsf{L}^2 = 1}) = {\textbf {1}}(\mathcal{P}_\textsf{LR}\le \theta \text { and } {\mathcal{P}_\textsf{LL}\leq \theta }) \\&{\mathcal{C}_\textsf{L}^4} = {\textbf {1}}(\mathcal{P}_\textsf{LR}\le \theta \text { and } {\mathcal{C}_\textsf{L}^3 = 1}) = {\textbf {1}}(\mathcal{P}_\textsf{LR}\le \theta \text { and } {\mathcal{P}_\textsf{LL}\leq \theta }) \\ &\, {\vdots } \qquad \qquad \vdots \qquad \,\qquad {\vdots } \qquad \quad \qquad \vdots \qquad \qquad \,\,\,\, {\vdots } \end{aligned}$$

From this, we observe that the constraint \(\mathcal{C}_\textsf{L}\) on the action \(\textsf{L}\) in \(s^1\) is stabilized and thereby yields the same safe policy \(\pi _\textsf{R}\) from iteration \(i = 3\), as also shown in Table 5. In addition, the recursive constraint approach chooses \(\pi _\textsf{L}\) at the first iteration so that it allows to investigate at the next iteration (\(i = 2\)) whether the action \(\textsf{L}\), hence policy \(\pi _\textsf{L}\), is safe in \(s^1\), whereas the process with the stable operator \(\mathcal{T}\) does not allow to do so (by its definition), as illustrated in Table 3.

Table 5 Policy iteration using recursive constraints on the MDP in Fig. 1, for \(p=0.7\), \(\gamma =0.95\) and threshold \(\theta = 0.85\)

On the other hand, the idea of recursive constraints still has the issue that except for policy iteration, a dynamic programming or reinforcement learning method typically does not wait until its value function (i.e., \(\mathcal{P}^{\pi} (s, a)\)) is accurately estimated. In Table 5 for example, if \(\mathcal{P}_\textsf{LL}\) and/or \(\mathcal{P}_\textsf{LR}\) is not correctly estimated at the previous iterations, then the constraints in \(\mathcal{C}_\textsf{L}\) at the current iteration can be so deteriorated as it is constructed from all the previous constraints, including those based on inaccurate predictions at the early stages. In particular, the initial values of \(\mathcal{P}_\textsf{LL}\) and \(\mathcal{P}_\textsf{LR}\) are typically random and have no information, which introduces and transfers absurd random constraints to all subsequent iterations. Hence, the recursive constraints at and around the initial and previous stages must also be stabilized, in order to generalize themselves to a broad class of reinforcement learning methods.

In order to solve this stability issue, we (i) replace the iteration axis in Table 5 with the axis of horizon window \({n = 1,2,\dots , N}\) and (ii) replace the constraint at stage n with

$$\begin{aligned} \max _{m \in [1..n]} \overline{ \mathcal{P}}\vphantom{P}_m(s, a) \le \theta \end{aligned},$$

where \(\overline{ \mathcal{P}}\vphantom{P}_n\) is an (over-)approximation of the n-bounded probabilistic reachability

$$\begin{aligned} \mathcal{P}^{\pi} (s, a \,;\,n )&:= \mathbb {P}^{\pi} ( s_{\min (T, n)} \in \mathcal{F}_\perp \,|\,s_0 a_0 = s a) \end{aligned}$$

w.r.t. the policy \(\pi ^{n-1}\) obtained at the previous iteration (\(n - 1\)). Here, an over-approximation means that \(0 \le \mathcal{P}^{\pi} (s, a \,;n) \le \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le 1\) for all \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\).

We also define the action-independent version of \(\mathcal{P}^{\pi}\)

$$\begin{aligned} P^\pi (s \,;\,n)&:= \mathbb {P}^{\pi} ( s_{\min (T, n)} \in \mathcal{F}_\perp \,|\,s_0 = s ), \end{aligned}$$

which satisfies \(P^\pi (s \,;\, n) = \mathcal{P}^{\pi} (s, \pi (s) \,;\, n)\) for any \((s, n) \in {\mathscr{S}}^{+} \times \mathbb {N}\) and any policy \(\pi\). Also note that both n-bounded P-values are consistent with the respective unbounded ones in the limit \(n \rightarrow \infty\). That is, for any policy \(\pi\) and any \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\),

$$\begin{aligned} P^\pi (s) = \lim _{n \rightarrow \infty } P^\pi (s \,;\, n) \,\text { and }\,\, \mathcal{P}^{\pi} (s, a) = \lim _{n \rightarrow \infty } \mathcal{P}^{\pi} (s, a \,;\, n). \end{aligned}$$

For policy \(\pi ^m\) indexed by \(m \in [1..N]\), we write \(P^{m}\) for \(P^{\pi ^m}\) and \(\mathcal{P}^{m}\) for \(\mathcal{P}^{\pi ^m}\), for both bounded and unbounded P-values.

To describe the sequential process of the recursive constraint approach, let \(\mathcal{R}\) be the recursive constraint operator defined as

$$\begin{aligned} \mathcal{R}( \overline{ \mathcal{P}}\vphantom{P}, \mathcal{C}) := ( \overline{ \mathcal{P}}\vphantom{P}\,', \mathcal{C}\,'), \end{aligned}$$

given \(\overline{ \mathcal{P}}\) and \(\mathcal{C}\) that map each \((s, a) \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\) to \(\overline{ \mathcal{P}}(s, a) \in [0, 1]\) and \(\mathcal{C}(s, a) \in \{0, 1\}\), respectively. \(\mathcal{C}\,'\) represents the next constraint satisfaction, recursively defined as

$$\begin{aligned} \mathcal{C}\,'(s, a) := {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}\,'(s, a) \le \theta \text { and } \mathcal{C}(s, a) = 1 \big ) \end{aligned}$$

for the next (over-)approximated bounded probabilistic reachability \(\overline{ \mathcal{P}}\vphantom{P}\,'\) given by

$$\begin{aligned} \overline{ \mathcal{P}}\vphantom{P}\,'(s, a) := \mathbb {E}\big [ \overline{ \mathcal{P}}(s_1, \pi ^\star (s_1)) \, \vert \, s_0a_0 = sa \big ], \end{aligned}$$

where \(\pi ^\star\) is an optimal policy w.r.t. the recursively-constrained action space

$$\begin{aligned} \mathcal{A}_\mathcal{C}(s) := \big \{ a \in \mathcal{A}(s) \,|\,\mathcal{C}(s, a) = 1 \big \}, \end{aligned}$$

explicitly given by

$$\begin{aligned}&\pi ^\star (s) \in \! {\left\{ \begin{array}{ll} \displaystyle \mathop {\mathrm {arg\,min}}\limits _{a \in \mathfrak {A}_\mathcal{C}^\star (s)} \;\overline{ \mathcal{P}}\,(s, a) \,\text { for }\, \mathfrak {A}_\mathcal{C}^\star (s) := \mathop {\mathrm {arg\,max}}\limits _{a \in \mathcal{A}_\mathcal{C}(s)} Q^\star (s, a) \;\;\text { if } \mathcal{A}_\mathcal{C}(s) \ne \varnothing \\\displaystyle \mathop {\mathrm {arg\,max}}\limits _{a \in \overline{ \mathfrak{A}}(s)} Q^\star (s, a) \,\text { for }\, \overline{ \mathfrak{A}}(s) := \mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}(s)}\; \overline{ \mathcal{P}}\,(s, a) \;\;\;\,\text { otherwise.} \end{array}\right. } \end{aligned}$$

We write its value functions \(V^{\pi ^\star }\) and \(Q^{\pi ^\star }\) as \(V^\star\) and \(Q^\star\). Hence, \(\pi ^\star\) is optimal in the sense that \(V^\pi (s) \le V^\star (s)\) for any policy \(\pi\) s.t. \(\pi (s) \in \mathcal{A}_\mathcal{C}(s) \text { if } \mathcal{A}_\mathcal{C}(s) \ne \varnothing\) and \(\pi (s) \in \overline{ \mathfrak{A}}(s)\) otherwise. We denote a mapping from \(( \overline{ \mathcal{P}}, \mathcal{C})\) to such an optimal policy \(\pi ^\star\) by \(\mathrm {\Phi }\).

We now give a step-by-step description of the recursive constraint method, where the statements and equations are valid for all \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\). In short, it is a process of applying the recursive constraint operator \(\mathcal{R}(\cdot )\) N-times on the pair of initial maps \(( \overline{ \mathcal{P}}\vphantom{P}_1, \mathcal{C}_1)\) in Step 1 below.

  1. 1.

    Step 1. Set the initial maps \(\overline{ \mathcal{P}}\vphantom{P}_1\) and \(\mathcal{C}_1\) by

    $$\begin{aligned} \overline{ \mathcal{P}}\vphantom{P}_1(s, a) := \underbrace{\mathbb {P}(s_1 \in \mathcal{F}_\perp | \, s_0 a_0 = s a )}_{\mathcal{P}^{\pi} (s, a \,;\,1)} \quad \text { and } \quad \mathcal{C}_1(s, a) := {\textbf {1}}( \overline{ \mathcal{P}}\vphantom{P}_1(s, a) \le \theta ) \end{aligned}$$

    where probabilistic reachability \(\mathcal{P}^{\pi} (s ,a \,;\,1)\) (and thus \(P^\pi (s \,;\, 1)\)) bounded at horizon 1 is now stable since it does not depend on the policy \(\pi\) as above.

  2. 2.

    Step 2. Apply the recursive constraint operator to obtain \(( \overline{ \mathcal{P}}\vphantom{P}_2, \mathcal{C}_2) = \mathcal{R}( \overline{ \mathcal{P}}\vphantom{P}_1, \mathcal{C}_1)\), where \(\overline{ \mathcal{P}}\vphantom{P}_2\) corresponds to the probabilistic reachability \(\mathcal{P}^1( \,\cdot \, \,;\, 2)\) bounded at horizon 2, w.r.t. policy \(\pi ^1 = \mathrm {\Phi }( \overline{ \mathcal{P}}\vphantom{P}_1, \mathcal{C}_1)\), as shown below:

    $$\begin{aligned} \overline{ \mathcal{P}}\vphantom{P}_2(s, a) = \mathbb {E}\big [ \overline{ \mathcal{P}}\vphantom{P}_1(s_1, \pi ^1(s_1)) \, \big \vert \, s_0a_0 = sa \big ]&= \mathbb {E}\big [ \mathcal{P}^1(s_1, \pi ^1(s_1) \,;\, 1 ) \, \big \vert \, s_0a_0 = sa \big ] \\&= \mathbb {E}\big [ P^1(s_1 \,;\, 1 ) \, \big \vert \, s_0a_0 = sa \big ] = \mathcal{P}^1(s, a \,;\, 2 ) \end{aligned}$$

    Here, the last equality is the Bellman equation. The constraint satisfaction map \(\mathcal{C}_2\) can be expressed explicitly as \(\mathcal{C}_2(s, a) = {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}_1(s, a) \le \theta \,\text { and }\, \overline{ \mathcal{P}}\vphantom{P}_2(s, a) \le \theta \big )\).

  3. 3.

    Steps \(n = 3, 4, 5, \dots , N\). At each step n, we apply the recursive constraint operator to \(( \overline{ \mathcal{P}}\vphantom{P}_{n-1}, \mathcal{C}_{n-1})\) to obtain \(( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n) = \mathcal{R}( \overline{ \mathcal{P}}\vphantom{P}_{n-1}, \mathcal{C}_{n-1})\), hence

    $$\begin{aligned}& \overline{ \mathcal{P}}\vphantom{P}_n(s, a) = \mathbb {E}\big [ \overline{ \mathcal{P}}\vphantom{P}_{n-1}(s_1, \pi ^{n-1}(s_1)) \, \big \vert \, s_0a_0 = sa \big ] \nonumber \\ &\mathcal{C}_n(s, a) = {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta \,\text { and }\, \mathcal{C}_{n-1}(s, a) = 1 \big ) \end{aligned}$$
    (21)

    where policy \(\pi ^{n-1} = \mathrm {\Phi }( \overline{ \mathcal{P}}\vphantom{P}_{n-1}, \mathcal{C}_{n-1})\). Constraint satisfaction map \(\mathcal{C}_n\) satisfies

    $$\begin{aligned} \mathcal{C}_n (s, a)&= {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta \,\text { and }\, \mathcal{C}_{n-1}(s, a) = 1 \big ) \\&= {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta \,\text { and }\, \overline{ \mathcal{P}}\vphantom{P}_{n-1}(s, a) \le \theta \,\text { and }\, \mathcal{C}_{n-2}(s, a) = 1 \big ) \\&\,\vdots \\ {}&= {\textbf {1}}\big ( \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta \,\text { and } \, \overline{ \mathcal{P}}\vphantom{P}_{n-1}(s, a) \le \theta \,\text { and }\, \cdots \,\text { and }\,\overline{ \mathcal{P}}\vphantom{P}_1(s, a) \le \theta \big ) \\&= {\textbf {1}} \big ( \textstyle \bigwedge _{m \in [1..n]} \; ( \, \overline{ \mathcal{P}}\vphantom{P}_m(s, a) \le \theta \, ) \, \big ) \end{aligned}$$

    At the last step \(n = N\), the final policy \(\pi ^N\) is obtained by \(\pi ^N = \mathrm {\Phi }( \overline{ \mathcal{P}}\vphantom{P}_N, \mathcal{C}_N)\).

The n-bounded probabilistic reachability w.r.t. the policy \(\pi ^{n-1}\) given at the previous step \(n-1\) satisfies the Bellman equation:

$$\begin{aligned} \mathcal{P}^{n-1}(s, a \,;\, n) = \mathbb {E}\big [ P^{n-1}(s_1 \,;\, n-1) \, \vert \, s_0a_0 = sa \big ] \qquad \forall sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s) \end{aligned}$$

However, in order to obtain \(P^{n-1}(\, \cdot \, \,;\, n-1)\), we need to calculate \(P^{n-1}(\, \cdot \, \,;\,m)\) and use it in the backward induction for \(P^{n-1}(\, \cdot \, \,;\,m+1)\), all the way through \(m = 1, 2, 3, \dots , n-2\). The longer the horizon n is, the more complexity this procedure induces in space and time. Instead, as shown in (21), our design choice at each step \(n \ge 3\) is to use a substitute \(\overline{ \mathcal{P}}\vphantom{P}_{n-1}(\,\cdot \,, \pi ^{n-1}(\cdot ))\) obtained at the previous stage \(n-1\). Here, \(\overline{ \mathcal{P}}\vphantom{P}_{n-1}(s, \pi ^{n-1}(s))\) typically over-approximates its target \(P^{n-1}(s \,;\, n-1)\) since \(\pi ^{n-1}\) is typically conservative more than or equally to \(\pi ^{n-2}\) (\(\because\) \(\mathcal{C}_{n-1} \le \mathcal{C}_{n-2}\)), and

$$\begin{aligned} \begin{cases} P^{n-1}(s \,;\, n-1) \,\, = \mathcal{P}^{n-1}(s, \pi ^{n-1}(s) \,;\, n-1) \\ \overline{ \mathcal{P}}\vphantom{P}_{n-1}(s, \pi ^{n-1}(s)) \ge \mathcal{P}^{n-2}(s, \pi ^{n-1}(s) \,;\, n - 1) \end{cases} \end{aligned}$$

(for the inequality, provided that at the previous step \(n-2\), \(\overline{ \mathcal{P}}\vphantom{P}_{n-2}(s, \pi ^{n-2}(s))\) over-approximates its target \(P^{n-2}(s \,;\, n-2)\) for all \(s \in {\mathscr{S}}^{+}\)).

Finally, we provide the policy \(\pi ^N\) at the last horizon N as the receding-horizon solution that is (i) potentially conservative but less than the solution \(\pi ^*\) defined as a fixed point of the operator \(\mathcal{T}\) in Sect. 5.1 and thus (ii) typically performs better, uniformly over all states, subject to N-bounded probabilistic reachability constraint imposed on every state. To address the instability issue, the final policy \(\pi ^N\) has the recursive constraints \(\mathcal{C}_N(s, a) = 1\) that contain all constraints w.r.t. shorter horizons, i.e.,

$$\begin{aligned} \mathcal{C}_N (s, a) = 1 \Longleftrightarrow \bigwedge _{n \in [1..N]} ( \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta ) \Longleftrightarrow \max _{n\in [1..N]} \overline{ \mathcal{P}}\vphantom{P}_n(s, a) \le \theta , \end{aligned}$$

where each \(\overline{ \mathcal{P}}\vphantom{P}_n(\cdot )\) is recursively defined from the initial one \(\overline{ \mathcal{P}}\vphantom{P}_{1}(\cdot )\) that is independent of any policy and hence can be stably obtained. Also note the following.

  1. 1.

    Thanks to the incremental finite horizon formulation, \(( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n)\) does not depend on the policy \(\pi ^n = \mathrm {\Phi }( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n)\) obtained from themselves (i.e., there is no self-feedback learning-loop), which further contributes to the stability.

  2. 2.

    The shorter the horizon is, the lesser constrained the actions typically are since

    Proposition 8

    For any policy \(\pi\),

    $$\begin{aligned}\mathcal{P}^{\pi} (s, a \,;\, n) \le \mathcal{P}^{\pi} (s, a \,;\, n + 1) \quad \forall sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s) \quad \forall n \in \mathbb {N}. \end{aligned}$$

    Proof

    See Appendix C.11\(\square\)

    Hence, the actions at the first and early horizons are less constrained than those at later horizons. This makes sense and help relax the conservativeness of the final policy \(\pi ^N\) since those actions constrained at the initial or early horizons will be also constrained for all later horizons by recursive constraints, even if they eventually become safe at horizon N. This is an additional benefit of forward-stepping over the horizon rather than the iteration axis.

  1. 3.

    Lastly, the recursive constraint satisfaction map \(\mathcal{C}_N\) and thus \(\mathcal{A}_{\mathcal{C}_N}\) are monotonically decreasing as N increases. Since the spaces \(\mathscr{S}^{+}\) and \({\mathcal{A}}^{+}\) are finite, the constrained action set \(\mathcal{A}_{\mathcal{C}_N}\) is therefore stabilized within a finite horizon.

The difference between the recursive constraints approaches presented in this section and in our previous work [18] is that in this paper we update nothing at the later horizons before we have completed the task \(( \overline{ \mathcal{P}}\vphantom{P}_{n+1}, \mathcal{C}_{n+1}) = \mathcal{R}( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n)\) at the current horizon n. This can reduce the space complexity since we only need to store the data \(( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n)\) at the current step to obtain the next one \(( \overline{ \mathcal{P}}\vphantom{P}_{n+1}, \mathcal{C}_{n+1})\); it also removes unnecessary computations since any changes at the previous horizon \(n < N\) (e.g., policy \(\pi ^n\)) would alter the solutions all the way up to the last horizon N (e.g., \(\pi ^{n+1}, \pi ^{n+2}, \dots , \pi ^{N}\)). At each horizon n, since \(( \overline{ \mathcal{P}}\vphantom{P}_n, \mathcal{C}_n)\) is fixed a priori, a standard single objective reinforcement learning method can be employed to solve the task of finding \(( \overline{ \mathcal{P}}\vphantom{P}_{n+1}, \mathcal{C}_{n+1})\) and an associated optimal policy \(\pi ^n\).

B Analysis of the counter-MDP

The counter-MDP of Fig. 1 (reproduced in Fig. 16) is reduced to the Markov chains shown in Figs. 17 and 18 by policies \(\pi _\textsf{L}\) and \(\pi _\textsf{R}\), respectively. Closed form expressions of the \(\mathcal{P}\)- and \(Q\)-functions of the counter-MDP can be constructed by solving simultaneous equations for the probabilistic reachability function P and value function V of the two Markov chains. The discount factor \(\gamma \in [0, 1)\) and step reward r are constant. By definition:

$$\begin{aligned}&P^{\pi _\textsf{L}}(\textsf{X})=P^{\pi _\textsf{R}}(\textsf{X}) := 1\qquad P^{\pi _\textsf{L}}(\textsf{G})=P^{\pi _\textsf{R}}(\textsf{G}) := 0 \\&V^{\pi _\textsf{L}}(\textsf{X})=V^{\pi _\textsf{R}}(\textsf{X}) := 0\qquad V^{\pi _\textsf{L}}(\textsf{G})=V^{\pi _\textsf{R}}(\textsf{G}) := 0 \end{aligned}$$

Fig. 17 induces the following equations for \(\pi _\textsf{L}\):

$$\begin{aligned}&P^{\pi _\mathsf{L}}(s^1)=p P^{\pi _\mathsf{L}}(\mathsf{X})+(1-p)P^{\pi _\mathsf{L}}(s^2) \\&P^{\pi _\mathsf{L}}(s^2)=p P^{\pi _\mathsf{L}}(s^1)+(1-p)P^{\pi _\mathsf{L}}(\mathsf{G}) \\&V^{\pi _\mathsf{L}}(s^1)=r+\gamma (p V^{\pi _\mathsf{L}}(\mathsf{X})+(1-p)V^{\pi _\mathsf{L}}(s^2)) \\&V^{\pi _\mathsf{L}}(s^2)=r+\gamma (p V^{\pi _\mathsf{L}}(s^1)+(1-p)V^{\pi _\mathsf{L}}(\mathsf{G})) \end{aligned}$$

The above can then be solved to give \(P^{\pi _\textsf{L}}\) and \(V^{\pi _\textsf{L}}\) in terms of rp and \(\gamma\). For instance, \(P^{\pi _\textsf{L}}(s^1) = p/(p^2-p+1)\) and \(V^{\pi _\textsf{L}}(s^1)=r(1+\gamma (1-p))/(1 - \gamma ^2p(1-p))\). \(P^{\pi _\textsf{L}}\) and \(V^{\pi _\textsf{L}}\) can then be used to define closed form expressions for \(\mathcal{P}\) and \(Q\):

$$\begin{aligned}&\mathcal{P}^{\pi _\mathsf{L}}(\mathsf{X},\,\cdot \,)=P^{\pi _\mathsf{L}}(\mathsf{X})\quad \,\mathcal{P}^{\pi _\mathsf{L}}(\mathsf{G},\,\cdot \,)=P^{\pi _\mathsf{L}}(\mathsf{G})\;\!\quad \mathcal{P}^{\pi _\mathsf{L}}(s^2,\mathsf{R})=P^{\pi _\mathsf{L}}(s^2)\\&Q^{\pi _\mathsf{L}}(\mathsf{X}, \,\cdot \,)=V^{\pi _\mathsf{L}}(\mathsf{X})\quad Q^{\pi _\mathsf{L}}(\mathsf{G},\,\cdot \,)=V^{\pi _\mathsf{L}}(\mathsf{G})\quad Q^{\pi _\mathsf{L}}(s^2,\mathsf{R})=V^{\pi _\mathsf{L}}(s^2)\\&\mathcal{P}_{\mathsf{L}\mathsf{L}}\,:=\mathcal{P}^{\pi _\mathsf{L}}(s^1,\mathsf{L})=P^{\pi _\mathsf{L}}(s^1)\quad \mathcal{P}_{\mathsf{R}\mathsf{L}}\;\!:=\mathcal{P}^{\pi _\mathsf{L}}(s^1,\mathsf{R})=p P^{\pi _\mathsf{L}}(s^2) + (1-p)P^{\pi _\mathsf{L}}(\mathsf{X}) \\&Q_{\mathsf{L}\mathsf{L}}:=Q^{\pi _\mathsf{L}}(s^1,\mathsf{L})=V^{\pi _\mathsf{L}}(s^1)\quad Q_{\mathsf{R}\mathsf{L}}:=Q^{\pi _\mathsf{L}}(s^1,\mathsf{R})=r+\gamma (p V^{\pi _\mathsf{L}}(s^2) + (1\!-\!p)V^{\pi _\mathsf{L}}(\mathsf{X})) \end{aligned}$$

Similarly, Fig. 18 induces the following equations for \(\pi _\textsf{R}\):

$$\begin{aligned}&P^{\pi _\mathsf{R}}(s^1)=(1-p)P^{\pi _\mathsf{R}}(\mathsf{X})+p P^{\pi _\mathsf{R}}(s^2) \\&P^{\pi _\mathsf{R}}(s^2)=p P^{\pi _\mathsf{R}}(s^1)+(1-p)P^{\pi _\mathsf{R}}(\mathsf{G}) \\&V^{\pi _\mathsf{R}}(s^1)=r+\gamma ((1-p)V^{\pi _\mathsf{R}}(\mathsf{X})+p V^{\pi _\mathsf{R}}(s^2)) \\&V^{\pi _\mathsf{R}}(s^2)=r+\gamma (p V^{\pi _\mathsf{R}}(s^1)+(1-p)V^{\pi _\mathsf{R}}(\mathsf{G})) \end{aligned}$$

Solving \(P^{\pi _\textsf{R}}\) and \(V^{\pi _\textsf{R}}\) in terms of rp and \(\gamma\), for instance, \(P^{\pi _\textsf{R}}(s^1)=1/(p+1)\) and \(V^{\pi _\textsf{R}}(s^1)=r/(1-\gamma p)\), leads to the following closed form expressions for \(\mathcal{P}\) and \(Q\):

$$\begin{aligned}&\mathcal{P}^{\pi _\mathsf{R}}(\mathsf{X}, \,\cdot \,) = P^{\pi _\mathsf{R}}(\mathsf{X})\;\!\quad \mathcal{P}^{\pi _\mathsf{R}}(\mathsf{G}, \,\cdot \,)=P^{\pi _\mathsf{R}}(\mathsf{G}) \;\!\quad \mathcal{P}^{\pi _\mathsf{R}}(s^2,\mathsf{R})=P^{\pi _\mathsf{R}}(s^2)\\&Q^{\pi _\mathsf{R}}(\mathsf{X}, \,\cdot \, )=V^{\pi _\mathsf{R}}(\mathsf{X})\quad Q^{\pi _\mathsf{R}}(\mathsf{G},\,\cdot \,)=V^{\pi _\mathsf{R}}(\mathsf{G})\quad Q^{\pi _\mathsf{R}}(s^2,\mathsf{R})=V^{\pi _\mathsf{R}}(s^2)\\&\mathcal{P}_{\mathsf{R}\mathsf{R}}\,:=\mathcal{P}^{\pi _\mathsf{R}}(s^1,\mathsf{R})=P^{\pi _\mathsf{R}}(s^1) \;\!\quad \mathcal{P}_{\mathsf{L}\mathsf{R}}:= \mathcal{P}^{\pi _\mathsf{R}}(s^1,\mathsf{L})\,=(1-p)P^{\pi _\mathsf{R}}(s^2) + p P^{\pi _\mathsf{R}}(\mathsf{X}) \\&Q_{\mathsf{R}\mathsf{R}}:=Q^{\pi _\mathsf{R}}(s^1,\mathsf{R})=V^{\pi _\mathsf{R}}(s^1)\quad Q_{\mathsf{L}\mathsf{R}}:=Q^{\pi _\mathsf{R}}(s^1,\mathsf{L})=r +\!\gamma ((1\!-\!p)V^{\pi _\mathsf{R}}(s^2) \!+\! p V^{\pi _\mathsf{R}}(\mathsf{X})) \end{aligned}$$
Fig. 16
figure 16

Counter-MDP of Fig. 1

Fig. 17
figure 17

Markov chain induced by \(\pi _\textsf{L}\)

Fig. 18
figure 18

Markov chain induced by \(\pi _\textsf{R}\)

C Proofs

This appendix provides all the proofs of Propositions and Theorems presented in the body of the paper. Some proofs assume that the readers are familiar with the Bellman equations and the principle of optimality, including those of the following lemmas that are essential for the other proofs.

Lemma 1

Let \(\Pi\) be a subset of all policies, state \(s \in {\mathscr{S}}^{+}\), policy \({\pi ^*}\in \Pi\) and

$$\begin{aligned} \mathcal{A}_\Pi (s) := \big \{ a \in \mathcal{A}(s) : \exists \pi \in \Pi \text { s.t. } a = \pi (s) \big \}. \end{aligned}$$
a.:

\(V^\pi (s) \le V^*(s)\) \(\;\forall \pi \in \Pi\) \(\;\implies \;\) \({\pi ^*}(s) \in \displaystyle \text{arg\,max}_{a \in \mathcal{A}_\Pi (s)} Q^*(s, a)\)

b.:

\(P^*(s) \le P^\pi (s)\) \(\;\forall \pi \in \Pi\) \(\;\implies \;\) \({\pi ^*}(s) \in \displaystyle \text{arg\,min}_{a \in \mathcal{A}_\Pi (s)} \mathcal{P}^{*}(s, a)\)

Proof

The first and second preconditions mean that \({\pi ^*}\) is optimal over \(\Pi\) w.r.t. maximizing \(V(s)\) and minimizing \(P(s)\), respectively. \(\mathcal{A}_\Pi (s)\) is the set of all actions a s.t. \(a = \pi (s)\) in state s for some policy \(\pi \in \Pi\). Therefore, by principle of optimality,

$$\begin{aligned}&V^*(s) = \max _{a \in \mathcal{A}_\Pi (s)} \mathbb {E}\big [r_0 + \gamma \cdot V^*(s_1) \,|\,s_0a_0 = sa \big ] = \max _{a \in \mathcal{A}_\Pi (s)} Q^*(s, a) \\&P^*(s) = \min _{a \in \mathcal{A}_\Pi (s)} \mathbb {E}\big [P^*(s_1) \,|\,s_0a_0 = sa \big ] = \min _{a \in \mathcal{A}_\Pi (s)} \mathcal{P}^{*}(s, a) \end{aligned}$$

for non-terminal \(s \in \mathscr{S}\). For terminal \(s \in \mathscr{S}_\perp\),

$$\begin{aligned}&{V^*(s) = \max _{a \in \mathcal{A}_\Pi (s)} \mathbb {E}\big [r_0 \,|\,s_0a_0 = sa \big ] = \max _{a \in \mathcal{A}_\Pi (s)} Q^*(s, a)} \\&{P^*(s) = \min _{a \in \mathcal{A}_\Pi (s)} \mathcal{P}^{*}(s, a) = \textbf{1}(s \in \mathcal{F}_\perp )} \end{aligned}$$

and thus the proof is completed since \({\pi ^*}\in \Pi\) and \({\left\{ \begin{array}{ll} V^*(s) = Q^*(s, {\pi ^*}(s)) \\ P^*(s) = \,\mathcal{P}^{*}(s, {\pi ^*}(s)) \end{array}\right. }\)\(\square\)

Lemma 2

Let \(\{\mathscr{S}_n^{+}\}_{n=1,2}\) be a partition of \({\mathscr{S}}^{+}\), i.e., \({\mathscr{S}}\,_{1}^{+} \cap {\mathscr{S}}\,_{2}^{+} = \varnothing\) and \({\mathscr{S}}\,_{1}^{+} \cup {\mathscr{S}}\,_{2}^{+} = {\mathscr{S}}^{+}\). For any policies \(\pi\) and \(\hat{\pi }\):

a.:

\(V^{\hat{\pi }} \ge V^\pi\) if \(V^{\hat{\pi }} \ge V^\pi\) over \({\mathscr{S}}\,_1^{+}\) and for each \(s \in {\mathscr{S}}\,_2^{+}\), either

$$\begin{aligned} V^{\hat{\pi }}(s) \ge Q^{\hat{\pi }}(s, \pi (s)) \;\text { or }\; V^\pi (s) \le Q^\pi (s, \hat{\pi }(s)) \end{aligned}$$
b.:

\(P^{\hat{\pi }} \le P^\pi\) if \(P^{\hat{\pi }} \le P^{\pi }\) over \({\mathscr{S}}^{+}_1\) and for each \(s \in {\mathscr{S}}^{+}_2\), either

$$\begin{aligned} P^{\hat{\pi }}(s) \le \mathcal{P}^{\hat{\pi }}(s, \pi (s)) \;\text { or }\; P^\pi (s) \ge \mathcal{P}^{\pi} (s, \hat{\pi }(s)) \end{aligned}$$

Proof

Let \(\smash {\overline{\mathscr{S}}}\,_1^+ := {\mathscr{S}}\,_{1}^{+} \cup \mathscr{S}_\perp\) and \(\mathscr{S}_2 := {\mathscr{S}}_{2}^{+} \setminus \mathscr{S}_\perp\). Then, the sets are disjoint and the latter does not contain any terminal state. Moreover, since \(\mathcal{P}\)- and \(Q\)-values in each terminal state are equal to their \(P\)- and \(V\)-values, i.e., \(\mathcal{P}(s, a) = P(s) = \textbf{1}(s \in \mathcal{F}_\perp )\) and \(Q(s, \pi (s)) = V^\pi (s) = R_\perp (s, \pi (s))\), the above preconditions over \({\mathscr{S}}\,_{1}^{+}\) and \({\mathscr{S}}\,_{2}^{+}\) can be replaced by those over the refined regions \(\smash {\overline{\mathscr{S}}}\,_1^{+}\) and \(\mathscr{S}_2\), respectively. Since there is nothing to prove if initial state \(s_0 \in \smash {\overline{\mathscr{S}}}\,_1^{+}\), without loss of generality we suppose \(s_0 \in \mathscr{S}_2\). Note that all the formulas and statements w.r.t. \(s_0\) in this proof are true for all \(s_0 \in \mathscr{S}_2\). For a sequence of states \(s_0s_1s_2 \cdots\), we write \(s_{1:\tau } \in \mathscr{S}_2\) iff \(s_t \in \mathscr{S}_2\) for all \(t \in [1..\tau ]\). For brevity, we also write \(\mathscr{T}^{\pi} (s_{0:\tau }) := \prod _{t=0}^{\tau -1} \mathscr{T}\,(s_t, \pi (s_t))(s_{t+1})\), e.g., \(\mathscr{T}^{\pi} (s_{0:1}) := \mathscr{T}\,\,(s_0, \pi (s_0))(s_{1})\).

First, we prove a. \(V^{\hat{\pi }} \ge V^\pi\). If \(V^{\hat{\pi }}(s_0) \ge Q^{\hat{\pi }}(s_0, \pi (s_0))\), then we obtain

$$\begin{aligned} V^{\hat{\pi }}(s_0) - V^\pi (s_0) \ge Q^{\hat{\pi }}(s_0, \pi (s_0)) - V^\pi (s_0) = \gamma \cdot \mathbb {E}^{\pi } \big [ V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

where the equality comes by substituting the Bellman equations:

$$\begin{aligned} Q^{\hat{\pi }}(s_0, \pi (s_0)) = \mathbb {E}^{\pi } \big [ r_0 + \gamma V^{\hat{\pi }}(s_1) \,|\,s_0 \big ] \text { and } V^{\pi }(s_0) = \mathbb {E}^{\pi } \big [ r_0 + \gamma V^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

Similarly, if \(V^{\pi }(s_0) \le Q^{\pi }(s_0, \hat{\pi }(s_0))\), then by the Bellman equations:

$$\begin{aligned} Q^{\pi }(s_0, \hat{\pi }(s_0)) = \mathbb {E}^{\hat{\pi }} \big [ r_0 + \gamma V^{\pi }(s_1) \,|\,s_0 \big ] \text { and } V^{\hat{\pi }}(s_0) = \mathbb {E}^{\hat{\pi }} \big [ r_0 + \gamma V^{\hat{\pi }}(s_1) \,|\,s_0 \big ] \end{aligned}$$

the following holds:

$$\begin{aligned} V^{\hat{\pi }}(s_0) - V^\pi (s_0) \ge V^{\hat{\pi }}(s_0) - Q^{\pi }(s_0, \hat{\pi }(s_0)) = \gamma \cdot \mathbb {E}^{\hat{\pi }} \big [ V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

Let \(\mu\) be a policy such that for each \(s \in \mathscr{S}_2\), \(\mu (s) = \pi (s)\) if \(V^{\hat{\pi }}(s) \ge Q^{\hat{\pi }}(s, \pi (s))\) and \(\mu (s) = \hat{\pi }(s)\) otherwise. Then, \(\mu\) is well-defined and the proof of the first part a. \(V^{\hat{\pi }} \ge V^\pi\) is completed by the following claim as \(\mu\) satisfies (22).

Claim. \(V^{\hat{\pi }} \ge V^\pi\) if \(V^{\hat{\pi }} \ge V^\pi\) over \(\smash {\overline{\mathscr{S}}}\,_1^{+}\) and there exists policy \(\mu\) s.t.

$$\begin{aligned} V^{\hat{\pi }}(s_0) - V^\pi (s_0) \ge \gamma \cdot \mathbb {E}^{\mu } \big [ V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \,|\,s_0 \big ] \qquad \forall s_0 \in \mathscr{S}_2 \end{aligned}$$
(22)

Proof of Claim. Since \(V^{\hat{\pi }} \ge V^\pi\) over \(\smash {\overline{\mathscr{S}}}\,_1^{+}\), it is obvious that for all \(s_0a_0 \in \mathscr{S}_2 \times \mathcal{A}(s_0)\),

$$\begin{aligned} \textstyle \sum _{s_1 \in \smash {\overline{\mathscr{S}}}\,_1^{+}} \mathscr{T}\,\,(s_0, a_0)(s_1) \cdot \big ( V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \big ) \ge 0 \end{aligned}$$

and therefore,

$$\begin{aligned} \mathbb {E} \big [ V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \,|\,s_0 a_0 \big ] \ge \textstyle \sum _{s_1 \in \mathscr{S}_2} \mathscr{T}\,\,(s_0, a_0)(s_1) \cdot \big ( V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \big ) \end{aligned}$$

Applying the inequality to (22), we obtain the following:

$$\begin{aligned} V^{\hat{\pi }}(s_0) - V^\pi (s_0)&\ge \gamma \cdot \mathbb {E}^{\mu } \big [ V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \,|\,s_0 \big ] \\&\ge \gamma \textstyle \sum _{s_1 \in \mathscr{S} _2} \mathscr{T}^{\mu }(s_{0:1}) \cdot \big ( V^{\hat{\pi }}(s_1) - V^{\pi }(s_1) \big ) \end{aligned}$$

Since \(s_1 \in \mathscr{S} _2\), we recursively apply the same inequality to the right hand sides:

$$\begin{aligned}&\ge \gamma ^2 \textstyle \sum _{s_{1:2} \in \mathscr{S}_2} \mathscr{T}^{\mu }(s_{0:2}) \cdot \big ( V^{\hat{\pi }}(s_2) - V^{\pi }(s_2) \big ) \\&\;\;\vdots \\&\ge \gamma ^n \textstyle \sum _{s_{1:n} \in \mathscr{S} _2} \mathscr{T}^{\mu} (s_{0:n}) \cdot \big ( V^{\hat{\pi }}(s_n) - V^{\pi }(s_n) \big ) \end{aligned}$$

where the last formula absolutely converges to zero as \(n \rightarrow \infty\), thanks to \(\gamma ^n \rightarrow 0\) and the boundedness of the rest. Since \(s_0 \in \mathscr{S}_2\) is arbitrary, we therefore conclude that \(V^{\hat{\pi }} \ge V^\pi\) over \(\mathscr{S}_2\) and thus over \({\mathscr{S}}^{+} = \smash {\overline{\mathscr{S}}}^{+}_1 \cup \mathscr{S}_2\). \(\square\)

The proof of the second part b. \(P^{\hat{\pi }} \le P^\pi\) can be done in a similar manner. If \(P^{\hat{\pi }}(s_0) \le \mathcal{P}^{\hat{\pi }}(s_0, \pi (s_0))\), then we obtain

$$\begin{aligned} P^{\hat{\pi }}(s_0) - P^\pi (s_0) \le \mathcal{P}^{\hat{\pi }}(s_0, \pi (s_0)) - P^\pi (s_0) = \mathbb {E}^{\pi } \big [ P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

where the equality comes by the substitutions of the Bellman equations:

$$\begin{aligned} \mathcal{P}^{\hat{\pi }}(s_0, \pi (s_0)) = \mathbb {E}^{\pi } \big [ P^{\hat{\pi }}(s_1) \,|\,s_0 \big ] \text { and } P^{\pi }(s_0) = \mathbb {E}^{\pi } \big [ P^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

Similarly, if \(P^{\pi }(s_0) \ge \mathcal{P}^{\pi }(s_0, \hat{\pi }(s_0))\), then by the Bellman equations

$$\begin{aligned} \mathcal{P}^{\pi }(s_0, \hat{\pi }(s_0)) = \mathbb {E}^{\hat{\pi }} \big [ P^{\pi }(s_1) \,|\,s_0 \big ] \text { and } P^{\hat{\pi }}(s_0) = \mathbb {E}^{\hat{\pi }} \big [ P^{\hat{\pi }}(s_1) \,|\,s_0 \big ] \end{aligned}$$

we obtain the following:

$$\begin{aligned} P^{\hat{\pi }}(s_0) - P^\pi (s_0) \le P^{\hat{\pi }}(s_0) - \mathcal{P}^{\pi }(s_0, \hat{\pi }(s_0)) = \mathbb {E}^{\hat{\pi }} \big [ P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \,|\,s_0 \big ] \end{aligned}$$

Let \(\mu\) be a policy such that for each \(s \in \mathscr{S}_2\), \(\mu (s) = \pi (s)\) if \(P^{\hat{\pi }}(s) \le \mathcal{P}^{\hat{\pi }}(s, \pi (s))\) and \(\mu (s) = \hat{\pi }(s)\) otherwise. Then, \(\mu\) is well-defined and the proof is completed by the following claim as \(\mu\) satisfies (23).

Claim. \(P^{\hat{\pi }} \le P^\pi\) if \(P^{\hat{\pi }} \le P^\pi\) over \(\smash {\overline{\mathscr{S}}}\,_1^{+}\) and there exists policy \(\mu\) s.t.

$$\begin{aligned} P^{\hat{\pi }}(s_0) - P^\pi (s_0) \le \mathbb {E}^{\mu } \big [ P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \,|\,s_0 \big ] \qquad \forall s_0 \in \mathscr{S}_2 \end{aligned}$$
(23)

Proof of Claim. Since \(P^{\hat{\pi }} \le P^\pi\) over \(\smash {\overline{\mathscr{S}}}\,_1^{+}\), it is obvious that for all \(s_0a_0 \in \mathscr{S}_2 \times \mathcal{A}(s_0)\),

$$\begin{aligned} \textstyle \sum _{s_1 \in \smash {\overline{\mathscr{S}}}\,_1^{+}} \mathscr{T}\,(s_0, a_0)(s_1) \cdot \big ( P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \big ) \le 0, \end{aligned}$$

and therefore

$$\begin{aligned} \mathbb {E} \big [ P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \,|\,s_0 a_0 \big ] \le \textstyle \sum _{s_1 \in \mathscr{S}_2} \mathscr{T}\,(s_0, a_0)(s_1) \cdot \big ( P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \big ) \end{aligned}$$

Applying the inequality to (23), we obtain the following:

$$\begin{aligned} P^{\hat{\pi }}(s_0) - P^\pi (s_0)&\le \mathbb {E}^{\mu } \big [ P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \,|\,s_0 \big ] \nonumber \\&\le \textstyle \sum _{s_1 \in \mathscr{S}_2} \mathscr{T}^{\mu }(s_{0:1}) \cdot \big ( P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \big ) \end{aligned}$$

Since \(s_1 \in \mathscr{S}_2\), we recursively apply the same inequality to the right hand sides:

$$\begin{aligned}&\le \textstyle \sum _{s_{1:2} \in \mathscr{S}_2} \mathscr{T}^{\mu }(s_{0:2}) \cdot \big ( P^{\hat{\pi }}(s_1) - P^{\pi }(s_1) \big ) \nonumber \\&\;\;\vdots \nonumber \\&\le \textstyle \sum _{s_{1:n} \in \mathscr{S}_2} \mathscr{T}^{\mu }(s_{0:n}) \cdot \big ( P^{\hat{\pi }}(s_n) - P^{\pi }(s_n) \big ) \nonumber \\&\le \mathbb {P}^\mu ( s_{1:n} \in \mathscr{S}_2 \,\vert \, s_0 ) \cdot \max \big \{ s \in \mathscr{S}_2 \, \vert \, P^{\hat{\pi }}(s) - P^{\pi }(s) \big \} \end{aligned}$$
(24)

where \(\mathbb {P}^\mu ( s_{1:n} \in \mathscr{S}_2 \,\vert \, s_0 ) = \sum _{s_{1:n} \in \mathscr{S}_2} \mathscr{T}^{\mu }(s_{0:n})\). Since \(\mathscr{S}_2\) contains no terminal state by its construction, we have

$$\begin{aligned} \mathbb {P}^\mu ( s_{1:n} \in \mathscr{S}_2 \,\vert \, s_0 )&= \mathbb {P}^\mu ( s_{1:n} \not \in \smash {\overline{\mathscr{S}}}_1^{+} \,\vert \, s_0 ) \\&\le \mathbb {P}^\mu ( s_{1:n} \not \in \mathscr{S}_\perp \,\vert \, s_0 ) = \mathbb {P}^\mu ( T > n \,\vert \, s_0 ) \rightarrow 0 \text { as } n \rightarrow \infty \end{aligned}$$

where the convergence is due to the assumption that terminal index T is finite with probability 1, that is, \(\mathbb {P}^\mu ( T < \infty \,\vert \, s_0 ) = 1\). Therefore, we obtain \(P^{\hat{\pi }} (s_0) - P^\pi\,(s_0) \le 0\) from (24) by taking the limit \(n \rightarrow \infty\). As \(s_0 \in \mathscr{S}_2\) is arbitrary, we conclude \(P^{\hat{\pi }} \le P^\pi\) over \(\mathscr{S}_2\), hence over \(\smash {\overline{\mathscr{S}}}_1^{+} \cup \mathscr{S}_2 = {\mathscr{S}}^{+}\).

\(\square\)

1.1 C.1 Proof of Proposition 1

The proof can be done by modifying and extending that of [20, Theorem 3.2.1] to the constrained optimization over MDP \(\mathcal{M}\): given state \(s \in {\mathscr{S}}^{+}\) and \(\theta \in [0, 1)\),

$$\begin{aligned} \hbox {} \mathop {\textrm{maximize}}\limits _\pi \; V^\pi (s) \hbox { subject to } P^\pi (s) \le \theta \end{aligned}$$
(2)

Suppose a solution \(\pi ^\star\) to the problem (2) is not Pareto optimal w.r.t. performance in s. Then, by definition, there exists a policy \(\pi \ne \pi ^\star\) s.t. \(P^\pi (s) \le P^{\star }(s)\) and \(V^\pi (s) > V^{\star }(s)\). The former implies that \(P^\pi (s) \le P^{\star }(s) \le \theta\), meaning that \(\pi\) is feasible, i.e., satisfies the constraint in (2). Hence, the latter \(V^\pi (s) > V^{\star }(s)\) contradicts the fact that \(\pi ^\star\) is a maximizing solution to (2). Therefore, \(\pi ^\star\) is Pareto optimal w.r.t. performance in s.

1.2 C.2 Proof of Proposition 2

Consider the counter-MDP shown in Fig. 1, with \((p, \gamma , \theta ) = (0.7, 0.95, 0.85)\). Then, from the equations in Sect. 3 and Appendix B, all the values \(V^\pi (s)\) in states \(s^1\) and \(s^2\) under \(\pi _\textsf{L}\) and \(\pi _\textsf{R}\) can be evaluated as

$$\begin{aligned}&V^{\pi _\textsf{L}}(s^1) = -\frac{1 + \gamma q}{1 - \gamma ^2 pq} = -1.585 \cdots \quad \,\qquad V^{\pi _\textsf{R}}(s^1) = -\frac{1}{1 - \gamma p} = -2.985 \cdots \\&V^{\pi _\textsf{L}}(s^2) = -1 + \gamma p V^{\pi _\textsf{L}}(s^1) = -2.054 \cdots \quad V^{\pi _\textsf{R}}(s^2) = -1 + \gamma p V^{\pi _\textsf{R}}(s^1) = -2.985 \cdots , \end{aligned}$$

where \(q := 1 - p\) and the constant step reward \(r = -1\) is substituted. Hence, \(\pi _\textsf{L}\) is optimal in state \(s \in \{s^1, s^2\}\) if it satisfies the constraint \(P^{\pi _\textsf{L}}(s) \le \theta\) (\(=0.85\)). However, calculating the probabilistic reachability values as

$$\begin{aligned}&P^{\pi _\textsf{L}}(s^1) = \frac{p}{1 - pq} = 0.886 \cdots \quad \,\qquad P^{\pi _\textsf{R}}(s^1) = \frac{1}{1 + p} = 0.588 \cdots \\&P^{\pi _\textsf{L}}(s^2) = p P^{\pi _\textsf{L}}(s^1) = 0.620 \cdots \qquad \, P^{\pi _\textsf{R}}(s^2) = p P^{\pi _\textsf{R}}(s^1) = 0.411 \cdots , \end{aligned}$$

we can see that \(\pi _\textsf{L}\) satisfies the constraint \(P^{\pi _\textsf{L}}(s) \le \theta\) in \(s = s^2\), but not in state \(s = s^1\). On the other hand, \(\pi _\textsf{R}\) satisfies \(P^{\pi _\textsf{R}}(s) \le \theta\) for state \(s \in \{s^1, s^2\}\). Therefore, we can conclude that \(\pi _\textsf{L}\) is the optimal policy for the constrained MDP problem (2) for state \(s = s^2\) but not for state \(s = s^1\). Similarly, \(\pi _\textsf{R}\) is optimal in \(s = s^1\) but not for \(s = s^2\).

In summary, we have found a counterexample, where there exists a policy (e.g., \(\pi _\textsf{L}\) or \(\pi _\textsf{R}\)) that is optimal in one state (i.e., solving the constrained problem (2) for the state) but not in the other state. Therefore, the proof is completed.

1.3 C.3 Proof of Corollaries 1, 4

For notational simplicity, we write \(P_s^\pi\) for \(P^\pi (s)\), \(P^*_{s}\) for \(P^*(s)\), \(V_s^\pi\) for \(V^\pi (s)\) and \(V^*_{s}\) for \(V^*(s)\). Then, Corollary 1 can be easily proven by logical inference. For example, we obtain the following: for each safe state \(s \in \mathcal{S}^{*}\),

$$\begin{aligned} \underbrace{\forall \pi \big ( P_s^\pi \le P^*_{s} \, \implies \, V_s^\pi \le V^*_{s} \big )}_{\text {from Property P1}}&= \forall \pi \big ( P_s^\pi> P^*_{s} \text { or } V_s^\pi \le V^*_{s} \big ) \nonumber \\&= \forall \pi \big [ \lnot \big ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi> V^*_{s} \big ) \big ]\nonumber \\&= \not \exists \pi \big ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi > V^*_{s} \big ). \end{aligned}$$
(25)

Therefore, Property P1 is true for any policy \(\pi\) iff \({\pi ^*}\) is uniformly Pareto optimal w.r.t. performance over \(\mathcal{S}^{*}\). It can be similarly proven that Property P2 is true iff \({\pi ^*}\) is uniformly Pareto optimal w.r.t. safety over \({\mathcal{F}}^{*}\), hence the proof of Corollary 1 is obvious.

To prove Corollary 4, note that for each safe state \(s \in \mathcal{S}^{*}\),

$$\begin{aligned} \underbrace{\forall \pi \big ( V^*_{s} = V_s^\pi \, \implies \, P^*_{s} \le P_s^\pi \big )}_{\text{from Property P5}} =\forall \pi \big ( V^*_{s} \ne V_s^\pi \text { or } P^*_{s} \le P_s^\pi \big ). \end{aligned}$$
(26)

Hence, we have the following logical inference: for each \(s \in \mathcal{S}^{*}\),

$$\begin{aligned} \forall \pi&( P_s^\pi \le P^*_{s} \, \implies \, V_s^\pi \le V^*_{s} ) \text { and }\quad \qquad \quad \qquad \qquad \qquad \qquad \qquad \text{(from Property P1)} \\ \forall \pi&(V^*_{s} = V_s^\pi \, \implies \, P^*_{s} \le P_s^\pi ) \quad \qquad \,\, \quad \qquad \qquad \qquad \qquad \qquad \qquad \text{(from Property P5)} \\ {}&=\forall \pi \big [ ( P_s^\pi> P^*_{s} \text { or } V_s^\pi \le V^*_{s} ) \text { and } ( V^*_{s} \ne V_s^\pi \text { or } P^*_{s} \le P_s^\pi ) \big ] \qquad \,\,\, \text{(by (25) and (26))} \\ {}&=\forall \pi \big ( \lnot \big [ ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \text { or } ( V^*_{s} = V_s^\pi \text { and } P^*_{s}> P_s^\pi ) \big ] \big ) \nonumber \\ {}&=\forall \pi \big ( \lnot \big [ ( P_s^\pi = P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \text { or } ( P_s^\pi < P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \\& \quad \qquad \qquad \quad \quad \qquad \qquad \quad \quad \text {or } ( V^*_{s} = V_s^\pi \text { and } P^*_{s}> P_s^\pi ) \big ] \big ) \nonumber \\ {}&=\forall \pi \big ( \lnot \big [ ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \text { or } ( V^*_{s} \ge V_s^\pi \text { and } P^*_{s}> P_s^\pi ) \big ] \big ) \nonumber \\ {}&= \forall \pi \big [ \lnot ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \big ] \text { and } \forall \pi \big [ \lnot ( V^*_{s} \ge V_s^\pi \text { and } P^*_{s}> P_s^\pi ) \big ] \nonumber \\ {}&= \not \exists \pi ( P_s^\pi \le P^*_{s} \text { and } V_s^\pi> V^*_{s} ) \text { and } \not \exists \pi ( V^*_{s} \ge V_s^\pi \text { and } P^*_{s} > P_s^\pi ). \end{aligned}$$

Therefore, Properties P1 and P5 are true for any policy \(\pi\) iff \({\pi ^*}\) is uniformly Pareto optimal over \(\mathcal{S}^{*}\). In a similar manner, we can also prove that Properties P2 and P6 hold for any policy \(\pi\) iff \({\pi ^*}\) is uniformly Pareto optimal over \({\mathcal{F}}^{*}\). Combining these results, we conclude that Requirements 1 and 4 are true iff \({\pi ^*}\) is globally Pareto optimal (i.e., uniformly Pareto optimal over \(\mathscr{S}^{+} = \mathcal{S}^{*}\cup {\mathcal{F}}^{*}\)), which completes the proof. \(\square\)

1.4 C.4 Proof of Corollary 2

Let Requirement 2 be true and \(\mathrm {\Pi }\) be the set of all policies \(\pi\) s.t. \(P^*(s) \le P^\pi (s)\) \(\forall s \in \mathcal{S}^{*}\). Then, \(\pi \in \mathrm {\Pi }\) implies \(P^*(s) \le P^\pi (s)\) \(\forall s \in {\mathcal{F}}^{*}\) by Requirement 2, hence

$$\begin{aligned} {\pi ^*}(s) \in \mathop {\text {arg min}}\limits _{a \in \mathcal{A}_\mathrm {\Pi }(s)} \,\mathcal{P}^{*}(s, a) \qquad \forall s \in {\mathcal{F}}^{*}\end{aligned}$$

by Lemma 1b where \(\mathcal{A}_\mathrm {\Pi }(s) := \big \{ a \in \mathcal{A}(s) : \exists \pi \in \mathrm {\Pi }\text { s.t. } a = \pi (s) \big \}\). To show \(\mathcal{A}_\mathrm {\Pi }= \mathcal{A}\) over \({\mathcal{F}}^{*}\), given \({\bar{s}} {\bar{a}} \in {\mathcal{F}}^{*}\times \mathcal{A}({\bar{s}})\), construct policy \({\bar{\pi }}\) that is a one-point modification of \({\pi ^*}\) s.t. \({\bar{\pi }} = {\pi ^*}\) everywhere except for state \({\bar{s}}\) at which \({\bar{\pi }}({\bar{s}}) = {\bar{a}}\). Then,

$$\begin{aligned}&P^*({\bar{s}}) = \mathcal{P}^{*}({\bar{s}}, {\pi ^*}({\bar{s}})) = \min _{\!\!a \in \mathcal{A}_\mathrm {\Pi }({\bar{s}})\!\!} \mathcal{P}^{*}({\bar{s}}, a) \le \mathcal{P}^{*}({\bar{s}}, {\bar{\pi }}({\bar{s}})) \\&P^*(s) = \mathcal{P}^{*}(s, {\pi ^*}(s)) = \mathcal{P}^{*}(s, {\bar{\pi }}(s)) \text { for any other state } s \ne {\bar{s}}. \end{aligned}$$

Therefore, the application of Lemma 2b (with \(\mathscr{S}\,_1^{+} = \varnothing\)) shows that \(P^* \le P^{{\bar{\pi }}}\) and thereby, \({\bar{\pi }} \in \mathrm {\Pi }\). Since \({\bar{s}} \in {\mathcal{F}}^{*}\) and \({\bar{a}} \in \mathcal{A}({\bar{s}})\) are arbitrary, we conclude that \(\mathcal{A}_\mathrm {\Pi }(s) = \mathcal{A}(s)\) \(\forall s \in {\mathcal{F}}^{*}\), which proves one direction.

To prove the other direction, assume \(P^*(s) \le P^\pi (s)\) \(\forall s \in \mathcal{S}^{*}\) for policy \(\pi\) and

$$\begin{aligned} {\pi ^*}(s) \in {\mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}(s)}} \, \mathcal{P}^{*}(s, a) \quad \forall s \in {\mathcal{F}}^{*}. \end{aligned}$$

Then, since we have

$$\begin{aligned} P^*(s) = \mathcal{P}^{*}(s, {\pi ^*}(s)) = \min _{a \in \mathcal{A}(s)} \mathcal{P}^{*}(s, a) \le \mathcal{P}^{*}(s, \pi (s)) \qquad \forall s \in {\mathcal{F}}^{*}, \end{aligned}$$

the application of Lemma 2b with \(\mathscr{S}\,_1^+ = \mathcal{S}^{*}\) and \(\mathscr{S}\,_2^+ = {\mathcal{F}}^{*}\) concludes \(P^* \le P^\pi\), and thereby Requirement 2 is satisfied. \(\square\)

1.5 C.5 Proof of Corollary 3

For action space \(\mathscr{A}\subseteq \mathcal{A}\), consider a modified MDP \(\hat{\mathcal{M}}\) where we replace for each \(s \in \mathcal{S}^{*}\) the set of all available actions \(\mathcal{A}\,(s)\) by \(\mathscr{A}\,(s)\). Then, \({\pi ^*}\) is simply a policy on \(\hat{\mathcal{M}}\), and Requirement 3 can be restated as: Property P4 holds for any policy \(\pi\) on \(\hat{\mathcal{M}}\). Here, we say that \(\pi\) is a policy on \(\hat{\mathcal{M}}\) iff it is a policy satisfying \(\pi (s) \in \mathscr{A}\,(s)\) \(\forall s \in \mathcal{S}^{*}\).

(\(\Longrightarrow\)) Let Requirement 3 be true and \(\mathrm {\Pi }\) be the set of all policies \(\pi\) on \(\hat{\mathcal{M}}\) s.t. \(V^\pi (s) \le V^*(s)\) \(\forall s \in {\mathcal{F}}^{*}\). Then, clearly we have \({\pi ^*}\in \mathrm {\Pi }\) and thus \(\pi \in \mathrm {\Pi }\) implies \(V^\pi (s) \le V^*(s)\) \(\forall s \in \mathcal{S}^{*}\) by Requirement 3. Therefore, the application of Lemma 1a to the modified MDP \(\hat{\mathcal{M}}\) results in

$$\begin{aligned} {\pi ^*}(s) \in \mathop {\mathrm {arg\,max}}\limits _{a \in \mathscr{A}_\mathrm {\Pi }(s)} \,Q^*(s, a) \qquad \forall s \in \mathcal{S}^{*}, \end{aligned}$$

where \(\mathscr{A}_\mathrm {\Pi }(s) := \big \{ a \in \mathscr{A}\,(s) : \exists \pi \in \mathrm {\Pi }\text { s.t. } a = \pi (s) \big \}\). To show \(\mathscr{A}_\mathrm {\Pi }= \mathscr{A}\) over \(\mathcal{S}^{*}\), given \({\bar{s}} {\bar{a}} \in \mathcal{S}^{*}\times \mathscr{A}\,({\bar{s}})\), construct policy \({\bar{\pi }}\) that is a one-point modification of \({\pi ^*}\) s.t. \({\bar{\pi }} = {\pi ^*}\) everywhere except for state \({\bar{s}}\) at which \({\bar{\pi }}({\bar{s}}) = {\bar{a}}\). Then,

$$\begin{aligned}&V^*({\bar{s}}) = Q^*({\bar{s}}, {\pi ^*}({\bar{s}})) = \max _{\!\!a \in \mathscr{A}_\mathrm {\Pi }({\bar{s}})\!\!} Q^*({\bar{s}}, a) \ge Q^*({\bar{s}}, {\bar{\pi }}({\bar{s}})) \\&V^*(s) = Q^*(s, {\pi ^*}(s)) = Q^*(s, {\bar{\pi }}(s)) \text { for any other state } s \ne {\bar{s}}. \end{aligned}$$

Therefore, the application of Lemma 2a to the modified MDP \(\hat{\mathcal{M}}\), with \(\mathscr{S}\,_1^+ = \varnothing\), shows that \(V^{{\bar{\pi }}} \le V^*\) and thereby, \({\bar{\pi }} \in \mathrm {\Pi }\). Since \({\bar{s}} \in {\mathcal{S}}^{*}\) and \({\bar{a}} \in \mathscr{A}\,({\bar{s}})\) are arbitrary, we conclude that \(\mathscr{A}_\mathrm {\Pi }(s) = \mathscr{A}\,(s)\) \(\forall s \in \mathcal{S}^{*}\), which proves one direction.

(\(\Longleftarrow\)) Suppose \(V^\pi (s) \le V^*(s)\) \(\forall s \in {\mathcal{F}}^{*}\) for policy \(\pi\) on \(\hat{\mathcal{M}}\) and

$$\begin{aligned} {\pi ^*}(s) \in {\mathop {\mathrm {arg\,max}}\limits _{a \in \mathscr{A}\,(s)}} \;Q^*(s, a) \quad \forall s \in \mathcal{S}^{*}. \end{aligned}$$

Then, since we have

$$\begin{aligned} V^*(s) = Q^*(s, {\pi ^*}(s)) = \max _{a \in \mathscr{A}\,(s)} Q^*(s, a) \ge Q^*(s, \pi (s)) \qquad \forall s \in \mathcal{S}^{*}, \end{aligned}$$

the application of Lemma 2a to the modified MDP \(\hat{\mathcal{M}}\), with \(\mathscr{S}\,_1^{+} = {\mathcal{F}}^{*}\) and \(\mathscr{S}\,_2^{+} = \mathcal{S}^{*}\), concludes \(V^\pi \le V^*\), and thereby Requirement 3 is satisfied. \(\square\)

1.6 C.6 Proof of Proposition 3

Given state-action \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\), let \(\pi ^{sa}\) be a one-point modification of \({\pi ^*}\) s.t. \(\pi ^{sa} = {\pi ^*}\) everywhere except for state s at which \(\pi ^{sa}(s) = a\). Then, since \({\pi ^*}\) satisfies Requirement 1,

$$\begin{aligned}&P^{sa}(s) \le P^*(s) \, \implies \, V^{sa}(s) \le V^*(s) \qquad \forall sa \in \mathcal{S}^{*}\times \mathcal{A}(s) \end{aligned}$$
(27)
$$\begin{aligned}&V^*(s) \le V^{sa}(s) \, \implies \, P^*(s) \le P^{sa}(s) \qquad \forall sa \in {\mathcal{F}}^{*}\times \mathcal{A}(s), \end{aligned}$$
(28)

where \(P^{sa}(s) := P^{\pi ^{sa}}(s)\) and \(V^{sa}(s) := V^{\pi ^{sa}}(s)\).

  1. 1.

    Choose arbitrary sa in \(\mathcal{S}^{*}\times \mathcal{A}_\text {p}^*(s)\). Then, by the definition of \(\pi ^{sa}\)

    $$\begin{aligned}&P^*(s) \ge \mathcal{P}^{*}(s, \pi ^{sa}(s)) \quad \qquad \qquad \quad\;\; \text{(by the definition of ${\mathcal{A}^{*}_\mathrm{p}}$)} \\&P^*({\hat{s}}) = \mathcal{P}^{*}({\hat{s}}, {\pi ^*}({\hat{s}})) = \mathcal{P}^{*}({\hat{s}}, \pi ^{sa}({\hat{s}})) \quad \text { for any other state } {\hat{s}} \ne s \end{aligned}$$

    Hence, \(P^{sa}(s) \le P^*(s)\) by Lemma 2b and thus \(V^{sa}(s) \le V^*(s)\) by (27). That is, since \(a \in \mathcal{A}_\text {p}^*(s)\) is arbitrary, \({\pi ^*}\) is optimal in state s, over \(\mathrm {\Pi }^s_\text {p} := \{ \pi ^{sa} : a \in \mathcal{A}_\text {p}^*(s)\}\). Therefore, the application of Lemma 1a with \(\mathrm {\Pi }= \mathrm {\Pi }^s_\text {p}\) results in

    $$\begin{aligned} {\pi ^*}(s) \in \mathop {\mathrm {arg\,max}}\limits _{a \in \mathcal{A}_\text {p}^*(s)} Q^*(s, a) \end{aligned}$$
  2. 2.

    Choose arbitrary sa in \({\mathcal{F}}^{*}\times \mathcal{A}_\text {v}^*(s)\). Then, by the definition of \(\pi ^{sa}\),

    $$\begin{aligned}&V^*(s) \le Q^*(s, \pi ^{sa}(s)) \quad \qquad \qquad \quad\;\;\, \text{(by the definition of ${\mathcal{A}^{*}_\mathrm{v}}$)} \\&V^*({\hat{s}}) = Q^*({\hat{s}}, {\pi ^*}({\hat{s}})) = Q^*({\hat{s}}, \pi ^{sa}({\hat{s}})) \quad \text { for any other state } {\hat{s}} \ne s. \end{aligned}$$

    Hence, \(V^*(s) \le V^{sa}(s)\) by Lemma 2a and thus \(P^*(s) \le P^{sa}(s)\) by (28). That is, \({\pi ^*}\) is least unsafe in state s, over \(\mathrm {\Pi }_\text {v}^s := \{ \pi ^{sa} : a \in \mathcal{A}_\text {v}^*(s)\}\). Therefore,

    $$\begin{aligned} {\pi ^*}(s) \in \mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}_\text {v}^*(s)} \mathcal{P}^{*}(s, a) \end{aligned}$$

    by Lemma 1b with \(\mathrm {\Pi }= \mathrm {\Pi }_\text {v}^s\).

As state s is arbitrary in \(\mathcal{S}^{*}\) and \({\mathcal{F}}^{*}\) for the respective cases above, the proof is completed. \(\square\)

1.7 C.7 Proof of Propositions 4, 5

We can directly prove Proposition 5, by noting that for any policy \(\pi\),

$$\begin{aligned} s \in \mathcal{S}^{\pi} \Longleftrightarrow P^\pi (s) \le \theta \Longleftrightarrow \mathcal{P}^{\pi} (s, \pi (s)) \le \theta \end{aligned}$$
(29)

and \(\mathcal{P}^{\pi}(s, \pi (s)) \le \theta \Longleftrightarrow \pi (s) \in \mathcal{A}_\text {c}^\pi (s) \!\implies \! \mathcal{A}_\text {c}^\pi (s) \ne \varnothing\). With \(\pi = {\pi ^*}\), this also proves one direction of Proposition 4: \(s \in \mathcal{S}^{*}\implies \mathcal{A}_\text {c}^*(s) \ne \varnothing\).

To prove the opposite direction of Proposition 4, note that by a contraposition of (29) for \(\pi = {\pi ^*}\) and then by \({\pi ^*}= \mathfrak {T}{\pi ^*}\),

$$\begin{aligned} s \in {\mathcal{F}}^{*}\Longleftrightarrow \theta< \mathcal{P}^{*}(s, {\pi ^*}(s)) \Longleftrightarrow \theta < \mathcal{P}^{*}(s, \mathfrak {T}{\pi ^*}(s)). \end{aligned}$$

Hence, we have \(\mathfrak {T}{\pi ^*}(s) \not \in \mathcal{A}_\text {c}^*(s)\) if (and only if) \(s \in {\mathcal{F}}^{*}\).

Now, suppose that \(s \in {\mathcal{F}}^{*}\) but \(\mathcal{A}_\text {c}^*(s) \ne \varnothing\). Then, by the definition of \(\mathfrak {T}\), we must have \(\mathfrak {T}{\pi ^*}(s) \in \mathcal{A}_\text {c}^*(s)\), which however contradicts to \(\mathfrak {T}{\pi ^*}(s) \not \in \mathcal{A}_\text {c}^*(s)\). Therefore, we must have \(s \in {\mathcal{F}}^{*}\) \(\implies\) \(\mathcal{A}_\text {c}^*(s) = \varnothing\), whose contraposition is then \(\mathcal{A}_\text {c}^*(s) \ne \varnothing\) \(\implies\) \(s \in \mathcal{S}^{*}\). \(\square\)

1.8 C.8 Proof of Proposition 6

For any state \(s \in {\mathscr{S}}^{+}\), it is obvious that (i) \({\pi ^*}(s) \in \mathcal{A}_\text {v}^*(s)\) by \(V^*(s) = Q^*(s, {\pi ^*}(s))\) and since \(\mathcal{A}_\text {v}^*(s) \subseteq \mathcal{A}(s)\), (ii) \(\min _{a \in \mathcal{A}_\text {v}^*(s)}\mathcal{P}^{*}(s, a) \ge \min _{a \in \mathcal{A}(s)}\, \mathcal{P}^{*}(s, a)\). Hence,

$$\begin{aligned} \forall s \in {\mathscr{S}}^{+}\!: \, {\pi ^*}(s) \in \mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}(s)}\mathcal{P}^{*}(s, a) \implies {\pi ^*}(s) \in \mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}_\text {v}^*(s)} \mathcal{P}^{*}(s, a) \end{aligned}$$

Similarly, we have (i) \({\pi ^*}(s) \in \mathcal{A}_\text {p}^*(s)\) by \(P^*(s) = \mathcal{P}^{*}(s, {\pi ^*}(s))\) and for \(s \in \mathcal{S}^{*}\), since \(\mathcal{A}_\text {p}^*(s) \subseteq \mathcal{A}_\text {c}^*(s)\) by (9), (ii) \(\max _{a \in \mathcal{A}_\text {p}^*(s)}Q^*(s, a) \le \max _{a \in \mathcal{A}_\text {c}^*(s)} Q^*(s, a)\). Hence,

$$\begin{aligned} \forall s \in \mathcal{S}^{*}\!: \, {\pi ^*}(s) \in \mathop {\mathrm {arg\,max}}\limits _{a \in \mathcal{A}_\text {c}^*(s)}Q^*(s, a) \implies {\pi ^*}(s) \in \mathop {\mathrm {arg\,max}}\limits _{a \in \mathcal{A}_\text {p}^*(s)}Q^*(s, a) \end{aligned}$$

Therefore, we can conclude that if \({\pi ^*}\) satisfies (8), then it does (7). Moreover, Requirement 2 is obviously true under (8) by Corollary 2.

Conversely, if \({\pi ^*}\) satisfies Requirement 2, it is true by Corollary 2 again that

$$\begin{aligned} {\pi ^*}(s) \in \mathop {\mathrm {arg\,min}}\limits _{a \in \mathcal{A}(s)} \mathcal{P}^{*}(s, a) \qquad \forall s \in {\mathcal{F}}^{*}. \end{aligned}$$

Moreover, it is obvious that \({\pi ^*}(s) \in \mathop {\mathrm {arg\,max}}\limits _{a \in \mathcal{A}_\text {c}^*(s)}Q^*(s, a)\) \(\forall s \in \mathcal{S}^{*}\) if \({\pi ^*}\) satisfies (7) and (10). Therefore, \({\pi ^*}\) satisfies (8) if it does (7), (10) and Requirement 2. \(\square\)

1.9 C.9 Proof of Theorem 1

Choose arbitrary \(i \in \mathbb {N}\) and note that \(\pi ^i = \mathcal{T}\pi ^{i-1}\). Let \(\mathcal{A}_\text {p}^i\) denote \(\mathcal{A}_\text {p}^{\pi ^{i}}\), \(\mathfrak {A}_\text {p}^i\) denote \(\mathfrak {A}_\text {p}^{\pi ^{i}}\) and \(\mathfrak {A}^i\) denote \(\mathfrak {A}^{\pi ^{i}}\). Then, by the definitions of \(\mathcal{T}\) and \(\mathcal{A}_\text {p}^{i-1}\), we have

$$\begin{aligned}&Q^{i-1}(s, \pi ^i(s)) = \max _{\!\!\!a \in \mathcal{A}_\text {p}^{i-1}(s)\!\!\!} Q^{i-1}(s, a) \ge Q^{i-1}(s, \pi ^{i-1}(s)) = V^{i-1}(s) \quad \forall s \in \mathcal{S}^{i-1} \end{aligned}$$
(30)

where for the inequality, we have used \(\pi ^{i-1}(s) \in \mathcal{A}_\text {p}^{i-1}(s)\), which is always true by \(\mathcal{P}^{i-1}(s, \pi ^{i-1}(s)) = P^{i-1}(s)\). Moreover,

$$\begin{aligned}&\mathcal{P}^{i-1}(s, \pi ^i(s)) \le P^{i-1}(s) \qquad \forall s \in \mathcal{S}^{i-1} \end{aligned}$$
(31)
$$\begin{aligned}&\mathcal{P}^{i-1}(s, \pi ^i(s)) = \min _{a \in \mathcal{A}(s)} \mathcal{P}^{i-1}(s, a) \le \mathcal{P}^{i-1}(s, \pi ^{i-1}(s)) = P^{i-1}(s) \quad \forall s \in \mathcal{F}^{i-1} \end{aligned}$$
(32)

by the definition of \(\mathcal{T}\). Therefore, the application of Lemma 2b to (31) and (32) yields \(P^i \le P^{i-1}\), which directly proves \(\mathcal{S}^{i} \supseteq \mathcal{S}^{i-1}\) and \(\mathcal{F}^{i} \subseteq \mathcal{F}^{i-1}\) as well. Moreover, applying Lemma 2a to (30), with \(\mathscr{S}\,_1^{+} = \mathcal{F}^{i-1}\) and \(\mathscr{S}\,_2^+ = \mathcal{S}^{i-1}\), we conclude:

$$\begin{aligned} \\ V^{i-1}(s) \le V^i(s) \;\; \forall s \in \mathcal{F}^{i-1} \,\,\implies \,\, V^{i-1} \le V^i \\ \end{aligned}$$

which completes the proof of monotonicity.

Since both \({\mathscr{S}}^{+}\) and \({\mathcal{A}}^+\) are finite, there exist only finitely many policies. Hence, for each \(s \in {\mathscr{S}}^{+}\), the monotonicity \(0 \le P^i(s) \le P^{i-1}(s)\) implies that the sequence \((P^{i-1}(s) )_{i\in \mathbb {N}}\) converges within a finite number of iterations, say \(i^*_s\). Letting \(i^* := \max _{s \in {\mathscr{S}}^{+}} i^*_s\), then given \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\), we have by convergence that

$$\begin{aligned} P^{i}(s) = P^{j}(s) \text { and thus } \mathcal{P}^{i}(s, a) = \mathcal{P}^{j}(s, a) \qquad \forall i,j \ge i^* \end{aligned}$$
(33)

However, it is possible that \(\pi ^i \ne \pi ^j\) for some \(i, j \ge i^*\), \(i \ne j\).

Now, take arbitrary \(i > i^*\) and note that by the definition of \(\mathcal{T}\),

$$\begin{aligned} \pi ^i(s)&\in \mathfrak {A}^{i-1}(s) \quad \forall s \in \mathcal{F}^{i-1} \\ \pi ^{i+1}(s)&\in \displaystyle \mathop {\mathrm {arg\,max}}\limits _{a \in \mathfrak {A}^i(s)} Q^i(s, a) \quad \forall s \in \mathcal{F}^{i} \end{aligned}$$

where we have \(\mathfrak {A}^{i-1}(s) = \mathop {\mathrm {arg\,min}}\nolimits _{a \in \mathcal{A}(s)} \mathcal{P}^{i-1}(s, a) = \mathop {\mathrm {arg\,min}}\nolimits _{a \in \mathcal{A}(s)} \mathcal{P}^{i}(s, a) = \mathfrak {A}^i(s)\) and \(\mathcal{F}^{i-1} = \mathcal{F}^{i}\) by definitions and (33). Therefore, \(\pi ^i(s) \in \mathfrak {A}^i(s)\) for all \(s \in \mathcal{F}^{i}\) and following a process similar to (30) results in

$$\begin{aligned} Q^i(s, \pi ^{i+1}(s)) = \max _{a \in \mathfrak {A}^i(s)} Q^i(s, a) \ge Q^i(s, \pi ^i(s)) = V^i(s) \qquad \forall s \in \mathcal{F}^{i}. \end{aligned}$$
(34)

The application of Lemma 2a to (30) and (34) then yields \(V^i \le V^{i+1}\), the monotonicity of V. Since the monotonicity holds \(\forall i > i^*\) and there are only finitely many policies, the sequence \((V^{j - 1}(s) )_{j\in \mathbb {N}}\) for each \(s \in {\mathscr{S}}^{+}\) converges within a finite number of iterations, say \(j^*_s > i^*\). Letting \(j^*\) denote \(\max _{s \in {\mathscr{S}}^{+}} j^*_s\), then given \(sa \in {\mathscr{S}}^{+} \times \mathcal{A}(s)\), we have

$$\begin{aligned} \forall j \ge j^*: {\left\{ \begin{array}{ll} V^{j}(s) = V^{j+1}(s) \quad \text { (by convergence)} \\ P^{j}(s) = P^{j+1}(s)\quad \text{ (by } (33) \text{ and } j^* > i^*), \end{array}\right. } \end{aligned}$$

meaning that \(\pi ^{j} \simeq \pi ^{j+1}\) and thus \(\pi ^{j} = \mathcal{T}\pi ^{j}\) for all \(j \ge j^*\), by the definition of \(\mathcal{T}\). That is, the sequence \((\pi _i)_{i\in \mathbb {N}}\) converges to a fixed point of \(\mathcal{T}\) within a finite number of iterations \(j^*\), and the proof is completed. \(\square\)

1.10 C.10 Proof of Proposition 7

The proof is divided into the following two parallel processes.

  1. 1.

    Consider Property P5′. Suppose that \({\pi ^*}\) satisfies

    $$\begin{aligned} &V^*(s)= V^\pi (s) \qquad \forall s \in \overline{ \mathcal{S}}^*_{\mskip-2.0\thinmuskip\pi } \end{aligned}$$
    (35)
    $$\begin{aligned} &P^*(s)\le P^\pi (s) \qquad \forall s \in {\mathcal{F}}^{*} \end{aligned}$$
    (36)

    where (35) implies \(Q^*(s, \pi ^*(s)) = Q^*(s, \pi (s))\) \(\forall s \in \mathcal{S}^{*}\) since by (37)

    $$\begin{aligned} {\underbrace{Q^*(s, {\pi ^*}(s))}_{V^*(s)} = \underbrace{Q^\pi (s, \pi (s))}_{V^\pi (s)}} &=\mathbb {E}^\pi \big [ r_0 + V^\pi (s_1) \,|\,s_0 = s \big ] \\ &=\mathbb {E}^\pi \big [ r_0 + V^*(s_1) \,|\,s_0 = s \big ] = Q^*(s, \pi (s)) \quad \forall s \in \mathcal{S}^{*} \setminus \mathscr{S}_{\!\perp} \end{aligned}$$

    and for all \(\forall s \in \mathcal{S}^{*} \cap \mathscr{S}_{\!\perp}, Q^*(s, \pi^*(s)) = Q^\pi(s, \pi(s)) = \mathcal{R}_{\!\perp}(s, \pi(s)) = Q^*(s, \pi(s))\). Thus, by the fixed point property \({\pi ^*}= \mathcal{T}{\pi ^*}\) and the definition of \(\mathcal{T}\), we obtain

    $$\begin{aligned} \max _{a \in \mathcal{A}_\text {p}^*(s)} Q^*(s, a) = Q^*(s, {\pi ^*}(s)) = Q^*(s, \pi (s)) \end{aligned}$$

    i.e., \(\pi (s) \in \mathfrak {A}_\text {p}^*(s)\) (\(= \mathop {\mathrm {arg\,max}}_{a \in \mathcal{A}_\text {p}^*(s)} Q^*(s, a)\)), and therefore

    $$\begin{aligned} P^*(s) = \mathcal{P}^{*}(s, {\pi ^*}(s))&= \mathcal{P}^{*}(s, \mathcal{T}{\pi ^*}(s)) \nonumber \\&= \min _{\!\!a \in \mathfrak {A}_\text {p}^*(s)\!\!} \mathcal{P}^{*}(s, a) \le \mathcal{P}^{*}(s, \pi (s)) \quad \forall s \in \mathcal{S}^{*}. \end{aligned}$$
    (37)

    Finally, the application of Lemma 2b to (36) and (37), with \({\mathscr{S}}\,_1^{+} = {\mathcal{F}}^{*}\) and \({\mathscr{S}}\,_2^{+} = \mathcal{S}^{*}\), results in \(P^* \le P^\pi\), implying Property P5′.

  2. 2.

    For Property P6′, assume that \({\pi ^*}\) satisfies

    $$\begin{aligned}&P^\pi (s) = P^*(s) \quad \forall s \in \overline{ \mathcal{F}}^{*}_{\!\!\!\;\pi } \end{aligned}$$
    (38)
    $$\begin{aligned} &V^\pi (s) \le V^*(s) \quad \forall s \in \mathcal{S}^{*}, \end{aligned}$$
    (39)

    where (38) implies \(\mathcal{P}^{*}(s, \pi ^*(s)) = \mathcal{P}^{*} (s, \pi (s))\) \(\forall s \in {\mathcal{F}}^{*}\) since it is trivially true for all terminal \(s \in \mathscr{S}_\perp\) and we have that for all \(s \in \overline{ \mathcal{F}}^{*}_{\!\!\!\;\pi } \setminus \mathscr{S}_\perp\),

    $$\begin{aligned} {\underbrace{\mathcal{P}^{*}(s, {\pi ^*}(s))}_{P^*(s)} = \underbrace{\mathcal{P}^{\pi} (s, \pi (s))}_{P^\pi (s)}} &=\mathbb {E}^\pi \big [ P^\pi (s_1) \,|\,s_0 = s \big ] \\ &=\mathbb {E}^\pi \big [ P^*(s_1) \,|\,s_0 = s \big ] = \mathcal{P}^{*}(s, \pi (s)). \end{aligned}$$

    Hence, by the fixed point property and the definition of \(\mathcal{T}\), we obtain

    $$\begin{aligned} \min _{a \in \mathcal{A}(s)} \mathcal{P}^{*}(s, a) = \mathcal{P}^{*}(s, {\pi ^*}(s)) = \mathcal{P}^{*}(s, \pi (s)), \end{aligned}$$

    i.e., \(\pi (s) \in \mathfrak {A}^*(s)\) (\(= \mathop {\mathrm {arg\,min}}_{a \in \mathcal{A}(s)} \mathcal{P}^{*}(s, a)\)), and therefore

    $$\begin{aligned} \!V^*(s) = Q^*(s, {\pi ^*}(s))&= Q^*(s, \mathcal{T}{\pi ^*}(s)) \nonumber \\&= \max _{\!\!\!a \in \mathfrak {A}^*(s)\!\!\!} Q^*(s, a) \ge Q^*(s, \pi (s)) \quad \, \forall s \in {\mathcal{F}}^{*}. \end{aligned}$$
    (40)

    Finally, the application of Lemma 2a to (39) and (40), with \({\mathscr{S}}^{+}_1 = \mathcal{S}^{*}\) and \({\mathscr{S}}^{+}_2 = {\mathcal{F}}^{*}\), results in \(V^\pi \le V^*\), implying Property P6′. \(\square\)

1.11 C.11 Proof of Proposition 8

Let \(P^\pi (s \,;\, 0) := {\textbf {1}}(s \in \mathcal{F}_\perp )\) and \(\pi\) be a fixed policy. We will prove that for all \(s \in {\mathscr{S}}^{+}\) and for any horizon \(n \in \mathbb {N}\),

$$\begin{aligned} P^\pi (s \,;\, n - 1) \le P^\pi (s \,;\, n) \end{aligned}$$
(41)

which obviously implies \(\mathcal{P}^{\pi} (s_0, a_0 \,;\, n) \le \mathcal{P}^{\pi} (s_0, a_0 \,;\, n+1)\) for all initial state-action \(s_0a_0 \in {\mathscr{S}}^{+} \times \mathcal{A}(s_0)\) by the Bellman equation: for \(m = n, n+1\),

$$\begin{aligned} \mathcal{P}^{\pi} (s_0, a_0 \,;\, m) = \mathbb {E}\big [ P^\pi (s_1 \,;\, m-1 ) \,\vert\, s_0a_0 \big ] \qquad \forall s_0a_0 \in \mathscr{S}\times \mathcal{A}(s_0) \end{aligned}$$

and \(\mathcal{P}^{\pi} (s_0, a_0 \,;\, m) = {\textbf {1}}(s_0 \in \mathcal{F}_\perp )\) \(\forall s_0a_0 \in \mathscr{S}_\perp \times \mathcal{A}(s_0)\). Also note that for any \(s \in \mathscr{S}_\perp\) and any \(n \in \mathbb {N}\), \(P^\pi (s \,;\, n - 1) = P^\pi (s \,;\, n) = {\textbf {1}}(s \in \mathcal{F}_\perp )\), so given \(n \in \mathbb {N}\), we prove (41) only for all non-terminal state \(s \in \mathscr{S}\), without loss of generality.

For \(n = 1\), (41) is obviously true for all \(s \in \mathscr{S}\) since

$$\begin{aligned} P^\pi (s_0 \,;\, 1) = \mathbb {P}^{\pi} (s_1 \in \mathcal{F}_\perp ) \ge 0 = \underbrace{{\textbf {1}}(s_0 \in \mathcal{F}_\perp )}_{P^\pi(s_0\,;\,0)} \qquad \forall s_0 \in \mathscr{S}\end{aligned}$$

where the last equality comes from \(\mathscr{S}\cap \mathcal{F}_\perp = \varnothing\), thus \(s_0 \in \mathscr{S}\) \(\implies\) \(s_0 \not \in \mathcal{F}_\perp\).

Given \(n \in \mathbb {N}\), suppose that (41) is true \(\forall s \in {\mathscr{S}}^{+}\). Then, by the Bellman equation,

$$\begin{aligned} P^\pi (s_0 \,;\,n + 1) = \mathbb {E}^\pi \big [ P^\pi (s_1 \,;\,n) \,\vert \, s_0 \big ] \ge \mathbb {E}^\pi \big [ P^\pi (s_1 \,;\,n - 1) \,\vert \, s_0 \big ] = P^\pi (s_0 \,;\,n) \quad \forall s_0 \in \mathscr{S}\end{aligned}$$

Therefore, (41) is true for all \(n \in \mathbb {N}\) and all \(s \in \mathscr{S}\) by mathematical induction, and thereby, the proof is completed. \(\square\)

D Algorithms

In this appendix, we provide pseudocode of (asynchronous) value iterations presented in the body of the work. Since the only difference among the different algorithms lies in the policy update, we present each value iteration method in a unified manner by using the value iteration template shown in Algorithm 2 and describing the subroutine Greedy for each case. The main loop of Algorithm 2 (lines 4–9) updates for each state (line 5) the values (lines 6–8) and the policy (line 9). This can be also easily extended to policy iteration template in which the policy evaluation can be implemented by re-initializing and then updating the values using lines 5–8 multiple times, without executing policy update in line 9, until convergence is met. Then, policy improvement can be made by removing the state-wise policy update in line 9 and instead inserting the state-uniform update (lines 2–3) at the level of the outermost loop. In this manner, we implicitly provide policy iteration pseudocodes in addition to the value iterations.

The Greedy subroutines for each value iteration can be summarized as follows.

  1. 1.

    Subroutine 1: \(\mathfrak {T}\)-value iteration, also called naive VI (Sect. 3)

  2. 2.

    Subroutine 2: \(\mathcal{T}\)-value iteration, also called stable operator VI (Sect. 7)

  3. 3.

    Subroutine 3: value iteration with adaptive hysteresis (Sect. 6)

The corresponding policy iterations are presented in Sects. 5 and 6.

figure b
figure c
figure d
figure e

E Cliff-Mountain-Car experiment

In this appendix, we provide details about the experiment of Adaptive Hysteresis Q-learning with function approximation, applied to Cliff-Mountain-Car environment. The experimental results were shown and discussed in Sect. 8.

1.1 E.1 Cliff-Mountain-Car environment

The state space is \({\mathscr{S}}^{+} = \mathcal{X}\times \mathcal{V}\subset \mathbb {R}^2\) with the sets of positions \(\mathcal{X}= [-1.2, 0.5]\) and velocities \(\mathcal{V}= [-0.07, 0.07]\) of the car; the action space \({\mathcal{A}}^{+} = \mathcal{A}(\cdot ) = \{-1, 0, 1\}\). An episode starts from initial state \((-0.5, 0.0)\) and ends when the car reaches a state at the right-most point (0.5, v), with any velocity \(v \in \mathcal{V}\). I.e., \(\mathscr{S}_\perp = \{ (0.5, v)\}_{v \in \mathcal{V}}\). Transitions from state \(s = (x, v) \in \mathscr{S}\) to next state \(s' = (x', v') \in {\mathscr{S}}^{+}\) were made according to the dynamics:

$$\begin{aligned} \begin{bmatrix} x' \\ v' \end{bmatrix} = \textrm{bound}\left( \begin{bmatrix} x + v \\ 0.001 a - 0.0025 \cos {3x} + w \end{bmatrix} \right) \end{aligned}$$
(42)

when action \(a \in \mathcal{A}(s) = \{-1, 0, 1\}\) is taken, where w is a one-dimensional zero-mean Gaussian random variable with standard deviation \(\sigma = 0.0005\); \(\textrm{bound}(\cdot )\) operation enforces \(x' \in \mathcal{X}\) and \(v' \in \mathcal{V}\), i.e., \(s' \in {\mathscr{S}}^{+}\). A reward of \(-1\) is given for every step until the agent reaches a terminal state \(\in \mathscr{S}_\perp\).

Note that (42) recovers to the conventional Mountain-Car dynamics [24, Chap. 10] if \(w = 0\). However, we suppose that there exists a hypothetical cliff beyond the right-most point \(x = 0.5\), to which the agent will reach if velocity \(v > 0.035\). I.e., the forbidden states are given by \(\mathcal{F}_\perp = \{ (0.5, v) \in \mathscr{S}_\perp \,|\,0.035 < v \}.\) The remaining terminal states at \(x = 0.5\) are goal states that the agent tries to reach.

1.2 E.2 Adaptive hysteresis Q-learning with function approximation

In order to apply to Cliff-Mountain-Car, we extended Adaptive Hysteresis Q-learning, Algorithm 1, with \(\mathcal{P}\)-, \(Q\)- and \(\mathcal{H}\)-functions approximated by linear networks

$$\begin{aligned} \mathcal{P}_{sa} = {{\textbf {w}}}_a^\mathcal{P}\cdot {\phi }_s \quad Q_{sa} = {{\textbf {w}}}_a^Q\cdot {\phi }_s \quad \mathcal{H}_{sa} = {{\textbf {w}}}_a^\mathcal{H}\cdot {\phi }_s \end{aligned}$$

where \({{\textbf {w}}}_a^\mathcal{P}\), \({{\textbf {w}}}_a^Q\) and \({{\textbf {w}}}_a^\mathcal{H}\) are weight vectors \(\in \mathbb {R}^N\) for each function w.r.t. action a, \({\phi }_s \in \mathbb {R}^N\) is the feature vector at state s and \({\textbf {x}} \cdot {\textbf {y}}\) denotes the inner product of \({\textbf {x}}\) and \({\textbf {y}}\). For simplicity, we write \(\mathcal{P}_{sa}\), \(Q_{sa}\) and \(\mathcal{H}_{sa}\) for \(\mathcal{P}(s, a)\), \(Q(s, a)\) and \(\mathcal{H}(s, a)\), respectively.

The weight vectors were initialized to zeros and updated by the following update rules that generalize those in Algorithm 1 (lines 10 to 12):

$$\begin{aligned}&{{\textbf {w}}}_a^\mathcal{P}\leftarrow {{\textbf {w}}}_a^\mathcal{P}+ \alpha _n \cdot \big ( \mathcal{P}_{s'}^{ \pi } - \mathcal{P}_{sa} \big ) \cdot \varvec{\phi }_s \\&{{\textbf {w}}}_a^Q\leftarrow {{\textbf {w}}}_a^Q+ \beta _n \cdot \big ( r + \gamma \, Q_{s'}^{ \pi } - Q_{sa} \big ) \cdot \varvec{\phi }_s \\&{{\textbf {w}}}_a^\mathcal{H}\leftarrow {{\textbf {w}}}_a^\mathcal{H}+ \eta _n \cdot \big ( {\textbf {1}}( \mathcal{P}_{sa} \le \vartheta _s ) - \mathcal{H}_{sa} \big ) \cdot \varvec{\phi }_s \end{aligned}$$

where we write \(\mathcal{P}_{s'}^{\pi }\) for \(\mathcal{P}(s', \pi (s'))\) and \(Q_{s'}^{\pi }\) for \(Q(s', \pi (s'))\); \(\alpha _n\), \(\beta _n\) and \(\eta _n\) are learning rates in episode n; the threshold \(\vartheta _s\) in the hysteresis update is given by \(\vartheta _s = \theta\) if \(\mathcal{H}_{sa} > 0.5\) and \(\vartheta _s = \min (\mathcal{P}_s^{\pi }, \theta )\) if \(\mathcal{H}_{sa} \le 0.5\). Here, we chose the decision boundary \(\mathcal{H}_{sa} = 0.5\) since the hysteresis variable \(\mathcal{H}_{sa}\) is not binary any more but changes smoothly between 0 and 1. Likewise, the policy \(\pi\) is updated in Algorithm 1 (lines 2 and 13) by (18) but with the hysteresis action space \(\mathcal{A}_\text {h}(\cdot )\) redefined as

$$\begin{aligned} \mathcal{A}_\text {h}(s) := \big \{ a \in \mathcal{A}(s) \,|\,\mathcal{H}_{sa} > 0.5 \big \}, \end{aligned}$$

using the same decision boundary \(\mathcal{H}_{sa} = 0.5\), rather than \(\mathcal{H}_{sa} = 1\) in (14). The other aspects are the same as Algorithm 1.

In the experiment, we choose tile coding with 64 tilings. Each tiling consists of \(8 \times 8\) tiles, so that a total of \(N = 4096\) binary features in \(\varvec{\phi }\) cover the state space \({\mathscr{S}}^{+}\). The learning rates \(\alpha _n\), \(\beta _n\) and \(\eta _n\) were scheduled from 0.0025 in the first episode (\(n = 1\)) to 0.00025 in the last episode (\(n = 10^6\)), with the decay rate 1/n. Discount rate \(\gamma = 1.0\) and constraint threshold \(\theta = 0.3\) were given. Episodes were generated by \(\epsilon\)-greedy with \(\varepsilon = 0.1\) and were timed out if \(t \ge 10^4\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, J., Sedwards, S. & Czarnecki, K. Uniformly constrained reinforcement learning. Auton Agent Multi-Agent Syst 38, 1 (2024). https://doi.org/10.1007/s10458-023-09607-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-023-09607-8

Keywords

Navigation