Actor-critic multi-objective reinforcement learning for non-linear utility functions

Reymond, Mathieu; Hayes, Conor F.; Steckelmacher, Denis; Roijers, Diederik M.; Nowé, Ann

doi:10.1007/s10458-023-09604-x

Actor-critic multi-objective reinforcement learning for non-linear utility functions

Published: 28 April 2023

Volume 37, article number 23, (2023)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Mathieu Reymond ORCID: orcid.org/0000-0002-6735-6752¹,
Conor F. Hayes²,
Denis Steckelmacher¹,
Diederik M. Roijers^1,3 &
…
Ann Nowé¹

1379 Accesses
Explore all metrics

Abstract

We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACRE: Actor-Critic with Reward-Preserving Exploration

Article Open access 14 August 2023

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Integrated Actor-Critic for Deep Reinforcement Learning

Notes

We note that our preliminary workshop paper [1] we referred to this algorithm as expected utility policy gradient (EUPG).

References

Roijers, D. M., Steckelmacher, D., & Nowé, A. (2018). Multi-objective reinforcement learning for the expected utility of the return. In Proceedings of the adaptive and learning agents workshop at FAIM.
Reymond, M., Hayes, C., Roijers, D. M., Steckelmacher, D., & Nowé, A. (2021). Actor-critic multi-objective reinforcement learning for non-linear utility functions. In Multi-objective decision making workshop (MODeM 2021).
Castelletti, A., Pianosi, F., & Restelli, M. (2013). A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research, 49(6), 3476–3486.
Article Google Scholar
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian, H., & Soltani, M. (2017). Multi-objective optimization of radiotherapy: distributed q-learning and agent-based simulation. Journal of Experimental & Theoretical Artificial Intelligence, 29(5), 1071–1086.
Article Google Scholar
Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67–113.
Article MathSciNet Google Scholar
Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L. M., Dazeley, R., & Heintz, F. (2022). A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1), 1–59.
Article Google Scholar
Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL) (pp. 191–199). IEEE.
Rădulescu, R., Mannion, P., Roijers, D. M., & Nowé, A. (2020). Multi-objective multi-agent decision making: A utility-based analysis and survey. Autonomous Agents and Multi-Agent Systems, 34(1), 10.
Article Google Scholar
Roijers, D. M., & Whiteson, S. (2017). Multi-objective decision making. Synthesis Lectures on Artificial Intelligence and Machine Learning, 11(1), 1–129.
Article Google Scholar
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Article Google Scholar
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928–1937).
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2011). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1–2), 51–80.
Article MathSciNet Google Scholar
Abels, A., Roijers, D.M., Lenaerts, T., Nowé, A., & Steckelmacher, D. (2019). Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the 36th international conference on machine learning. Proceedings of machine learning research (Vol. 97, pp. 11–20). PMLR.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019) Exploration by random network distillation. In International conference on learning representations.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., & Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Article Google Scholar
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).
Roijers, D. M., Whiteson, S., & Oliehoek, F. A. (2015). Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research, 52, 399–443.
Article MathSciNet Google Scholar
Mossalam, H., Assael, Y. M., Roijers, D. M., & Whiteson, S. (2016). Multi-objective deep reinforcement learning. CoRR. arXiv:1610.02707
Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on machine learning (pp. 41–47). ACM.
Hiraoka, K., Yoshida, M., & Mishima, T. (2009). Parallel reinforcement learning for weighted multi-criteria model with adaptive margin. Cognitive Neurodynamics, 3(1), 17–24.
Article Google Scholar
Castelletti, A., Pianosi, F., & Restelli, M. (2012). Tree-based fitted q-iteration for multi-objective Markov decision problems. In The 2012 international joint conference on neural Networks (IJCNN) (pp. 1–8). IEEE.
Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc.
Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., & Riedmiller, M. (2020). A distributional view on multi-objective policy optimization. In International conference on machine learning (pp. 11–22). PMLR.
Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International conference on learning representations. https://openreview.net/forum?id=S1ANxQW0b.
Xu, J., Tian, Y., Ma, P., Rus, D., Sueda, S., & Matusik, W. (2020). Prediction-guided multi-objective reinforcement learning for continuous robot control. In International conference on machine learning (pp. 10607–10616). PMLR.
Vamplew, P., Dazeley, R., Barker, E., & Kelarev, A. (2009). Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In Australasian joint conference on artificial intelligence (pp. 340–349). Springer.
Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in neural information processing systems (pp. 1497–1504).
Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumbley, D., Sellwood, M., & Brown, N. (2018). Exploring deep recurrent models with reinforcement learning for molecule design. In 6th International conference on learning representations (ICLR), workshop track.
Roijers, D. M., Zintgraf, L. M., Libin, P., & Nowé, A. (2018). Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In ALA workshop at FAIM (Vol. 8).
Hayes, C. F., Reymond, M., Roijers, D. M., Howley, E., & Mannion, P. (2021). Distributional Monte Carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 1530–1532).
Van Moffaert, K., & Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1), 3483–3512.
MathSciNet Google Scholar
Parisi, S., Pirotta, M., & Restelli, M. (2016). Multi-objective reinforcement learning through continuous pareto manifold approximation. Journal of Artificial Intelligence Research, 57, 187–227.
Article MathSciNet Google Scholar
Reymond, M., & Nowé, A. (2019). Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems. In Proceedings of the adaptive and learning agents workshop (ALA-19) at AAMAS.
Reymond, M., Bargiacchi, E., & Nowé, A. (2022) Pareto conditioned networks. In Proceedings of the 21st international conference on autonomous agents and multiagent systems (pp. 1110–1118).
de Oliveira, T. H. F., de Souza Medeiros, L. P., Neto, A. D. D., & Melo, J. D. (2021). Q-managed: A new algorithm for a multiobjective reinforcement learning. Expert Systems with Applications, 168, 114228.
Article Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Article MathSciNet Google Scholar
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803
Zintgraf, L. M., Roijers, D. M., Linders, S., Jonker, C. M., & Nowé, A. (2018). Ordered preference elicitation strategies for supporting multi-objective decision making. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (pp. 1477–1485). International Foundation for Autonomous Agents and Multiagent Systems.
Roijers, D. M., Zintgraf, L. M., Libin, P., Reymond, M., Bargiacchi, E., & Nowé, A. (2020). Interactive multi-objective reinforcement learning in multi-armed bandits with gaussian process utility models. In Joint European conference on machine learning and knowledge discovery in databases (pp. 463–478). Springer.
Hayes, C. F., Verstraeten, T., Roijers, D. M., Howley, E., & Mannion, P. (2022). Expected scalarised returns dominance: A new solution concept for multi-objective decision making. Neural Computing and Applications, 1–21.

Download references

Acknowledgements

Conor F. Hayes is funded by the University of Galway Hardiman Scholarship. This research was supported by funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.

Author information

Authors and Affiliations

Vrije Universiteit Brussel, Brussels, Belgium
Mathieu Reymond, Denis Steckelmacher, Diederik M. Roijers & Ann Nowé
University of Galway, Galway, Ireland
Conor F. Hayes
HU University of Applied Sciences Utrecht, Utrecht, The Netherlands
Diederik M. Roijers

Authors

Mathieu Reymond
View author publications
You can also search for this author in PubMed Google Scholar
Conor F. Hayes
View author publications
You can also search for this author in PubMed Google Scholar
Denis Steckelmacher
View author publications
You can also search for this author in PubMed Google Scholar
Diederik M. Roijers
View author publications
You can also search for this author in PubMed Google Scholar
Ann Nowé
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathieu Reymond.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary non-archival version of this work was presented at the Adaptive and Learning Agents Workshop 2018 [1] and at the Multi-Objective Decision Making Workshop 2021 [2]. This article extends our previous work with additional theoretical analysis and new empirical results.

Appendices

Appendix A: MO policy gradient proofs

MOCAC is inspired by single-objective actor-critic methods, which make use of the Policy Gradient theorem to provide convergence guarantees towards a local optimum via gradient descent. This theorem assumes that the policy maximizes the sum of rewards, which does not hold when using non-linear utility functions. In this section, we follow the original Policy Gradient theorem to provide convergence guarantees for MOCAC.

Given a performance measure $J(\pi _{\theta })$, with $\theta$ the parameters of the policy $\pi$, we would like to use its gradient $\nabla _\theta J(\pi _{\theta })$ to update the policy using gradient ascent:

$$\begin{aligned} \theta _{k+1} = \theta _k + \alpha \nabla _\theta J(\pi _{\theta _k}) \end{aligned}$$

where $\alpha$ is the learning rate.

We now derive this gradient for MOAC, the baseline that does not use a distributional critic in the ablation study, as well as for MOCAC, our main contribution.

1.1 A.1 Proof under SER

In Sect. 4.1 we perform an ablation study to show that using a distributional critic is necessary to optimize under the ESR criterion. Our ablation, which we call MOAC, does not use a distributional critic. MOAC thus optimizes under the SER criterion, as the utility function is applied on the V-value. We prove that MOAC indeed converges under SER. We define the performance measure we aim to maximize as:

$$\begin{aligned} J(\pi _{\theta })&\doteq u({{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \vec {R(\tau )} \right] ) \nonumber \\ &= u(V^{\pi _\theta }(s_0)) \end{aligned}$$

(A1)

where $\tau$ are trajectories sampled by following $\pi$. The gradient of the policy performance, i.e. the policy gradient, is:

$$\begin{aligned} \nabla _\theta J(\pi _{\theta }) &= \nabla _\theta u(V^{\pi _\theta }(s_0)) \nonumber \\ &= \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i}\frac{\partial V^{\pi _\theta }_i}{\partial \theta } \nonumber \\ &= \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i}\nabla _\theta V^{\pi _\theta }_i(s_0) \end{aligned}$$

(A2)

with $V^{\pi _\theta }_i$ the V-value for the i-th objective. Since this is a scalar, $\nabla _\theta V^{\pi _\theta }_i(s_0)$ follows the original Policy Gradient proof (in our case with baseline):

$$\begin{aligned} \nabla _\theta V^{\pi _\theta }_i(s_0) = {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) (Q^{\pi _\theta }_i(s_t, a_t) - V^{\pi _\theta }_i(s_t)) \right] \end{aligned}$$

(A3)

Intuitively, our MOAC policy gradient is a weighted sum over the individual single-objective policy gradients where each weight defines the importance of its objective, which is measured using the utility function. Concretely, the MOAC policy gradient equals to the sum of the single-objective policy gradients multiplied by the gradient of the utility function when evaluating the utility function for $s_0$. Since, in our setting, u is known, we assume we can compute its derivative:

$$\begin{aligned} \nabla _\theta J(\pi _{\theta }) = \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) (Q^{\pi _\theta }_i(s_t, a_t) - V^{\pi _\theta }_i(s_t)) \right] \end{aligned}$$

(A4)

We use this gradient to update the policy of MOAC with gradient ascent.

1.2 A.2 Proof under ESR

While MOAC optimizes under SER, we are interested in the ESR setting, which is what MOCAC optimizes under. In this case, the performance measure we aim to maximize is:

$$\begin{aligned} J(\pi _\theta ) \doteq {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ u(\vec {R}(\tau )) \right] \end{aligned}$$

(A5)

To derive its gradient, we use the log-derivative trick:

$$\begin{aligned}{} & {} \frac{\textrm{d}}{\textrm{d}x}\log f(x) = \frac{1}{f(x)}\frac{\textrm{d}}{\textrm{d}x} f(x) \nonumber \\{} & {} \frac{\textrm{d}}{\textrm{d}x} f(x) = f(x)\frac{\textrm{d}}{\textrm{d}x}\log f(x) \end{aligned}$$

(A6)

We derive the gradient $\nabla _\theta J(\pi _\theta )$ of $J(\pi _\theta )$ as follows:

$$\begin{aligned} \nabla _\theta J(\pi _\theta )&= \nabla _\theta {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ u(\vec {R}(\tau )) \right] \nonumber \\ &= \nabla _\theta \int _\tau P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\ &= \int _\tau \nabla _\theta P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\ &{\mathop {=}\limits ^{({\rm A}6)}}\int _\tau P(\tau \mid \pi _\theta ) \nabla _\theta \log P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \nabla _\theta \log P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \right] \nonumber \\ &= {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \end{aligned}$$

(A7)

Using the law of iterated expectations, we split the trajectory $\tau$ into two parts: $\tau _{:t}, \tau _{t:}$ the parts of the trajectories before timestep t and from t onwards, respectively. We also use our definition of accrued rewards (Sect. 3.1) to split the episodic return $\vec {R}(\tau )$ into the accrued return $\vec {R}^{-}_t$ and future return $\vec {R}_t$.

$$\begin{aligned} \nabla _\theta J(\pi _\theta )= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \mid \tau _{:t} \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] \right] \end{aligned}$$

(A8)

Since $\nabla _\theta \log \pi _\theta (a_t \mid s_t)$ is constant with respect to the inner expectation, we can take it out:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] \right] \end{aligned}$$

(A9)

Because we augment the state-space with the accrued reward, $\vec {R}^{-}_t$ depends only on $s_t$. Moreover, due to the Markovian property of (MO)MDPs, the future return $\vec {R}_t$ only depends on $s_t, a_t$. Given that $\vec {R}^{-}_t$ depends only on $s_t$ and $\vec {R}_t$ only on $s_t, a_t$, $u(\vec {R}^{-}_t + \vec {R}_t)$ does not depend on the past:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] = {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] \end{aligned}$$

(A10)

Since a MOMDP is composed of a discrete number of states and actions, there are a finite number of possible returns $\vec {R}$. Thus, expanding the expectation of the previous equation results in:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} \end{aligned}$$

(A11)

where $Pr\{\vec {R} \mid s_t, a_t \}$ is the probability of obtaining future return $\vec {R}$ from $s_t, a_t$. The MOCAC policy gradient is thus:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} \right] \end{aligned}$$

(A12)

MOCAC approximates the distribution over future returns as a categorical distribution $Z_\psi$. Replacing $Pr\{\vec {R} \mid s_t, a_t \}$ with this approximation yields Eq. (13) but without a baseline $b(s_t)$. However, since the baseline only depends on $s_t$ (and not on $a_t$), we have the property that:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{a_t \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) b(s_t) \right]= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{a_t \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) \right] b(s_t)\nonumber \\= & {} \sum _{a \in {\mathcal {A}}} \pi _\theta (a_t \mid s_t) \nabla _\theta \log \pi _\theta (a_t \mid s_t) b(s_t) \nonumber \\&{\mathop {=}\limits ^{({\rm A}6)}}\sum _{a \in {\mathcal {A}}} \nabla _\theta \pi _\theta (a_t \mid s_t) b(s_t) \nonumber \\= & {} \nabla _\theta \sum _{a \in {\mathcal {A}}} \pi _\theta (a_t \mid s_t) b(s_t) = \nabla _\theta 1 b(s_t) = 0 \end{aligned}$$

(A13)

We can use the exact same proof as for the original policy gradient with baseline theorem:

$$\begin{aligned} \nabla _\theta J(\pi _\theta )= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ (u(\vec {R}^{-}_t + \vec {R}_t) - b(s_t)) \mid s_t, a_t \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] \right. \nonumber \\{} & {} \left. -\nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ b(s_t) \mid s_t, a_t \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ (u(\vec {R}^{-}_t + \vec {R}_t)) \mid s_t, a_t \right] - 0 \right] \end{aligned}$$

(A14)

Thus, we can use a baseline that only depends on $s_t$ as it does not affect the expectation ${{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta }$. In the case of MOCAC, we use the expected utility for state $s_t$, i.e. $b(s_t) = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t \}$. Plugging this in Eq. (A12) results in:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) A(s_t, a_t) \right] \end{aligned}$$

(A15)

with

$$\begin{aligned} A(s_t, a_t) = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} - \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t \} \end{aligned}$$

(A16)

We use this gradient to update the policy of MOCAC with gradient ascent.

Appendix B: Environment details

1.1 B.1 Split

In the Split environment, the agent chooses between two hallways of length 11 to traverse. Including the start- and end-states, there are 26 states. The states are one-hot encoded, resulting in a 26-sized vector that is given as input to any of the actor and critic estimators.

Table 1 Hyperparameters for the split, deep-sea-treasure and Minecart environments

Full size table

1.2 B.2 Deep sea treasure

Deep sea treasure is a grid-world environment, where the submarine moves on a $11 \times 12$, resulting in 132 different states. The state number is one-hot encoded, resulting in a 132-sized vector that is given as input to any of the actor and critic estimators. The rewards provided by each treasure are made in such a way that every optimal $\texttt {treasure}\times \texttt {fuel}$ combination is evenly spread out on the convex coverage set. Treasure values are displayed on Fig. 8.

1.3 B.3 Minecart

The state is a top-down visual representation of the mining area. It displays the different mine locations, the cart and the base station. A frame of the environment can be seen in Fig. 9. The frames are pre-processed as follows: they are rescaled to $42\times 42$, converted to grayscale and normalized. Moreover, we keep a history of 2 frames as observation. This is similar to the frame-preprocessing done by [16] on the Arcade Learning Environment suite (Table 1).

Appendix C: Hyperparameters

All hyperparameters used for all these experiments, including neural network architectures are listed in Table 1. All hyperparameters used for the ablative study are listed in Table 2. Figure 11 shows results for Deep-Sea-Treasure using a linear utility function with different weight-parameters. Figure 12 shows results for diverse utility functions on Fishwood, with an additional single-objective baseline.

Table 2 Hyperparameters for the experiments on MiniRandom and Fishwood. Since the input-size of the neural networks depend on the conditioning, the neurons rows are variable

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Reymond, M., Hayes, C.F., Steckelmacher, D. et al. Actor-critic multi-objective reinforcement learning for non-linear utility functions. Auton Agent Multi-Agent Syst 37, 23 (2023). https://doi.org/10.1007/s10458-023-09604-x

Download citation

Accepted: 14 March 2023
Published: 28 April 2023
DOI: https://doi.org/10.1007/s10458-023-09604-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Actor-critic multi-objective reinforcement learning for non-linear utility functions

Abstract

Access this article

Similar content being viewed by others

ACRE: Actor-Critic with Reward-Preserving Exploration

Latent-Conditioned Policy Gradient for Multi-Objective Deep Reinforcement Learning

Integrated Actor-Critic for Deep Reinforcement Learning

Notes

References

Acknowledgements