Skip to main content
Log in

Actor-critic multi-objective reinforcement learning for non-linear utility functions

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We note that our preliminary workshop paper [1] we referred to this algorithm as expected utility policy gradient (EUPG).

References

  1. Roijers, D. M., Steckelmacher, D., & Nowé, A. (2018). Multi-objective reinforcement learning for the expected utility of the return. In Proceedings of the adaptive and learning agents workshop at FAIM.

  2. Reymond, M., Hayes, C., Roijers, D. M., Steckelmacher, D., & Nowé, A. (2021). Actor-critic multi-objective reinforcement learning for non-linear utility functions. In Multi-objective decision making workshop (MODeM 2021).

  3. Castelletti, A., Pianosi, F., & Restelli, M. (2013). A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research, 49(6), 3476–3486.

    Article  Google Scholar 

  4. Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian, H., & Soltani, M. (2017). Multi-objective optimization of radiotherapy: distributed q-learning and agent-based simulation. Journal of Experimental & Theoretical Artificial Intelligence, 29(5), 1071–1086.

    Article  Google Scholar 

  5. Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67–113.

    Article  MathSciNet  Google Scholar 

  6. Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L. M., Dazeley, R., & Heintz, F. (2022). A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1), 1–59.

    Article  Google Scholar 

  7. Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL) (pp. 191–199). IEEE.

  8. Rădulescu, R., Mannion, P., Roijers, D. M., & Nowé, A. (2020). Multi-objective multi-agent decision making: A utility-based analysis and survey. Autonomous Agents and Multi-Agent Systems, 34(1), 10.

    Article  Google Scholar 

  9. Roijers, D. M., & Whiteson, S. (2017). Multi-objective decision making. Synthesis Lectures on Artificial Intelligence and Machine Learning, 11(1), 1–129.

    Article  Google Scholar 

  10. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.

    Article  Google Scholar 

  11. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928–1937).

  12. Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887

  13. Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2011). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1–2), 51–80.

    Article  MathSciNet  Google Scholar 

  14. Abels, A., Roijers, D.M., Lenaerts, T., Nowé, A., & Steckelmacher, D. (2019). Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the 36th international conference on machine learning. Proceedings of machine learning research (Vol. 97, pp. 11–20). PMLR.

  15. Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019) Exploration by random network distillation. In International conference on learning representations.

  16. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., & Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.

    Article  Google Scholar 

  17. Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).

  18. Roijers, D. M., Whiteson, S., & Oliehoek, F. A. (2015). Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research, 52, 399–443.

    Article  MathSciNet  Google Scholar 

  19. Mossalam, H., Assael, Y. M., Roijers, D. M., & Whiteson, S. (2016). Multi-objective deep reinforcement learning. CoRR. arXiv:1610.02707

  20. Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on machine learning (pp. 41–47). ACM.

  21. Hiraoka, K., Yoshida, M., & Mishima, T. (2009). Parallel reinforcement learning for weighted multi-criteria model with adaptive margin. Cognitive Neurodynamics, 3(1), 17–24.

    Article  Google Scholar 

  22. Castelletti, A., Pianosi, F., & Restelli, M. (2012). Tree-based fitted q-iteration for multi-objective Markov decision problems. In The 2012 international joint conference on neural Networks (IJCNN) (pp. 1–8). IEEE.

  23. Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc.

  24. Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., & Riedmiller, M. (2020). A distributional view on multi-objective policy optimization. In International conference on machine learning (pp. 11–22). PMLR.

  25. Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International conference on learning representations. https://openreview.net/forum?id=S1ANxQW0b.

  26. Xu, J., Tian, Y., Ma, P., Rus, D., Sueda, S., & Matusik, W. (2020). Prediction-guided multi-objective reinforcement learning for continuous robot control. In International conference on machine learning (pp. 10607–10616). PMLR.

  27. Vamplew, P., Dazeley, R., Barker, E., & Kelarev, A. (2009). Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In Australasian joint conference on artificial intelligence (pp. 340–349). Springer.

  28. Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in neural information processing systems (pp. 1497–1504).

  29. Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumbley, D., Sellwood, M., & Brown, N. (2018). Exploring deep recurrent models with reinforcement learning for molecule design. In 6th International conference on learning representations (ICLR), workshop track.

  30. Roijers, D. M., Zintgraf, L. M., Libin, P., & Nowé, A. (2018). Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In ALA workshop at FAIM (Vol. 8).

  31. Hayes, C. F., Reymond, M., Roijers, D. M., Howley, E., & Mannion, P. (2021). Distributional Monte Carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 1530–1532).

  32. Van Moffaert, K., & Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1), 3483–3512.

    MathSciNet  Google Scholar 

  33. Parisi, S., Pirotta, M., & Restelli, M. (2016). Multi-objective reinforcement learning through continuous pareto manifold approximation. Journal of Artificial Intelligence Research, 57, 187–227.

    Article  MathSciNet  Google Scholar 

  34. Reymond, M., & Nowé, A. (2019). Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems. In Proceedings of the adaptive and learning agents workshop (ALA-19) at AAMAS.

  35. Reymond, M., Bargiacchi, E., & Nowé, A. (2022) Pareto conditioned networks. In Proceedings of the 21st international conference on autonomous agents and multiagent systems (pp. 1110–1118).

  36. de Oliveira, T. H. F., de Souza Medeiros, L. P., Neto, A. D. D., & Melo, J. D. (2021). Q-managed: A new algorithm for a multiobjective reinforcement learning. Expert Systems with Applications, 168, 114228.

    Article  Google Scholar 

  37. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

    Article  MathSciNet  Google Scholar 

  38. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  39. Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803

  40. Zintgraf, L. M., Roijers, D. M., Linders, S., Jonker, C. M., & Nowé, A. (2018). Ordered preference elicitation strategies for supporting multi-objective decision making. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (pp. 1477–1485). International Foundation for Autonomous Agents and Multiagent Systems.

  41. Roijers, D. M., Zintgraf, L. M., Libin, P., Reymond, M., Bargiacchi, E., & Nowé, A. (2020). Interactive multi-objective reinforcement learning in multi-armed bandits with gaussian process utility models. In Joint European conference on machine learning and knowledge discovery in databases (pp. 463–478). Springer.

  42. Hayes, C. F., Verstraeten, T., Roijers, D. M., Howley, E., & Mannion, P. (2022). Expected scalarised returns dominance: A new solution concept for multi-objective decision making. Neural Computing and Applications, 1–21.

Download references

Acknowledgements

Conor F. Hayes is funded by the University of Galway Hardiman Scholarship. This research was supported by funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathieu Reymond.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary non-archival version of this work was presented at the Adaptive and Learning Agents Workshop 2018 [1] and at the Multi-Objective Decision Making Workshop 2021 [2]. This article extends our previous work with additional theoretical analysis and new empirical results.

Appendices

Appendix A: MO policy gradient proofs

MOCAC is inspired by single-objective actor-critic methods, which make use of the Policy Gradient theorem to provide convergence guarantees towards a local optimum via gradient descent. This theorem assumes that the policy maximizes the sum of rewards, which does not hold when using non-linear utility functions. In this section, we follow the original Policy Gradient theorem to provide convergence guarantees for MOCAC.

Given a performance measure \(J(\pi _{\theta })\), with \(\theta\) the parameters of the policy \(\pi\), we would like to use its gradient \(\nabla _\theta J(\pi _{\theta })\) to update the policy using gradient ascent:

$$\begin{aligned} \theta _{k+1} = \theta _k + \alpha \nabla _\theta J(\pi _{\theta _k}) \end{aligned}$$

where \(\alpha\) is the learning rate.

We now derive this gradient for MOAC, the baseline that does not use a distributional critic in the ablation study, as well as for MOCAC, our main contribution.

1.1 A.1 Proof under SER

In Sect. 4.1 we perform an ablation study to show that using a distributional critic is necessary to optimize under the ESR criterion. Our ablation, which we call MOAC, does not use a distributional critic. MOAC thus optimizes under the SER criterion, as the utility function is applied on the V-value. We prove that MOAC indeed converges under SER. We define the performance measure we aim to maximize as:

$$\begin{aligned} J(\pi _{\theta })&\doteq u({{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \vec {R(\tau )} \right] ) \nonumber \\ &= u(V^{\pi _\theta }(s_0)) \end{aligned}$$
(A1)

where \(\tau\) are trajectories sampled by following \(\pi\). The gradient of the policy performance, i.e. the policy gradient, is:

$$\begin{aligned} \nabla _\theta J(\pi _{\theta }) &= \nabla _\theta u(V^{\pi _\theta }(s_0)) \nonumber \\ &= \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i}\frac{\partial V^{\pi _\theta }_i}{\partial \theta } \nonumber \\ &= \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i}\nabla _\theta V^{\pi _\theta }_i(s_0) \end{aligned}$$
(A2)

with \(V^{\pi _\theta }_i\) the V-value for the i-th objective. Since this is a scalar, \(\nabla _\theta V^{\pi _\theta }_i(s_0)\) follows the original Policy Gradient proof (in our case with baseline):

$$\begin{aligned} \nabla _\theta V^{\pi _\theta }_i(s_0) = {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) (Q^{\pi _\theta }_i(s_t, a_t) - V^{\pi _\theta }_i(s_t)) \right] \end{aligned}$$
(A3)

Intuitively, our MOAC policy gradient is a weighted sum over the individual single-objective policy gradients where each weight defines the importance of its objective, which is measured using the utility function. Concretely, the MOAC policy gradient equals to the sum of the single-objective policy gradients multiplied by the gradient of the utility function when evaluating the utility function for \(s_0\). Since, in our setting, u is known, we assume we can compute its derivative:

$$\begin{aligned} \nabla _\theta J(\pi _{\theta }) = \sum _{i=0}^n \frac{\partial u}{\partial V^{\pi _\theta }_i} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) (Q^{\pi _\theta }_i(s_t, a_t) - V^{\pi _\theta }_i(s_t)) \right] \end{aligned}$$
(A4)

We use this gradient to update the policy of MOAC with gradient ascent.

1.2 A.2 Proof under ESR

While MOAC optimizes under SER, we are interested in the ESR setting, which is what MOCAC optimizes under. In this case, the performance measure we aim to maximize is:

$$\begin{aligned} J(\pi _\theta ) \doteq {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ u(\vec {R}(\tau )) \right] \end{aligned}$$
(A5)

To derive its gradient, we use the log-derivative trick:

$$\begin{aligned}{} & {} \frac{\textrm{d}}{\textrm{d}x}\log f(x) = \frac{1}{f(x)}\frac{\textrm{d}}{\textrm{d}x} f(x) \nonumber \\{} & {} \frac{\textrm{d}}{\textrm{d}x} f(x) = f(x)\frac{\textrm{d}}{\textrm{d}x}\log f(x) \end{aligned}$$
(A6)

We derive the gradient \(\nabla _\theta J(\pi _\theta )\) of \(J(\pi _\theta )\) as follows:

$$\begin{aligned} \nabla _\theta J(\pi _\theta )&= \nabla _\theta {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ u(\vec {R}(\tau )) \right] \nonumber \\ &= \nabla _\theta \int _\tau P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\ &= \int _\tau \nabla _\theta P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\ &{\mathop {=}\limits ^{({\rm A}6)}}\int _\tau P(\tau \mid \pi _\theta ) \nabla _\theta \log P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \nonumber \\= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \nabla _\theta \log P(\tau \mid \pi _\theta ) u(\vec {R}(\tau )) \right] \nonumber \\ &= {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \end{aligned}$$
(A7)

Using the law of iterated expectations, we split the trajectory \(\tau\) into two parts: \(\tau _{:t}, \tau _{t:}\) the parts of the trajectories before timestep t and from t onwards, respectively. We also use our definition of accrued rewards (Sect. 3.1) to split the episodic return \(\vec {R}(\tau )\) into the accrued return \(\vec {R}^{-}_t\) and future return \(\vec {R}_t\).

$$\begin{aligned} \nabla _\theta J(\pi _\theta )= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \sum _{t=0}^T \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}(\tau )) \mid \tau _{:t} \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] \right] \end{aligned}$$
(A8)

Since \(\nabla _\theta \log \pi _\theta (a_t \mid s_t)\) is constant with respect to the inner expectation, we can take it out:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] \right] \end{aligned}$$
(A9)

Because we augment the state-space with the accrued reward, \(\vec {R}^{-}_t\) depends only on \(s_t\). Moreover, due to the Markovian property of (MO)MDPs, the future return \(\vec {R}_t\) only depends on \(s_t, a_t\). Given that \(\vec {R}^{-}_t\) depends only on \(s_t\) and \(\vec {R}_t\) only on \(s_t, a_t\), \(u(\vec {R}^{-}_t + \vec {R}_t)\) does not depend on the past:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid \tau _{:t} \right] = {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] \end{aligned}$$
(A10)

Since a MOMDP is composed of a discrete number of states and actions, there are a finite number of possible returns \(\vec {R}\). Thus, expanding the expectation of the previous equation results in:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} \end{aligned}$$
(A11)

where \(Pr\{\vec {R} \mid s_t, a_t \}\) is the probability of obtaining future return \(\vec {R}\) from \(s_t, a_t\). The MOCAC policy gradient is thus:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} \right] \end{aligned}$$
(A12)

MOCAC approximates the distribution over future returns as a categorical distribution \(Z_\psi\). Replacing \(Pr\{\vec {R} \mid s_t, a_t \}\) with this approximation yields Eq. (13) but without a baseline \(b(s_t)\). However, since the baseline only depends on \(s_t\) (and not on \(a_t\)), we have the property that:

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{a_t \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) b(s_t) \right]= & {} {{\,\mathrm{{\mathbb {E}}}\,}}_{a_t \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) \right] b(s_t)\nonumber \\= & {} \sum _{a \in {\mathcal {A}}} \pi _\theta (a_t \mid s_t) \nabla _\theta \log \pi _\theta (a_t \mid s_t) b(s_t) \nonumber \\&{\mathop {=}\limits ^{({\rm A}6)}}\sum _{a \in {\mathcal {A}}} \nabla _\theta \pi _\theta (a_t \mid s_t) b(s_t) \nonumber \\= & {} \nabla _\theta \sum _{a \in {\mathcal {A}}} \pi _\theta (a_t \mid s_t) b(s_t) = \nabla _\theta 1 b(s_t) = 0 \end{aligned}$$
(A13)

We can use the exact same proof as for the original policy gradient with baseline theorem:

$$\begin{aligned} \nabla _\theta J(\pi _\theta )= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ (u(\vec {R}^{-}_t + \vec {R}_t) - b(s_t)) \mid s_t, a_t \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ u(\vec {R}^{-}_t + \vec {R}_t) \mid s_t, a_t \right] \right. \nonumber \\{} & {} \left. -\nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ b(s_t) \mid s_t, a_t \right] \right] \nonumber \\= & {} \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t)\ {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{t:} \sim \pi _\theta } \left[ (u(\vec {R}^{-}_t + \vec {R}_t)) \mid s_t, a_t \right] - 0 \right] \end{aligned}$$
(A14)

Thus, we can use a baseline that only depends on \(s_t\) as it does not affect the expectation \({{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta }\). In the case of MOCAC, we use the expected utility for state \(s_t\), i.e. \(b(s_t) = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t \}\). Plugging this in Eq. (A12) results in:

$$\begin{aligned} \nabla _\theta J(\pi _\theta ) = \sum _{t=0}^T {{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta } \left[ \nabla _\theta \log \pi _\theta (a_t \mid s_t) A(s_t, a_t) \right] \end{aligned}$$
(A15)

with

$$\begin{aligned} A(s_t, a_t) = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t, a_t \} - \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t \} \end{aligned}$$
(A16)

We use this gradient to update the policy of MOCAC with gradient ascent.

Appendix B: Environment details

1.1 B.1 Split

In the Split environment, the agent chooses between two hallways of length 11 to traverse. Including the start- and end-states, there are 26 states. The states are one-hot encoded, resulting in a 26-sized vector that is given as input to any of the actor and critic estimators.

Table 1 Hyperparameters for the split, deep-sea-treasure and Minecart environments

1.2 B.2 Deep sea treasure

Deep sea treasure is a grid-world environment, where the submarine moves on a \(11 \times 12\), resulting in 132 different states. The state number is one-hot encoded, resulting in a 132-sized vector that is given as input to any of the actor and critic estimators. The rewards provided by each treasure are made in such a way that every optimal \(\texttt {treasure}\times \texttt {fuel}\) combination is evenly spread out on the convex coverage set. Treasure values are displayed on Fig. 8.

Fig. 8
figure 8

The Deep Sea Treasure environment. The agent starts on the top-left corner and tries to reach any of the treasures. Further treasures are worth more

1.3 B.3 Minecart

The state is a top-down visual representation of the mining area. It displays the different mine locations, the cart and the base station. A frame of the environment can be seen in Fig. 9. The frames are pre-processed as follows: they are rescaled to \(42\times 42\), converted to grayscale and normalized. Moreover, we keep a history of 2 frames as observation. This is similar to the frame-preprocessing done by [16] on the Arcade Learning Environment suite (Table 1).

Fig. 9
figure 9

The Minecart environment. The base is located at the top-left corner, while the 5 mines are spread around the environment

Appendix C: Hyperparameters

All hyperparameters used for all these experiments, including neural network architectures are listed in Table 1. All hyperparameters used for the ablative study are listed in Table 2. Figure 11 shows results for Deep-Sea-Treasure using a linear utility function with different weight-parameters. Figure 12 shows results for diverse utility functions on Fishwood, with an additional single-objective baseline.

Fig. 10
figure 10

The base of each actor and critic for the convolutional neural network used for the Minecart environment

Table 2 Hyperparameters for the experiments on MiniRandom and Fishwood. Since the input-size of the neural networks depend on the conditioning, the neurons rows are variable
Fig. 11
figure 11

Results for Deep-Sea-Treasure using a linear function with diverse sets of weights

Fig. 12
figure 12

Results for diverse utility functions as in Fig. 4, but with the added single-objective A2C (terminal) baseline

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reymond, M., Hayes, C.F., Steckelmacher, D. et al. Actor-critic multi-objective reinforcement learning for non-linear utility functions. Auton Agent Multi-Agent Syst 37, 23 (2023). https://doi.org/10.1007/s10458-023-09604-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10458-023-09604-x

Keywords

Navigation