Abstract
We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.
Similar content being viewed by others
Notes
We note that our preliminary workshop paper [1] we referred to this algorithm as expected utility policy gradient (EUPG).
References
Roijers, D. M., Steckelmacher, D., & Nowé, A. (2018). Multi-objective reinforcement learning for the expected utility of the return. In Proceedings of the adaptive and learning agents workshop at FAIM.
Reymond, M., Hayes, C., Roijers, D. M., Steckelmacher, D., & Nowé, A. (2021). Actor-critic multi-objective reinforcement learning for non-linear utility functions. In Multi-objective decision making workshop (MODeM 2021).
Castelletti, A., Pianosi, F., & Restelli, M. (2013). A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research, 49(6), 3476–3486.
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian, H., & Soltani, M. (2017). Multi-objective optimization of radiotherapy: distributed q-learning and agent-based simulation. Journal of Experimental & Theoretical Artificial Intelligence, 29(5), 1071–1086.
Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67–113.
Hayes, C. F., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zintgraf, L. M., Dazeley, R., & Heintz, F. (2022). A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1), 1–59.
Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL) (pp. 191–199). IEEE.
Rădulescu, R., Mannion, P., Roijers, D. M., & Nowé, A. (2020). Multi-objective multi-agent decision making: A utility-based analysis and survey. Autonomous Agents and Multi-Agent Systems, 34(1), 10.
Roijers, D. M., & Whiteson, S. (2017). Multi-objective decision making. Synthesis Lectures on Artificial Intelligence and Machine Learning, 11(1), 1–129.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), 229–256.
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928–1937).
Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2011). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1–2), 51–80.
Abels, A., Roijers, D.M., Lenaerts, T., Nowé, A., & Steckelmacher, D. (2019). Dynamic weights in multi-objective deep reinforcement learning. In Proceedings of the 36th international conference on machine learning. Proceedings of machine learning research (Vol. 97, pp. 11–20). PMLR.
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2019) Exploration by random network distillation. In International conference on learning representations.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., & Ostrovski, G. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32).
Roijers, D. M., Whiteson, S., & Oliehoek, F. A. (2015). Computing convex coverage sets for faster multi-objective coordination. Journal of Artificial Intelligence Research, 52, 399–443.
Mossalam, H., Assael, Y. M., Roijers, D. M., & Whiteson, S. (2016). Multi-objective deep reinforcement learning. CoRR. arXiv:1610.02707
Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on machine learning (pp. 41–47). ACM.
Hiraoka, K., Yoshida, M., & Mishima, T. (2009). Parallel reinforcement learning for weighted multi-criteria model with adaptive margin. Cognitive Neurodynamics, 3(1), 17–24.
Castelletti, A., Pianosi, F., & Restelli, M. (2012). Tree-based fitted q-iteration for multi-objective Markov decision problems. In The 2012 international joint conference on neural Networks (IJCNN) (pp. 1–8). IEEE.
Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc.
Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., & Riedmiller, M. (2020). A distributional view on multi-objective policy optimization. In International conference on machine learning (pp. 11–22). PMLR.
Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International conference on learning representations. https://openreview.net/forum?id=S1ANxQW0b.
Xu, J., Tian, Y., Ma, P., Rus, D., Sueda, S., & Matusik, W. (2020). Prediction-guided multi-objective reinforcement learning for continuous robot control. In International conference on machine learning (pp. 10607–10616). PMLR.
Vamplew, P., Dazeley, R., Barker, E., & Kelarev, A. (2009). Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In Australasian joint conference on artificial intelligence (pp. 340–349). Springer.
Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In Advances in neural information processing systems (pp. 1497–1504).
Neil, D., Segler, M., Guasch, L., Ahmed, M., Plumbley, D., Sellwood, M., & Brown, N. (2018). Exploring deep recurrent models with reinforcement learning for molecule design. In 6th International conference on learning representations (ICLR), workshop track.
Roijers, D. M., Zintgraf, L. M., Libin, P., & Nowé, A. (2018). Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In ALA workshop at FAIM (Vol. 8).
Hayes, C. F., Reymond, M., Roijers, D. M., Howley, E., & Mannion, P. (2021). Distributional Monte Carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 1530–1532).
Van Moffaert, K., & Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1), 3483–3512.
Parisi, S., Pirotta, M., & Restelli, M. (2016). Multi-objective reinforcement learning through continuous pareto manifold approximation. Journal of Artificial Intelligence Research, 57, 187–227.
Reymond, M., & Nowé, A. (2019). Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems. In Proceedings of the adaptive and learning agents workshop (ALA-19) at AAMAS.
Reymond, M., Bargiacchi, E., & Nowé, A. (2022) Pareto conditioned networks. In Proceedings of the 21st international conference on autonomous agents and multiagent systems (pp. 1110–1118).
de Oliveira, T. H. F., de Souza Medeiros, L. P., Neto, A. D. D., & Melo, J. D. (2021). Q-managed: A new algorithm for a multiobjective reinforcement learning. Expert Systems with Applications, 168, 114228.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. arXiv preprint arXiv:1605.08803
Zintgraf, L. M., Roijers, D. M., Linders, S., Jonker, C. M., & Nowé, A. (2018). Ordered preference elicitation strategies for supporting multi-objective decision making. In Proceedings of the 17th international conference on autonomous agents and multiagent systems (pp. 1477–1485). International Foundation for Autonomous Agents and Multiagent Systems.
Roijers, D. M., Zintgraf, L. M., Libin, P., Reymond, M., Bargiacchi, E., & Nowé, A. (2020). Interactive multi-objective reinforcement learning in multi-armed bandits with gaussian process utility models. In Joint European conference on machine learning and knowledge discovery in databases (pp. 463–478). Springer.
Hayes, C. F., Verstraeten, T., Roijers, D. M., Howley, E., & Mannion, P. (2022). Expected scalarised returns dominance: A new solution concept for multi-objective decision making. Neural Computing and Applications, 1–21.
Acknowledgements
Conor F. Hayes is funded by the University of Galway Hardiman Scholarship. This research was supported by funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary non-archival version of this work was presented at the Adaptive and Learning Agents Workshop 2018 [1] and at the Multi-Objective Decision Making Workshop 2021 [2]. This article extends our previous work with additional theoretical analysis and new empirical results.
Appendices
Appendix A: MO policy gradient proofs
MOCAC is inspired by single-objective actor-critic methods, which make use of the Policy Gradient theorem to provide convergence guarantees towards a local optimum via gradient descent. This theorem assumes that the policy maximizes the sum of rewards, which does not hold when using non-linear utility functions. In this section, we follow the original Policy Gradient theorem to provide convergence guarantees for MOCAC.
Given a performance measure \(J(\pi _{\theta })\), with \(\theta\) the parameters of the policy \(\pi\), we would like to use its gradient \(\nabla _\theta J(\pi _{\theta })\) to update the policy using gradient ascent:
where \(\alpha\) is the learning rate.
We now derive this gradient for MOAC, the baseline that does not use a distributional critic in the ablation study, as well as for MOCAC, our main contribution.
1.1 A.1 Proof under SER
In Sect. 4.1 we perform an ablation study to show that using a distributional critic is necessary to optimize under the ESR criterion. Our ablation, which we call MOAC, does not use a distributional critic. MOAC thus optimizes under the SER criterion, as the utility function is applied on the V-value. We prove that MOAC indeed converges under SER. We define the performance measure we aim to maximize as:
where \(\tau\) are trajectories sampled by following \(\pi\). The gradient of the policy performance, i.e. the policy gradient, is:
with \(V^{\pi _\theta }_i\) the V-value for the i-th objective. Since this is a scalar, \(\nabla _\theta V^{\pi _\theta }_i(s_0)\) follows the original Policy Gradient proof (in our case with baseline):
Intuitively, our MOAC policy gradient is a weighted sum over the individual single-objective policy gradients where each weight defines the importance of its objective, which is measured using the utility function. Concretely, the MOAC policy gradient equals to the sum of the single-objective policy gradients multiplied by the gradient of the utility function when evaluating the utility function for \(s_0\). Since, in our setting, u is known, we assume we can compute its derivative:
We use this gradient to update the policy of MOAC with gradient ascent.
1.2 A.2 Proof under ESR
While MOAC optimizes under SER, we are interested in the ESR setting, which is what MOCAC optimizes under. In this case, the performance measure we aim to maximize is:
To derive its gradient, we use the log-derivative trick:
We derive the gradient \(\nabla _\theta J(\pi _\theta )\) of \(J(\pi _\theta )\) as follows:
Using the law of iterated expectations, we split the trajectory \(\tau\) into two parts: \(\tau _{:t}, \tau _{t:}\) the parts of the trajectories before timestep t and from t onwards, respectively. We also use our definition of accrued rewards (Sect. 3.1) to split the episodic return \(\vec {R}(\tau )\) into the accrued return \(\vec {R}^{-}_t\) and future return \(\vec {R}_t\).
Since \(\nabla _\theta \log \pi _\theta (a_t \mid s_t)\) is constant with respect to the inner expectation, we can take it out:
Because we augment the state-space with the accrued reward, \(\vec {R}^{-}_t\) depends only on \(s_t\). Moreover, due to the Markovian property of (MO)MDPs, the future return \(\vec {R}_t\) only depends on \(s_t, a_t\). Given that \(\vec {R}^{-}_t\) depends only on \(s_t\) and \(\vec {R}_t\) only on \(s_t, a_t\), \(u(\vec {R}^{-}_t + \vec {R}_t)\) does not depend on the past:
Since a MOMDP is composed of a discrete number of states and actions, there are a finite number of possible returns \(\vec {R}\). Thus, expanding the expectation of the previous equation results in:
where \(Pr\{\vec {R} \mid s_t, a_t \}\) is the probability of obtaining future return \(\vec {R}\) from \(s_t, a_t\). The MOCAC policy gradient is thus:
MOCAC approximates the distribution over future returns as a categorical distribution \(Z_\psi\). Replacing \(Pr\{\vec {R} \mid s_t, a_t \}\) with this approximation yields Eq. (13) but without a baseline \(b(s_t)\). However, since the baseline only depends on \(s_t\) (and not on \(a_t\)), we have the property that:
We can use the exact same proof as for the original policy gradient with baseline theorem:
Thus, we can use a baseline that only depends on \(s_t\) as it does not affect the expectation \({{\,\mathrm{{\mathbb {E}}}\,}}_{\tau _{:t} \sim \pi _\theta }\). In the case of MOCAC, we use the expected utility for state \(s_t\), i.e. \(b(s_t) = \sum _{\vec {R}} u(\vec {R}^{-}_t + \vec {R}) Pr\{\vec {R} \mid s_t \}\). Plugging this in Eq. (A12) results in:
with
We use this gradient to update the policy of MOCAC with gradient ascent.
Appendix B: Environment details
1.1 B.1 Split
In the Split environment, the agent chooses between two hallways of length 11 to traverse. Including the start- and end-states, there are 26 states. The states are one-hot encoded, resulting in a 26-sized vector that is given as input to any of the actor and critic estimators.
1.2 B.2 Deep sea treasure
Deep sea treasure is a grid-world environment, where the submarine moves on a \(11 \times 12\), resulting in 132 different states. The state number is one-hot encoded, resulting in a 132-sized vector that is given as input to any of the actor and critic estimators. The rewards provided by each treasure are made in such a way that every optimal \(\texttt {treasure}\times \texttt {fuel}\) combination is evenly spread out on the convex coverage set. Treasure values are displayed on Fig. 8.
1.3 B.3 Minecart
The state is a top-down visual representation of the mining area. It displays the different mine locations, the cart and the base station. A frame of the environment can be seen in Fig. 9. The frames are pre-processed as follows: they are rescaled to \(42\times 42\), converted to grayscale and normalized. Moreover, we keep a history of 2 frames as observation. This is similar to the frame-preprocessing done by [16] on the Arcade Learning Environment suite (Table 1).
Appendix C: Hyperparameters
All hyperparameters used for all these experiments, including neural network architectures are listed in Table 1. All hyperparameters used for the ablative study are listed in Table 2. Figure 11 shows results for Deep-Sea-Treasure using a linear utility function with different weight-parameters. Figure 12 shows results for diverse utility functions on Fishwood, with an additional single-objective baseline.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Reymond, M., Hayes, C.F., Steckelmacher, D. et al. Actor-critic multi-objective reinforcement learning for non-linear utility functions. Auton Agent Multi-Agent Syst 37, 23 (2023). https://doi.org/10.1007/s10458-023-09604-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10458-023-09604-x