Machine Learning

, Volume 92, Issue 2–3, pp 403–429 | Cite as

Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search



Concerned with multi-objective reinforcement learning (MORL), this paper presents MOMCTS, an extension of Monte-Carlo Tree Search to multi-objective sequential decision making, embedding two decision rules respectively based on the hypervolume indicator and the Pareto dominance reward. The MOMCTS approaches are firstly compared with the MORL state of the art on two artificial problems, the two-objective Deep Sea Treasure problem and the three-objective Resource Gathering problem. The scalability of MOMCTS is also examined in the context of the NP-hard grid scheduling problem, showing that the MOMCTS performance matches the (non-RL based) state of the art albeit with a higher computational cost.


Reinforcement learning Monte-Carlo Tree Search Multi-objective optimization Sequential decision making 



We wish to thank Jean-Baptiste Hoock, Dawei Feng, Ilya Loshchilov, Romaric Gaudel, and Julien Perez for many discussions on UCT, MOO and MORL. We are grateful to the anonymous reviewers for their many comments and suggestions on a previous version of the paper.


  1. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256. MATHCrossRefGoogle Scholar
  2. Auger, A., Bader, J., Brockhoff, D., & Zitzler, E. (2009). Theory of the hypervolume indicator: optimal μ-distributions and the choice of the reference point. In FOGA’09 (pp. 87–102). New York: ACM. Google Scholar
  3. Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In W. W. Cohen, A. McCallum, & S. T. Roweis (Eds.), ICML’08 (pp. 41–47). New York: ACM. Google Scholar
  4. Berthier, V., Doghmen, H., & Teytaud, O. (2010). Consistency modifications for automatically tuned Monte-Carlo Tree Search. In C. Blum & R. Battiti (Eds.), LNCS: Vol. 6073. LION4 (pp. 111–124). Berlin: Springer. Google Scholar
  5. Beume, N., Naujoks, B., & Emmerich, M. (2007). SMS-EMOA: multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3), 1653–1669. MATHCrossRefGoogle Scholar
  6. Beume, N., Fonseca, C. M., Lopez-Ibanez, M., Paquete, L., & Vahrenhold, J. (2009). On the complexity of computing the hypervolume indicator. IEEE Transactions on Evolutionary Computation, 13(5), 1075–1082. CrossRefGoogle Scholar
  7. Chaslot, G., Chatriot, L., Fiter, C., Gelly, S., Hoock, J. B., Perez, J., Rimmel, A., & Teytaud, O. (2008). Combining expert, offline, transient and online knowledge in Monte-Carlo exploration (Technical Report). Paris: Lab. Rech. Inform. (LRI). doi:
  8. Chatterjee, K. (2007). Markov decision processes with multiple long-run average objectives. In: FSTTCS 2007 foundations of software technology and theoretical computer science (Vol. 4855, pp. 473–484). CrossRefGoogle Scholar
  9. Ciancarini, P., & Favini, G. P. (2009). Monte-Carlo Tree Search techniques in the game of kriegspiel. In C. Boutilier (Ed.), IJCAI’09 (pp. 474–479). Google Scholar
  10. Coquelin, P. A., & Munos, R. (2007). Bandit algorithms for tree search. Preprint arXiv:cs/0703062.
  11. Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo Tree Search. In Proc. computers and games (pp. 72–83). Google Scholar
  12. Deb, K. (2001). Multi-objective optimization using evolutionary algorithms (pp. 55–58). Chichester: Wiley. MATHGoogle Scholar
  13. Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2000). A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In M. Schoenauer et al. (Eds.), LNCS: Vol. 1917. PPSN VI (pp. 849–858). Berlin: Springer. Google Scholar
  14. Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002). Scalable multi-objective optimization test problems. In Proceedings of the congress on evolutionary computation (CEC-2002) (pp. 825–830). Honolulu, USA. Google Scholar
  15. Fleischer, M. (2003). The measure of Pareto optima. applications to multi-objective metaheuristics. In LNCS: Vol. 2632. EMO’03 (pp. 519–533). Berlin: Springer. Google Scholar
  16. Gábor, Z., Kalmár, Z., & Szepesvári, C. (1998). Multi-criteria reinforcement learning. In ICML’98 (pp. 197–205). San Mateo: Morgan Kaufmann. Google Scholar
  17. Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. In Z. Ghahramani (Ed.), ICML’07 (pp. 273–280). New York: ACM. Google Scholar
  18. Hansen, N. (2006). The cma evolution strategy: a comparing review. In Towards a new evolutionary computation (pp. 75–102). Berlin: Springer. doi: 10.1007/3-540-32494-1_4. CrossRefGoogle Scholar
  19. Kocsis, L., & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), ECML’06 (pp. 282–293). Berlin: Springer. Google Scholar
  20. Lizotte, D. J., Bowling, M., & Murphy, S. A. (2012). Linear fitted-q iteration with multiple reward functions. Journal of Machine Learning Research, 13, 3253–3295. MathSciNetGoogle Scholar
  21. Maes, F., Wehenkel, L., & Ernst, D. (2011). Automatic discovery of ranking formulas for playing with multi-armed bandits. In S. Sanner & M. Hutter (Eds.), LNCS: Vol. 7188. Recent advances in reinforcement learning—9th European workshop, EWRL 2011 (pp. 5–17). Berlin: Springer. CrossRefGoogle Scholar
  22. Mannor, S., & Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement learning. Journal of Machine Learning Research, 5, 325–360. doi: MathSciNetMATHGoogle Scholar
  23. Nakhost, H., & Müller, M. (2009). Monte-Carlo exploration for deterministic planning. In C. Boutilier (Ed.), IJCAI’09 (pp. 1766–1771). Google Scholar
  24. Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multi-criteria reinforcement learning. In ICML’05. New York: ACM. Google Scholar
  25. Papadimitriou, C. H., & Yannakakis, M. (2000). On the approximability of trade-offs and optimal access of web sources. In FOCS (pp. 86–92). Los Alamitos: IEEE Computer Society. Google Scholar
  26. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press. Google Scholar
  27. Szepesvári, C. (2010). Algorithms for reinforcement learning. San Rafael: Morgan & Claypool. MATHGoogle Scholar
  28. Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2007). Managing power consumption and performance of computing systems using reinforcement learning. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), NIPS’07 (pp. 1–8). Google Scholar
  29. Ullman, J. D. (1975). NP-complete scheduling problems. Journal of Computer and System Sciences, 10(3), 384–393. MathSciNetMATHCrossRefGoogle Scholar
  30. Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2010). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84, 51–80. CrossRefGoogle Scholar
  31. Van Veldhuizen, D. A. (1999). Multiobjective evolutionary algorithms: classifications, analyses, and new innovations (Technical report). DTIC Document. Google Scholar
  32. Wang, Y., & Gelly, S. (2007). Modifications of UCT and sequence-like simulations for Monte-Carlo Go. In CIG’07 (pp. 175–182). New York: IEEE Press. Google Scholar
  33. Wang, W., & Sebag, M. (2012). Multi-objective Monte-Carlo Tree Search. In Asian conference on machine learning. Google Scholar
  34. Wang, Y., Audibert, J., & Munos, R. (2008). Algorithms for infinitely many-armed bandits. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS’08 (pp. 1–8). Google Scholar
  35. Yu, J., Buyya, R., & Ramamohanarao, K. (2008). Workflow scheduling algorithms for grid computing. In Studies in computational intelligence (Vol. 146, pp. 173–214). Berlin: Springer. Google Scholar
  36. Zitzler, E., & Thiele, L. (1998). Multiobjective optimization using evolutionary algorithms—a comparative case study. In A. E. Eiben, T. Bäck, M. Schoenauer, & H. Schwefel (Eds.), LNCS: Vol. 1498. PPSN v (pp. 292–301). Berlin: Springer. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.LRI, CNRS UMR 8623 & INRIA-SaclayUniversité Paris-SudOrsay CedexFrance

Personalised recommendations