Hypervolume indicator and dominance reward based multiobjective MonteCarlo Tree Search
 911 Downloads
 14 Citations
Abstract
Concerned with multiobjective reinforcement learning (MORL), this paper presents MOMCTS, an extension of MonteCarlo Tree Search to multiobjective sequential decision making, embedding two decision rules respectively based on the hypervolume indicator and the Pareto dominance reward. The MOMCTS approaches are firstly compared with the MORL state of the art on two artificial problems, the twoobjective Deep Sea Treasure problem and the threeobjective Resource Gathering problem. The scalability of MOMCTS is also examined in the context of the NPhard grid scheduling problem, showing that the MOMCTS performance matches the (nonRL based) state of the art albeit with a higher computational cost.
Keywords
Reinforcement learning MonteCarlo Tree Search Multiobjective optimization Sequential decision making1 Introduction
Reinforcement learning (RL) (Sutton and Barto 1998; Szepesvári 2010) addresses sequential decision making in the Markov decision process framework. RL algorithms provide guarantees of finding the optimal policies in the sense of the expected cumulative reward, relying on the thorough exploration of the state and action spaces. The price to pay for these optimality guarantees is the limited scalability of mainstream RL algorithms w.r.t. the size of the state and action spaces.
Recently, MonteCarlo Tree Search (MCTS), including the famed Upper Confidence Tree algorithm (Kocsis and Szepesvári 2006) and its variants, has been intensively investigated to handle sequential decision problems. MCTS, notably illustrated in the domain of ComputerGo (Gelly and Silver 2007), has been shown to efficiently handle mediumsize state and action search spaces through a careful balance between the exploration of the search space, and the exploitation of the best results found so far. While providing some consistency guarantees (Berthier et al. 2010), MCTS has demonstrated its merits and wide applicability in the domain of games (Ciancarini and Favini 2009) or planning (Nakhost and Müller 2009) among many others.
This paper is motivated by the fact that many realworld applications, including reinforcement learning problems, are most naturally formulated in terms of multiobjective optimization (MOO). In multiobjective reinforcement learning (MORL), the reward associated to a given state is ddimensional (e.g. cost, risk, robustness) instead of a single scalar value (e.g. quality). To our knowledge, MORL was first tackled by Gábor et al. (1998); introducing a lexicographic (hence total) order on the policy space, the authors show the convergence of standard RL algorithms under the total order assumption. In practice, multiobjective reinforcement learning is often tackled by applying standard RL algorithms on a scalar aggregation of the objective values (e.g. optimizing their weighted sum; see also Mannor and Shimkin (2004), Tesauro et al. (2007)).
In the general case of antagonistic objectives however (e.g. simultaneously minimize the cost and the risk of a manufacturing process), two policies might be incomparable (e.g. the cheapest process for a fixed robustness; the most robust process for a fixed cost): solutions are partially ordered, and the set of optimal solutions according to this partial order is referred to as Pareto front (more in Sect. 2). The goal of the socalled multiplepolicy MORL algorithms (Vamplew et al. 2010) is to find several policies on the Pareto front (Natarajan and Tadepalli 2005; Chatterjee 2007; Barrett and Narayanan 2008; Lizotte et al. 2012).
The goal of this paper is to extend MCTS to multiobjective sequential decision making. The proposed scheme called MOMCTS basically aims at discovering several Paretooptimal policies (decision sequences, or solutions) within a single tree. MOMCTS requires one to modify the exploration of the tree to account for the lack of total order among the nodes, and the fact that the desired result is a set of Paretooptimal solutions (as opposed to, a single optimal one). A first possibility considers the use of the hypervolume indicator (Zitzler and Thiele 1998), which measures the MOO quality of a solution w.r.t. the current Pareto front. Specifically, taking inspiration from Auger et al. (2009), this indicator is used to define a single optimization objective for the current path being visited in each MCTS treewalk, conditioned on the other solutions previously discovered. MOMCTS thus handles a singleobjective optimization problem in each treewalk, while eventually discovering several decision sequences pertaining to the Paretofront. This approach, first proposed by Wang and Sebag (2012), suffers from two limitations. On the one hand, the hypervolume indicator computation cost increases exponentially with the number of objectives. Secondly, the hypervolume indicator is not invariant under the monotonous transformation of the objectives. The invariance property (satisfied for instance by comparisonbased optimization algorithms) gives robustness guarantees which are most important w.r.t. illconditioned optimization problems (Hansen 2006).
Addressing these limitations, a new MOMCTS approach is proposed in this paper, using Pareto dominance to compute the instant reward of the current path visited by MCTS. Compared to the first approach—referred to as MOMCTShv in the remainder of this paper, the latter approach—referred to as MOMCTSdom—has linear computational complexity w.r.t. the number of objectives, and is invariant w.r.t. the monotonous transformation of the objectives.
Both MOMCTS approaches are empirically assessed and compared to the state of the art on three benchmark problems. Firstly, both MOMCTS variants are applied on two artificial benchmark problems, using MOQL (Vamplew et al. 2010) as baseline: the twoobjective Deep Sea Treasure (DST) problem (Vamplew et al. 2010) and the threeobjective Resource Gathering (RG) problem (Barrett and Narayanan 2008). A stochastic transition model is considered for both DST (originally deterministic) and RG, to assess the robustness of both MOMCTS approaches. Secondly, the real world NPhard problem of grid scheduling Yu et al. (2008) is considered to assess the performance and scalability of MOMCTS methods comparatively to the (nonRLbased) state of the art.
The paper is organized as follows. Section 2 briefly introduces the formal background. Section 3 describes the MOMCTShv and the MOMCTSdom algorithm. Section 4 presents the experimental validation of MOMCTS approaches. Section 5 discusses the strengths and limitations of MOMCTS approaches w.r.t. the state of the art and the paper concludes with some research perspectives.
2 Formal background
Assuming the reader’s familiarity with the reinforcement learning setting (Sutton and Barto 1998), this section briefly introduces the main notations and definitions used in the rest of the paper.
A Markov decision process (MDP) is described by its state and action space respectively denoted \(\mathcal{S}\) and \(\mathcal{A}\). The transition function (\(p : \mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]\)) gives the probability p(s,a,s′) of reaching state s′ by executing action a in state s. The (scalar) reward function is defined from the state × action space onto \(\mathbb{R}\) (\(r: \mathcal{S} \times\mathcal{A} \mapsto\mathbb{R}\)).
2.1 Multiobjective optimization
In multiobjective optimization (MOO), each point x in the search space \(\mathcal{X}\) is associated with a ddimensional reward vector r _{ x } in \(\mathbb{R}^{d}\), referred to as vectorial reward in the following. With no loss of generality, it is assumed that each objective is to be maximized.
Definition 1
Two different categories of MOO problems are distinguished depending on whether they correspond to a convex or nonconvex Pareto front. The convex Pareto front can be identified by solving a set of single objective optimization problems defined on the weighted sum of the objectives, referred to as linear scalarization of the MOO problem (as done in MOQL, Sect. 4.2.1). When dealing with nonconvex Pareto fronts (for instance, the DST problem Vamplew et al. 2010, and the ZDT2 and DTLZ2 test benchmarks Deb et al. 2002) however, the linear scalarization approach fails to discover the nonconvex parts of the Pareto front (Deb 2001). Although many MOO problems have a convex Pareto front, especially in the twoobjective case, the discovery of the nonconvex Pareto front remains the main challenge for MOO approaches (Deb et al. 2000; Beume et al. 2007).^{1}
2.2 MonteCarlo Tree Search
Let us describe the best known MCTS algorithm, referred to as Upper Confidence Tree (UCT) (Kocsis and Szepesvári 2006) and extending the Upper Confidence Bound algorithm (Auer et al. 2002) to treestructured spaces. UCT simultaneously explores and builds a search tree, initially restricted to its root node, along N treewalks a.k.a. simulations. Each treewalk involves three phases:
The tree building phase takes place upon arriving in a leaf node s; some action a is (uniformly or heuristically) selected and (s,a) is added as child node of s. Accordingly, the number of nodes in the tree is the number of treewalks.
3 Overview of MOMCTS
The main difference between MCTS and MOMCTS regards the node selection step. The challenge is to extend the singleobjective node selection criterion (Eq. (1)) to the multiobjective setting. Since there is no total order between points in the multidimensional space, as mentioned, the most straightforward way of dealing with multiobjective optimization is to get back to singleobjective optimization, through aggregating the objectives into a single one; the price to pay is that this approach yields a single solution on the Pareto front. Two aggregating functions (the hypervolume indicator and the cumulative discounted dominance reward) aimed at recovering a total order among points in the multidimensional reward space conditionally to the search archive, will be integrated within the MCTS framework.
The MOMCTShv algorithm is presented in Sect. 3.1 and its limitations are discussed in Sect. 3.2. The MOMCTSdom algorithm aimed at overcoming these limitations is introduced in Sect. 3.3.
3.1 MOMCTShv
3.1.1 Node selection based on hypervolume indicator
The hypervolume indicator (Zitzler and Thiele 1998) provides a scalar measure of solution sets in the multiobjective space as follows.
Definition 2
It is clear that all dominated points in A can be removed without modifying the hypervolume indicator (HV(A;z)=HV(P _{ A };z)). As shown by Fleischer (2003), the hypervolume indicator is maximized iff points in P _{ A } belong to the Pareto front of the MOO problem. Auger et al. (2009) show that, for d=2, for a number K of points, the hypervolume indicator maps a multiobjective optimization problem defined on \(\mathbb{R}^{d}\), onto a singleobjective optimization problem on \(\mathbb{R}^{d \times K}\), in the sense that there exists at least one set of K points in \(\mathbb{R}^{d}\) that maximizes the hypervolume indicator w.r.t. z.
Let P denote the archive of nondominated vectorial rewards measured for every terminal state u (Sect. 2.2). It then comes naturally to define the value of any MCTS tree node as follows.
3.1.2 MOMCTShv algorithm
MOMCTShv parameters include (i) the total number of treewalks N, (ii) the b parameter used in the progressive widening heuristic (Sect. 2.2); (iii) the exploration vs exploitation tradeoff parameter c _{ i } for every ith objective; and (iv) the reference point z.
3.2 Discussion
Let B denote the average branching factor in the MOMCTShv tree, and let N denote the number of treewalks. As each treewalk adds a new node, the number of nodes in the tree is N+1 by construction. The average length of a treepath thus is in \(\mathcal{O}(\log{N})\). Depending on the number d of objectives, the hypervolume indicator is computed with complexity \(\mathcal{O}(P^{d/2})\) for d>3 (respectively \(\mathcal{O}(P)\) for d=2 and \(\mathcal{O}(P\log{P})\) for d=3) (Beume et al. 2009). The complexity of each treewalk thus is \(\mathcal{O}(BP^{d/2}\log N)\), where P is at most the number N of treewalks.
By construction, the hypervolume indicator based selection criterion (Eq. (4)) drives MOMCTShv towards the Pareto front and favours the diversity of the Pareto archive. On the negative side however, the computational cost of W(s,a) is exponential with the number d of objectives. Besides, the hypervolume indicator is not invariant under monotonous transformation of objective functions, which prevents the approach from enjoying the same robustness as comparisonbased optimization approaches (Hansen 2006). Lastly, the MOMCTShv critically depends on its hyperparameters. The exploration vs exploitation (EvE) tradeoff parameters c _{ i },i=1,2,…,d (Eq. (1)) of each objective have a significant impact on the performance of MOMCTShv (likewise, the MCTS applicative results depend on the tuning of the EvE tradeoff parameters (Chaslot et al. 2008)). Additionally, the choice of the reference point z also influences the hypervolume indicator values (Auger et al. 2009).
3.3 MOMCTSdom
This section presents a new MOMCTS approach aimed at overcoming the above limitations, which is based on the Pareto dominance test. Notably, this test has linear complexity w.r.t. the number of objectives, and is invariant under monotonous transformation of objectives. As this reward depends on the Pareto archive which evolves along the search, the cumulative discounted dominance (CDD) reward mechanism is proposed to handle the search dynamics.
3.3.1 Node selection based on cumulative discounted dominance reward
Secondly, a discount mechanism is used to moderate the cumulative effects using the discount factor δ (0≤δ≤1) and taking into account the number Δt of treewalks since this node was last visited. This discount mechanism is meant to cope with the dynamics of multiobjective search through forgetting old rewards, thus enabling the decision rule to reflect uptodate information.
Indeed, the CD process is reminiscent of the discounted cumulative reward defining the value function in Reinforcement Learning (Sutton and Barto 1998), with the difference that the timestep t here corresponds to the treewalk index, and that the discount mechanism is meant to limit the impact of past (as opposed to, future) information.
In a stationary context, \(\hat{r}_{s,a;dom}\) would converge towards \(\frac{1}{1\delta^{\varDelta t}} \bar{r}\), with Δt the average interval of time between two visits to the node. If the node gets exponentially rarely visited, \(\hat{r}_{s,a;dom}\) goes to \(\bar{r}\). Quite the contrary, if the node happens to be frequently visited, \(\bar{r}\) is multiplied by a large factor (\(\frac{1}{1\delta}\)), entailing the overexploitation of the node. However, the overexploitation is bound to decrease as soon as the Pareto archive moves towards the true Pareto front. While this CDD reward was found to be empirically suited to the MOO setting (see also Maes et al. 2011), further work is required to analyze its properties.
3.3.2 MOMCTSdom algorithm
MOMCTSdom proceeds as standard MCTS except for the update procedure, where Eq. (2) is replaced by Eq. (7). Keeping the same notations B,N and P as above, as the dominance test in the end of each treewalk is linear (\(\mathcal{O}(dP)\)), the complexity of each treewalk in MOMCTSdom is \(\mathcal{O}(B\log N+dP)\), linear w.r.t. the number d of objectives.
Besides the MCTS parameters N and b, MOMCTSdom involves two additional hyperparameters: (i) the exploration vs exploitation tradeoff parameter c _{ e }; and (ii) the discount factor δ.
4 Experimental validation
This section presents the experimental validation of the MOMCTShv and MOMCTSdom algorithms.
4.1 Goals of experiments
The first goal is to assess the performance of the MOMCTS approaches comparatively to the state of the art in MORL (Vamplew et al. 2010). Two artificial benchmark problems (Deep Sea Treasure and Resource Gathering) with probabilistic transition functions are considered. The Deep Sea Treasure problem has two objectives which define a nonconvex Pareto front (Sect. 4.2). The Resource gathering problem has three objectives and a convex Pareto front (Sect. 4.3). The second goal is to assess the performance and scalability of MOMCTS approaches in a realworld setting, that of grid scheduling problems (Sect. 4.4).
All reported results are averaged over 11 runs unless stated otherwise.
Indicators of performance
Two indicators are defined to measure the quality of solution sets in the multidimensional space. The first indicator is the hypervolume indicator (Sect. 3.1.1). The second indicator, inspired from the notion of regret, is defined as follows. Let P ^{∗} denote the true Pareto front. The empirical Pareto front P defined by a search process is assessed from its generational distance (Van Veldhuizen 1999) and inverted generational distance w.r.t. P ^{∗}. The generational distance (GD) is defined by \(\mathit{GD}(P) = (\sqrt{\sum_{i=1}^{n} d_{i}^{2}})/n\), where n is the size of P and d _{ i } is the Euclidean distance between the ith point in P and its nearest point in P ^{∗}. GD measures the average distance from points in P to the Pareto front. The inverted generational distance (IGD) is likewise defined as the average distance of points in P ^{∗} to their nearest neighbour in P. For both generational and inverted generational distances, the smaller, the better.
The algorithms are also assessed w.r.t. their computational cost (measured on a PC with Intel dualcore CPU 2.66 GHz).
4.2 Deep Sea Treasure
4.2.1 Baseline algorithm
As mentioned in the introduction, the state of the art in MORL considers a scalar aggregation (e.g. a weighted sum) of rewards associated to all objectives. Several multiplepolicy MORL algorithms have been proposed (Natarajan and Tadepalli 2005; Tesauro et al. 2007; Barrett and Narayanan 2008; Lizotte et al. 2012) using the weighted sum of the objectives (with several weight settings) as scalar reward, which is optimized using standard reinforcement learning algorithms. The differences between the above algorithms are how they share the information between different weight settings and which weight settings they choose to optimize. In the following, MOMCTSdom is compared to MultiObjective QLearning (MOQL) (Vamplew et al. 2010). Choosing MOQL as baseline is motivated as it yields all policies found by other linearscalarisation based approaches, provided that a sufficient number of weight settings be considered.
Formally, in the two objective reinforcement learning case, MOQL optimizes independently m scalar RL problems through Qlearning, where the ith problem considers reward r _{ i }=(1−λ _{ i })×r _{ a }+λ _{ i }×r _{ b }, where 0≤λ _{ i }≤1,i=1,2,…,m define the m weight settings of MOQL, and r _{ a } (respectively r _{ b }) is the first (resp. the second) objective reward. In its simplest version, the overall computational effort is equally divided between the m scalar RL problems. The computational effort allocated to the each weight setting is further equally divided into n _{ tr } training phases; after the jth training phase, the performance of the ith weight setting is measured by the twodimensional vectorial reward, noted r _{ i,j }, of the current greedy policy. The m vectorial rewards of all weight settings {r _{1,j },r _{2,j },…,r _{ m,j }} together compose the Pareto front of MOQL at training phase j.
4.2.2 Experimental setting

ϵgreedy exploration is used with ϵ=0.1.

Learning rate α is set to 0.1.

The stateaction value table is optimistically initialized (time=0, treasure=124).

Due to the episodic nature of DST, no discounting is used in MOQL(γ=1).

The number m of weight settings ranges in {3,7,21}, with \(\lambda_{i} = \frac{i1}{m1}, i= 1,2,\ldots,m\).
After a few preliminary experiments, the progressive widening parameters b is set to 2 in both MOMCTShv and MOMCTSdom. In MOMCTShv, the exploration vs exploitation (EvE) tradeoff parameters in the time cost and treasure value objectives are respectively set to c _{ time }=20,000 and c _{ treasure }=150. As the DST problem is concerned with minimizing the search time (maximizing its opposite) and maximizing the treasure value, the reference point used in the hypervolume indicator calculation is set to (−100,0).
In MOMCTSdom, the EvE tradeoff parameter c _{ e } is set to 1, and the discount factor δ is set to 0.999.
Experiments are carried out in a DST simulator with the η noise level ranging in 0, 1×10^{−3}, 1×10^{−2}, 5×10^{−2} and 0.1. The training time of MOQL, MOMCTShv and MOMCTSdom is limited to 300,000 time steps (ca 37,000 treewalks in MOMCTShv and 45,000 treewalks in MOMCTSdom). The entire training process is equally divided into n _{ tr }=150 phases. At the end of each training phase, the MOQL and MOMCTS solution sets are tested in the DST simulator, and form the Pareto set P. The performance of algorithms is reported as the hypervolume indicator of P.
4.2.3 Results
The DST problem: hypervolume indicator results of MOMCTShv, MOMCTSdom and MOQL with m ranging in 3,7 and 21 in with different noise levels η, averaged over 11 independent runs. The optimal hypervolume indicator is 10455. For each η, significantly better results are indicated in bold font (significance value p<0.05 for the Student’s ttest)
η=0  η=1×10^{−3}  η=0.01  η=0.05  η=0.1  

MOMCTShv  10416±37  10434±31  10436±32  10205±211  9883±1091 
MOMCTSdom  10450±4  10446±19  10389±65  9858±1153  9982±360 
MOQLm=3  7099±3926  8116±3194  6422±4353  7333±4411  6953±3775 
MOQLm=7  10078±34  10049±94  9495±1701  8345±2887  8924±2663 
MOQLm=21  10078±17  10085±129  7806±1933  8744±2070  6744±2355 
Deterministic setting
Figure 3(b) shows the influence of m on MOQL. For m=7, MOQL reaches the performance plateau before m=21 (respectively 8,000 time steps vs 20,000 time steps), albeit with some instability. The instability increases as m is set to 3. The fact that for MOQLm=3 fails to reach the MOQL performance plateau is explained as the extreme point (−19,124) can be missed in some runs as MOQL uses a discount factor of 1 (after Vamplew et al. 2010). Therefore the largest 124 treasure might be discovered later than in time step 19.
The percentage of times out of 11 runs that each nondominated vectorial reward is discovered for at least one test episode during the training process of MOMCTShv, MOMCTSdom and MOQL for m=21 is displayed in Fig. 4(b). This picture shows that MOQL discovers all strategies (lying in the nonconvex regions of the Pareto front) during intermediate test episodes. However, these nonconvex strategies are eventually discarded as the MOQL solution set gradually converges to extreme strategies. Quite the contrary, MOMCTS approaches discovers all strategies in the Pareto front, and keeps them in the search tree after they have been discovered. The weakness of MOMCTShv is that the longest decision sequences corresponding to the vectorial rewards (−17,74) and (−19,124) need more time to be discovered. The MOMCTSdom successfully discovers all nondominated vectorial rewards (in 10 out of 11 runs) and reaches an average hypervolume indicator performance slightly higher than that of MOMCTShv.
Stochastic setting
In summary, the empirical validation on the artificial DST problem shows both the strengths and the weaknesses of MOMCTS approaches. On the positive side, MOMCTS approaches show able to find solutions lying in the nonconvex regions of the Pareto front, as opposed to linear scalarizationbased methods. Moreover, MOMCTS shows a relatively good robustness w.r.t. noise. On the negative side, MOMCTS approaches are more computationally expensive than MOQL (for 300,000 time steps, MOMCTShv takes 147 secs, MOMCTSdom takes 49 secs versus 25 secs for MOQL).
4.3 Resource Gathering

(−1,0,0) in case of an enemy attack;

(0,1,0) for returning home with only gold;

(0,0,1) for returning home with only gems;

(0,1,1) for returning home with both gold and gems;

(0,0,0) in all other cases.
The optimal policies for the Resource Gathering problem
#  Policy description  Vectorial reward 

π _{1}  Go directly to gems, avoiding enemies  (0,0,0.1) 
π _{2}  Go to both gold and gems, avoiding enemies  (0,5.556×10^{−2},5.556×10^{−2}) 
π _{3}  Go directly to gold, avoiding enemies  (0,8.333×10^{−2},0) 
π _{4}  Go to both gold and gems, through enemy1 or enemy2 once  (−7.75×10^{−3},6.977×10^{−2},6.977×10^{−2}) 
π _{5}  Go directly to gold, through enemy1 once  (−1.075×10^{−2},9.677×10^{−2},0) 
π _{6}  Go to both gold and gems, through the enemies twice  (−1.815×10^{−2},7.736×10^{−2},7.736×10^{−2}) 
π _{7}  Go directly to gold, through enemy1 twice  (−2.628×10^{−2},1.1203×10^{−1},0) 
4.3.1 Experimental setting
In the RG problem, the MOMCTS approaches are assessed comparatively with the MOQL algorithm, which independently optimizes the weighted sums of the three objective functions (enemy,gold,gems) under m weight settings. In the three dimensional reward space, one weight setting is defined by a 2D vector \((\lambda_{i}, \lambda_{j}^{\prime})\), with \(\lambda _{i},\lambda_{j}^{\prime}\in[0,1]\) and \(0 \leq\lambda_{i} + \lambda _{j}^{\prime}\leq1\). Let us denote the scalar rewards optimized by MOQL as \(r_{i,j} = (1 \lambda_{i}  \lambda_{j}^{\prime})\times r_{enemy} + \lambda _{i} \times r_{gold} + \lambda_{j}^{\prime}\times r_{gems}\), where l weights λ _{ i } (respectively \(\lambda_{j}^{\prime}\)) are evenly distributed in [0,1] for the gold (resp. gems) objective, subject to \(\lambda_{i} + \lambda_{j}^{\prime}\leq1\), the total number of weight settings thus is \(m = \frac{l(l1)}{2}\).

The ϵgreedy exploration is used with ϵ=0.2.

Learning rate α is set to 0.2.

The discount factor γ is set to 0.95.

By taking l=4,6,10, the number m of weight settings ranges in {6,15,45}.
In MOMCTSdom, the progressive widening parameter b is set to 1 (no progressive widening). The EvE tradeoff parameter c _{ e } is set to 0.1. The discount factor δ is set to 0.99.
The training time of all considered algorithms is 600,000 time steps (ca 17,200 treewalks for MOMCTShv and 16,700 treewalks for MOMCTSdom). Like in the DST problem, the training process is equally divided into 150 phases. At the end of each training phase, the MOQL and MOMCTS solution sets are tested in the RG simulator. Each solution (strategy) is launched 100 times and is associated the average vectorial reward (which might dominate the theoretical optimal ones due to the limited sample). The vectorial rewards of the solution set provided by each algorithm defines its Pareto archive. The algorithm performance is set to the hypervolume indicator of the Pareto archive with reference point z=(−0.33,−1×10^{−3},−1×10^{−3}). The optimal hypervolume indicator is 2.01×10^{−3}.
4.3.2 Results
The Resource Gathering problem: Average hypervolume indicator of MOMCTShv, MOMCTSdom and MOQL (with m=6, 15 and 45) over 11 runs. The optimal hypervolume indicator is 2.01×10^{−3}. Significantly better results are indicated in bold font (significance value p<0.05 for the Student’s ttest)
HV(×10^{−3})  HV(×10^{−3})  

MOMCTShv  1.735±0.304  MOMCTSdom, δ=0.9  1.285±0.351 
MOQL, m=6  1.933±0.04  MOMCTSdom, δ=0.98  1.75±0.38 
MOQL, m=15  2.021±0.033  MOMCTSdom, δ=0.99  1.836±0.175 
MOQL, m=45  2.012±0.041  MOMCTSdom, δ=0.999  1.004±0.26 
The MOMCTS approaches are outperformed by MOQL; their average hypervolume indicator reach 1.8×10^{−3} in the end of the training process, which is explained as the MOMCTS approaches rarely find the risky policies (π _{6},π _{7}) (Fig. 11). For example, policy π _{6} visits the enemy case twice; the neighbor nodes of this policy thus get the (−1,0,0) reward (more in Sect. 5).
On the computational cost side, the average execution time of 600,000 training steps of in MOMCTShv, MOMCTSdom and MOQL are respectively 944 secs, 47 secs and 43 secs. As the size of the Pareto archive is close to 10 in most treewalks of MOMCTShv and MOMCTSdom, the fact that MOMCTShv algorithm is 20 times slower than MOMCTSdom matches their computational complexities.
4.4 Grid scheduling
Pertaining to the domain of autonomic computing (Tesauro et al. 2007), the problem of grid scheduling has been selected to investigate the scalability of MOMCTS approaches. The presented experimental validation considers the problem of grid scheduling, referring the reader to Yu et al. (2008) for a comprehensive presentation of the field. Grid scheduling at large is concerned with scheduling the different tasks involved in the jobs on different computational resources. As tasks are interdependent and resources are heterogeneous, grid scheduling defines an NPhard combinatorial optimization problem (Ullman 1975).
Grid scheduling naturally aims at minimizing the socalled makespan, that is the overall job completion time. But other objectives such as energy consumption, monetary cost, or the allocation fairness w.r.t. the resource providers become increasingly important. In the rest of Sect. 4.4, two objectives will be considered, the makespan and the cost of the solution.
4.4.1 Baseline algorithms
The state of the art in grid scheduling is achieved by stochastic optimization algorithms (Yu et al. 2008). The two prominent multiobjective variants thereof are NSGAII (Deb et al. 2000) and SMSEMOA (Beume et al. 2007).
Both algorithms can be viewed as importance sampling methods. They maintain a population of solutions, initially defined as random execution plans. Iteratively, the solutions with best Pareto rank and best crowded distance (a density estimation of neighboring points in NSGAII) or hypervolume indicator (in SMSEMOA) are selected and undergo unary and binary stochastic perturbations.
4.4.2 Experimental setting
A simulated grid environment containing 3 resources with different unit time costs and processing capabilities (cost _{1}=20, speed _{1}=10; cost _{2}=2, speed _{2}=5; cost _{3}=1, speed _{3}=1) is defined. We firstly compare the performance of MOMCTS approaches and baseline algorithms on a realistic bioinformatic workflow EBI_ClustalW2, which performs a ClustalW multiple sequence alignment using the EBI’s WSClustalW2 service.^{3} This workflow contains 21 tasks and 23 precedence pairs (graph density q=12 %), assuming that all workloads are equal. Secondly, the scalability of MOMCTS approaches is tested through experiments based on artificially generated workflows containing respectively 20, 30 and 40 tasks with graph density q=15 %.
As evidenced from the literature (Wang and Gelly 2007), MCTS performances heavily depend on the socalled random phase (Sect. 2.2). Preliminary experiments showed that a uniform action selection in the random phase was ineffective. A simple heuristic was thus used to devise a better suited action selection criterion in the random phase, as follows.
The parameters of all algorithms have been selected after preliminary experiments, using the same amount of computational resources for a fair comparison. The progressive widening parameter b is set to 2 in both MOMCTShv and MOMCTSdom. In MOMCTShv, the exploration vs. exploitation (EvE) tradeoff parameters associated to the makespan and cost objectives, c _{ time } and c _{ cost } are both set to 5×10^{−3}. In MOMCTSdom, the EvE tradeoff parameters c _{ e } is set to 1, and the discount factor δ is set to 0.99. The parameters used for NSGAII (respectively SMSEMOA) involve a population size of 200 (resp. 120) individuals, of which 100 are selected and undergo stochastic unary and binary variations (resp. onepoint reordering, and resource exchange among two individuals). For all three algorithms, the number N of treewalks a.k.a. evaluation budget is set to 10,000. The reference point in each experiment is set to (z _{ t },z _{ c }), where z _{ t } and z _{ c } respectively denote the maximal makespan and cost.
Due to the lack of the true Pareto front in the considered problems, we use a reference Pareto front P ^{∗} gathering all nondominated vectorial rewards obtained in all runs of all three algorithms to replace the true Pareto front. The performance indicators are defined by the generational distance (GD) and inverted generational distance (IGD) between the actual Pareto front P found in the run and the reference Pareto front P ^{∗}. In the grid scheduling experiment, the IGD indicator plays a similar role as the hypervolume indicator in DST and RG problems.
4.4.3 Results
Overall, the main weakness of MOMCTS approaches is their computational runtime. The computational cost of MOMCTShv and MOMCTSdom are respectively 5 and 2.5 times higher than that of NSGAII and SMSEMOA.^{4} This weakness should have been relativized, noting that in realworld problems, the evaluation cost dominates by several orders of magnitude the search cost.
5 Discussion
As mentioned, the state of the art in MORL is divided into singlepolicy and multiple policy algorithms (Vamplew et al. 2010). In the former case, the authors use a set of preferences between objectives which are userspecified or derived from the problem domain (e.g. defining preferred regions Mannor and Shimkin 2004 or setting weights on the objectives Tesauro et al. 2007) to aggregate the multiple objectives in a single one. The strength of the singlepolicy approach is its simplicity; its long known limitation is that it cannot discover a policy in nonconvex regions of the Pareto front (Deb 2001).
In the multiplepolicy case, multiple Pareto optimal vectorial rewards can be obtained by optimization of different scalarized RL problems under different weight settings. Natarajan and Tadepalli (2005) show that the efficiency of MOQL can be improved by sharing information between different weight settings. A hot topic in multiplepolicy MORL is how to design the weight settings and share information among the different scalarized RL problems. In the case where the Pareto front is known, the design of the weight settings is made easier—provided that the Pareto front is convex. When the Pareto front is unknown, an alternative provided by Barrett and Narayanan (2008) is to maintain Qvectors instead of Qvalues for each pair (state, action). Through an adaptive selection of weight settings corresponding to the vectorial rewards on the boundary of the convex set of the current Qvectors, this algorithm narrows down the set of selected weight settings, at the expense of an higher complexity of value iteration in each state: the \(\mathcal{O}(SA)\) complexity of standard Qlearning is multiplied by a factor O(n ^{ d }), where n is the number of points on the convex hull of the Qvectors and d is the number of objectives. While the approach provides optimality guarantees (n converge toward the number of Pareto optimal policies), the number of intermediate solutions can be huge (in the worst case, \(\mathcal{O}(A^{S})\)). Based on the convexity and piecewise linearity assumption on the shape of the convex hull of Qvectors, Lizotte et al. (2012) extends (Barrett and Narayanan 2008) by narrowing down the range of points locating on the convex hull, thus keeping the n value under control.
In the MOMCTShv approach, each tree node is associated its average reward w.r.t. each objective, and the selection rule involves the scalar associated reward based on the hypervolume indicator (Zitzler and Thiele 1998), with complexity \(\mathcal{O}(BP^{d/2}\log N)\). On the one hand, this complexity is lower than that of a value iteration in Barrett and Narayanan (2008) (considering that the size of archive P is comparable to the number n of nondominated Q vectors). On the other hand, this complexity is higher than that of MOMCTSdom, where the dominance test only needs be computed at the end of each treewalk, thus with linear complexity in the number of objectives and treewalks. The MOMCTSdom complexity thus is \(\mathcal{O}(B\log N+dP)\). The price to pay for the improved scalability of MOMCTSdom is that the dominance reward might less favor the diversity of the Pareto archive than the hypervolume indicator: any nondominated point has the same dominance reward whereas the hypervolume indicator of nondominated points in sparsely populated regions of the Pareto archive is higher.
As shown in the Resource Gathering problem, the MOMCTS approaches have difficulties in finding “risky“ policies, visiting nodes with many low reward nodes in their neighborhood. A tentative explanation for this fact is given as, as already noted by Coquelin and Munos (2007), it may require an exponential time for the UCT algorithm to converge to the optimal node if this node is hidden by nodes with low reward.
6 Conclusion and perspectives
This paper has pioneered the extension of MCTS to multiobjective reinforcement learning, based on two scalar rewards measuring the merits of a policy relatively to the nondominated policies in the search tree. These rewards, respectively the hypervolume indicator and the dominance reward, have complementary strengths and weaknesses: the hypervolume indicator is computationally expensive but explicitly favors the diversity of the MOO policies, enforcing a good coverage of the Pareto front. Quite the contrary, the dominance test is linear in the number of objectives; it is further invariant under the monotonous transformation of the objective functions, a robust property much appreciated when dealing with illposed optimization problems.
These approaches have been validated on three problems: Deep Sea Treasure (DST), Resource Gathering (RG) and grid scheduling.
The experimental results on DST confirm a main merit of the proposed approaches, their ability to discover policies lying in the nonconvex regions of the Pareto front. To our knowledge,^{5} this feature is unique in the MORL literature.
In counterpart, MOMCTS approaches suffer from two weaknesses. Firstly, as shown on the grid scheduling problem, some domain knowledge is required in complex problems to enforce an efficient exploration in the random phase. Secondly, as evidenced in the Resource Gathering problem, the presented approaches hardly discover “risky” policies which lie in an unpromising region (the proverbial needle in the haystack).
These first results however provide a proof of concept for the MOMCTS approaches, noting that these approaches yield comparable performances to the (non RLbased) state of the art albeit at the price of a higher computational cost.
This work opens two perspectives for further studies. The main theoretical perspective concerns the properties of the cumulative discounted reward mechanism in the general (singleobjective) dynamic optimization context. On the applicative side, we plan to refine the RAVE heuristics used in the grid scheduling problem, e.g. to estimate the reward attached to task allocation paired ordering.
Footnotes
 1.
Notably, the chances for a Pareto front to be convex decreases with the number of objectives.
 2.
Another option is to use a dynamically weighted combination of the reward \(\hat{r}_{s,a}\) and RAVE(a) in Eq. (1).
 3.
The complete description is available at http://www.myexperiment.org/workflows/203.html.
 4.
On workflow EBI_ClustalW2, the average execution time of MOMCTShv, MOMCTSdom, NSGAII and SMSEMOA are respectively 142 secs, 74 secs, 31 secs and 32 secs.
 5.
A general polynomial result of MOO has been proposed by Chatterjee (2007), which claims that for all irreducible MDP with multiple longrun average objectives, the Pareto front can be ϵapproximated in time polynomial in ϵ. However this claim relies on the assumption that finding some Pareto optimal point can be reduced to optimizing a single objective: optimize a convex combination of objectives using as set of positive weights (p. 2, Chatterjee 2007), which does not hold for nonconvex Pareto fronts. Furthermore, the approach relies on the ϵapproximation of the Pareto front proposed by Papadimitriou and Yannakakis (2000), which assumes the existence of an oracle telling for each vectorial reward whether it is ϵParetodominated (Theorem 2, p. 4, Papadimitriou and Yannakakis 2000).
Notes
Acknowledgements
We wish to thank JeanBaptiste Hoock, Dawei Feng, Ilya Loshchilov, Romaric Gaudel, and Julien Perez for many discussions on UCT, MOO and MORL. We are grateful to the anonymous reviewers for their many comments and suggestions on a previous version of the paper.
References
 Auer, P., CesaBianchi, N., & Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256. zbMATHCrossRefGoogle Scholar
 Auger, A., Bader, J., Brockhoff, D., & Zitzler, E. (2009). Theory of the hypervolume indicator: optimal μdistributions and the choice of the reference point. In FOGA’09 (pp. 87–102). New York: ACM. Google Scholar
 Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In W. W. Cohen, A. McCallum, & S. T. Roweis (Eds.), ICML’08 (pp. 41–47). New York: ACM. Google Scholar
 Berthier, V., Doghmen, H., & Teytaud, O. (2010). Consistency modifications for automatically tuned MonteCarlo Tree Search. In C. Blum & R. Battiti (Eds.), LNCS: Vol. 6073. LION4 (pp. 111–124). Berlin: Springer. Google Scholar
 Beume, N., Naujoks, B., & Emmerich, M. (2007). SMSEMOA: multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3), 1653–1669. zbMATHCrossRefGoogle Scholar
 Beume, N., Fonseca, C. M., LopezIbanez, M., Paquete, L., & Vahrenhold, J. (2009). On the complexity of computing the hypervolume indicator. IEEE Transactions on Evolutionary Computation, 13(5), 1075–1082. CrossRefGoogle Scholar
 Chaslot, G., Chatriot, L., Fiter, C., Gelly, S., Hoock, J. B., Perez, J., Rimmel, A., & Teytaud, O. (2008). Combining expert, offline, transient and online knowledge in MonteCarlo exploration (Technical Report). Paris: Lab. Rech. Inform. (LRI). doi: 10.1.1.169.8073.
 Chatterjee, K. (2007). Markov decision processes with multiple longrun average objectives. In: FSTTCS 2007 foundations of software technology and theoretical computer science (Vol. 4855, pp. 473–484). CrossRefGoogle Scholar
 Ciancarini, P., & Favini, G. P. (2009). MonteCarlo Tree Search techniques in the game of kriegspiel. In C. Boutilier (Ed.), IJCAI’09 (pp. 474–479). Google Scholar
 Coquelin, P. A., & Munos, R. (2007). Bandit algorithms for tree search. Preprint arXiv:cs/0703062.
 Coulom, R. (2006). Efficient selectivity and backup operators in MonteCarlo Tree Search. In Proc. computers and games (pp. 72–83). Google Scholar
 Deb, K. (2001). Multiobjective optimization using evolutionary algorithms (pp. 55–58). Chichester: Wiley. zbMATHGoogle Scholar
 Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2000). A fast elitist nondominated sorting genetic algorithm for multiobjective optimization: NSGAII. In M. Schoenauer et al. (Eds.), LNCS: Vol. 1917. PPSN VI (pp. 849–858). Berlin: Springer. Google Scholar
 Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002). Scalable multiobjective optimization test problems. In Proceedings of the congress on evolutionary computation (CEC2002) (pp. 825–830). Honolulu, USA. Google Scholar
 Fleischer, M. (2003). The measure of Pareto optima. applications to multiobjective metaheuristics. In LNCS: Vol. 2632. EMO’03 (pp. 519–533). Berlin: Springer. Google Scholar
 Gábor, Z., Kalmár, Z., & Szepesvári, C. (1998). Multicriteria reinforcement learning. In ICML’98 (pp. 197–205). San Mateo: Morgan Kaufmann. Google Scholar
 Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. In Z. Ghahramani (Ed.), ICML’07 (pp. 273–280). New York: ACM. Google Scholar
 Hansen, N. (2006). The cma evolution strategy: a comparing review. In Towards a new evolutionary computation (pp. 75–102). Berlin: Springer. doi: 10.1007/3540324941_4. CrossRefGoogle Scholar
 Kocsis, L., & Szepesvári, C. (2006). Bandit based MonteCarlo planning. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), ECML’06 (pp. 282–293). Berlin: Springer. Google Scholar
 Lizotte, D. J., Bowling, M., & Murphy, S. A. (2012). Linear fittedq iteration with multiple reward functions. Journal of Machine Learning Research, 13, 3253–3295. MathSciNetGoogle Scholar
 Maes, F., Wehenkel, L., & Ernst, D. (2011). Automatic discovery of ranking formulas for playing with multiarmed bandits. In S. Sanner & M. Hutter (Eds.), LNCS: Vol. 7188. Recent advances in reinforcement learning—9th European workshop, EWRL 2011 (pp. 5–17). Berlin: Springer. CrossRefGoogle Scholar
 Mannor, S., & Shimkin, N. (2004). A geometric approach to multicriterion reinforcement learning. Journal of Machine Learning Research, 5, 325–360. doi: 10.1.1.9.5762. MathSciNetzbMATHGoogle Scholar
 Nakhost, H., & Müller, M. (2009). MonteCarlo exploration for deterministic planning. In C. Boutilier (Ed.), IJCAI’09 (pp. 1766–1771). Google Scholar
 Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multicriteria reinforcement learning. In ICML’05. New York: ACM. Google Scholar
 Papadimitriou, C. H., & Yannakakis, M. (2000). On the approximability of tradeoffs and optimal access of web sources. In FOCS (pp. 86–92). Los Alamitos: IEEE Computer Society. Google Scholar
 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press. Google Scholar
 Szepesvári, C. (2010). Algorithms for reinforcement learning. San Rafael: Morgan & Claypool. zbMATHGoogle Scholar
 Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2007). Managing power consumption and performance of computing systems using reinforcement learning. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), NIPS’07 (pp. 1–8). Google Scholar
 Ullman, J. D. (1975). NPcomplete scheduling problems. Journal of Computer and System Sciences, 10(3), 384–393. MathSciNetzbMATHCrossRefGoogle Scholar
 Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2010). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84, 51–80. CrossRefGoogle Scholar
 Van Veldhuizen, D. A. (1999). Multiobjective evolutionary algorithms: classifications, analyses, and new innovations (Technical report). DTIC Document. Google Scholar
 Wang, Y., & Gelly, S. (2007). Modifications of UCT and sequencelike simulations for MonteCarlo Go. In CIG’07 (pp. 175–182). New York: IEEE Press. Google Scholar
 Wang, W., & Sebag, M. (2012). Multiobjective MonteCarlo Tree Search. In Asian conference on machine learning. Google Scholar
 Wang, Y., Audibert, J., & Munos, R. (2008). Algorithms for infinitely manyarmed bandits. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS’08 (pp. 1–8). Google Scholar
 Yu, J., Buyya, R., & Ramamohanarao, K. (2008). Workflow scheduling algorithms for grid computing. In Studies in computational intelligence (Vol. 146, pp. 173–214). Berlin: Springer. Google Scholar
 Zitzler, E., & Thiele, L. (1998). Multiobjective optimization using evolutionary algorithms—a comparative case study. In A. E. Eiben, T. Bäck, M. Schoenauer, & H. Schwefel (Eds.), LNCS: Vol. 1498. PPSN v (pp. 292–301). Berlin: Springer. Google Scholar