1 Introduction

The past years have seen significant breakthroughs in agents that can gain abilities through interactions with the environment [23, 24], thus promising spectacular advances in the society and the industry. These advances are partly due to single-agent (deep) RL algorithms. That is a learning scheme in which the agent describes its world as a Markov decision process (MDP), other agents being part of that world, and assumptions at both learning and execution phases being identical [31]. In this setting, policy gradient and (natural) actor-critic variants demonstrated impressive results with strong convergence guarantees [1, 8, 17, 32]. These methods directly search in the space of parameterized policies of interest, adjusting the parameters in the direction of the policy gradient. Unfortunately, extensions to cooperative multi-agent systems have restricted attention to either independent learners [28, 35] or multi-agent systems with common knowledge about the world [38], which are essentially single-agent systems.

In this paper, we instead consider cooperative multi-agent settings where we accomplished learning in a centralized manner, but execution must be independent. This paradigm allows us to break the independence assumption in decentralized multi-agent systems but only during the training phase, while still preserving the ability to meet it during the execution phase. In many real-world cooperative multi-agent systems, conditions at the training phase do not need to be as strict as those at the execution phase. During rehearsal, for example, actors can read the script, take breaks, or receive feedback from the director, but none of these will be possible during the show [19]. To win matches, a soccer coach develops (before the game) tactics players will apply during the game. So, it is natural to wonder whether the policy gradient approach in such a paradigm could be as successful as for the single-agent learning paradigm.

The CTDC paradigm has been successfully applied in planning methods for Dec-POMDPs, i.e., a framework of choice for sequential decision making by a team of cooperative agents [5, 9, 16, 26, 33]. In the literature of game theory, Dec-POMDPs are partially observable stochastic games with identical payoffs. They subsume many other collaborative multi-agent models, including multi-agent MDPs [7]; stochastic games with identical payoffs [30]; to cite a few. The critical assumption that makes Dec-POMDPs significantly different from MDPs holds only at the execution phase: agents can neither see the real state of the world nor explicitly communicate with one another their noisy observations. Nonetheless, agents can share their local information at the training phase, as long as they act at the execution phase based solely on their individual experience. Perhaps surprisingly, this insight has been neglected so far, explaining the formal treatment of CTDC received little attention from the RL community [19]. When this centralized training takes place in a simulator or a laboratory, one can exploit information that may not be available at the execution time, e.g., hidden states, local information of the other agents, etc. Recent work in the (deep) multi-agent RL community builds upon this paradigm to design domain-specific methods [14, 15, 22], but the theoretical foundations of decentralized multi-agent RL are still in their infancy.

This paper investigates the theoretical foundations of policy gradient methods within the CTDC paradigm. In this paradigm, among policy gradient algorithms, actor-critic methods can train multiple independent actors (or policies) guided by a centralized critic (Q-value function) [14]. Methods of this family differ only through how they represent and maintain the centralized critic. The primary result of this article generalizes the policy gradient theorem and compatible function approximations from (PO)MDPs to Dec-POMDPs. In particular, these results show the compatible centralized critic is the sum of individual critics, each of which is linear in the “features” of its corresponding individual policy. Even more interestingly, we derive update rules adjusting individual critics in the direction of the gradient of the centralized critic. Experiments demonstrate our policy gradient methods compare favorably against techniques from standard RL paradigms in benchmarks from the literature. Proofs of our results are provided in the companion research report [6].

We organized the rest of this paper as follows. Section 2 gives formal definitions of POMDPs and Dec-POMDPs along with useful properties. In Sect. 3, we review the policy gradient methods for POMDPs, then pursue the review for cooperative multi-agent settings in Sect. 4. Section 5 develops the theoretical foundations of policy gradient methods for Dec-POMDPs and derives the algorithms. Finally, we present empirical results in Sect. 6.

2 Backgrounds

2.1 Partially Observable Markov Decision Processes

Consider a (centralized coordinator) agent facing the problem of influencing the behavior of a POMDP as it evolves through time. This setting often serves to formalize cooperative multi-agent systems, where all agents can explicitly and instantaneously communicate with one another their noisy observations.

Definition 1

Let \(M_1 \doteq (\mathcal {X},\mathcal {U},\mathcal {Z},p,r,T,s_0,\gamma )\) be a POMDP, where \(X_t\), \(U_t\), \(Z_t\) and \(R_t\) are random variables taking values in \(\mathcal {X}\), \(\mathcal {U}\), \(\mathcal {Z}\) and , and representing states of the environment, controls the agent took, observations and reward signals it received at time step \(t = 0,1,\ldots ,T\), respectively. State transition and observation probabilities \(p(x',z'|x,u) \doteq \mathbb {P}(X_{t+1} = x', Z_{t+1} = z' | X_t = x, U_t = u)\) characterize the world dynamics. \(r(x,u) \doteq \mathbb {E}[R_{t+1} | X_t = x, U_t = u]\) is the expected immediate reward. Quantities \(s_0\) and \(\gamma \in [0,1]\) define the initial state distribution and the discount factor.

We call tth history, \(o_t \doteq (o_{t-1},u_{t-1},z_t)\) where \(o_0 \doteq \emptyset \), a sequence of controls and observations the agent experienced up to time step \(t=0,1,\ldots ,T\). We denote \(\mathcal {O}_t\) the set of histories of the agent might experience up to time step t.

Definition 2

The agent selects control \(u_t\) through time using a parametrized policy \(\pi \doteq (a_0,a_1,\ldots ,a_{T})\), where \(a_t(u_t|o_t) \doteq \mathbb {P}_{\theta _t}(u_t | o_t)\) denotes the decision rule at time step \(t=0,1,\ldots ,T\), with parameter vector where \(\ell _t\ll |\mathcal {O}_t|\).

In practice, we represent policies using a deep neural network; a finite-state controller; or a linear approximation architecture, e.g., Gibbs. Such policy representations rely on different (possibly lossy) descriptions of histories, called internal states. It is worth noticing that when available, one can use p to calculate a unique form of internal-states, called beliefs, which are sufficient statistics of histories [3]. If we let \(b^o \doteq \mathbb {P}(X_t|O_t=o)\) be the current belief induced by history o, with initial belief \(b^\emptyset \doteq s_0\); then, the next belief after taking control \(u\in \mathcal {U}\) and receiving observation \(z'\in \mathcal {Z}\) is:

$$\begin{aligned} b^{o,u,z'}(x')&\doteq \mathbb {P}\big (X_{t+1} = x'|O_{t+1} = (o,u,z')\big ) \propto \sum _{x\in \mathcal {X}} p(x',z'|x,u) b^o(x),&\forall x'\in \mathcal {X}. \end{aligned}$$

Hence, using beliefs instead of histories in the description of policies preserves the ability to act optimally, while significantly reducing the memory requirement. Doing so makes it possible to restrict attention to stationary policies, which are particularly useful for infinite-horizon settings, i.e., \(T = \infty \). Policy \(\pi \) is said to be stationary if \(a_0 = a_1 =\ldots = a\) and \(\theta _0 = \theta _1 = \ldots = \theta \); otherwise, it is non-stationary.

Through interactions with the environment under policy \(\pi \), the agent generates a trajectory of rewards, observations, controls and states \(\omega _{t:T} \doteq (x_{t:T},z_{t:T},u_{t:T})\). Each trajectory produces return \(R(\omega _{t:T}) \doteq \gamma ^0 r(s_t,u_t) + \cdots +\gamma ^{T-t} r(s_T,u_T)\). Policies of interest are those that achieve the highest expected return starting at \(s_0\)

$$\begin{aligned} J(s_0 ; \theta _{0:T}) \doteq \mathbb {E}_{\pi , M_1}[ R(\varOmega _{0:T}) ] = \int \mathbb {P}_{\pi , M_1}(\omega _{0:T}) R(\omega _{0:T}) \mathrm {d}\omega _{0:T} \end{aligned}$$
(1)

where \(\mathbb {P}_{\pi , M_1}(\omega _{0:T})\) denotes the probability of generating trajectory \(\omega _{0:T}\) under \(\pi \). Finding the best way for the agent to influence \(M_1\) consists in finding parameter vector \(\theta ^*_{0:T}\) that satisfies: \(\theta ^*_{0:T} \in \arg \max _{\theta _{0:T}}~J(s_0 ; \theta _{0:T})\).

It will prove useful to break the performance under policy \(\pi \) into pieces to exploit the underlying structure—i.e., the performance of \(\pi \) from time step t onward depend on earlier controls only through the current states and histories. To this end, the following defines value, Q-value and advantage functions under \(\pi \). The Q-value functions under \(\pi \) is given by:

$$\begin{aligned} Q_t^\pi&:(x,o,u) \mapsto \mathbb {E}_{\pi , M_1}[ R(\varOmega _{t:T}) | X_t = x, O_t = o, U_t = u ],&\forall t=0,1,\ldots \end{aligned}$$
(2)

where \(Q_t^\pi (x,o,u)\) denotes the expected return of executing u starting in x and o at time step t and then following policy \(\pi \) from time step \(t+1\) onward. The value functions under \(\pi \) is given by:

$$\begin{aligned} V_t^\pi&:(x,o) \mapsto \mathbb {E}_{a_t}[ Q^\pi _t(x,o,U_t)],&\forall t=0,1,\ldots \end{aligned}$$
(3)

where \(V_t^\pi (x,o)\) denotes the expected return of following policy \(\pi \) from time step t onward, starting in x and o. Finally, the advantage functions under \(\pi \) is given by:

$$\begin{aligned} A_t^\pi&:(x,o,u) \mapsto Q^\pi _t(x,o,u) - V_t^\pi (x,o),&\forall t=0,1,\ldots \end{aligned}$$
(4)

where \(A_t^\pi (x,o,u)\) denotes the relative advantage of executing u starting in x and o at time step t and then the following policy \(\pi \) from time step \(t+1\) onward. The nice property of these functions is that they satisfy certain recursions.

Lemma 3

(Bellman equations [4]).Q-value functions under \(\pi \) satisfy the following recursion: \( \forall t=0,1,\ldots , T\), \(\forall x\in \mathcal {X},o\in \mathcal {O}_t,u\in \mathcal {U}\),

$$\begin{aligned} Q_t^\pi (x,o,u) = R(x,u) + \gamma \mathbb {E}_{a_{t+1}, p}[ Q_{t+1}^\pi (X_{t+1},O_{t+1},U_{t+1}) | X_t = x, O_t = o, U_t = u ] \end{aligned}$$

Lemma 3 binds altogether \(V^\pi _{0:T}\), \(Q^\pi _{0:T}\) and \(A^\pi _{0:T}\), including overall performance

\(J(s_0;\theta _{0:T}) = \mathbb {E}_{s_0}[ V_0^\pi (X_0, \emptyset ) ]\).

So far we restricted our attention to systems under the control of a single agent. Next, we shall generalize to settings where multiple agents cooperate to control the same system in a decentralized manner.

2.2 Decentralized Partially Observable Markov Decision Processes

Consider a slightly different framework in which n agents cooperate when facing the problem of influencing the behavior of a POMDP, but can neither see the state of the world and nor communicate with one another their noisy observations.

Definition 4

A Dec-POMDP \(M_n\doteq (\mathcal {I}_n,\mathcal {X},\mathcal {U},\mathcal {Z},p,R,T,\gamma ,s_0)\) is such that \(i\in \mathcal {I}_n\) indexes the ith agent involved in the process; \(\mathcal {X},\mathcal {U},\mathcal {Z},p,R,T,\gamma \) and \(s_0\) are as in \(M_1\); \(\mathcal {U}^i\) is an individual control set of agent i, such that \(\mathcal {U} = \mathcal {U}^1\times \cdots \times \mathcal {U}^n\) specifies the set of controls \(u = (u^1,\ldots ,u^n)\); \(\mathcal {Z}^i\) is an individual observation set of agent i, where \(\mathcal {Z} = \mathcal {Z}^1\times \cdots \times \mathcal {Z}^n\) defines the set of observations \(z=(z^1,\ldots ,z^n)\).

We call the individual history of agent \(i\in \mathcal {I}_n\), \(o_t^i=(o^i_{t-1},u_{t-1}^i,z^i_t)\) where \(o_0^i = \emptyset \), the sequence of controls and observations up to time step \(t=0,1,\ldots ,T\). We denote \(\mathcal {O}_t^i\), the set of individual histories of agent i at time step t.

Definition 5

Agent \(i\in \mathcal {I}_n\) selects control \(u^i_t\) at the tth time step using a parametrized policy \(\pi ^i \doteq (a^i_0,a^i_1,\ldots ,a^i_T)\), where \(a^i_t(u^i_t|o^i_t) \doteq \mathbb {P}_{\theta ^i_t}(u^i_t|o^i_t)\) is a parametrized decision rule, with parameter vector \(\theta ^i_t\in \mathbb {R}^{\ell ^i_t}\), assuming \(\ell ^i_t \ll |\mathcal {O}_t^i|\).

Similarly to \(M_1\), individual histories grow every time step, which quickly becomes untractable. The only sufficient statistic for individual histories known so far [9, 11] relies on the occupancy state given by: \(s_t(x,o) \doteq \mathbb {P}_{\theta ^{1:n}_{0:T}, M_n}(x,o)\), for all \(x\in \mathcal {X}\) and \(o\in \mathcal {O}_t\). The individual occupancy state induced by individual history \(o^i\in \mathcal {O}^i_t\) is a conditional distribution probability: \(s^i_t(x,o^{-i}) \doteq \mathbb {P}(x,o^{-i}|o^i,s_t)\), where \(o^{-i}\) is the history of all agents except i. Learning to map individual histories to internal states close to individual occupancy states is hard, which limits the ability to find optimal policies in \(M_n\). One can instead restrict attention to stationary individual policies, by mapping the history space into a finite set of possibly lossy representations of individual occupancy states, called internal states \(\varsigma \doteq (\varsigma ^1,\ldots ,\varsigma ^n)\), e.g., nodes in finite-state controllers or hidden state of a Recurrent Neural Network (RNN). We define transition rules prescribing the next internal state given the current internal state, control and next observation as follows: \(\psi :(\varsigma ,u,z') \mapsto (\psi ^1(\varsigma ^1,u^1,z'^1), \ldots , \psi ^n(\varsigma ^n,u^n,z'^n))\) where \(\psi ^i :(\varsigma ^i,u^i,z'^i) \mapsto \varsigma '^i\) is an individual transition rule. In general, \(\psi \) and \(\psi ^{1:n}\) are stochastic transition rules. In the following, we will consider these rules fixed a-priori.

The goal of solving \(M_n\) is to find a joint policy \(\pi \doteq (\pi ^1,\ldots ,\pi ^n)\), i.e., a tuple of individual policies, one for each agent—that achieves the highest expected return, \(\theta ^{*,1:n}_{0:T} \in \arg \max _{\theta ^{1:n}_{0:T}}~J(s_0;\theta ^{1:n}_{0:T})\), starting at initial belief \(s_0\): \(J(s_0;\theta ^{1:n}_{0:T}) \doteq \mathbb {E}_{\pi ,M_n}[ R(\varOmega _{0:T}) ]\). \(M_n\) inherits all definitions introduced for \(M_1\), including functions \(V^\pi _{0:T}\), \(Q^\pi _{0:T}\) and \(A^\pi _{0:T}\) for a given joint policy \(\pi \).

3 Policy Gradient for POMDPs

In this section, we will review the literature of policy gradient methods for centralized single-agent systems. In this setting, the policy gradient approach consists of a centralized algorithm which searches the best \(\theta _{0:T}\) in the parameter space. Though, we restrict attention to non-stationary policies, methods discussed here easily extend to stationary policies when \(a_t= a\), i.e. \(\theta _t = \theta \), for all \(t=0,1,\ldots ,T\). Assuming \(\pi \) is differentiable w.r.t. its parameter vector, \(\theta _{0:T}\), the centralized algorithm updates \(\theta _{0:T}\) in the direction of the gradient:

$$\begin{aligned} \varDelta \theta _{0:T} = \alpha \frac{\partial J(s_0;\theta _{0:T})}{\partial \theta _{0:T}}, \end{aligned}$$
(5)

where \(\alpha \) is the step-size. Applying iteratively such a centralized update rule, assuming a correct estimation of the gradient, \(\theta _{0:T}\) can usually converge towards a local optimum. Unfortunately, correct estimation of the gradient may not be possible. To overcome this limitation, one can rely on an unbiased estimation of the gradient, actually restricting (5) to stochastic gradient: \(\varDelta \theta _{0:T} = \alpha R(\omega _{0:T})\frac{\partial }{\partial \theta _{0:T}} \log \mathbb {P}_{\pi ,M_n}(\omega _{0:T})\). We compute \(\frac{\partial }{\partial \theta _{0:T}} \log {\mathbb {P}_{\pi ,M_n}(\omega _{0:T})}\) with no knowledge of the trajectory distribution \(\mathbb {P}_{\pi ,M_n}(\omega _{0:T})\). Indeed \(\mathbb {P}_{\pi , M_n}(\omega _{0:T}) \doteq s_0(x_0) \prod _{t=0}^{T} p(x_{t+1},z_{t+1}|x_t,u_t) a_t(u_t | o_t)\) implies:

$$\begin{aligned} \frac{\partial \log {\mathbb {P}_{\pi ,M_n}(\omega _{0:T})}}{\partial \theta _{0:T}} = \frac{\partial \log a_0(u_0|o_0)}{\partial \theta _0} + \ldots + \frac{\partial \log a_T(u_T|o_T)}{\partial \theta _T}. \end{aligned}$$

3.1 Likelihood Ratio Methods

Likelihood ratio methods, e.g., Reinforce [36], exploit the separability of parameter vectors \(\theta _{0:T}\), which leads to the following update rule:

$$\begin{aligned} \varDelta \theta _t&= \alpha \mathbb {E}_{\mathcal {D}}\left[ R(\omega _{0:T}) \frac{\partial \log {a_t(u_t|o_t)}}{\partial \theta _t}\right] \!\!\!,&\forall t=0,1,\ldots ,T \end{aligned}$$
(6)

where \(\mathbb {E}_{\mathcal {D}}[\cdot ]\) is the average over trajectory samples \(\mathcal {D}\) generated under policy \(\pi \). The primary issue with this centralized update-rule is the high-variance of \(R(\varOmega _{0:T})\), which can significantly slow down the convergence. To somewhat mitigate this high-variance, one can exploit two observations. First, it is easy to see that future actions do not depend on past rewards, i.e., \(\mathbb {E}_{\mathcal {D}}[ R(\omega _{0:t-1})\frac{\partial }{\partial \theta _t} \log {a_t(u_t|o_t)}] = 0\). This insight allows us to use \(R(\omega _{t:T})\) instead of \(R(\omega _{0:T})\) in (6), thereby resulting in a significant reduction in the variance of the policy gradient estimate. Second, it turns out that the absolute value of \(R(\omega _{t:T})\) is not necessary to obtain an unbiased policy gradient estimate. Instead, we only need a relative value \(R(\omega _{t:T})-\beta _t(x_t,o_t)\), where \(\beta _{0:T}\) can be any arbitrary value function, often referred to as a baseline.

3.2 Actor-Critic Methods

To moderate even more the variance for the gradient estimate in (6), the policy gradient theorem [32] suggests replacing \(R(\omega _{t:T})\) by \(Q^\mathrm {w}_t(x_t,o_t,u_t)\), i.e., an approximate value of taking control \(u_t\) starting in state \(x_t\) and history \(o_t\) and then following policy \(\pi \) from time step \(t+1\) onward: \(Q^\mathrm {w}_t(x_t,o_t,u_t) \approx Q^\pi _t(x_t,o_t,u_t)\), where \(\mathrm {w}_t\in \mathbb {R}^{l_t}\) is a parameter vector with \(l_t \ll |\mathcal {X}||\mathcal {O}_t||\mathcal {U}|\). Doing so leads us to the actor-critic algorithmic scheme, in which a centralized algorithm maintains both parameter vectors \(\theta _{0:T}\) and parameter vectors \(\mathrm {w}_{0:T}\): \(\forall t=0,1,\ldots ,T\),

$$\begin{aligned} \varDelta \mathrm {w}_t&= \alpha \mathbb {E}_{\mathcal {D}}\left[ \delta _t \frac{\partial \log {a_t(u_t|o_t)}}{\partial \theta _t}\right] \end{aligned}$$
(7)
$$\begin{aligned} \varDelta \theta _t&= \alpha \mathbb {E}_{\mathcal {D}}\left[ Q^\mathrm {w}_t(x_t,o_t,u_t) \frac{\partial \log {a_t(u_t|o_t)}}{\partial \theta _t}\right] \end{aligned}$$
(8)

where \(\delta _t \doteq \widehat{Q}^\pi _t(x_t,o_t,u_t) - Q^\mathrm {w}_t(x_t,o_t,u_t;w_t)\) and \(\widehat{Q}^\pi _t(x_t,o_t,u_t)\) is an unbiased estimate of true Q-value \(Q^\pi _t(x_t,o_t,u_t)\).

The choice of parameter vector \(\mathrm {w}_{0:T}\) is critical to ensure the gradient estimation remains unbiased [32]. There is no bias whenever Q-value functions \(Q^\mathrm {w}_{0:T}\) are compatible with parametrized policy \(\pi \). Informally, a compatible function approximation \(Q^\mathrm {w}_{0:T}\) of \(Q^\pi _{0:T}\) should be linear in “features” of policy \(\pi \), and its parameters \(\mathrm {w}_{0:T}\) are the solution of a linear regression problem that estimates \(Q^\pi _{0:T}\) from these features. In practice, we often relax the second condition and update parameter vector \(\mathrm {w}_{0:T}\) using Monte-Carlo or temporal-difference learning methods.

3.3 Natural Actor-Critic Methods

Following the direction of the gradient might not always be the right option to take. In contrast, the natural gradient suggests updating the parameter vector \(\theta _{0:T}\) in the steepest ascent direction w.r.t. the Fisher information metric

$$\begin{aligned} \varvec{\varPhi }(\theta _t) \doteq \mathbb {E}_{\mathcal {D}}\left[ \frac{\partial \log {a_t(u_t|o_t)}}{\partial \theta _t} \left( \frac{\partial \log {a_t(u_t|o_t)}}{\partial \theta _t}\right) ^\top \right] . \end{aligned}$$
(9)

This metric is invariant to re-parameterizations of the policy. Combining the policy gradient theorem with the compatible function approximations and then taking the steepest ascent direction, \(\mathbb {E}_{\mathcal {D}}[\varvec{\varPhi }(\theta _t)^{-1}\varvec{\varPhi }(\theta _t) w_t]\), results in natural actor-critic algorithmic scheme, which replaces the update rule (8) by: \(\varDelta \theta _t = \alpha \mathbb {E}_{\mathcal {D}}[w_t]\).

4 Policy Gradient for Multi-Agent Systems

In this section, we review extensions of single-agent policy gradient methods to cooperative multi-agent settings. We shall distinguish between three paradigms: centralized training for centralized control (CTCC) vs distributed training for decentralized control (DTDC) vs centralized training for decentralized control (CTDC), illustrated in Fig. 1.

Fig. 1.
figure 1

Best viewed in color. For each paradigms—(left) CTCC; (center) CTDC; and (right) DTDC—we describe actor-critic algorithmic schemes. We represent in blue, green and red arrows: forward control flow; the aggregation of information for the next time step; and the feedback signals back-propagated to update all parameters, respectively.

4.1 Centralized Training for Centralized Control (CTCC)

Some cooperative multi-agent applications have cost-free instantaneous communications. Such applications can be modeled as POMDPs, making it possible to use single-agent policy gradient methods (Sect. 3). In such a CTCC paradigm, see Fig. 1 (left), centralized single-agent policy gradient methods use a single critic and a single actor. The major limitation of this paradigm is also its strength: the requirement for instantaneous, free and noiseless communications among all agents till the end of the process both at the training and execution phases.

4.2 Distributed Training for Decentralized Control (DTDC)

Perhaps surprisingly, the earliest multi-agent policy gradient method aims at learning in a distributed manner policies that are to be executed in a decentralized way, e.g., distributed Reinforce [28]. In this DTDC paradigm, see Fig. 1 (right), agents simultaneously but independently learn via Reinforce their individual policies using multiple critics and multiple actors. The independence of parameter vectors \(\theta _{0:T}^1,\ldots ,\theta _{0:T}^n\), leads us to the following distributed update-rule:

$$\begin{aligned} \varDelta \theta ^i_t&= \alpha \mathbb {E}_{\mathcal {D}}\left[ R(\omega _{0:T}) \frac{\partial \log {a^i_t(u^i_t|o^i_t)}}{\partial \theta ^i_t}\right] ,&\forall t=0,1,\ldots ,T,\forall i\in I_n \end{aligned}$$
(10)

Interestingly, the sum of individual policy gradient estimates is an unbiased estimate of the joint policy gradient. However, how to exploit insights from actor-critic methods (Sect. 3) to combat high-variance in the joint policy gradient estimate remains an open question. Distributed Reinforce restricts to on-policy setting, off-policy methods instead can significantly improve the exploration, i.e., learns target joint policy \(\pi \) while following and obtaining trajectories from behavioral joint policy \(\bar{\pi }\) [8].

4.3 Centralized Training for Decentralized Control (CTDC)

The CTDC paradigm has been successfully applied in planning [2, 5, 9,10,11, 13, 16, 26, 27, 33, 34] and learning [12, 19,20,21] for \(M_n\). In such a paradigm, a centralized coordinator agent learns on behalf of all agents at the training phase and then assigns policies to corresponding agents before the execution phase takes place. Actor-critic algorithms in this paradigm, see Fig. 1 (center), maintain a centralized critic but learn multiple actors, one for each agent.

Recent work in the (deep) multi-agent RL builds upon this paradigm [14, 15, 22], but lacks theoretical foundations, resulting in different specific forms of centralized critics, including: individual critics with shared parameters [15]; or counterfactual-regret based centralized critics [14]. Theoretical results similar to ours were previously developed for collective multi-agent planning domains [25], i.e., a setting where all agents have the same policy, but their applicability to general Dec-POMDPs remain questionable.

5 Policy Gradient for Dec-POMDPs

In this section, we address the limitation of both CTCC and DTDC paradigms and extend both ‘vanilla’ and natural actor-critic algorithmic schemes from \(M_1\) to \(M_n\).

5.1 The Policy Gradient Theorem

Our primary result is an extension of the policy gradient theorem [32] from \(M_1\) to \(M_n\). First, we state the partial derivatives of value functions \(V^\pi _{0:T}\) w.r.t. the parameter vectors \(\theta _{0:T}^{1:n}\) for finite-horizon settings.

Lemma 6

For any arbitrary \(M_n\), target joint policy \(\pi \doteq (a_0,\ldots , a_T)\) and behavior joint policy \(\bar{\pi } \doteq (\bar{a}_0,\ldots , \bar{a}_T)\), the following holds, for any arbitrary \(t=0,1,\ldots ,T\), and agent \(i\in \mathcal {I}_n\), hidden state \(x_t\in \mathcal {X}\), and joint history \(o_t\in \mathcal {O}_t\):

$$\begin{aligned} \frac{\partial V^\pi _t(x_t,o_t) }{\partial \theta _t^i} = \mathbb {E}_{\bar{a}_t}\left[ \frac{a_t(U_t|o_t)}{\bar{a}_t(U_t|o_t)} Q^\pi _t(x_t,o_t,U_t)\frac{\partial \log {a^i_t(U^i_t|o^i_t)}}{\partial \theta _t^i}\right] \!\!. \end{aligned}$$
(11)

We are now ready to state the main result of this section.

Theorem 7

For any arbitrary \(M_n\), target joint policy \(\pi \doteq (a_0,\ldots , a_T)\) and behavior joint policy \(\bar{\pi } \doteq (\bar{a}_0,\ldots , \bar{a}_T)\), the following holds:

  1. 1.

    for finite-horizon settings \(T<\infty \), any arbitrary \(t=0,1,\ldots ,T\) and \(i\in \mathcal {I}_n\),

    $$\begin{aligned} \frac{\partial J(s_0;\theta _{0:T}^{1:n})}{\partial \theta _t^i} = \gamma ^t\mathbb {E}_{\bar{a}_t,M_n} \left[ \frac{a_t(U_t|O_t)}{\bar{a}_t(U_t|O_t)} Q^\pi _t(X_t,O_t,U_t) \frac{\partial \log {a^i_t(U_t^i|O_t^i)}}{\partial \theta _t^i} \right] \!\!. \end{aligned}$$
  2. 2.

    for finite-horizon settings \(T=\infty \), and any arbitrary agent \(i\in \mathcal {I}_n\),

    $$\begin{aligned} \frac{\partial J(s_0;\theta ^{1:n})}{\partial \theta ^i} = \mathbb {E}_{\bar{s},\bar{a}} \left[ \frac{a(U|\Sigma )}{\bar{a}(U|\Sigma )} Q^\pi (X,\Sigma ,U) \frac{\partial \log {a^i(U^i|\Sigma ^i)}}{\partial \theta ^i} \right] \!\!, \end{aligned}$$

    where \(\bar{s}(x,\varsigma ) \doteq \sum _{t=0}^\infty \gamma ^t\mathbb {P}_{\bar{a}, \psi , M_n}(X_t=x,\Sigma _t = \varsigma )\).

While the policy gradient theorem for \(M_1\) [32] assumes a single agent learning to act in a (PO)MDP, Theorem 7 applies to multiple agents learning to control a POMDP in a decentralized manner. Agents act independently, but their policy gradient estimates are guided by a centralized Q-value function \(Q_{0:T}^\pi \). To use this property in practice, one needs to replace \(Q_{0:T}^\pi \) with a function approximation of \(Q_{0:T}^\pi \). To ensure this function approximation is compatible—i.e., the corresponding gradient still points roughly in the direction of the real gradient, we carefully select its features. The following addresses this issue for \(M_n\).

5.2 Compatible Function Approximations

The main result of this section characterizes compatible function approximations \(V^\sigma _{0:T}\) and \(A^\nu _{0:T}\) for both the value function \(V_{0:T}^\pi \) and the advantage function \(A_{0:T}^\pi \) of any arbitrary \(M_n\), respectively. These functions together shall provide a function approximation for \(Q^\pi _{0:T}\) assuming \(Q^\pi _t(x_t,o_t,u_t) \doteq V^\pi _t(x_t,o_t) + A^\pi _t(x_t,o_t,u_t)\), for any time step \(t=0,1,\ldots , T\), state \(x_t\), joint history \(o_t\) and joint control \(u_t\).

Theorem 8

For any arbitrary \(M_n\), function approximations \(V^\sigma _{0:T}\) and \(A^\nu _{0:T}\), with parameter vectors \(\sigma _{0:T}^{1:n}\) and \(\nu _{0:T}^{1:n}\) respectively, are compatible with parametric joint policy \(\pi \doteq (a_0,\ldots ,a_T)\), with parameter vector \(\theta _{0:T}^{1:n}\), if one of the following holds: \(\forall t=0,1,\ldots ,T\)

  1. 1.

    for any state \(x_t\in \mathcal {X}\), joint history \(o_t\in \mathcal {O}_t\), and agent \(i\in \mathcal {I}_n\),

    $$\begin{aligned} \frac{\partial V_t^\sigma (x_t, o_t)}{\partial \sigma ^i_t} = \mathbb {E}_{a^i_t}\left[ \frac{\partial \log {a_t^i(U_t^i|o_t^i)}}{\partial \theta _t^i} \right] \!\!. \end{aligned}$$
    (12)

    and \(\sigma \) minimizes the MSE \(\mathbb {E}_{\pi , M_n}[\epsilon _t(X_t,O_t,U_t)^2]\)

  2. 2.

    for any state \(x_t\in \mathcal {X}\), joint history \(o_t\in \mathcal {O}_t\), joint control \(u_t\in \mathcal {U}\), and agent \(i\in \mathcal {I}_n\),

    $$\begin{aligned} \frac{\partial A_t^\nu (x_t,o_t,u_t)}{\partial \nu ^i_t} = \frac{\partial \log {a_t^i(u_t^i|o_t^i)}}{\partial \theta _t^i} \end{aligned}$$
    (13)

    and \(\nu \) minimizes the MSE \(\mathbb {E}_{\pi , M_n}[\epsilon _t(X_t,O_t,U_t)^2]\)

where \(\epsilon _t(x,o,u) \doteq Q^\pi _t(x,o,u) - V^\sigma _t(x,o) - A^\nu _t(x,o,u)\). Then, \(\frac{\partial }{\partial \theta _t^i} V_t^\pi (x_t,o_t)\) follows

$$\begin{aligned} \mathbb {E}_{\bar{a}_t}\left[ \frac{a_t(U_t|o_t)}{\bar{a}_t(U_t|o_t)}\left( V^\sigma _t(x_t,o_t) + A^\nu _t(x_t,o_t,U_t)\right) \frac{\partial \log {a_t^i(U_t^i|o_t^i)}}{\partial \theta _t^i}\right] \!\!, \end{aligned}$$
(14)

for any behavior joint policy \(\bar{\pi } \doteq (\bar{a}_0,\ldots , \bar{a}_T)\).

We state Theorem 8 for non-stationary policies and \(T<\infty \), but the result naturally extends to infinite-horizon and stationary policies. The theorem essentially demonstrates how compatibility conditions generalize from \(M_1\) to \(M_n\). Notable properties of a compatible centralized critic include the separability w.r.t. individual approximators:

$$\begin{aligned} V_t^\sigma :{}&(x_t,o_t) \mapsto \sum _{i\in I_n} \mathbb {E}_{a^i_t}\left[ \frac{\partial \log {a_t^i(U_t^i|o_t^i)}}{\partial \theta _t^i} \right] ^\top \sigma ^i_t + \beta _t(x_t,o_t), \end{aligned}$$
(15)
$$\begin{aligned} A_t^\nu :{}&(x_t,o_t,u_t) \mapsto \sum _{i\in I_n} \left( \frac{\partial \log {a_t^i(u_t^i|o_t^i)}}{\partial \theta _t^i}\right) ^\top \nu ^i_t + \tilde{\beta }_t(x_t,o_t,u_t), \end{aligned}$$
(16)

where \(\beta _{0:T}\) and \(\tilde{\beta }_{0:T}\) are baselines independent of \(\theta _{0:T}^{1:n}\), \(\nu _{0:T}^{1:n}\) and \(\sigma _{0:T}^{1:n}\). Only one of (12) or (13) needs to be verified to preserve the direction of the policy gradient. Similarly to the compatibility theorem for \(M_1\), the freedom granted by the potentially unconstrained approximation and the baselines can be exploited to reduce the variance of the gradient estimation, but also take advantage of extra joint or hidden information unavailable to the agents at the execution phase. We can also benefit from the separability of both approximators at once to decrease the number of learned parameters and speed up the training phase for large-scale applications. Finally, the separability of function approximators does not allow us to independently maintain individual critics, the gradient estimation is still guided by a centralized critic.

5.3 Actor-Critic for Decentralized Control Algorithms

In this section, we derive actor-critic algorithms for \(M_n\) that exploit insights from Theorem 8, as illustrated in Algorithm 1, namely Actor-Critic for Decentralized Control (ACDC). This algorithm is model-free, centralizedFootnote 1, off-policy and iterative. Each iteration consists of policy evaluation and policy improvement. The policy evaluation composes a mini-batch based on trajectories sampled from \(\mathbb {P}_{\bar{\pi }, M_n}(\varOmega _{0:T})\) and the corresponding temporal-difference errors, see lines (6–11). The policy improvement updates \(\theta \), \(\nu \), and \(\sigma \) by taking the average over mini-batch samples and exploiting compatible function approximations, see lines (12–16), where \(\phi ^i_t(o_t,u_t) \doteq \frac{\partial }{\partial \theta ^i_{t,h}}\log a^i_t(u^i_t|o^i_t)\).

figure a

The step-sizes \(\alpha ^\theta _h\), \(\alpha ^\nu _h\) and \(\alpha ^\sigma _h\) should satisfy the standard Robbins and Monro’s conditions for stochastic approximation algorithms [29], i.e., \(\sum _{h=0}^\infty \alpha _h = \infty \), \(\sum _{h=0}^\infty \alpha _h^2 < \infty \). Moreover, according to [18], they should be scheduled such that we update \(\theta \) at a slower time-scale than \(\nu \) and \(\sigma \) to ensure convergence. To ease the maximum improvement of a joint policy for a constant fixed change of its parameters, the method of choice is the natural policy gradient [1, 17]. The natural ACDC (NACDC) differs from ACDC only in the update of the actors: \(\theta ^i_{t,h+1} \leftarrow \theta ^i_{t,h} + \alpha ^\theta _h \mathbb {E}_{\mathcal {D}_{t,h}}[ \frac{a_t(u_t|o_t)}{\bar{a}_t(u_t|o_t)} \nu _t^i]\). We elaborate on this analysis of natural Policy Gradient in our companion research report [6].

We conclude this section with remarks on theoretical properties of ACDC algorithms. First, they are guaranteed to converge with probability one under mild conditions to local optima as they are true gradient descent algorithms [8]. The basic argument is that they minimize the mean square projected error by stochastic gradient descent, see [8] for further details. They further terminate with a local optimum that is also a Nash equilibrium, i.e., the partial derivatives of the centralized critic w.r.t. any parameter is zero only at an equilibrium point.

6 Experiments

In this section, we empirically demonstrate and validate the advantage of CTDC over CTCC and DTDC paradigms. We show that ACDC methods compare favorably w.r.t. existing algorithms on many decentralized multi-agent domains from the literature. We also highlight limitations that preclude the current implementation of our methods to achieve better performances.

6.1 Experimental Setup

As discussed throughout the paper, there are many key components in actor-critic methods that can affect their performances. These key components include: training paradigms (CTCC vs DTDC vs CTDC); policy representations (stationary vs non-stationary policies); approximation architectures (linear approximations vs deep recurrent neural networks); history representations (truncated histories vs hidden states of deep neural networks). We implemented three variants of actor-critic methods that combine these components. Unless otherwise mentioned, we will refer to actor-critic methods from: the acronym of the paradigm in which they have been implemented, e.g., CTDC for ACDC; plus the key components, “CTDC_TRUNC(K)” for ACDC where we use K last observations instead of histories (non-stationary policy); or “DTDC_RNN” for distributed Reinforce where we use RNNs (stationary policy), see Fig. 2.

Fig. 2.
figure 2

Best viewed in color. Recurrent neural network architecture used to represent actors of agent \(i\in I_n\). The boxes are standard neural network layers, text denotes intermediate tensors computed during forward pass, and text indicates the number of parameters in each layer. An LSTM cell maintains an internal state updated using an embedding of the action-observation pair. A fully connected layer followed by an ReLU generates a feature vector \(\phi ^i\), which are combined by a second FC layer then normalized by Softmax to get conditional decision rule \(a^i(\cdot |\varsigma ^i)\).

We conducted experiments on a Dell Precision Tower 7910 equipped with a 16-core, 3 GHz Intel Xeon CPU, 16 GB of RAM and a 2 GB nVIDIA Quadro K620 GPU. We run simulations on standard benchmarks from Dec-POMDP literature, including Dec. Tiger, Broadcast Channel, Mars, Box Pushing, Meeting in a Grid, and Recycling Robots, see http://masplan.org. For the sake of conciseness, we report details on hyper-parameters in the companion research report [6].

6.2 History Representation Matters

In this section, we conducted experiments with the goal of gaining insights on how the representation of histories affects the performance of ACDC methods. Figure 3 depicts the comparison of truncated histories vs hidden states of deep neural networks. Results obtained using an \(\epsilon \)-optimal planning algorithm called FB-HSVI [9] are included as reference. For short planning horizons, e.g., \(T=10\), CTDC_RNN quickly converges to good solutions in comparison to CTDC_TRUNC(1) and CTDC_TRUNC(3). This suggests CTDC rnn learns more useful and concise representations of histories than the truncated representation. However, for some of the more complex tasks such as Dec. Tiger, Box Pushing or Mars, no internal representation was able to perform optimally.

Fig. 3.
figure 3

Comparison of different structures used to represent histories.

Overall, our experiments on history representations show promising results for RNNs, which have the advantage over truncated histories to automatically learn equivalence classes and compact internal representations based on the gradient back-propagated from the reward signal. Care should be taken though, as some domain planning horizons and other specific properties might cause early convergence to poor local optima. We are not entirely sure which specific features of the problems deteriorate performances, and we leave for future works to explore better methods to train these architectures.

Fig. 4.
figure 4

Comparison of the three paradigms for \(T=10\).

Fig. 5.
figure 5

Comparison of the three paradigms for \(T=\infty \).

6.3 Comparing Paradigms Altogether

In this section, we compare paradigms, CTCC, DTDC, and CTDC. We complement our experiments with results from other Dec-POMDP algorithms: an \(\epsilon \)-optimal planning algorithm called FB-HSVI [9]; and a sampling-based planning algorithm called Monte-Carlo Expectation-Maximization (MCEM) algorithm [37], which shares many similarities with actor-critic methods. It is worth noticing that we are not competing against FB-HSVI as it is model-based. As for MCEM, we reported performancesFootnote 2 recorded in [37].

In almost all tested benchmarks, CTDC seems to take the better out of the two other paradigms, for either \(T=10\) (Fig. 4) or \(T=\infty \) (Fig. 5). CTCC might suffer from the high dimensionality of the joint history space, and fail to explore it efficiently before the learning step-sizes become negligible, or we reached the predefined number of training episodes. Our on-policy sampling evaluation certainly amplified this effect. Having a much smaller history space to explore, CTDC outperforms CTCC in these experiments. Compared to DTDC which also explores smaller history space, there is a net gain to consider a compatible centralized critic in the CTDC paradigm, resulting in better performances. Even if CTDC achieves performances better or equal to the state of the art MCEM algorithm, there is still some margins of improvements to reach the global optima given by FB-HSVI in every benchmark. As previously mentioned, this is partly due to inefficient representations of histories.

7 Conclusion

This paper establishes the theoretical foundations of centralized actor-critic methods for Dec-POMDPs within the CTDC paradigm. In this paradigm, a centralized actor-critic algorithm learns independent policies, one for each agent, using a centralized critic. In particular, we show that the compatible centralized critic is the sum of individual critics, each of which is linear in the “features” of its corresponding individual policy. Experiments demonstrate our actor-critic methods, namely ACDC, compares favorably against methods from standard RL paradigms in benchmarks from the literature. Current implementations of ACDC reveal a challenging and open issue, namely the representation learning problem of individual histories, e.g., learning to map individual histories to individual occupancy states. We plan to address this limitation in the future. Whenever the representation of individual histories is not an issue, ACDC can exploit the separability of the centralized critic to scale up the number of agents. We are currently investigating a large-scale decentralized multi-agent application, where we plan to exploit this scalability property.