An incremental offpolicy search in a modelfree Markov decision process using a single sample path
 437 Downloads
Abstract
In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which maximizes the value function, i.e., the long run discounted reward of the MDP. The current settings also assume access to a generative model of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this paper, we consider a modified version, where the cost function is the expectation of a nonconvex function of the value function without access to the generative model. Rather, we assume that a sample trajectory generated using a priori chosen behaviour policy is made available. In this restricted setting, we solve the modified control problem in its true sense, i.e., to find the best possible policy given this limited information. We propose a stochastic approximation algorithm based on the wellknown cross entropy method which is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a policy which is globally optimal relative to the behaviour policy. We also present experimental results to corroborate our claims and we demonstrate the superiority of the solution produced by our algorithm compared to the stateoftheart algorithms under appropriately chosen behaviour policy.
Keywords
Markov decision process Offpolicy prediction Control problem Stochastic approximation method Cross entropy method Linear function approximation ODE method Global optimization1 Summary of notation
2 Introduction and preliminaries
The two fundamental questions most commonly addressed in the MDP literature are: 1. Prediction problem and 2. Control problem.
2.1 Model free algorithms
In the above section, the prediction and control algorithms are numerical methods that assume that the probability transition function P and the reward function R are available. In most of the practical scenarios, it is unrealistic to assume that accurate knowledge of P and R is realizable. However, the behaviour of the system can be observed and one needs to either predict the value of a given policy or find the optimal control using the available observations. The observations are in the form of a sample trajectory \(\{s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, \dots \}\), where \(s_i \in {\mathbb {S}}\) is the state and \(r_i = R(s_i, a_i, s_{i+1})\) is the immediate reward at time instant i. Model free algorithms are basically of three types: (i) Indirect methods, (ii) Direct methods and (iii) Policy search methods. The last of these methods searches in the policy space to find the optimal policy where the performance measure used for comparison is the estimate of the value function induced from the observations. Prominent algorithms in this category are actorcritic (Konda and Tsitsiklis 2003), policy gradient (Baxter and Bartlett 2001), natural actorcritic (Bhatnagar et al. 2009) and fast policy search (Mannor et al. 2003). Indirect methods are based on the certainty equivalence of computing where initially the transition matrix and the expected reward vector are estimated using the observations and subsequently, model based approaches mentioned in the above section are applied on the estimates. A few indirect methods are control learning (Sato et al. 1982, 1988; Kumar and Lin 1982), priority sweeping (Moore and Atkeson 1993), adaptive realtime dynamic programming (ARTDP) (Barto et al. 1995) and PILCO (Deisenroth and Rasmussen 2011). For the case of direct methods which are more appealing, the model is not estimated, rather the control policy is adapted iteratively using a shadow utility function derived from the instantiation of the internal dynamics of the MDP. The algorithms in this class are generally referred to in the literature as the reinforcement learning algorithms. Prominent reinforcement learning algorithms include temporal difference (TD) learning (Sutton 1988) (prediction method), Qlearning (Watkins 1989) and SARSA (Singh and Sutton 1996) (control methods). There are two variants of the prediction algorithm depending on how the sample trajectory is generated. They are onpolicy and offpolicy algorithms. In the onpolicy case, the sample trajectory is generated using the policy \(\pi \) which is being evaluated, i.e., \(s_{i+1} \sim P(s_i, a_i, \cdot )\), where \(a_i \sim \pi (\cdot \vert s_{i})\) and \(r_i = R(s_i, a_i, s_{i+1})\). In the offpolicy case, the sample trajectory is generated using a policy \(\pi _b\) which is possibly different from the policy \(\pi \) that is being evaluated, i.e., \(s_{i+1} \sim P(s_i, a_i, \cdot )\), where \(a_i \sim \pi _b(\cdot \vert s_{i})\) and \(r_i = R(s_i, a_i, s_{i+1})\).
Model free algorithms are shown to be robust, stable and exhibit good convergence behaviour under realistic assumptions. However, they suffer from the curse of dimensionality which arises due to the space complexity. Note that the space complexity of the above mentioned learning algorithms is \(O(\vert {\mathbb {S}} \vert )\), which becomes unmanageably large with increasing state space.
2.2 Linear function approximation (LFA) methods for model free Markov decision process
The accuracy of the function approximation method depends on the representational/expressive ability of \(\mathrm{I\!H}^{\varPhi }\). For example, when \(k_{1} = \vert {\mathbb {S}} \vert \), the representational ability is utmost, since \(\mathrm{I\!H}^{\varPhi } = \mathrm{I\!R}^{\vert \mathbb {S} \vert }\). In general, \(k_1 \ll \vert \mathbb {S} \vert \) and hence \(\mathrm{I\!H}^{\varPhi } \subset \mathrm{I\!R}^{\vert \mathbb {S} \vert }\). So for an arbitrary policy \(\pi \), where \(V^{\pi } \notin \mathrm{I\!H}^{\varPhi }\), the prediction of the value function \(V^{\pi }\) shall always incur an unavoidable approximation error (\(e_{appr}\)) given by \(\inf _{h \in \mathrm{I\!H}^{\varPhi }}\Vert V^{\pi }  h \Vert \). Given \(\mathrm{I\!H}^{\varPhi }\), one cannot perform no better than \(e_{appr}\). The prediction features \(\{\phi _i\}\) are handcrafted using prior domain knowledge and their choice is critical in approximating the value function. There is an abundance of literature available on the topic. In this paper, we assume that an appropriately chosen feature set is available a priori. Also note that the convergence of the prediction methods is in asymptotic sense. But in most practical scenarios, the algorithm has to be terminated after a finite number of steps. This incurs an estimation error (\(e_{est}\)) which however decays to zero, asymptotically.
Even though LFA produces suboptimal solutions, since the search is conducted on a restricted subspace of \(\mathrm{I\!R}^{\vert {\mathbb {S}} \vert }\), it yields large computational and storage benefits. So some degree of tradeoff between accuracy and tractability is indeed unavoidable.
2.3 Offpolicy prediction using LFA
Setup Given \(w, w_b \in {\mathbb {W}}\) and an observation of the system dynamics in the form of a sample trajectory \(\{s_{0}, a_0, r_{0}, s_{1}, a_1, r_{1}, s_{2}, \dots \}\), where at each instant k, \(a_{k} \sim \pi _{w_b}(\cdot \vert s_{k})\), \(s_{k+1} \sim \) \(P(s_k, a_k, \cdot )\) and \(r_{k}\) = \(R(s_{k}, a_{k}, s_{k+1})\), the goal is to estimate the value function \(V^{\pi _{w}}\) of the target policy \(\pi _{w}\) (that is possibly different from \(\pi _{w_b}\)). We assume that the Markov chains defined by \(P_{w}\) and \(P_{w_b}\) are ergodic. Further, let \(\nu _{w}\) and \(\nu _{w_b}\) be the stationary distributions of the Markov chains with transition probability matrices \(P_{w}\) and \(P_{w_b}\) respectively, i.e., \(\lim _{k \rightarrow \infty }P_{w}({\mathbf {s}}_k = s) = \nu _{w}(s)\) and \(\nu _{w}^{\top }P_{w} = \nu _{w}^{\top }\) and likewise for \(\nu _{w_b}\). Note that for brevity the notations have been simplified here, i.e., \(P_{w} \triangleq P_{\pi _{w}}\) and \(P_{w_b} \triangleq P_{\pi _{w_b}}\). We follow the new notation for the rest of the paper. Similarly, \(V^{w} \triangleq V^{\pi _w}\).

Offpolicy TD(\(\lambda \))

Offpolicy LSTD(\(\lambda \))
Theorem 1
2.4 The control problem of interest
In this section, we define a variant of the control problem which is the topic of interest in this paper.
\(\circledast \) Assumption (A2) The Markov chain under any SRP \(\pi _{w}, w \in \mathrm{I\!R}^{k_2}\) is ergodic, i.e., irreducible and aperiodic.
2.5 Motivation
The control problem in Eq. (15) is harder due to the application of the performance function L on the approximate value function. Hence we cannot apply the existing direct model free methods like LSPI or offpolicy Qlearning (Maei et al. 2010). Note that the LSPI algorithm [Fig. 8 of Lagoudakis and Parr (2003)] is a policy iteration method, where at each iteration an improved policy parameter is deduced from the projected Qvalue of the previous policy parameter. So one cannot directly incorporate the operator \({\mathbb {E}}_{\nu _w}\) into the LSPI iteration. Similar compatibility issues are found with the offpolicy Qlearning also (Maei et al. 2010). However, policy search methods are a direct match for this problem. Not all policy search methods can provide quality solutions. The pertinent issue is the nonconvexity of \({\mathbb {E}}_{\nu _w}\left[ L(h_{w \vert w})\right] \) which presents a landscape with many local optima. Any gradient based method like the stateoftheart simultaneous perturbation stochastic approximation (SPSA) (Spall 1992) algorithm or the policy gradient methods can only provide suboptimal solutions. In this paper, we try to solve the control problem in its true sense, i.e., find a solution close to the global optimum of the optimization problem (15). We employ a stochastic approximation variant of the well known cross entropy (CE) method proposed in Joseph and Bhatnagar (2016b, c, 2016a) to achieve the true sense behaviour. The CE method has in fact been applied to the model free control setting before in Mannor et al. (2003), where the algorithm is termed the fast policy search. However, the approach in Mannor et al. (2003) has left several practical and computational challenges uncovered. The method in Mannor et al. (2003) assumes access to a generative model, i.e., the real MDP system itself or a simulator/computational model of the MDP under consideration, which can be configured with moderate ease (with time constraints) and the observations recorded. The existence of generative models for extremely complex MDPs is highly unlikely, since it demands accurate knowledge about the transition dynamics of the MDP. Now regarding the computational aspect, the algorithm in Mannor et al. (2003) maintains an evolving \(\vert {\mathbb {S}} \vert \times \vert {\mathbb {A}} \vert \) matrix \(P^{(t)} \triangleq (P^{(t)}_{sa})_{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\), where \(P^{(t)}_{sa}\) is the probability of taking action a in state s at time t. At each discrete time instant t, the algorithm generates multiple sample trajectories using \(P^{(t)}\), each of finite length, but sufficiently long. For each trajectory, the discounted cost is calculated and then averaged over those multiple trajectories to deduce the subsequent iterate \(P^{(t+1)}\). This however is an expensive operation, both computation and storage wise. Another pertinent issue is the number of sample trajectories required at each time instant t. There is no analysis pertaining to finding a bound on the trajectory count. This implies that a bruteforce approach has to be adopted which further burdens the algorithm. A more recent global optimization algorithm called the model reference adaptive search (MRAS) has also been applied in the model free control setting (Chang et al. 2013). However, it also suffers from similar issues as the earlier approach.
A few relevant work in the literature which do not assume the availability of a generative model include Bellmanresidual minimization based fitted policy iteration using a single trajectory (Antos et al. 2008) and valueiteration based fitted policy iteration using a single trajectory (Antos et al. 2007). However, those approaches fall prey to the curse of dimensionality arising from large action spaces. Also, they are abstract in the sense that a generic function space is considered and the value function approximation step is expressed as a formal optimization problem. In the above methods which are almost similar in their approach, considerable effort is dedicated to addressing the approximation power of the function space and sample complexity.
 1.
To reduce the total number of policy evaluations.
 2.
To find a high performing policy without presuming an unlimited access to the generative model.
To accomplish the former objective, the ubiquitous choice is to employ the stochastic approximation (SA) version of the CE method instead of the naive CE method used in Mannor et al. (2003). The SA version of CE is a zeroorder optimization method which is incremental, adaptive, robust and stable with the additional attractive attribute of convergence to the global optimum of the objective function. It has been demonstrated empirically in Joseph and Bhatnagar (2016a, b) that the method exhibits efficient utilization of the samples and possesses better rate of convergence than the naive CE method. The effective sample utilization implies that the method requires minimum number of objective function evaluations. These attributes are appealing in the context of the control problem we consider here, especially in effectively addressing the former objective. The adaptive nature of the algorithm apparently eliminates any bruteforce approach which has a detrimental impact on the performance of the naive CE method.
Goal of the Paper To solve the control problem defined in Eq. (15) without having access to any generative model. Formally stated, given an infinitely long sample trajectory \(\{{\mathbf {s}}_0, {\mathbf {a}}_0, {\mathbf {r}}_0, {\mathbf {s}}_1, {\mathbf {a}}_1, {\mathbf {r}}_1, {\mathbf {s}}_2, \dots \}\) generated using the behaviour policy \(\pi _{w_b}\) (\(w_b \in \mathrm{I\!R}^{k_{2}}\)), solve the control problem in (15).
\(\circledast \) Assumption (A4) The behaviour policy \(\pi _{w_b}\), where \(w_b \in {\mathbb {W}}\), satisfies the following condition: \(\pi _{w_b}(a \vert s) > 0\), \(\forall s \in {\mathbb {S}}, \forall a \in {\mathbb {A}}\).
A few remarks are in order: We can classify the reinforcement learning algorithms based on the information made available to the algorithm in order to seek the optimal policy. We graphically illustrate this classification as a pyramid in Fig. 4. The bottom of the pyramid contains the classical methods, where the entire model information, i.e., both P and R are available, while in the middle, we have the model free algorithms, where both P and R are assumed hidden, however an access to the generative model/simulator is presumed. In the top of the pyramid, we have the single trajectory approaches, where a single sample trajectory generated using a behaviour policy is made available, however, the algorithms have no access to the model information or simulator. Observe that as one goes up the pyramid, the mass of the information vested upon the algorithm reduces considerably. The algorithm we propose in this paper belongs to the top of the information pyramid and to the upper half of the optimization box which makes it a unique combination.
3 Proposed algorithm
In this section, we propose an algorithm to solve the control problem defined in Eq. (15). We employ a stochastic approximation variant of the Gaussian based cross entropy method to find the optimal policy. We delay the discussion of the algorithm until the next subsection. We now focus on the objective function estimation. The objective function values \({\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] \) which are required to efficiently guide the search for \(w^{*}\) are estimated using the offpolicy LSTD(\(\lambda \)) method. In LFA, given \(w \in {\mathbb {W}}\), the best approximation of \(V^{w}\) one can hope for is the projection \(\varPi ^{w}V^{w}\). Theorem 1 of Tsitsiklis and Roy (1997) shows that the onpolicy LSTD(\(\lambda \)) solution \(\varPhi x_{w \vert w}\) is indeed an approximation of the projection \(\varPi ^{w}V^{w}\). Using Babylonian–Pythagorean theorem and Theorem 1 of Tsitsiklis and Roy (1997) along with a little arithmetic, we obtain \(\Vert \varPhi x_{w \vert w}  \varPi ^{w}V^{w} \Vert _{\nu _{w}} \le \frac{\sqrt{(1\lambda )\gamma (\gamma +\gamma \lambda +2)}}{1\gamma }\Vert \varPi ^{w}V^{w}  V^{w}\Vert _{\nu _{w}}\). Hence for \(\lambda = 1\), we have \(\varPhi x_{w \vert w} = \varPi ^{w}V^{w}\), i.e., the onpolicy LSTD(1) provides the exact projection. However for \(\lambda < 1\), only approximations to it are obtained. Now when offpolicy LSTD(\(\lambda \)) is applied, it adds one more level of approximation, i.e., \(\varPhi x_{w \vert w}\) is approximated by \(\varPhi x_{w \vert w_b}\). Hence to evaluate the performance of the offpolicy approximation, we must quantify the errors incurred in the approximation procedure and we believe a capacious analysis had been far overdue.
3.1 Choice of the behaviour policy
The behaviour policy is often an exploration policy which promotes the exploration of the state and action spaces of the MDP. Efficient exploration is a precondition for effective learning. In this paper, we operate in a minimalistic MDP setting, where the only information available for inference is the single stream of transitions and payoffs generated using the behaviour policy. So the choice of the behaviour policy is vital for a sound inductive reasoning. The following theorem will provide a bound on the approximation error incurred in the offpolicy LSTD(\(\lambda \)) method. The provided bound can be beneficial in choosing a good behaviour policy and also supplements in understanding the stability and usefulness of the proposed algorithm.
Theorem 2
Proof
Observation 1
Proof
Using\(<\varPi ^{w} V  V, \varPi ^{w} V>_{\nu _{w}} = 0\) and by the Babylonian–Pythagorean theorem, we have \(\Vert V \Vert _{\nu _{w}}^{2} = \Vert \varPi ^{w} V  V \Vert _{\nu _{w}}^{2} + \Vert \varPi ^{w} V \Vert _{\nu _{w}}^{2}, \Rightarrow \Vert \varPi ^{w} V \Vert _{\nu _{w}} \le \Vert V \Vert _{\nu _{w}}\). This proves (24). \(\square \)
Observation 2
Proof
Observation 3
Proof
Refer Lemma 4 of Tsitsiklis and Roy (1997).
Observation 4
Proof
Observation 5
\(\varPhi x_{w \vert w_b} = \varPi ^{w_b}T^{(\lambda )}_{w \vert w_b}\varPhi x_{w \vert w_b}\). This is the offpolicy projected Bellman equation. Detailed discussion is available in Yu (2012). For the onpolicy case, similar equation exists which is as follows: \(\varPhi x_{w \vert w} = \varPi ^{w}T^{(\lambda )}_{w \vert w}\varPhi x_{w \vert w}\). For the proof of the above equation, refer Theorem 1 of Tsitsiklis and Roy (1997). A few other relevant fixed point equations are \(T^{(\lambda )}_{w \vert w}V^{w} = V^{w}\) and \(T^{(\lambda )}_{w_b \vert w_b}V^{w_b} = V^{w_b}\). The proof of the above equations is provided in Lemma 5 of Tsitsiklis and Roy (1997).
The implications of the bounds given in Theorem 2 are indeed significant. The quantity \(\sup _{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}1\Big \vert \) given in the hypothesis of the theorem can ostensibly be viewed as a measure of the closeness of the SRPs \(\pi _w\) and \(\pi _{w_b}\), with the minimum value of 0 being achieved in the onpolicy case. Under the hypothesis that \(\sup _{s \in {\mathbb {S}}, a \in {\mathbb {A}}}\Big \vert \frac{\pi _{w}(a \vert s)}{\pi _{w_b}(a \vert s)}1\Big \vert < \epsilon _2\), we obtain in (16) an upper bound on the relative error of the onpolicy and offpolicy solutions. The bound is predominantly dominated by the hypothesis bound \(\epsilon _2\), the eligibility factor \(\lambda \), the discount factor \(\gamma \) and \(\Vert (D^{\nu _{w_b}})^{1} \Vert _{\infty }\Vert D^{\nu _{w_b}} \Vert _{\infty }\). Note that \(\Vert D^{\nu _{w_b}} \Vert _{\infty } = \max _{s} \nu _{w_b}(s)\) and \(\Vert (D^{\nu _{w_b}})^{1} \Vert _{\infty } = (\min _{s} \nu _{w_b}(s))^{1}\). If the behaviour policy is chosen in such a way that all the states are equally likely under its stationary distribution, then \(\Vert (D^{\nu _{w_b}})^{1} \Vert _{\infty }\Vert D^{\nu _{w_b}} \Vert _{\infty } \approx 1\). Consequently, the upper bound can be reduced to \(O\big ((\vert {\mathbb {S}} \vert ^{2}\epsilon ^{2}_2 + \vert {\mathbb {S}} \vert \epsilon _2)\frac{(1+\gamma )(1+\gamma \lambda )}{(1\gamma )(1\gamma \lambda )}\big )\).
Now regarding the latter bound provided in Eq. (17), given \(w \in {\mathbb {W}}\), by using triangle inequality and Eq. (17), we obtain a proper quantification of the distance between the solution of the offpolicy LSTD(\(\lambda \)), i.e., \(\varPhi x_{w \vert w_b}\) and the projection \(\varPi ^{w_b}V^{w}\) in terms of \(\Vert \cdot \Vert _{\nu _{w_b}}\) and \(\epsilon _2\). The above bound can be further improved by obtaining an expedient bound for \(\Vert V^{w}  V^{w_b} \Vert _{\nu _{w_b}}\) as follows:
Corollary 1
Proof
The note worthy result on the upper bound of the approximation error of the onpolicy LSTD(\(\lambda \)) provided in Tsitsiklis and Roy (1997) can be easily derived from the above result as follows:
Corollary 2
Proof
In the onpolicy case, \(w_b = w\). Hence \(\epsilon _2 = 0\). The corollary directly follows from direct substitution of these values in (17). \(\square \)
3.2 Estimation of the objective function
For a given \(w \in {\mathbb {W}}\), \(\ell _{k}^{w}\) attempts to find an approximate value of the objective function J(w). The following lemma formally characterizes the limiting behaviour of the iterates \(\ell _{k}^{w}\).
Lemma 1
Proof
We begin the proof by defining the filtration \(\{\mathcal {F}_{k}\}_{k \in {\mathbb {N}}}\), where the \(\sigma \)field \(\mathcal {F}_{k} \triangleq \sigma (\{{\mathbf {x}}_{i}, \ell ^{w}_{i}, {\mathbf {s}}_{i}, {\mathbf {a}}_{i}, {\mathbf {r}}_{i}, 0 \le i \le k \})\).
\(h(z) \triangleq {\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1}))\right]  z\) and \(c_k \triangleq L({\mathbf {x}}_{k}^{\top }\phi ({\mathbf {s}}_{k+1}))  L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) + \)
\({\mathbb {E}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1})) \big \vert \mathcal {F}_{k}\right]  {\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}_{k+1}))\right] \).
 1.
\(\{{\mathbb {M}}_{k}, k \ge 1\}\) is a martingale difference noise sequence w.r.t. \(\{\mathcal {F}_{k}\}\), i.e., \({\mathbb {M}}_{k}\) is \(\mathcal {F}_{k}\)measurable, integrable and \({\mathbb {E}}[{\mathbb {M}}_{k+1} \vert \mathcal {F}_{k}] = 0\) a.s., \(\forall k \ge 0\).
 2.
\(h(\cdot )\) is a Lipschitz continuous function.
 3.
\(\exists K > 0\) s.t. \({\mathbb {E}}[\vert {\mathbb {M}}_{k+1} \vert ^{2} \vert \mathcal {F}_{k}] \le K(1+\vert \ell _{k} \vert ^{2})\) a.s., \(\forall k \ge 0\).
 4.
By Theorem 1, \(c_k \rightarrow 0\) as \(k \rightarrow \infty \) w.p. 1. This directly follows by considering the following facts: (a) by Eq. (1), the offpolicy LSTD(\(\lambda \)) iterates \(\{{\mathbf {x}}_{k}\}\) converges almost surely to the offpolicy solution \(x_{w \vert w_b}\) (b) by assumption (A2), \(P_{w_b}({\mathbf {s}}_{k} = s) \rightarrow \nu _{w_b}(s)\) as \(k \rightarrow \infty \) and (c) \(L(\cdot )\) and \(\phi (\cdot )\) are bounded.
 5.For a given \(w \in {\mathbb {W}}\), the iterates \(\{\ell _{k}^{w}\}_{k \in \mathbb {N}}\) are stable, i.e., \(\sup _{k} \vert \ell _{k}^{w} \vert < \infty \) a.s. A brief proof is provided here: For \(c > 0\), we defineNow consider the following ODE corresponding to the following \(\infty \)system:$$\begin{aligned} h_{c}(z) \triangleq \frac{h(cz)}{c} = \frac{{\mathbb {E}}_{\nu _{w_b}}\left[ L(x_{w \vert w_b}^{\top }\phi ({\mathbf {s}}))\right] }{c}  z. \end{aligned}$$(35)Note that \(h_{\infty }(z) = z\). It can be easily verified that the above ODE is globally asymptotically stable to the origin. This further implies the stability of the iterates \(\{\ell _{k}^{w}\}\) using Theorem 2, Chapter 3 of Borkar (2008).$$\begin{aligned} \dot{z}(t) = h_{\infty }(z(t)) \triangleq \lim _{c \rightarrow \infty }h_{c}(z(t)). \end{aligned}$$(36)
Remark 1
By the above lemma, for a given \(w \in {\mathbb {W}}\), the quantity \(\ell _{k}^{w}\) tracks \({\mathbb {E}}_{\nu _{w_b}}\big [L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\big ]\). This is however different from the true objective function value \(J(w) = {\mathbb {E}}_{\nu _{w}}\left[ L(h_{w \vert w})\right] \), when \(w \ne w_b\). This additional approximation error incurred is the extra cost one has to pay for the dearth in information (in the form of generative model) about the underlying MDP. Nevertheless, from Eqs. (16) and (19), we know that the relative errors in the solutions \(x_{w\vert w}\) and \(x_{w \vert w_b}\) as well as in the stationary distributions \(\nu _{w}\) and \(\nu _{w_b}\) are bounded. We also know that \(\varPhi x_{w \vert w} \approx h_{w \vert w}\). Further, if we can restrict the smoothness of the performance function L, then we can contain the deviation of L(y) when the input variable y is perturbed slightly. All these factors further affirm the fact that the approximation proposed in (33) is wellconditioned. This is indeed significant, considering the restricted setting we operate in, i.e., nonavailability of the generative model.
3.3 Stochastic approximation version of Gaussian cross entropy method and its application to the control problem
CE is a model based search method (Zlochin et al. 2004) used to solve the global optimization problem. CE is a zeroorder method (a.k.a. gradientfree method) which implies the algorithm does not require gradient or higherorder derivatives of the objective function. This remarkable feature of the algorithm makes it a suitable choice for the “blackbox” optimization setting, where neither a closed form expression nor structural properties of the objective function J are available. CE method has found successful application in diverse domains which include continuous multiextremal optimization (Rubinstein 1999), buffer allocation (Alon et al. 2005), queueing models (de Boer 2000), DNA sequence alignment (Keith and Kroese 2002), control and navigation (Helvik and Wittner 2001), reinforcement learning (Mannor et al. 2003; Menache et al. 2005) and several NPhard problems (Rubinstein 2002, 1999). We would also like to mention that there are other model based search methods in the literature, a few pertinent ones include the gradientbased adaptive stochastic search for simulation optimization (GASSO) (Zhou et al. 2014), estimation of distribution algorithm (EDA) (Mühlenbein and Paass 1996) and model reference adaptive search (MRAS) (Hu et al. 2007). However, in this paper, we do not explore the possibility of employing the above algorithms in a MDP setting.
Property (P1) \(supp(f_{\theta _{j+1}}) \subseteq \{x \vert J(x) \ge \gamma _{\rho }(J, \theta _{j})\}\),
where \(\rho \in (0,1)\) is fixed a priori. Note that \(\gamma _{\rho }(J, \theta _{j})\) is the \((1\rho )\)quantile of J w.r.t. the distribution \(f_{\theta _{j}}\). Hence it is easy to verify that the threshold sequence \(\{\gamma _{\rho }(J, \theta _{j})\}_{j \in {\mathbb {N}}}\) is a monotonically nondecreasing sequence. The intuition behind this recursive generation of the model sequence is that by assigning greater weight to the higher values of J at each iteration, the expected behaviour of the model sequence should improve. We make the following assumption on the model parameter space \(\Theta \):
\(\circledast \) Assumption (A5) The parameter space \(\Theta \) is a compact subset of \(\mathrm{I\!R}^{d(d+1)}\).
However, there are certain tractability concerns. The quantities \(\gamma _{\rho }(J, \widehat{\theta }_{j})\), \(\Upsilon _{1}(\widehat{\theta }_{j}, \cdot )\) and \(\Upsilon _{2}(\widehat{\theta }_{j}, \cdot )\) involved in the update rule are intractable, i.e. computationally hard to compute (and hence the tag name ‘ideal’). To overcome this, a naive approach usually found in the literature is to employ sample averaging, with sample size increasing to infinity. However, this approach suffers from hefty storage and computational complexity which is primarily attributed to the accumulation and processing of huge number of samples. In Joseph and Bhatnagar (2016a, b, c), a stochastic approximation variant of the extended cross entropy method has been proposed. The proposed approach is efficient both computationally and storage wise, when compared to the rest of the stateoftheart CE tracking methods (Hu et al. 2012; Wang and Enright 2013; Kroese et al. 2006). It also integrates the mixture approach (44) and henceforth exhibits global optimum convergence.
The goal of the stochastic approximation (SA) version of Gaussian CE method is to find a sequence of Gaussian model parameters \(\{\theta _j = (\mu _j, \Sigma _j)^{\top }\}\) (where \(\mu _j\) is the mean vector and \(\Sigma _j\) is the covariance matrix) which tracks the ideal CE method. The algorithm efficiently accomplishes the goal by employing multiple stochastic approximation recursions. The algorithm is shown to exhibit global optimum convergence, i.e., the model sequence \(\{\theta _{j}\}\) converges to the degenerate distribution concentrated on any of the global optima of the objective function (Fig. 5), in both deterministic (when the objective function is deterministic) and stochastic settings, i.e., when noisy versions of the objective function are available. Successful application of the stochastic approximation version of CE in stochastic settings is appealing to the control problem we consider in this paper, since the offpolicy LSTD(\(\lambda \)) method only provides estimates of the value function. The SA version of CE is a discrete evolutionary procedure where the model sequence \(\{\theta _j\}\) is adapted to the degenerate distribution concentrated at global optima, where at each discrete step of the evolution a single sample from the solution space is used. This unique nature of the SA version is appealing to settings where the objective function values are hard to obtain, especially to the MDP control problem we consider in this paper. The single sample requirement attribute which is unique to the SA version implies that one does not need to scale the computing machine for unnecessary value function evaluations.
 1.The learning rates \(\{\overline{\beta }_{j}\}\), \(\{\beta _{j}\}\) and the mixing weight \(\zeta \) are deterministic, nonincreasing and satisfy the following:$$\begin{aligned} \begin{aligned} \zeta \in (0, 1),&\beta _{j}> 0, \overline{\beta }_{j} > 0, \\&\sum _{j=1}^{\infty }\beta _{j} = \infty , \sum _{j=1}^{\infty }\overline{\beta }_{j} = \infty , \sum _{j=1}^{\infty }\left( \beta ^{2}_{j}+\bar{\beta }^{2}_{j}\right) < \infty . \end{aligned} \end{aligned}$$(45)
 2.
In our algorithm, the objective function is estimated in (50) using the Predict procedure which is defined in Algorithm 1. Even though an infinitely long sample trajectory is assumed to be available, the Predict procedure has to practically terminate after processing a finite number of transitions from the trajectory. Hence a user configured trajectory length rule \(\{N_{j} \in {\mathbb {N}}\setminus \{0\}\}_{j \in {\mathbb {N}}}\) with \(N_{j} \uparrow \infty \) is used. At each iteration j of the cross entropy method, when Predict procedure is invoked to estimate the objective function \(L(h_{w_j \vert w_j})\), the procedure terminates after processing the first \(N_{j}\) transitions in the trajectory. It is also important to note that the same sample trajectory is reused for all invocations of Predict. This eliminates the need for any further observations of the MDP.
 3.Recall that we employ the stochastic approximation (SA) version of the extended CE method to solve our control problem (15). The SA version (hence Algorithm 2) maintains three variables: \(\gamma _j, \xi ^{(0)}_{j}\) and \(\xi ^{(1)}_{j}\), with \(\gamma _j\) tracking \(\gamma _{\rho }(\cdot , \widehat{\theta }_j)\), while \(\xi ^{(0)}_j\) and \(\xi ^{(1)}_j\) track \(\Upsilon _1(\widehat{\theta }_j, \cdot )\) and \(\Upsilon _2(\widehat{\theta }_j, \cdot )\) respectively. Their stochastic recursions are defined in Eqs. (51), (52) and (53) of Algorithm 2. The increment terms for their respective stochastic recursions are defined recursively as follows:$$\begin{aligned}&\Delta \gamma _{j}(y) \triangleq (1\rho )I_{\{y \ge \gamma _j\}}+\rho I_{\{y \le \gamma _j\}}.\end{aligned}$$(46)$$\begin{aligned}&\Delta \xi ^{(0)}_{j}(x, y) \triangleq {\mathbf {g_{1}}}(y, x, \gamma _j)  \xi ^{(0)}_j {\mathbf {g_{0}}}(y, \gamma _j).\end{aligned}$$(47)$$\begin{aligned}&\Delta \xi ^{(1)}_{j}(x, y) \triangleq {\mathbf {g_{2}}}(y, x, \gamma _j, \xi ^{(0)}_j)  \xi ^{(1)}_j {\mathbf {g_{0}}}(y, \gamma _j). \end{aligned}$$(48)
 4.
The initial distribution parameter \(\theta _0\) is chosen by hand such that probability density function \(f_{\theta _0}\) has strictly positive values for every point in the solution space \({\mathbb {W}}\), i.e., \(f_{\theta _0}(w) > 0, \forall w \in {\mathbb {W}}\).
 5.
The stopping rule we adopt here for the control problem is to terminate the algorithm when the model sequence \(\{\theta _j\}\) is sufficiently close consequently for a finitely long time, i.e., \(\exists \bar{j} \ge 0\) s.t. \(\Vert \theta _j  \theta _{j+1} \Vert < \delta _1\), \(\bar{j} \le \forall j \le \bar{j}+N(\delta _1)\), where \(\delta _1 \in \mathrm{I\!R}_{+}\), \(N(\delta _1) \in {\mathbb {N}}\) are decided a priori.
 6.
The quantile factor \(\rho \) is also a relevant parameter of the CE method. An empirical analysis in Joseph and Bhatnagar (2016b) has revealed that the convergence rate of the algorithm is sensitive to the choice of \(\rho \). The paper also recommends that [0.01, 0.3] is the most suitable choice of \(\rho \).
 7.
We also extended the algorithm to include Polyak averaging of the model sequence \(\{\theta _j\}\). The sequence \(\{\overline{\theta }_{j}\}\) maintains the Polyak averages of the sequence \(\{\theta _j\}\) and its update step is given in (57). Note that the Polyak averaging Polyak and Juditsky (1992) is a double averaging technique which does not cripple the convergence of the original sequence \(\{\theta _j\}\), however it reduces the variance of the iterates and accelerates the convergence of the sequence.
3.4 Convergence analysis of Algorithm 2
The convergence analysis of the generalized variant of Algorithm 2 is already addressed in Joseph and Bhatnagar (2016c) and its application to the prediction problem is given in Joseph and Bhatnagar (2016b). However, for completeness, we will restate the results here. We do not give proof of those results, however, provide references for the same. The additional Polyak averaging (step 19 of Algorithm 2) requires analysis, which is covered below.
Note that Algorithm 2 employs the offpolicy prediction method for estimating the objective function. In particular, in step 6 of Algorithm 2, we have \(\hat{J}({\mathbf {w}}_{j+1}) := Predict({\mathbf {w}}_{j+1}, N_{j+1})\), which converges to \({\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \) almost surely as \(N_j \rightarrow \infty \) (by Lemma 1). Hence the objective function optimized by Algorithm 2 is \(J_b(w) \triangleq {\mathbb {E}}_{\nu _{w_b}}\left[ L(x^{\top }_{w \vert w_b}\phi ({\mathbf {s}}))\right] \), where \(w_b \in {\mathbb {W}}\) is the chosen behaviour policy vector.
Also note that the model parameter \(\theta _{j}\) in Algorithm 2 is not updated at each iteration j. Rather it is updated whenever \(T_{j}\) hits the \(\epsilon \) threshold (step 15 of Algorithm 2), where \(\epsilon \in (0, 1)\) is a constant. So the update of \(\theta _{j}\) only happens along a subsequence \(\{j_{(n)}\}_{n \in {\mathbb {N}}}\) of \(\{j\}_{j \in {\mathbb {N}}}\). Between \(j = j_{(n)}\) and \(j = j_{(n+1)}\), the model parameter \(\theta _j\) remains constant and the variable \(\gamma _{j}\) estimates \((1\rho )\)quantile of \(J_b\) w.r.t. \(\widehat{f}_{\theta _{j_{(n)}}}\).
Notation We denote by \(\gamma _{\rho }(J_b, \widehat{\theta })\), the \((1\rho )\)quantile of \(J_b\) w.r.t. the mixture distribution \(\widehat{f}_{\theta }\) and let \(E_{\widehat{\theta }}[\cdot ]\) be the expectation w.r.t. \(\widehat{f}_{\theta }\).
Since the model parameter \(\theta _j\) remains constant between \(j = j_{(n)}\) and \(j = j_{(n+1)}\), the convergence behaviour of \(\gamma _j\), \(\xi ^{(0)}_j\) and \(\xi ^{(1)}_j\) can be studied by keeping \(\theta _j\) constant.
Lemma 2
Let \(\theta _{j} \equiv \theta , \forall j\). Also, assume \(sup_{j} \vert \gamma _{j} \vert < \infty \) a.s. Then the stochastic sequence \(\{\gamma _{j}\}\) defined in Eq. (51) satisfies \(\lim _{j \rightarrow \infty }\gamma _{j} = \gamma _{\rho }(J_b, \widehat{\theta })\) a.s.
Proof
Refer Lemma 3 of Joseph and Bhatnagar (2016b). \(\square \)
Lemma 3
 (i)$$\begin{aligned} \lim _{j \rightarrow \infty } \xi ^{(0)}_{j} = \xi ^{(0)}_{*} = \frac{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{1}}}\big (J_b({\mathbf {x}}), {\mathbf {x}}, \gamma _{\rho }(J_b, \widehat{\theta })\varvec{\big )}\right] }{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{0}}}\big (J_b({\mathbf {x}}), \gamma _{\rho }(J_b, \widehat{\theta })\big )\right] }. \end{aligned}$$
 (ii)$$\begin{aligned} \lim _{j \rightarrow \infty } \xi ^{(1)}_{j} = \xi ^{(1)}_{*} = \frac{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{2}}}\big (J_b({\mathbf {x}}), {\mathbf {x}}, \gamma _{\rho }(J_b, \widehat{\theta }), \xi ^{(0)}_{*}\big )\right] }{{\mathbb {E}}_{\widehat{\theta }}\left[ {\mathbf {g_{0}}}\big (J_b({\mathbf {x}}), \gamma _{\rho }(J_b, \widehat{\theta })\big )\right] }. \end{aligned}$$
 (iii)
\(T_j\) defined in Eq. (54) satisfies \(1< T_j < 1, \forall j\).
 (iv)
If \(\gamma _{\rho }(J_b, \widehat{\theta }) > \gamma _{\rho }(J_b, \widehat{\theta }^{p})\), then \(T_{j}\),\(j \ge 1\) in (54) satisfy \(\lim _{j \rightarrow \infty } T_{j} = 1\) a.s.
Proof
For (i), (ii) and (iv), refer Lemma 4 of Joseph and Bhatnagar (2016b). For (iii) refer Proposition 1 of Joseph and Bhatnagar (2016b). \(\square \)
Notation For the subsequence \(\{j_{(n)}\}_{n > 0}\) of \(\{j\}_{j \in {\mathbb {N}}}\), we denote \(j^{}_{(n)} \triangleq j_{(n)}1\) for \(n > 0\).
We now present our main result. The following theorem shows that the model sequence \(\{\theta _{j}\}\) and the averaged sequence \(\{\overline{\theta }_{j}\}\) generated by Algorithm 2 converge to the degenerate distribution concentrated on the global maximum of the objective function \(J_b\).
Theorem 3
Proof
Since \(\overline{\beta }_{j} = o(\beta _{j})\), \(\overline{\beta }_{j} \rightarrow 0\) faster than \(\beta _{j} \rightarrow 0\). This implies that the updates of \(\theta _j\) in (55) are larger than those of \(\overline{\theta }_j\) in (57). Hence the sequence \(\{\theta _{j}\}\) appears quasiconvergent when viewed from the timescale of \(\{\overline{\theta }_{j}\}\) sequence.
Theorem 2 of Joseph and Bhatnagar (2016b) analyses the limiting behaviour of the stochastic recursion (55) of Algorithm 2 in great detail. The analysis discloses the global optimum convergence of the algorithm under limited regularity conditions. It is shown that the model sequence \(\{\theta _j\}\) converges almost surely to the degenerate distribution concentrated on the global optimum. The proposed regularity conditions for the global optimum convergence are that the objective function belongs to \(\mathcal {C}^{2}\) and the existence of a Lyapunov function on the neighbourhood of the degenerate distribution concentrated on the global optimum. This justifies the hypothesis \(J_{b} \in \mathcal {C}^{2}\) in the statement of the theorem and we further assume the existence of a Lyapunov function on the neighbourhood of the degenerate distribution \((w^{b*}, 0_{k_2 \times k_2})^{\top }\). Then by Theorem 2 of Joseph and Bhatnagar (2016b), we deduce that \(\{\theta _j\}\) converges to \((w^{b*}, 0_{k_2 \times k_2})^{\top }\). This completes the proof of (59).
 1.
\(\overline{b}_{j} \rightarrow 0\) almost surely as \(j \rightarrow \infty \). This follows from the hypothesis \(\overline{\beta }_{j} = o(\beta _j)\) and by considering the fact that \(\theta _j \rightarrow \theta ^{*}\) almost surely.
 2.
\(\overline{h}\) is Lipschitz continuous.
 3.
\(\{\overline{{\mathbb {M}}}_{j}\}\) is a martingale difference sequence.
 4.
\(\{\overline{\theta }_{j}\}\) is stable, i.e., \(\sup _{j}\Vert \overline{\theta }_{j} \Vert < \infty \).
 5.
The ODE defined by \(\dot{\overline{\theta }}(t) = \overline{h}(\overline{\theta }(t))\) is globally asymptotically stable at \(\theta ^{*}\).
4 Experimental illustrations
 1.
Chain walk MDP.
 2.
Linearized cartpole balancing.
 3.
5link actuated pendulum balancing.
 4.
Random MDP.
4.1 Experiment 1: chain walk
This particular setting (Fig. 6) which has been proposed in Koller and Parr (2000) demonstrates the unique scenario where policy iteration is nonconvergent when approximate value functions are employed instead of true ones. This particular example is also utilized to empirically evaluate the performance of LSPI in Lagoudakis and Parr (2003). Here, we compare the performance of our algorithm against LSPI and also against the stable Qlearning algorithm with linear function approximation (called GreedyGQ) proposed in Maei et al. (2010). This particular demonstration is pertinent in two ways: (1) when LSPI was evaluated on this setting, the maximum state space cardinality considered was 50. We consider here a larger MDP with 450 states and (2) the stable GreedyGQ algorithm is only evaluated over a small experimental setting in Maei et al. (2010). Here, by applying it on a relatively harder setting, we attempt to assess its applicability and robustness.
Setup We consider a Markov decision process with \(\vert {\mathbb {S}} \vert = 450\), \({\mathbb {A}} = \{L, R\}\), \(k_1=5\), \(k_2=10\) and the discount factor \(\gamma = 0.99\).
Reward function \(R(\cdot , \cdot , 150) = R(\cdot , \cdot , 300) = 1.0\) and zero for all other transitions. This implies that only the transitions to states 150 and 300 will acquire a positive payoff, while the rest are nugatory transitions.
Policy features  Prediction features 

\(\psi (s,a) = \begin{pmatrix} I_{\{a = L\}}e^{\frac{(sm_1)^{2}}{2.0v_{1}^{2}}}\\ \vdots \\ I_{\{a = L\}}e^{\frac{(sm_5)^{2}}{2.0v_5^{2}}}\\ I_{\{a = R\}}e^{\frac{(sm_1)^{2}}{2.0v_1^{2}}}\\ \vdots \\ I_{\{a = R\}}e^{\frac{(sm_5)^{2}}{2.0v_5^{2}}} \end{pmatrix}.\)  \(\phi _{i}(s) = e^{\frac{(sm_{i})^{2}}{2.0v_{i}^{2}}},\) 
where \(m_i = 5+10(i1), v_i = 5\), \(1 \le i \le 5\).
Behaviour policy This is the most important choice and one has to be discreet while choosing the behaviour policy. For this setting, we prefer a policy which is unbiased and which uniformly covers the action space to provide sufficient exploration. Henceforth, by choosing \(w_{b} = (0,0,\dots ,0)^{\top }\) we obtain a uniform distribution over action space for every state in \({\mathbb {S}}\).
Performance function Note that both LSPI and Qlearning seek in the policy parameter space to find the optimal or suboptimal policy by recalibrating the parameter vector at each iteration in the direction of the improved value function. But the objective function that we consider in this paper is a more generalized version involving the performance function L and scalarization using \({\mathbb {E}}_{\nu _w}[\cdot ]\). So the predicament, the above algorithms attempt to resolve becomes a special instance of our generalized version and hence to compare our algorithm against them, we consider the objective function to be the weighted Euclidean norm of the approximate value function (with weight being the stationary distribution \(\nu _w\)). Therefore, the performance function L is defined as \(L(h_{w \vert w}) = h^{2}_{w \vert w}\) (where squaring of the vector is defined as squaring of each of its components). Note that, in our algorithm, we approximate \(h_{w \vert w}\) using the behaviour policy and the true approximation and the stationary distribution involved are \(\varPhi x_{w \vert w_b}\) and \(\nu _{w_b}\) respectively. However, since the behaviour policy chosen is the uniform distribution over the action space for each state in \({\mathbb {S}}\), one can easily deduce that the underlying Markov chain of the behaviour policy is a uniform random walk and its stationary distribution is the uniform distribution over the state space \({\mathbb {S}}\).
Algorithm parameter values used in the chain walk experiment
\(\beta _{j}\)  0.2 
\(\overline{\beta }_{j}\)  0 
\(\zeta \)  0 
\(c_j\)  0.08 
\(\rho \)  0.05 
\(\epsilon \)  0.9 
\(\tau \)  1.0 
r  0.01 
4.2 Experiment 2: linearized cartpole balancing (Dann et al. 2014)
Setup A pole with mass m and length l is connected to a cart of mass M. It can rotate in the interval \([\pi , \pi ]\) with negative angle representing the rotation in the counter clockwise direction. The cart is free to move in either direction within the bounds of a linear track and the distance lies in the region \([4.0, 4.0]\) with negative distance representing the movement to the left of the origin.In our experiment, we have \(m = 0.5\), \(M = 0.5\), \(l = 20.5\) and the discount factor \(\gamma = 0.1\).
Goal To bring the cart to the equilibrium position, i.e., to balance the pole upright and the cart at the centre of the track.
State space The state is the 4tuple \((x, \dot{x}, \psi , \dot{\psi })^{\top }\) where \(\psi \) is the angle of the pendulum w.r.t. the vertical axis, \(\dot{\psi }\) is the angular velocity, x the relative cart position from the centre of the track and \(\dot{x}\) is its velocity. For better tractability, we restrict \(\dot{x} \in [5.0, 5.0]\) and \(\dot{\psi } \in [5.0, 5.0]\), respectively.
Control (Policy) space The controller applies a horizontal force a on the cart parallel to the track. The stochastic policy used in this setting corresponds to \(\pi (as) = \mathcal {N}(a  \vartheta ^{\top }s, \sigma ^{2})\) (normal distribution with mean \({\vartheta }^{\top }s\) and standard deviation \(\sigma \)). Here the policy is parametrized by \(\vartheta \in \mathrm{I\!R}^{4}\) and \(\sigma \in \mathrm{I\!R}\).
Reward function \(R(s, a) = R(\psi , \dot{\psi }, x, \dot{x}, a) = 4\psi ^2  x^2  0.1a^2\). The reward function can be viewed as assigning penalty which is directly proportional to the deviation from the equilibrium state.
Prediction features \(\phi (s \in \mathrm{I\!R}^{4}) = (1, s_{1}^{2}, s_{2}^{2} \dots , s_{1}s_{2}, s_{1}s_{3}, \dots , s_{3}s_{4})^{\top } \in \mathrm{I\!R}^{11}\).
Behaviour policy \(\pi _{b}(as) = \mathcal {N}(a  \vartheta _{b}^{\top }s, \sigma _{b}^{2})\), where \(\vartheta _{b} = (3.684, 3.193, 4.252,\) \(3.401)^{\top }\) and \(\sigma _{b} = 5.01\). The behaviour policy is determined by vaguely solving the problem using true value functions and then choosing the behaviour policy vector \(\vartheta _b\) by perturbing each component of the vague solution so obtained. The margin of perturbation we considered is chosen randomly from the interval \([\,5.0, 5.0]\).
Algorithm parameter values used in the experiments
Cartpole experiment  Actuated pendulum balancing  

\(\beta _{j}\)  0.7  0.7 
\(\overline{\beta }_{j}\)  \(j^{1}_{(n)}\)  \(j^{1}_{(n)}\) 
\(\zeta \)  \(j^{1}_{(n)}\)  \(j^{1}_{(n)}\) 
\(\lambda \)  0.1  0.1 
\(c_j\)  0.1  0.1 
\(\rho \)  0.01  0.01 
\(\epsilon \)  0.9  0.9 
r  0.01  0.01 
\(N_{j}\)  \(4000, \forall j\)  \(4000, \forall j\) 
4.3 Experiment 3: 5link actuated pendulum balancing (Dann et al. 2014)
Setup 5 independent poles each with mass m and length l with the top pole being a pendulum connected using 5 rotational joints. In our experiment, we take \(m = 1.5\), \(l = 10.0\) and the discount factor \(\gamma = 0.1\).
Goal To keep all the poles in the horizontal position by applying independent torques at each joint.
State space The state \(s = (q, \dot{q})^{\top } \in \mathrm{I\!R}^{10}\) where \(q = (\psi _{1}, \psi _{2}, \psi _{3}, \psi _{4}, \psi _{5}) \in \mathrm{I\!R}^{5}\) and \(\dot{q} = (\dot{\psi }_{1}, \dot{\psi }_{2}, \dot{\psi }_{3}, \dot{\psi }_{4}, \dot{\psi }_{5}) \in \mathrm{I\!R}^{5}\) with \(\psi _{i}\) being the angle of the pole i w.r.t. the horizontal axis and \(\dot{\psi }_{i}\) is the angular velocity. In our experiment, we consider the following bounds on the state space: \(\psi _i \in [\pi , \pi ]\), \(\forall 1 \le i \le 5\) and \(\dot{\psi }_i \in [5.0, 5.0]\), \(\forall 1 \le i \le 5\).
Feature vectors \(\phi (s \in \mathrm{I\!R}^{10}) = (1, s_{1}^{2}, s_{2}^{2} \dots , s_{1}s_{2}, s_{1}s_{3}, \dots , s_{9}s_{10})^{\top } \in \mathrm{I\!R}^{46}\).
The various parameter values employed and the results obtained in the experiment are provided in Table 2 and Fig. 9 respectively.
4.4 Experiment 4: random MDP
Setup We consider a randomly generated Markov decision process with \(\vert {\mathbb {S}} \vert = 500\), \(\vert {\mathbb {A}} \vert = 30\), \(k_1=5\), \(k_2=5\) and \(\gamma = 0.8\).
Policy features  Prediction features 

\(\psi (s, a) = B[s\vert \mathbb {A} \vert + a]\)  \(\phi _{i}(s) = e^{\frac{(sm_{i})^{2}}{2.0v_{i}^{2}}}\) 
\(\text {where }{B = \begin{pmatrix} 1 &{} 0 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 &{} 0\\ 0 &{} 0 &{} 0 &{} 0 &{} 1\\ 1 &{} 0 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 &{} 0\\ \vdots &{} &{} \ddots &{} &{}\vdots \\ \end{pmatrix}.}_{15000 \times 5}\)  where \(m_i = 5+10(i1), v_i = 5.\) 
In this experimental setting, we employ the Gibbs “softmax” policies defined in Eq. (7).
Behaviour policy The behaviour policy vector \(w_b\) considered for the experiment is \(w_{b} = (12.774, 15.615, 20.626, 25.877, 11.945)^{\top }\).
Performance function The performance function L is defined as follows:
Algorithm parameter values used in the random MDP experiment
\(\beta _{j}\)  0.7 
\(\overline{\beta }_{j}\)  \(j^{1}_{(n)}\) 
\(\zeta \)  \(j^{1}_{(n)}\) 
\(c_j\)  0.1 
\(\rho \)  0.01 
\(\epsilon \)  0.9 
\(\tau \)  \(10^{3}\) 
r  0.001 
\(N_{j}\)  \(1000, \forall j\) 
As with the previous two experiments, Algorithm 2 was run for the offpolicy case while SPSA, MRAS and fast policy search were run for the onpolicy setting.
The various parameter values employed and the results obtained in the experiment are provided in Table 3 and Fig. 10 respectively.
4.5 Exegesis of the experiments
In this section, we summarize the inferences drawn from the above experiments:
(1) The proposed algorithm performed better than the stateoftheart methods without compromising on the rate of convergence. The choice of the underlying behaviour policy indeed influenced this improved performance. Note that to labour high quality solutions, the choice of the behaviour policy is pivotal. In Experiment 1, we considered a uniform policy, where every action is equally likely to be chosen for each state in \({\mathbb {S}}\). The results obtained in that experiment are quite promising, since, by only utilizing a uniform behaviour policy, we were able to grind out superior quality solutions. One has to justify the results to add credibility, considering the fact that LSPI is shown to produce optimal policy given a generative model. Note that in the original LSPI paper, we find that the LSPI method utilizes a sample trajectory provided in the form of tuples \(\{(s_i, a_i, r_i, s_i^{\prime })\}_{i \in {\mathbb {N}}}\), where \(s_i\) and \(a_i\) are drawn uniformly randomly from \({\mathbb {S}}\) and \({\mathbb {A}}\) respectively, while \(s_{i}^{\prime }\) is the transitioned state given \(s_i\) and \(a_i\) by following the underlying transition dynamics of the MDP and \(r_i\) is the immediate reward for that transition. One can immediately see that the information content required to generate such a trajectory is equivalent to that of maintaining a generative model.
Further, in Lagoudakis and Parr (2003), where LSPI is being empirically evaluated, we find that a trajectory length of 5000 is being used in the 20state chain walk to obtain optimal performance. However, in our experiment (Experiment 1) with 450 states, we only consider a trajectory length of 5000 for LSPI and hence obtain the suboptimal performance. But, one should also consider the fact that the behaviour policy utilized by our algorithm in the same experiment is uniform (no prior information about the MDP is being availed) and the trajectory length is only half of that of LSPI. Now, regarding the performance of Qlearning, we know [from Theorem 1 of Maei et al. (2010)] that the method can only provide suboptimal solutions.
In Experiments 2, 3 and 4, we surmised the behaviour policy based on more than a passable knowledge of the MDP. To make the comparison unbiased (since our algorithm utilized prior information about the MDP to induce the behaviour policy), in the algorithms (MRAS, fast policy search and SPSA) to which our method is being compared, we employed the more accurate onpolicy approximation which requires the generative model. This is contrary to our method, where offpolicy approximation is tried. Our algorithm exhibited as good a performance as the stateoftheart methods in the cartpole experiment and noticeably the finest performance in the actuated pendulum experiment. This is regardless of the fact that our algorithm is primarily designed for the discrete, finite MDP setting, while the cartpole experiment and the actuated pendulum experiment are MDPs with continuous state and action spaces. The suboptimal performance of the fast policy search and MRAS is primarily attributed to the insufficient sample size. But the underlying computing machine which we consider for the experiments is a 64bit Intel i3 processor with 4GB of memory. Because of these limited resources, there is a finite limit to which the sample size can be scaled. This illustrates the effectiveness of our approach on a resource restricted setting. Now regarding the random MDP experiment, the performance of our algorithm is on par (in fact superior) to the stateoftheart schemes.
(4) Finally, in the experiments, we found that the parameter which required the highest tuning is \(\beta _j\) which is also intuitive since \(\beta _{j}\) controls most of the stochastic recursions. The other parameters required minimum tuning with almost all of them taking common values.
4.6 Data efficiency
This nondependency of our algorithm on the dimension of the policy space has a real pragmatic advantage since, as a result of this, our algorithm can be applied to very large and complex MDPs with wider policy spaces where fast policy search and MRAS might become intractable.
Another advantage of our approach is the application on legacy systems. In such systems, the information on the dynamics of the system in the form of bits or bytes or paper might be hard to find. However, human experience through long time interaction with the system is available in most cases. Utilizing this human experience to develop a generative model of the system might be hard, however using it to find a behaviour policy which can give average performance is more plausible, and which in turn can be exploited using our algorithm to find an optimal policy.
5 Conclusion
We presented an algorithm which solves the modified control problem in a model free MDP setting. We showed its convergence to the global optimal policy relative to the choice of the behaviour policy. The algorithm is data efficient, robust, stable as well as computationally and storage efficient. Using an appropriately chosen behaviour policy, it is also seen to consistently outperform or is competitive against the current stateoftheart (both) offpolicy and onpolicy methods.
References
 Alon, G., Kroese, D. P., Raviv, T., & Rubinstein, R. Y. (2005). Application of the crossentropy method to the buffer allocation problem in a simulationbased environment. Annals of Operations Research, 134(1), 137–151.MathSciNetzbMATHCrossRefGoogle Scholar
 Antos, A., Szepesvári, C., & Munos, R. (2007). Valueiteration based fitted policy iteration: Learning with a single trajectory. In 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning (pp. 330–337).Google Scholar
 Antos, A., Szepesvári, C., & Munos, R. (2008). Learning nearoptimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1), 89–129.zbMATHCrossRefGoogle Scholar
 Bagnell, J. A., & Schneider, J. G. (2001). Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings 2001 ICRA. IEEE international conference on robotics and automation, vol. 2 (pp. 1615–1620).Google Scholar
 Balleine, B. W., & Dickinson, A. (1998). Goaldirected instrumental action: Contingency and incentive learning and their cortical substrates. Neuropharmacology, 37(4), 407–419.CrossRefGoogle Scholar
 Barreto, A. D. M. S., Pineau, J., & Precup, D. (2014). Policy iteration based on stochastic factorization. Journal of Artificial Intelligence Research, 50, 763–803.MathSciNetzbMATHGoogle Scholar
 Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using realtime dynamic programming. Artificial Intelligence, 72(1), 81–138.CrossRefGoogle Scholar
 Baxter, J., & Bartlett, P. L. (2001). Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.MathSciNetzbMATHGoogle Scholar
 Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 1). Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
 Bertsekas, D. P., & Castanon, D. A. (1989). Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6), 589–598.MathSciNetzbMATHCrossRefGoogle Scholar
 Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actorcritic algorithms. Automatica, 45(11), 2471–2482.MathSciNetzbMATHCrossRefGoogle Scholar
 Borkar, V. S. (2008). Stochastic approximation. Cambridge: Cambridge University Press.zbMATHCrossRefGoogle Scholar
 Chang, H. S., Hu, J., Fu, M. C., & Marcus, S. I. (2013). Simulationbased algorithms for Markov decision processes. Berlin: Springer.zbMATHCrossRefGoogle Scholar
 Dann, C., Neumann, G., & Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15(1), 809–883.MathSciNetzbMATHGoogle Scholar
 Deisenroth, M., & Rasmussen, C. E. (2011). Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th international conference on machine learning (ICML) (pp. 465–472).Google Scholar
 de Boer, P. T. (2000). Analysis and efficient simulation of queueing models of telecommunication systems. Centre for Telematics and Information Technology University of Twente.Google Scholar
 Ertin, E., Dean, A. N., Moore, M. L., & Priddy, K. L. (2001). Dynamic optimization for optimal control of water distribution systems. Applications and Science of Computational Intelligence IV, 4390, 142–149.CrossRefGoogle Scholar
 Feinberg, E. A., & Shwartz, A. (2012). Handbook of Markov decision processes: Methods and applications. Berlin: Springer.zbMATHGoogle Scholar
 Fracasso, P., Barnes, F., & Costa, A. (2014). Optimized control for water utilities. Procedia Engineering, 70, 678–687.CrossRefGoogle Scholar
 Glynn, P. W., & Iglehart, D. L. (1989). Importance sampling for stochastic simulations. Management Science, 35(11), 1367–1392.MathSciNetzbMATHCrossRefGoogle Scholar
 Helvik, B. E., & Wittner, O. (2001). Using the crossentropy method to guide/govern mobile agents path finding in networks. In International Workshop on Mobile Agents for Telecommunication Applications (pp. 255–268). Springer.Google Scholar
 Higham, N. J. (1994). A survey of componentwise perturbation theory in numerical linear algebra. In W. Gautschi (Ed.), Mathematics of computation 1943–1993: A half century of computational mathematics (Proceedings of Symposia in Applied Mathematics) (Vol. 48, pp. 49–77). Providence, RI: American Mathematical Society.Google Scholar
 Hu, J., Fu, M. C., & Marcus, S. I. (2007). A model reference adaptive search method for global optimization. Operations Research, 55(3), 549–568.MathSciNetzbMATHCrossRefGoogle Scholar
 Hu, J., Hu, P., & Chang, H. S. (2012). A stochastic approximation framework for a class of randomized optimization algorithms. IEEE Transactions on Automatic Control, 57(1), 165–178.MathSciNetzbMATHCrossRefGoogle Scholar
 Ikonen, E., & Bene, J. (2011). Scheduling and disturbance control of a water distribution network. IFAC Proceedings Volumes, 44(1), 7138–7143.CrossRefGoogle Scholar
 Joseph, A. G., & Bhatnagar, S. (2016a). A randomized algorithm for continuous optimization. In Winter simulation conference, WSC 2016, Washington, DC, USA, December 11–14 (pp. 907–918).Google Scholar
 Joseph, A. G., & Bhatnagar, S. (2016b). A cross entropy based stochastic approximation algorithm for reinforcement learning with linear function approximation. CoRR abs/1207.0016.Google Scholar
 Joseph, A. G., & Bhatnagar, S. (2016c). Revisiting the cross entropy method with applications in stochastic global optimization and reinforcement learning. Frontiers in Artificial Intelligence and Applications, 285(ECAI 2016), 1026–1034. https://doi.org/10.3233/97816149967291026.
 Keith, J., & Kroese, D. P. (2002). Rare event simulation and combinatorial optimization using cross entropy: Sequence alignment by rare event simulation. In Proceedings of the 34th conference on winter simulation: Exploring new frontiers, winter simulation conference (pp. 320–327).Google Scholar
 Koller, D., & Parr, R. (2000). Policy iteration for factored MDPs. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence (pp. 326–334). Morgan Kaufmann Publishers Inc.Google Scholar
 Konda, V. R., & Tsitsiklis, J. N. (2003). Actorcritic algorithms. SIAM journal on Control and Optimization, 42(4), 1143–1166.MathSciNetzbMATHCrossRefGoogle Scholar
 Kroese, D. P., Porotsky, S., & Rubinstein, R. Y. (2006). The crossentropy method for continuous multiextremal optimization. Methodology and Computing in Applied Probability, 8(3), 383–407.MathSciNetzbMATHCrossRefGoogle Scholar
 Kumar, P., & Lin, W. (1982). Optimal adaptive controllers for unknown Markov chains. IEEE Transactions on Automatic Control, 27(4), 765–774.MathSciNetzbMATHCrossRefGoogle Scholar
 Lagoudakis, M. G., & Parr, R. (2003). Leastsquares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.MathSciNetzbMATHGoogle Scholar
 Lee, S. W., Shimojo, S., & O’Doherty, J. P. (2014). Neural computations underlying arbitration between modelbased and modelfree learning. Neuron, 81(3), 687–699.CrossRefGoogle Scholar
 Maei, H. R., Szepesvári, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward offpolicy learning control with function approximation. In Proceedings of the 27th international conference on machine learning (ICML) (pp. 719–726).Google Scholar
 Mannor, S., Rubinstein, R. Y., & Gat, Y.(2003). The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML) (pp. 512–519).Google Scholar
 Menache, I., Mannor, S., & Shimkin, N. (2005). Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1), 215–238.MathSciNetzbMATHCrossRefGoogle Scholar
 Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), 103–130.Google Scholar
 Mühlenbein, H., & Paass, G. (1996). From recombination of genes to the estimation of distributions i. Binary parameters. In International conference on parallel problem solving from nature (pp. 178–187). Springer.Google Scholar
 O’Doherty, J. P., Lee, S. W., & McNamee, D. (2015). The structure of reinforcementlearning mechanisms in the human brain. Current Opinion in Behavioral Sciences, 1, 94–100.CrossRefGoogle Scholar
 Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.MathSciNetzbMATHCrossRefGoogle Scholar
 Puterman, M. L. (2014). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.zbMATHGoogle Scholar
 Rubinstein, R. (1999). The crossentropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, 1(2), 127–190.MathSciNetzbMATHCrossRefGoogle Scholar
 Rubinstein, R. Y. (2002). Crossentropy and rare events for maximal cut and partition problems. ACM Transactions on Modeling and Computer Simulation (TOMACS), 12(1), 27–53.CrossRefGoogle Scholar
 Rubinstein, R. Y., & Kroese, D. P. (2013). The crossentropy method: A unified approach to combinatorial optimization, MonteCarlo simulation and machine learning. Berlin: Springer.zbMATHCrossRefGoogle Scholar
 Sato, M., Abe, K., & Takeda, H. (1982). Learning control of finite Markov chains with unknown transition probabilities. IEEE Transactions on Automatic Control, 27(2), 502–505.zbMATHCrossRefGoogle Scholar
 Sato, M., Abe, K., & Takeda, H. (1988). Learning control of finite Markov chains with an explicit tradeoff between estimation and control. IEEE Transactions on Systems, Man, and Cybernetics, 18(5), 677–684.zbMATHCrossRefGoogle Scholar
 Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22(1–3), 123–158.zbMATHGoogle Scholar
 Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332–341.MathSciNetzbMATHCrossRefGoogle Scholar
 Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44.Google Scholar
 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.Google Scholar
 Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.MathSciNetzbMATHCrossRefGoogle Scholar
 Varga, R. S. (1976). On diagonal dominance arguments for bounding \(\Vert A^{1}\Vert _{\infty }\). Linear Algebra and its Applications, 14(3), 211–217.MathSciNetzbMATHGoogle Scholar
 Wang, B., & Enright, W. (2013). Parameter estimation for ODEs using a crossentropy approach. SIAM Journal on Scientific Computing, 35(6), A2718–A2737.MathSciNetzbMATHCrossRefGoogle Scholar
 Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge England.Google Scholar
 Xue, J. (1997). A note on entrywise perturbation theory for Markov chains. Linear Algebra and its Applications, 260, 209–213.MathSciNetzbMATHCrossRefGoogle Scholar
 Yu, H. (2012). Least squares temporal difference methods: An analysis under general conditions. SIAM Journal on Control and Optimization, 50(6), 3310–3343.MathSciNetzbMATHCrossRefGoogle Scholar
 Yu, H. (2015). On convergence of emphatic temporaldifference learning. In Proceedings of the conference on computational learning theory.Google Scholar
 Zhou, E., Bhatnagar, S., Chen, X. (2014). Simulation optimization via gradientbased stochastic search. In Proceedings of the 2014 winter simulation conference (pp. 3869–3879). IEEE Press.Google Scholar
 Zlochin, M., Birattari, M., Meuleau, N., & Dorigo, M. (2004). Modelbased search for combinatorial optimization: A critical survey. Annals of Operations Research, 131(1–4), 373–395.MathSciNetzbMATHCrossRefGoogle Scholar