Neural Networks and Deep Learning pp 373417  Cite as
Deep Reinforcement Learning
Abstract
“The reward of suffering is experience.”—Harry S. Truman
“The reward of suffering is experience.”—Harry S. Truman
9.1 Introduction
Human beings do not learn from a concrete notion of training data. Learning in humans is a continuous experiencedriven process in which decisions are made, and the reward/punishment received from the environment are used to guide the learning process for future decisions. In other words, learning in intelligent beings is by rewardguided trial and error. Furthermore, much of human intelligence and instinct is encoded in genetics, which has evolved over millions of years with another environmentdriven process, referred to as evolution. Therefore, almost all of biological intelligence, as we know it, originates in one form or other through an interactive process of trial and error with the environment. In his interesting book on artificial intelligence [453], Herbert Simon proposed the ant hypothesis:
“Human beings, viewed as behaving systems, are quite simple. The apparent complexity of our behavior over time is largely a reflection of the complexity of the environment in which we find ourselves.”
Human beings are considered simple because they are onedimensional, selfish, and rewarddriven entities (when viewed as a whole), and all of biological intelligence is therefore attributable to this simple fact. Since the goal of artificial intelligence is to simulate biological intelligence, it is therefore natural to draw inspirations from the successes of biological greed in simplifying the design of highly complex learning algorithms.
 1.
Deep learners have been trained to play video games by using only the raw pixels of the video console as feedback. A classical example of this setting is the Atari 2600 console, which is a platform supporting multiple games. The input to the deep learner from the Atari platform is the display of pixels from the current state of the game. The reinforcement learning algorithm predicts the actions based on the display and inputs them into the Atari console. Initially, the computer algorithm makes many mistakes, which are reflected in the virtual rewards given by the console. As the learner gains experience from its mistakes, it makes better decisions. This is exactly how humans learn to play video games. The performance of a recent algorithm on the Atari platform has been shown to surpass humanlevel performance for a large number of games [165, 335, 336, 432]. Video games are excellent test beds for reinforcement learning algorithms, because they can be viewed as highly simplified representations of the choices one has to make in various decisioncentric settings. Simply speaking, video games represent toy microcosms of real life.
 2.
DeepMind has trained a deep learning algorithm AlphaGo [445] to play the game of Go by using the rewardoutcomes in the moves of games drawn from both human and computer selfplay. Go is a complex game that requires significant human intuition, and the large tree of possibilities (compared to other games like chess) makes it an incredibly difficult candidate for building a gameplaying algorithm. AlphaGo has not only convincingly defeated all topranked Go players it has played against [602, 603], but has contributed to innovations in the style of human play by using unconventional strategies in defeating these players. These innovations were a result of the rewarddriven experience gained by AlphaGo by playing itself over time. Recently, the approach has also been generalized to chess, and it has convincingly defeated one of the top conventional engines [447].
 3.
In recent years, deep reinforcement learning has been harnessed in selfdriving cars by using the feedback from various sensors around the car to make decisions. Although it is more common to use supervised learning (or imitation learning) in selfdriving cars, the option of using reinforcement learning has always been recognized as a viable possibility [604]. During the course of driving, these cars now consistently make fewer errors than do human beings.
 4.
The quest for creating selflearning robots is a task in reinforcement learning [286, 296, 432]. For example, robot locomotion turns out to be surprisingly difficult in nimble configurations. Teaching a robot to walk can be couched as a reinforcement learning task, if we do not show a robot what walking looks like. In the reinforcement learning paradigm, we only incentivize the robot to get from point A to point B as efficiently as possible using its available limbs and motors [432]. Through rewardguided trial and error, robots learn to roll, crawl, and eventually walk.
Reinforcement learning is appropriate for tasks that are simple to evaluate but hard to specify. For example, it is easy to evaluate a player’s performance at the end of a complex game like chess, but it is hard to specify the precise action in every situation. As in biological organisms, reinforcement learning provides a path to the simplification of learning complex behaviors by only defining the reward and letting the algorithm learn rewardmaximizing behaviors. The complexity of these behaviors is automatically inherited from that of the environment. This is the essence of Herbert Simon’s ant hypothesis [453] at the beginning of this chapter. Reinforcement learning systems are inherently endtoend systems in which a complex task is not broken up into smaller components, but viewed through the lens of a simple reward.
The simplest example of a reinforcement learning setting is the multiarmed bandit problem, which addresses the problem of a gambler choosing one of many slot machines in order to maximize his payoff. The gambler suspects that the (expected) rewards from the various slot machines are not the same, and therefore it makes sense to play the machine with the largest expected reward. Since the expected payoffs of the slot machines are not known in advance, the gambler has to explore different slot machines by playing them and also exploit the learned knowledge to maximize the reward. Although exploration of a particular slot machine might gain some additional knowledge about its payoff, it incurs the risk of the (potentially fruitless) cost of playing it. Multiarmed bandit algorithms provide carefully crafted strategies to optimize the tradeoff between exploration and exploitation. However, in this simplified setting, each decision of choosing a slot machine is identical to the previous one. This is not quite the case in settings such as video games and selfdriving cars with raw sensory inputs (e.g., video game screen or traffic conditions), which define the state of the system. Deep learners are excellent at distilling these sensory inputs into statesensitive actions by wrapping their learning process within the exploration/exploitation framework.
Chapter Organization
This chapter is organized as follows. The next section introduces multiarmed bandits, which constitutes one of the simplest stateless settings in reinforcement learning. The notion of states is introduced in Section 9.3. The Qlearning method is introduced in Section 9.4. Policy gradient methods are discussed in Section 9.5. The use of Monte Carlo tree search strategies is discussed in Section 9.6. A number of case studies are discussed in Section 9.7. The safety issues associated with deep reinforcement learning methods are discussed in Section 9.8. A summary is given in Section 9.9.
9.2 Stateless Algorithms: MultiArmed Bandits
We revisit the problem of a gambler who repeatedly plays slot machines based on previous experience. The gambler suspects that one of the slot machines has a better expected reward than others and attempts to both explore and exploit his experience with the slot machines. Trying the slot machines randomly is wasteful but helps in gaining experience. Trying the slot machines for a very small number of times and then always picking the best machine might lead to solutions that are poor in the longterm. How should one navigate this tradeoff between exploration and exploitation? Note that every trial provides the same probabilistically distributed reward as previous trials for a given action, and therefore there is no notion of state in such a system. This is a simplified case of traditional reinforcement learning in which the notion of state is important. In a computer video game, moving the cursor in a particular direction has a reward that heavily depends on the state of the video game.
There are a number of strategies that the gambler can use to regulate the tradeoff between exploration and exploitation of the search space. In the following, we will briefly describe some of the common strategies used in multiarmed bandit systems. All these methods are instructive because they provide the basic ideas and framework, which are used in generalized settings of reinforcement learning. In fact, some of these stateless algorithms are also used as subroutines in general forms of reinforcement learning. Therefore, it is important to explore this simplified setting.
9.2.1 Naïve Algorithm
In this approach, the gambler plays each machine for a fixed number of trials in the exploration phase. Subsequently, the machine with the highest payoff is used forever in the exploitation phase. Although this approach might seem reasonable at first sight, it has a number of drawbacks. The first problem is that it is hard to determine the number of trials at which one can confidently predict whether a particular slot machine is better than another machine. The process of estimation of payoffs might take a long time, especially in cases where the payoff events are rare compared to nonpayoff events. Using many exploratory trials will waste a significant amount of effort on suboptimal strategies. Furthermore, if the wrong strategy is selected in the end, the gambler will use the wrong slot machine forever. Therefore, the approach of fixing a particular strategy forever is unrealistic in realworld problems.
9.2.2 εGreedy Algorithm
The εgreedy algorithm is designed to use the best strategy as soon as possible, without wasting a significant number of trials. The basic idea is to choose a random slot machine for a fraction ε of the trials. These exploratory trials are also chosen at random (with probability ε) from all trials, and are therefore fully interleaved with the exploitation trials. In the remaining (1 −ε) fraction of the trials, the slot machine with the best average payoff so far is used. An important advantage of this approach is that one is guaranteed to not be trapped in the wrong strategy forever. Furthermore, since the exploitation stage starts early, one is often likely to use the best strategy a large fraction of the time.
The value of ε is an algorithm parameter. For example, in practical settings, one might set ε = 0. 1, although the best choice of ε will vary with the application at hand. It is often difficult to know the best value of ε to use in a particular setting. Nevertheless, the value of ε needs to be reasonably small in order to gain significant advantages from the exploitation portion of the approach. However, at small values of ε it might take a long time to identify the correct slot machine. A common approach is to use annealing, in which large values of ε are initially used, with the values declining with time.
9.2.3 Upper Bounding Methods
Unlike εgreedy, the trials are no longer divided into two categories of exploration and exploitation; the process of selecting the slot machine with the largest upper bound has the dual effect of encoding both the exploration and exploitation aspects within each trial. One can regulate the tradeoff between exploration and exploitation by using a specific level of statistical confidence. The choice of K = 3 leads to a 99. 99% confidence interval for the upper bound under the Gaussian assumption. In general, increasing K will give large bonuses C_{i} for uncertainty, thereby causing exploration to comprise a larger proportion of the plays compared to an algorithm with smaller values of K.
9.3 The Basic Framework of Reinforcement Learning
The bandit algorithms of the previous section are stateless. In other words, the decision made at each time stamp has an identical environment, and the actions in the past only affect the knowledge of the agent (not the environment itself). This is not the case in generic reinforcement learning settings like video games or selfdriving cars, which have a notion of state.
In generic reinforcement learning settings, each action is associated with a reward in isolation. While playing a video game, you do not get a reward only because you made a particular move. The reward of a move depends on all the other moves you made in the past, which are incorporated in the state of the environment. In a video game or selfdriving car, we would need a different way of performing the credit assignment in a particular system state. For example, in a selfdriving car, the reward for violently swerving a car in a normal state would be different from that of performing the same action in a state that indicates the danger of a collision. In other words, we need a way to quantify the reward of each action in a way that is specific to a particular system state.
The learning process helps the agent choose actions based on the inherent values of the actions in different states. This general principle applies to all forms of reinforcement learning in biological organisms, such as a mouse learning a path through a maze to earn a reward. The rewards earned by the mouse depend on an entire sequence of actions, rather than on only the latest action. When a reward is earned, the synaptic weights in the mouse’s brain adjust to reflect how sensory inputs should be used to decide future actions in the maze. This is exactly the approach used in deep reinforcement learning, where a neural network is used to predict actions from sensory inputs (e.g., pixels of video game). This relationship between the agent and the environment is shown in Figure 9.1.
Examples
 1.
Game of tictactoe, chess, or Go: The state is the position of the board at any point, and the actions correspond to the moves made by the agent. The reward is + 1, 0, or − 1 (depending on win, draw, or loss), which is received at the end of the game. Note that rewards are often not received immediately after strategically astute actions.
 2.
Robot locomotion: The state corresponds to the current configuration of robot joints and its position. The actions correspond to the torques applied to robot joints. The reward at each time stamp is a function of whether the robot stays upright and the amount of forward movement from point A to point B.
 3.
Selfdriving car: The states correspond to the sensor inputs from the car, and the actions correspond to the steering, acceleration, and braking choices. The reward is a handcrafted function of car progress and safety.
Some effort usually needs to be invested in defining the state representations and corresponding rewards. However, once these choices have been made, reinforcement learning frameworks are endtoend systems.
9.3.1 Challenges of Reinforcement Learning
 1.
When a reward is received (e.g., winning a game of chess), it is not exactly known how much each action has contributed to that reward. This problem lies at the heart of reinforcement learning, and is referred to as the creditassignment problem. Furthermore, rewards may be probabilistic (e.g., pulling the lever of a slot machine), which can only be estimated approximately in a datadriven manner.
 2.
The reinforcement learning system might have a very large number of states (such as the number of possible positions in a board game), and must be able to make sensible decisions in states it has not seen before. This task of model generalization is the primary function of deep learning.
 3.
A specific choice of action affects the collected data in regard to future actions. As in multiarmed bandits, there is a natural tradeoff between exploration and exploitation. If actions are taken only to learn their reward, then it incurs a cost to the player. On the other hand, sticking to known actions might result in suboptimal decisions.
 4.
Reinforcement learning merges the notion of data collection with learning. Realistic simulations of large physical systems such as robots and selfdriving cars are limited by the need to physically perform these tasks and gather responses to actions in the presence of the practical dangers of failures. In many cases, the early portion of learning in a task may have few successes and many failures. The inability to gather sufficient data in real settings beyond simulated and gamecentric environments is arguably the single largest challenge to reinforcement learning.
9.3.2 Simple Reinforcement Learning for TicTacToe
One can generalize the stateless εgreedy algorithm in the previous section to learn to play the game of tictactoe. In this case, each board position is a state, and the action corresponds to placing ‘X’ or ‘O’ at a valid position. The number of valid states of the 3 × 3 board is bounded above by 3^{9} = 19683, which corresponds to three possibilities (‘X’, ‘O’, and blank) for each of 9 positions. Instead of estimating the value of each (stateless) action in multiarmed bandits, we now estimate the value of each stateaction pair (s, a) based on the historical performance of action a in state s against a fixed opponent. Shorter wins are preferred at discount factor γ < 1, and therefore the unnormalized value of action a in state s is increased with γ^{r−1} in case of wins and −γ^{r−1} in case of losses after r moves (including the current move). Draws are credited with 0. The discount also reflects the fact that the significance of an action decays with time in realworld settings. In this case, the table is updated only after all moves are made for a game (although later methods in this chapter allow online updates after each move). The normalized values of the actions in the table are obtained by dividing the unnormalized values with the number of times the stateaction pair was updated (which is maintained separately). The table starts with small random values, and the action a in state s is chosen greedily to be the action with the highest normalized value with probability 1 −ε, and is chosen to be a random action otherwise. All moves in a game are credited after the termination of each game. Over time, the values of all stateaction pairs will be learned and the resulting moves will also adapt to the play of the fixed opponent. Furthermore, one can even use selfplay to generate these tables optimally. When selfplay is used, the table is updated from a value in { −γ^{r}, 0, γ^{r}} depending on win/draw/loss from the perspective of the player for whom moves are made. At inference time, the move with the highest normalized value from the perspective of the player are made.
9.3.3 Role of Deep Learning and a StrawMan Algorithm
The aforementioned algorithm for tictactoe did not use neural networks or deep learning, and this is also the case in many traditional algorithms for reinforcement learning [483]. The overarching goal of the εgreedy algorithm for tictactoe was to learn the inherent longterm value of each stateaction pair, since the rewards are received long after valuable actions are performed. The goal of the training process is to perform the value discovery task of identifying which actions are truly beneficial in the longterm at a particular state. For example, making a clever move in tictactoe might set a trap, which eventually results in assured victory. Examples of two such scenarios are shown in Figure 9.2(a) (although the trap on the right is somewhat less obvious). Therefore, one needs to credit a strategically good move favorably in the table of stateaction pairs and not just the final winning move. The trialanderror technique based on the εgreedy method of Section 9.3.2 will indeed assign high values to clever traps. Examples of typical values from such a table are shown in Figure 9.2(b). Note that the less obvious trap of Figure 9.2(a) has a slightly lower value because moves assuring wins after longer periods are discounted by γ, and εgreedy trialanderror might have a harder time finding the win after setting the trap.
The main problem with this approach is that the number of states in many reinforcement learning settings is too large to tabulate explicitly. For example, the number of possible states in a game of chess is so large that the set of all known positions by humanity is a minuscule fraction of the valid positions. In fact, the algorithm of Section 9.3.2 is a refined form of rote learning in which Monte Carlo simulations are used to refine and remember the longterm values of seen states. One learns about the value of a trap in tictactoe only because previous Monte Carlo simulations have experienced victory many times from that exact board position. In most challenging settings like chess, one must generalize knowledge learned from prior experiences to a state that the learner has not seen before. All forms of learning (including reinforcement learning) are most useful when they are used to generalize known experiences to unknown situations. In such cases, the tablecentric forms of reinforcement learning are woefully inadequate. Deep learning models serve the role of function approximators. Instead of learning and tabulating the values of all moves in all positions (using rewarddriven trial and error), one learns the value of each move as a function of the input state, based on a trained model using the outcomes of prior positions. Without this approach, reinforcement learning cannot be used beyond toy settings like tictactoe.
For example, a strawman (but not very good) algorithm for chess might use the same εgreedy algorithm of Section 9.3.2, but the values of actions are computed by using the board state as input to a convolutional neural network. The output is the evaluation of the board position. The εgreedy algorithm is simulated to termination with the output values, and the discounted groundtruth value of each move in the simulation is selected from the set {γ^{r−1}, 0, −γ^{r−1}} depending on win/draw/loss and number of moves r to game completion (including the current move). Instead of updating a table of stateaction pairs, the parameters of the neural network are updated by treating each move as a training point. The board position is input, and the output of the neural network is compared with the groundtruth value from {γ^{r−1}, 0, −γ^{r−1}} to update the parameters. At inference time, the move with the best output score (with some minimax lookahead) can be used.
Although the aforementioned approach is too naive, a sophisticated system with Monte Carlo tree search, known as Alpha Zero, has recently been trained [447] to play chess. Two examples of positions [447] from different games in the match between Alpha Zero and a conventional chess program, Stockfish8.0, are provided in Figure 9.2(c). In the chess position on the left, the reinforcement learning system makes a strategically astute move of cramping the opponent’s bishop at the expense of immediate material loss, which most handcrafted computer evaluations would not prefer. In the position on the right, Alpha Zero has sacrificed two pawns and a piece exchange in order to incrementally constrict black to a point where all its pieces are completely paralyzed. Even though Alpha Zero (probably) never encountered these specific positions during training, its deep learner has the ability to extract relevant features and patterns from previous trialanderror experience in other board positions. In this particular case, the neural network seems to recognize the primacy of spatial patterns representing subtle positional factors over tangible material factors (much like a human’s neural network).
In reallife settings, states are often described using sensory inputs. The deep learner uses this input representation of the state to learn the values of specific actions (e.g., making a move in a game) in lieu of the table of stateaction pairs. Even when the input representation of the state (e.g., pixels) is quite primitive, neural networks are masters at squeezing out the relevant insights. This is similar to the approach used by humans to process primitive sensory inputs to define the state of the world and make decisions about actions using our biological neural network. We do not have a table of prememorized stateaction pairs for every possible reallife situation. The deeplearning paradigm converts the forbiddingly large table of stateaction values into a parameterized model mapping statesaction pairs to values, which can be trained easily with backpropagation.
9.4 Bootstrapping for Value Function Learning
The simple generalization of the εgreedy algorithm to tictactoe (cf. Section 9.3.2) is a rather naive approach that does not work for nonepisodic settings. In episodic settings like tictactoe, a fixedlength sequence of at most nine moves can be used to characterize the full and final reward. In nonepisodic settings like robots, the Markov decision process may not be finite or might be very long. Creating a sample of the groundtruth reward by Monte Carlo sampling becomes difficult and online updating might be desirable. This is achieved with the methodology of bootstrapping.
Intuition 9.4.1 (Bootstrapping).
Consider a Markov decision process in which we are predicting values (e.g., longterm rewards) at each timestamp. We do not need the groundtruth at each timestamp, as long as we can use a partial simulation of the future to improve the prediction at the current timestamp. This improved prediction can be used as the groundtruth at the current time stamp for a model without knowledge of the future.
For example, Samuel’s checkers program [421] used the difference in evaluation at the current position and the minimax evaluation obtained by looking several moves ahead with the same function as a “prediction error” in order to update the evaluation function. The idea is that the minimax evaluation from looking ahead is stronger than the one without lookahead and can therefore be used as a “ground truth” to compute the error.
9.4.1 Deep Learning Models as Function Approximators
The QLearning Algorithm
The weights \(\overline{W}\) of the neural network need to be learned via training. Here, we encounter an interesting problem. We can learn the vector of weights only if we have observed values of the Qfunction. With observed values of the Qfunction, we could easily set up a loss in terms of \(Q(s_{t},a) \hat{ Q}(s_{t},a)\) in order to perform the learning after each action. The problem is that the Qfunction represents the maximum discounted reward over all future combinations of actions, and there is no way of observing it at the current time.
 1.
Perform a forward pass through the network with input \(\overline{X}_{t+1}\) to compute \(\hat{Q}_{t+1} = \mbox{ max}_{a}F(\overline{X}_{t+1},\overline{W},a)\). The value is 0 in case of termination after performing a_{t}. Treating the terminal state specially is important. According to the Bellman equations, the Qvalue at previous timestamp t should be \(r_{t} +\gamma \hat{ Q}_{t+1}\) for observed action a_{t} at time t. Therefore, instead of using observed values of the target, we have created a surrogate for the target value at time t, and we pretend that this surrogate is an observed value given to us.
 2.
Perform a forward pass through the network with input \(\overline{X}_{t}\) to compute \(F(\overline{X}_{t},\overline{W},a_{t})\).
 3.
Set up a loss function in \(L_{t} = (r_{t} +\gamma Q_{t+1}  F(\overline{X}_{t},\overline{W},a_{t}))^{2}\), and backpropagate in the network with input \(\overline{X}_{t}\). Note that this loss is associated with neural network output node corresponding to action a_{t}, and the loss for all other actions is 0.
 4.
One can now use backpropagation on this loss function in order to update the weight vector \(\overline{W}\). Even though the term r_{t} + γQ_{t+1} in the loss function is also obtained as a prediction from input \(\overline{X}_{t+1}\) to the neural network, it is treated as a (constant) observed value during gradient computation by the backpropagation algorithm.
Both the training and the prediction are performed simultaneously, as the values of actions are used to update the weights and select the next action. It is tempting to select the action with the largest Qvalue as the relevant prediction. However, such an approach might perform inadequate exploration of the search space. Therefore, one couples the optimality prediction with a policy such as the εgreedy algorithm in order to select the next action. The action with the largest predicted payoff is selected with probability (1 −ε). Otherwise, a random action is selected. The value of ε can be annealed by starting with large values and reducing them over time. Therefore, the target prediction value for the neural network is computed using the best possible action in the Bellman equation (which might eventually be different from observed action a_{t+1} based on the εgreedy policy). This is the reason that Qlearning is referred to as an offpolicy algorithm in which the target prediction values for the neural network update are computed using actions that might be different from the actually observed actions in the future.
There are several modifications to this basic approach in order to make the learning more stable. Many of these are presented in the context of the Atari video game setting [335]. First, presenting the training examples exactly in the sequence they occur can lead to local minima because of the strong similarity among training examples. Therefore, a fixedlength history of actions/rewards is used as a pool. One can view this as a history of experiences. Multiple experiences are sampled from this pool to perform minibatch gradient descent. In general, it is possible to sample the same action multiple times, which leads to greater efficiency in leveraging the learning data. Note that the pool is updated over time as old actions drop out of the pool and newer ones are added. Therefore, the training is still temporal in an approximate sense, but not strictly so. This approach is referred to as experience replay, as experiences are replayed multiple times in a somewhat different order than the original actions.
Another modification is that the network used for estimating the target Qvalues with Bellman equations (step 1 above) is not the same as the network used for predicting Qvalues (step 2 above). The network used for estimating the target Qvalues is updated more slowly in order to encourage stability. Finally, one problem with these systems is the sparsity of the rewards, especially at the initial stage of the learning when the moves are random. For such cases, a variety of tricks such as prioritized experience replay [428] can be used. The basic idea is to make more efficient use of the training data collected during reinforcement learning by prioritizing actions from which more can be learned.
9.4.2 Example: Neural Network for Atari Setting
All hidden layers used the ReLU activation, and the output used linear activation in order to predict the realvalued Qvalue. No pooling was used, and the strides in the convolution provided spatial compression. The Atari platform supports many games, and the same broader architecture was used across different games in order to showcase its generalizability. There was some variation in performance across different games, although human performance was exceeded in many cases. The algorithm faced the greatest challenges in games in which longerterm strategies were required. Nevertheless, the robust performance of a relatively homogeneous framework across many games was encouraging.
9.4.3 OnPolicy Versus OffPolicy Methods: SARSA
The QLearning methodology belongs to the class of methods, referred to as temporal difference learning. In Qlearning, the actions are chosen according to an εgreedy policy. However, the parameters of the neural network are updated based on the best possible action at each step with the Bellman equation. The best possible action at each step is not quite the same as the εgreedy policy used to perform the simulation. Therefore, Qlearning is an offpolicy reinforcement learning method. Choosing a different policy for executing actions from those for performing updates does not worsen the ability to find the optimum solutions that are goals of the updates. In fact, since more exploration is performed with a randomized policy, local optima are avoided.
Learning Without Function Approximators
9.4.4 Modeling States Versus StateAction Pairs
Temporal difference learning was used in Samuel’s celebrated checkers program [421], and also motivated the development of TDGammon for Backgammon by Tesauro [492]. A neural network was used for state value estimation, and its parameters were updated using temporaldifference bootstrapping over successive moves. The final inference was performed with minimax evaluation of the improved evaluation function over a shallow depth such as 2 or 3. TDGammon was able to defeat several expert players. It also exhibited some unusual strategies of game play that were eventually adopted by toplevel players.
9.5 Policy Gradient Methods
The neural network for estimating the policy is referred to as a policy network in which the input is the current state of the system, and the output is a set of probabilities associated with the various actions in the video game (e.g., moving up, down, left, or right). As in the case of the Qnetwork, the input can be an observed representation of the agent state. For example, in the Atari video game setting, the observed state can be the last four screens of pixels. An example of a policy network is shown in Figure 9.6, which is relevant for the Atari setting. It is instructive to compare this policy network with the Qnetwork of Figure 9.3. Given an output of probabilities for various actions, we throw a biased die with the faces associated with these probabilities, and select one of these actions. Therefore, for each action a, observed state representation \(\overline{X}_{t}\), and current parameter \(\overline{W}\), the neural network is able to compute the function \(P(\overline{X}_{t},\overline{W},a)\), which is the probability that the action a should be performed. One of the actions is sampled, and a reward is observed for that action. If the policy is poor, the action will more likely to be a mistake and the reward will be poor as well. Based on the reward obtained from executing the action, the weight vector \(\overline{W}\) is updated for the next iteration. The update of the weight vector is based on the notion of policy gradient with respect to the weight vector \(\overline{W}\). One challenge in estimating the policy gradient is that the reward of an action is often not observed immediately, but is tightly integrated into the future sequence of rewards. Often Monte Carlo policy rollouts must be used in which the neural network is used to follow a particular policy to estimate the discounted rewards over a longer horizon.
9.5.1 Finite Difference Methods
The method of finite differences sidesteps the problem of stochasticity with empirical simulations that provide estimates of the gradient. Finite difference methods use weight perturbations in order to estimate gradients of the reward. The idea is to use s different perturbations of the neural network weights, and examine the expected change ΔJ in the reward. Note that this will require us to run the perturbed policy for the horizon of H moves in order to estimate the change in reward. Such a sequence of H moves is referred to as a rollout. For example, in the case of the Atari game, we will need to play it for a trajectory of H moves for each of these s different sets of perturbed weights in order to estimate the changed reward. In games where an opponent of sufficient strength is not available to train against, it is possible to play a game against a version of the opponent based on parameters learned a few iterations back.
9.5.2 Likelihood Ratio Methods
It is easy to use this trick for neural network parameter estimation. Each action a sampled by the simulation is associated with the longterm reward Q^{p}(s, a), which is obtained by Monte Carlo simulation. Based on the relationship above, the gradient of the expected advantage is obtained by multiplying the gradient of the logprobability log(p(a)) of that action (computable from the neural network in Figure 9.6 using backpropagation) with the longterm reward Q^{p}(s, a) (obtained by Monte Carlo simulation).
Note that the gradient of the logprobability of the groundtruth class is often used to update softmax classifiers with crossentropy loss in order to increase the probability of the correct class (which is intuitively similar to the update here). The difference here is that we are weighting the update with the Qvalues because we want to push the parameters more aggressively in the direction of highly rewarding actions. One could also use minibatch gradient ascent over the actions in the sampled rollouts. Randomly sampling from different rollouts can be helpful in avoiding the local minima arising from correlations because the successive samples from each rollout are closely related to one another.
Reducing Variance with Baselines: Although we have used the longterm reward Q^{p}(s, a) as the quantity to be optimized, it is more common to subtract a baseline value from this quantity in order to obtain its advantage (i.e, differential impact of the action over expectation). The baseline is ideally statespecific, but can be a constant as well. In the original work of REINFORCE, a constant baseline was used (which is typically some measure of average longterm reward over all states). Even this type of simple measure can help in speeding up learning because it reduces the probabilities of lessthanaverage performers and increases the probabilities of morethanaverage performers (rather than increasing both at differential rates). A constant choice of baseline does not affect the bias of the procedure, but it reduces the variance. A statespecific option for the baseline is the value V^{p}(s) of the state s immediately before sampling action a. Such a choice results in the advantage (Q^{p}(s, a) − V^{p}(s)) becoming identical to the temporal difference error. This choice makes intuitive sense, because the temporal difference error contains additional information about the differential reward of an action beyond what we would know before choosing the action. Discussions on baseline choice may be found in [374, 433].
Consider an example of an Atari gameplaying agent, in which a rollout samples the move UP and output probability of UP was 0. 2. Assume that the (constant) baseline is 0. 17, and the longterm reward of the action is + 1, since the game results in win (and there is no reward discount). Therefore, the score of every action in that rollout is 0.83 (after subtracting the baseline). Then, the gain associated with all actions (output nodes of the neural network) other than UP at that timestep would be 0, and the gain associated with the output node corresponding to UP would be 0. 83 ×log(0.2). One can then backpropagate this gain in order to update the parameters of the neural network.
Adjustment with a statespecific baseline is easy to explain intuitively. Consider the example of a chess game between agents Alice and Bob. If we use a baseline of 0, then each move will only be credited with a reward corresponding to the final result, and the difference between good moves and bad moves will not be evident. In other words, we need to simulate a lot more games to differentiate positions. On the other hand, if we use the value of the state (before performing the action) as the baseline, then the (more refined) temporal difference error is used as the advantage of the action. In such a case, moves that have greater statespecific impact will be recognized with a higher advantage (within a single game). As a result, fewer games will be required for learning.
9.5.3 Combining Supervised Learning with Policy Gradients
Supervised learning is useful for initializing the weights of the policy network before applying reinforcement learning. For example, in a game of chess, one might have prior examples of expert moves that are already known to be good. In such a case, we simply perform gradient ascent with the same policy network, except that each expert move is assigned the fixed credit of 1 for evaluating the gradient according to Equation 9.24. This problem becomes identical to that of softmax classification, where the goal of the policy network is to predict the same move as the expert. One can sharpen the quality of the training data with some examples of bad moves with a negative credit obtained from computer evaluations. This approach would be considered supervised learning rather than reinforcement learning because we are simply using prior data, and not generating/simulating the data that we learn from (as is common in reinforcement learning). This general idea can be extended to any reinforcement learning setting, where some prior examples of actions and associated rewards are available. Supervised learning is extremely common in these settings for initialization because of the difficultly in obtaining highquality data in the early stages of the process. Many published works also interleave supervised learning and reinforcement learning in order to achieve greater data efficiency [286].
9.5.4 ActorCritic Methods
 1.
The Qlearning and TD(λ) methods work with the notion of a value function that is optimized. This value function is a critic, and the policy (e.g., εgreedy) of the actor is directly derived from this critic. Therefore, the actor is subservient to the critic, and such methods are considered criticonly methods.
 2.
The policygradient methods do not use a value function at all, and they directly learn the probabilities of the policy actions. The values are often estimated using Monte Carlo sampling. Therefore, these methods are considered actoronly methods.
Note that the policygradient methods do need to evaluate the advantage of intermediate actions, and this estimation has so far been done with the use of Monte Carlo simulations. The main problem with Monte Carlo simulations is its high complexity and inability to use in an online setting.
However, it turns out that one can learn the advantage of intermediate actions using value function methods. As in the previous section, we use the notation Q^{p}(s_{t}, a) to denote the value of action a, when the policy p followed by the policy network is used. Therefore, we would now have two coupled neural networks– a policy network and a Qnetwork. The policy network learns the probabilities of actions, and the Qnetwork learns the values Q^{p}(s_{t}, a) of various actions in order to provide an estimation of the advantage to the policy network. Therefore, the policy network uses Q^{p}(s_{t}, a) (with baseline adjustments) to weight its gradient ascent updates. The Qnetwork is updated using an onpolicy update as in SARSA, where the policy is controlled by the policy network (rather than εgreedy). The Qnetwork, however, does not directly decide the actions as in Qlearning, because the policy decisions are outside its control (beyond its role as a critic). Therefore, the policy network is the actor and the value network is the critic. To distinguish between the policy network and the Qnetwork, we will denote the parameter vector of the policy network by \(\overline{\varTheta }\), and that of the Qnetwork by \(\overline{W}\).
 1.
Sample the action a_{t+1} using the current state of the parameters in the policy network. Note that the current state is s_{t+1} because the action a_{t} is already observed.
 2.Let \(F(\overline{X}_{t},\overline{W},a_{t}) =\hat{ Q}^{p}(s_{t},a_{t})\) represent the estimated value of Q^{p}(s_{t}, a_{t}) by the Qnetwork using the observed representation \(\overline{X}_{t}\) of the states and parameters \(\overline{W}\). Estimate Q^{p}(s_{t}, a_{t}) and Q^{p}(s_{t+1}, a_{t+1}) using the Qnetwork. Compute the TDerror δ_{t} as follows:$$\displaystyle\begin{array}{rcl} \delta _{t}& =& r_{t} +\gamma \hat{ Q}^{p}(s_{ t+1},a_{t+1}) \hat{ Q}^{p}(s_{ t},a_{t}) {}\\ & =& r_{t} +\gamma F(\overline{X}_{t+1},\overline{W},a_{t+1})  F(\overline{X}_{t},\overline{W},a_{t}) {}\\ \end{array}$$
 3.[Update policy network parameters]: Let \(P(\overline{X}_{t},\overline{\varTheta },a_{t})\) be the probability of the action a_{t} predicted by policy network. Update the parameters of the policy network as follows:Here, α is the learning rate for the policy network and the value of \(\hat{Q}^{p}(s_{t},a_{t}) = F(\overline{X}_{t},\overline{W},a_{t})\) is obtained from the Qnetwork.$$\displaystyle{ \overline{\varTheta } \leftarrow \overline{\varTheta } +\alpha \hat{ Q}^{p}(s_{ t},a_{t})\nabla _{\varTheta \,}\mbox{ log}(P(\overline{X}_{t},\overline{\varTheta },a_{t})) }$$
 4.[Update QNetwork parameters]: Update the Qnetwork parameters as follows:Here, β is the learning rate for the Qnetwork. A caveat is that the learning rate of the Qnetwork is generally higher than that of the policy network.$$\displaystyle{ \overline{W} \Leftarrow \overline{W} +\beta \delta _{t}\nabla _{W\,}F(\overline{X}_{t},\overline{W},a_{t}) }$$
The action a_{t+1} is then executed in order to observe state s_{t+2}, and the value of t is incremented. The next iteration of the approach is executed (by repeating the above steps) at this incremented value of t. The iterations are repeated, so that the approach is executed to convergence. The value of \(\hat{Q}^{p}(s_{t},a_{t})\) is the same as the value \(\hat{V }^{p}(s_{t+1})\).
9.5.5 Continuous Action Spaces
The methods discussed to this point were all associated with discrete action spaces. For example, in a video game, one might have a discrete set of choices such as whether to move the cursor up, down, left, and right. However, in a robotics application, one might have continuous action spaces, in which we wish to move the robot’s arm a certain distance. One possibility is to discretize the action into a set of finegrained intervals, and use the midpoint of the interval as the representative value. One can then treat the problem as one of discrete choice. However, this is not a particularly satisfying design choice. First, the ordering among the different choices will be lost by treating inherently ordered (numerical) values as categorical values. Second, it blows up the space of possible actions, especially if the action space is multidimensional (e.g., separate dimensions for distances moved by the robot’s arm and leg). Such an approach can cause overfitting, and greatly increase the amount of data required for learning.
9.5.6 Advantages and Disadvantages of Policy Gradients
Policy gradient methods represent the most natural choice in applications like robotics that have continuous sequences of states and actions. For cases in which there are multidimensional and continuous action spaces, the number of possible combinations of actions can be very large. Since Qlearning methods require the computation of the maximum Qvalue over all such actions, this step can turn out to be computationally intractable. Furthermore, policy gradient methods tend to be stable and have good convergence properties. However, policy gradient methods are susceptible to local minima. While Qlearning methods are less stable in terms of convergence behavior than are policygradient methods, and can sometimes oscillate around particular solutions, they have better capacity to reach near global optima.
Policygradient methods do possess one additional advantage in that they can learn stochastic policies, leading to better performance in settings where deterministic policies are known to be suboptimal (such as guessing games) due to being able to be exploited by the opponent. Qlearning provides deterministic policies, and so policy gradients are preferable in these settings because they provide a probability distribution on the possible actions from which the action is sampled.
9.6 Monte Carlo Tree Search
Monte Carlo tree search is a way of improving the strengths of learned policies and values at inference time by combining them with lookaheadbased exploration. This improvement also provides a basis for lookaheadbased bootstrapping like temporal difference learning. It is also leveraged as a probabilistic alternative to the deterministic minimax trees that are used by conventional gameplaying software (although the applicability is not restricted to games). Each node in the tree corresponds to a state, and each branch corresponds to a possible action. The tree grows over time during the search as new states are encountered. The goal of the tree search is to select the best branch to recommend the predicted action of the agent. Each branch is associated with a value based on previous outcomes in tree search from that branch as well as an upper bound “bonus” that reduces with increased exploration. This value is used to set the priority of the branches during exploration. The learned goodness of a branch is adjusted after each exploration, so that branches leading to positive outcomes are favored in later explorations.
At any given state, the action a with the largest value of u(s, a) is followed. This approach is applied recursively until following the optimal action does not lead to an existing node. This new state s′ is now added to the tree as a leaf node with initialized values of each N(s′, a) and Q(s′, a) set to 0. Note that the simulation up to a leaf node is fully deterministic, and no randomization is involved because P(s, a) and Q(s, a) are deterministically computable. Monte Carlo simulations are used to estimate the value of the newly added leaf node s′. Specifically, Monte Carlo rollouts from the policy network (e.g., using P(s, a) to sample actions) return either + 1 or − 1, depending on win or loss. In Section 9.7.1, we will discuss some alternatives for leafnode evaluation that use a value network as well. After evaluating the leaf node, the values of Q(s″, a″) and N(s″, a″) on all edges (s″, a″) on the path from the current state s to the leaf s′ are updated. The value of Q(s″, a″) is maintained as the average value of all the evaluations at leaf nodes reached from that branch during the Monte Carlo tree search. After multiple searches have been performed from s, the most visited edge is selected as the relevant one, and is reported as the desired action.
Use in Bootstrapping
Traditionally, Monte Carlo tree search has been used during inference rather than during training. However, since Monte Carlo tree search provides an improved estimate Q(s, a) of the value of a stateaction pair (as a result of lookaheads), it can also be used for bootstrapping (Intuition 9.4.1). Monte Carlo tree search provides an excellent alternative to nstep temporaldifference methods. One point about onpolicy nstep temporaldifference methods is that they explore a single sequence of nmoves with the εgreedy policy, and therefore tend to be too weak (with increased depth but not width of exploration). One way to strengthen them is to examine all possible nsequences and use the optimal one with an offpolicy technique (i.e., generalizing Bellman’s 1step approach). In fact, this was the approach used in Samuel’s checkers program [421], which used the best option in the minimax tree for bootstrapping (and later referred to as TDLeaf [22]). This results in increased complexity of exploring all possible nsequences. Monte Carlo tree search can provide a robust alternative for bootstrapping, because it can explore multiple branches from a node to generate averaged target values. For example, the lookaheadbased ground truth can use the averaged performance over all the explorations starting at a given node.
AlphaGo Zero [447] bootstraps policies rather than state values, which is extremely rare. AlphaGo Zero uses the relative visit probabilities of the branches at each node as posterior probabilities of the actions at that state. These posterior probabilities are improved over the probabilistic outputs of the policy network by virtue of the fact that the visit decisions use knowledge about the future (i.e., evaluations at deeper nodes of the Monte Carlo tree). The posterior probabilities are therefore bootstrapped as groundtruth values with respect to the policy network probabilities and used to update the weight parameters (cf. Section 9.7.1.1).
9.7 Case Studies
In the following, we present case studies from real domains to showcase different reinforcement learning settings. We will present examples of reinforcement learning in Go, robotics, conversational systems, selfdriving cars, and neuralnetwork hyperparameter learning.
9.7.1 AlphaGo: Championship Level Play at Go
Go is a twoperson board game like chess. The complexity of a twoperson board game largely depends on the size of the board and the number of valid moves at each position. The simplest example of a board game is tictactoe with a 3 × 3 board, and most humans can solve it optimally without the need for a computer. Chess is a significantly more complex game with an 8 × 8 board, although clever variations of the bruteforce approach of selectively exploring the minimax tree of moves up to a certain depth can perform significantly better than the best human today. Go occurs at the extreme end of complexity because of its 19 × 19 board.
Whereas one can make about 35 possible moves (i.e., tree branch factor) in a particular position in chess, the average number of possible moves at a particular position in Go is 250, which is almost an order of magnitude larger. Furthermore, the average number of sequential moves (i.e., tree depth) of a game of Go is about 150, which is around twice as large as chess. All these aspects make Go a much harder candidate from the perspective of automated gameplaying. The typical strategy of chessplaying software is to construct a minimax tree with all combinations of moves the players can make up to a certain depth, and then evaluate the final board positions with chessspecific heuristics (such as the amount of remaining material and the safety of various pieces). Suboptimal parts of the tree are pruned in a heuristic manner. This approach is simply a improved version of a bruteforce strategy in which all possible positions are explored up to a given depth. The number of nodes in the minimax tree of Go is larger than the number of atoms in the observable universe, even at modest depths of analysis (20 moves for each player). As a result of the importance of spatial intuition in these settings, humans always perform better than brute force strategies at Go. The use of reinforcement learning in Go is much closer to what humans attempt to do. We rarely try to explore all possible combinations of moves; rather, we visually learn patterns on the board that are predictive of advantageous positions, and try to make moves in directions that are expected to improve our advantage.
The automated learning of spatial patterns that are predictive of good performance is achieved with a convolutional neural network. The state of the system is encoded in the board position at a particular point, although the board representation in AlphaGo includes some additional features about the status of junctions or the number of moves since a stone was played. Multiple such spatial maps are required in order to provide full knowledge of the state. For example, one feature map would represent the status of each intersection, another would encode the number of turns since a stone was played, and so on. Integer feature maps were encoded into multiple onehot planes. Altogether, the game board could be represented using 48 binary planes of 19 × 19 pixels.
AlphaGo uses its winloss experience with repeated game playing (both using the moves of expert players and with games played against itself) to learn good policies for moves in various positions with a policy network. Furthermore, the evaluation of each position on the Go board is achieved with a value network. Subsequently, Monte Carlo tree search is used for final inference. Therefore, AlphaGo is a multistage model, whose components are discussed in the following sections.
Policy Networks
The policy network takes as its input the aforementioned visual representation of the board, and outputs the probability of action a in state s. This output probability is denoted by p(s, a). Note that the actions in the game of Go correspond to the probability of placing a stone at each legal position on the board. Therefore, the output layer uses the softmax activation. Two separate policy networks are trained using different approaches. The two networks were identical in structure, containing convolutional layers with ReLU nonlinearities. Each network contained 13 layers. Most of the convolutional layers convolve with 3 × 3 filters, except for the first and final convolutions. The first and final filters convolve with 5 × 5 and 1 × 1 filters, respectively. The convolutional layers were zero padded to maintain their size, and 192 filters were used. The ReLU nonlinearity was used, and no maxpooling was used in order to maintain the spatial footprint.

Supervised learning: Randomly chosen samples from expert players were used as training data. The input was the state of the network, while the output was the action performed by the expert player. The score (advantage) of such a move was always + 1, because the goal was to train the network to imitate expert moves, which is also referred to as imitation learning. Therefore, the neural network was backpropagated with the loglikelihood of the probability of the chosen move as its gain. This network is referred to as the SLpolicy network. It is noteworthy that these supervised forms of imitation learning are often quite common in reinforcement learning for avoiding coldstart problems. However, subsequent work [446] showed that dispensing with this part of the learning was a better option.

Reinforcement learning: In this case, reinforcement learning was used to train the network. One issue is that Go needs two opponents, and therefore the network was played against itself in order to generate the moves. The current network was always played against a randomly chosen network from a few iterations back, so that the reinforcement learning could have a pool of randomized opponents. The game was played until the very end, and then an advantage of + 1 or − 1 was associated with each move depending on win or loss. This data was then used to train the policy network. This network was referred to as the RLpolicy network.
Note that these networks were already quite formidable Go players compared to stateoftheart software, and they were combined with Monte Carlo tree search to strengthen them.
Value Networks
This network was also a convolutional neural network, which uses the state of the network as the input and the predicted score in [−1, +1] as output, where + 1 indicates a perfect probability of 1. The output is the predicted score of the next player, whether it is white or black, and therefore the input also encodes the “color” of the pieces in terms of “player” or “opponent” rather than white or black. The architecture of the value network was very similar to the policy network, except that there were some differences in terms of the input and output. The input contained an additional feature corresponding to whether the next player to play was white or black. The score was computed using a single tanh unit at the end, and therefore the value lies in the range [−1, +1]. The early convolutional layers of the value network are the same as those in the policy network, although an additional convolutional layer is added in layer 12. A fully connected layer with 256 units and ReLU activation follows the final convolutional layer. In order to train the network, one possibility is to use positions from a data set [606] of Go games. However, the preferred choice was to generate the data set using selfplay with the SLpolicy and RLpolicy networks all the way to the end, so that the final outcomes were generated. The stateoutcome pairs were used to train the convolutional neural network. Since the positions in a single game are correlated, using them sequentially in training causes overfitting. It was important to sample positions from different games in order to prevent overfitting caused by closely related training examples. Therefore, each training example was obtained from a distinct game of selfplay.
Monte Carlo Tree Search
A simplified variant of Equation 9.27 was used for exploration, which is equivalent to setting \(K = 1/\sqrt{\sum _{b } N(s, b)}\) at each node s. Section 9.6 described a version of the Monte Carlo tree search method in which only the RLpolicy network is used for evaluating leaf nodes. In the case of AlphaGo, two approaches are combined. First, fast Monte Carlo rollouts were used from the leaf node to create evaluation e_{1}. While it is possible to use the policy network for rollout, AlphaGo trained a simplified softmax classifier with a database of human games and some handcrafted features for faster speed of rollouts. Second, the value network created a separate evaluation e_{2} of the leaf nodes. The final evaluation e is a convex combination of the two evaluations as e = βe_{1} + (1 −β)e_{2}. The value of β = 0. 5 provided the best performance, although using only the value network also provided closely matching performance (and a viable alternative). The most visited branch in Monte Carlo tree search was reported as the predicted move.
9.7.1.1 Alpha Zero: Enhancements to Zero Human Knowledge
A later enhancement of the idea, referred to as AlphaGo Zero [446], removed the need for human expert moves (or an SLnetwork). Instead of separate policy and value networks, a single network outputs both the policy (i.e., action probabilities) p(s, a) and the value v(s) of the position. The crossentropy loss on the output policy probabilities and the squared loss on the value output were added to create a single loss. Whereas the original version of AlphaGo used Monte Carlo tree search only for inference from trained networks, the zeroknowledge versions also use the visit counts in Monte Carlo tree search for training. One can view the visit count of each branch in tree search as a policy improvement operator over p(s, a) by virtue of its lookaheadbased exploration. This provides a basis for creating bootstrapped groundtruth values (Intuition 9.4.1) for neural network learning. While temporal difference learning bootstraps state values, this approach bootstraps visit counts for learning policies. The predicted probability of Monte Carlo tree search for action a in board state s is π(s, a) ∝ N(s, a)^{1∕τ}, where τ is a temperature parameter. The value of N(s, a) is computed using a similar Monte Carlo search algorithm as used for AlphaGo, where the prior probabilities p(s, a) output by the neural network are used for computing Equation 9.27. The value of Q(s, a) in Equation 9.27 is set to the average value output v(s′) from the neural network of the newly created leaf nodes s′ reached from state s.
Further advancements were proposed in the form of Alpha Zero [447], which could play multiple games such as Go, shogi, and chess. Alpha Zero has handily defeated the best chessplaying software, Stockfish, and has also defeated the best shogi software (Elmo). The victory in chess was particularly unexpected by most top players, because it was always assumed that chess required too much domain knowledge for a reinforcement learning system to win over a system with handcrafted evaluations.
Comments on Performance
AlphaGo has shown extraordinary performance against a variety of computer and human opponents. Against a variety of computer opponents, it won 494 out of 495 games [445]. Even when AlphaGo was handicapped by providing four free stones to the opponent, it won 77%, 86%, and 99% of the games played against (the software programs named) Crazy Stone, Zen, and Pachi, respectively. It also defeated notable human opponents, such as the European champion, the World champion, and the topranked player.
A more notable aspect of its performance was the way in which it achieved its victories. In several of its games, AlphaGo made many unconventional and brilliantly unorthodox moves, which would sometimes make sense only in hindsight after the victory of the program [607, 608]. There were cases in which the moves made by AlphaGo were contrary to conventional wisdom, but eventually revealed innovative insights acquired by AlphaGo during selfplay. After this match, some top Go players reconsidered their approach to the entire game.
The performance of Alpha Zero in chess was similar, where it often made material sacrifices in order to incrementally improve its position and constrict its opponent. This type of behavior is a hallmark of human play and is very different from conventional chess software (which is already much better than humans). Unlike handcrafted evaluations, it seemed to have no preconceived notions on the material values of pieces, or on when a king was safe in the center of the board. Furthermore, it discovered most wellknown chess openings on its own using selfplay, and seemed to have its own opinions on which ones were “better.” In other words, it had the ability to discover knowledge on its own. A key difference of reinforcement learning from supervised learning is that it has the ability to innovate beyond known knowledge through learning by rewardguided trial and error. This behavior represents some promise in other applications.
9.7.2 SelfLearning Robots
Selflearning robots represent an important frontier in artificial intelligence, in which robots can be trained to perform various tasks such as locomotion, mechanical repairs, or object retrieval by using a rewarddriven approach. For example, consider the case in which one has constructed a robot that is physically capable of locomotion (in terms of how it is constructed and the movement choices available to it), but it has to learn the precise choice of movements in order to keep itself balanced and move from point A to point B. As bipedal humans, we are able to walk and keep our balance naturally without even thinking about it, but this is not a simple matter for a bipedal robot in which an incorrect choice of joint movement could easily cause it to topple over. The problem becomes even more difficult when uncertain terrain and obstacles are placed in the way of a robot.
This type of problem is naturally suited to reinforcement learning, because it is easy to judge whether a robot is walking correctly, but it is hard to specify precise rules about what the robot should do in every possible situation. In the rewarddriven approach of reinforcement learning, the robot is given (virtual) rewards every time it makes progress in locomotion from point A to point B. Otherwise, the robot is free to take any actions, and it is not pretrained with knowledge about the specific choice of actions that would help keep it balanced and walk. In other words, the robot is not seeded with any knowledge of what walking looks like (beyond the fact that it will be rewarded for using its available actions for making progress from point A to point B). This is a classical example of reinforcement learning, because the robot now needs to learn the specific sequence of actions to take in order to earn the goaldriven rewards. Although we use locomotion as a specific example in this case, this general principle applies to any type of learning in robots. For example, a second problem is that of teaching a robot manipulation tasks such as grasping an object or screwing the cap on a bottle. In the following, we will provide a brief discussion of both cases.
9.7.2.1 Deep Learning of Locomotion Skills
The humanoid model has 33 state dimensions and 10 actuated degrees of freedom, while the quadruped model has 29 state dimensions and 8 actuated degrees of freedom. Models were rewarded for forward progress, although episodes were terminated when the center of mass of the robot fell below a certain point. The actions of the robot were controlled by joint torques. A number of features were available to the robot, such as sensors providing the positions of obstacles, the joint positions, angles, and so on. These features were fed into the neural networks. Two neural networks were used; one was used for value estimation, and the other was used for policy estimation. Therefore, a policy gradient method was used in which the value network played the role of estimating the advantage. Such an approach is an instantiation of an actorcritic method.
A feedforward neural network was used with three hidden layers, with 100, 50, and 25 tanh units, respectively. The approach in [433] requires the estimation of both a policy function and a value function, and the same architecture was used in both cases for the hidden layers. However, the value estimator required only one output, whereas the policy estimator required as many outputs as the number of actions. Therefore, the main difference between the two architectures was in terms of how the output layer and the loss function was designed. The generalized advantage estimator (GAE) was used in combination with trustbased policy optimization (TRPO). The bibliographic notes contain pointers to specific details of these methods. On training the neural network for 1000 iterations with reinforcement learning, the robot learned to walk with a visually pleasing gait. A video of the final results of the robot walking is available at [610]. Similar results were also later released by Google DeepMind with more extensive abilities of avoiding obstacles or other challenges [187].
9.7.2.2 Deep Learning of Visuomotor Skills
A natural approach is to use a convolutional neural network for mapping image frames to actions. As in the case of Atari games, spatial features need to be learned in the layers of the convolutional neural network that are suitable for earning the relevant rewards in a tasksensitive manner. The convolutional neural network had 7 layers and 92,000 parameters. The first three layers were convolutional layers, the fourth layer was a spatial softmax, and the fifth layer was a fixed transformation from spatial feature maps to a concise set of two coordinates. The idea was to apply a softmax function to the responses across the spatial feature map. This provides a probability of each position in the feature map. The expected position using this probability distribution provides the 2dimensional coordinate, which is referred to as a feature point. Note that each spatial feature map in the convolution layer creates a feature point. The feature point can be viewed as a kind of soft argmax over the spatial probability distribution. The fifth layer was quite different from what one normally sees in a convolutional neural network, and was designed to create a precise representation of the visual scene that was suitable for feedback control. The spatial feature points are concatenated with the robot’s configuration, which is an additional input occurring only after the convolution layers. This concatenated feature set is fed into two fully connected layers, each with 40 rectified units, followed by linear connections to the torques. Note that only the observations corresponding to the camera were fed to the first layer of the convolutional neural network, and the observations corresponding to the robot state were fed to the first fully connected layer. This is because the convolutional layers cannot make much use of the robot states, and it makes sense to concatenate the statecentric inputs after the visual inputs have been processed by the convolutional layers. The entire network contained about 92,000 parameters, of which 86,000 were in the convolutional layers. The architecture of the convolutional neural network is shown in Figure 9.9(b). The observations consist of the RGB camera image, joint encoder readings, velocities, and endeffector pose.
The full robot states contained between 14 and 32 dimensions, such as the joint angles, endeffector pose, object positions, and their velocities. This provided a practical notion of a state. As in all policybased methods, the outputs correspond to the various actions (motor torques). One interesting aspect of the approach discussed in [286] is that it transforms the reinforcement learning problem into supervised learning. A guided policy search method was used, which is not discussed in this chapter. This approach converts portions of the reinforcement learning problem into supervised learning. Interested readers are referred to [286], where a video of the performance of the robot (trained using this system) may also be found.
9.7.3 Building Conversational Systems: Deep Learning for Chatbots
Chatbots are also referred to as conversational systems or dialog systems. The ultimate goal of a chatbot is to build an agent that can freely converse with a human about a variety of topics in a natural way. We are very far from achieving this goal. However, significant progress has been made in building chatbots for specific domains and particular applications (e.g., negotiation or shopping assistant). An example of a relatively generalpurpose system is Apple’s Siri, which is a digital personal assistant. One can view Siri as an opendomain system, because it is possible to have conversations with it about a wide variety of topics. It is reasonably clear to anyone using Siri that the assistant is sometimes either unable to provide satisfactory responses to difficult questions, and in some cases hilarious responses to common questions are hardcoded. This is, of course, natural because the system is relatively generalpurpose, and we are nowhere close to building a humanlevel conversational system. In contrast, closeddomain systems have a specific task in mind, and can therefore be more easily trained in a reliable way.
In the following, we will describe a system built by Facebook for endtoend learning of negotiation skills [290]. This is a closeddomain system because it is designed for the particular purpose of negotiation. As a testbed, the following negotiation task was used. Two agents are shown a collection of items of different types (e.g., two books, one hat, three balls). The agents are instructed to divide these items among themselves by negotiating a split of the items. A key point is that the value of each of the types of items is different for the two agents, but they are not aware of the value of the items for each other. This is often the case in reallife negotiations, where users attempt to reach a mutually satisfactory outcome by negotiating for items of value to them.
The values of the items are always assumed to be nonnegative and generated randomly in the testbed under some constraints. First, the total value of all items for a user is 10. Second, each item has nonzero value to at least one user so that it makes little sense to ignore an item. Last, some items have nonzero values to both users. Because of these constraints, it is impossible for both users to achieve the maximum score of 10, which ensures a competitive negotiation process. After 10 turns, the agents are allowed the option to complete the negotiation with no agreement, which has a value of 0 points for both users. The three item types of books, hats, and balls were used, and a total of between 5 and 7 items existed in the pool. The fact that the values of the items are different for the two users (without knowledge about each other’s assigned values) is significant; if both negotiators are capable, they will be able to achieve a total value of larger than 10 for the items between them. Nevertheless, the better negotiator will be able to capture the larger value by optimally negotiating for items with a high value for them.
The reward function for this reinforcement learning setting is the final value of the items attained by the user. One can use supervised learning on previous dialogs in order to maximize the likelihood of utterances. A straightforward use of recurrent networks to maximize the likelihood of utterances resulted in agents that were too eager to compromise. Therefore, the approach combined supervised learning with reinforcement learning. The incorporation of supervised learning within the reinforcement learning helps in ensuring that the models do not diverge from human language. A form of planning for dialogs called dialog rollout was introduced. The approach uses an encoderdecoder recurrent architecture, in which the decoder maximizes the reward function rather than the likelihood of utterances. This encoderdecoder architecture is based on sequencetosequence learning, as discussed in Section 7.7.2 of Chapter 7
To facilitate supervised learning, dialogs were collected from Amazon Mechanical Turk. A total of 5808 dialogs were collected in 2236 unique scenarios, where a scenario is defined by assignment of a particular set of values to the items. Of these cases, 252 scenarios corresponding to 526 dialogs were held out. Each scenario results in two training examples, which are derived from the perspective of each agent. A concrete training example could be one in which the items to be divided among the two agents correspond to 3 books, 2 hats, and 1 ball. These are part of the input to each agent. The second input could be the value of each item to the agent, which are (i) Agent A: book:1, hat:3, ball:1, and (ii) Agent B: book:2, hat:1, ball:2. Note that this means that agent A should secretly try to get as many hats as possible in the negotiation, whereas agent B should focus on books and balls. An example of a dialog in the training data is given below [290]:
Agent A: I want the books and the hats, you get the ball.
Agent B: Give me a book too and we have a deal.
Agent A: Ok, deal.
Agent B: 〈choose〉
The final output for agent A is 2 books and 2 hats, whereas the final output for agent B is 1 book and 1 ball. Therefore, each agent has her own set of inputs and outputs, and the dialogs for each agent are also viewed from their own perspective in terms of the portions that are reads and the portions that are writes. Therefore, each scenario generates two training examples and the same recurrent network is shared for generating the writes and the final output of each agent. The dialog x is a list of tokens x_{0}…x_{T}, containing the turns of each agent interleaved with symbols marking whether the turn was written by an agent or their partner. A special token at the end indicates that one agent has marked that an agreement has been reached.
The supervised learning procedure uses four different gated recurrent units (GRUs). The first gated recurrent unit GRU_{g} encodes the input goals, the second gated recurrent unit GRU_{q} generates the terms in the dialog, a forwardoutput gated recurrent unit \( GRU_{\overrightarrow{O}} \), and a backwardoutput gated recurrent unit \( GRU_{\overleftarrow{O}} \). The output is essentially produced by a bidirectional GRU. These GRUs are hooked up in endtoend fashion. In the supervised learning approach, the parameters are trained using the inputs, dialogs, and outputs available from the training data. The loss for the supervised model for a weighted sum of the tokenprediction loss of the dialog and the output choice prediction loss of the items.
However, for reinforcement learning, dialog rollouts are used. Note that the group of GRUs in the supervised model is, in essence, providing probabilistic outputs. Therefore, one can adapt the same model to work for reinforcement learning by simply changing the loss function. In other words, the GRU combination can be considered a type of policy network. One can use this policy network to generate Monte Carlo rollouts of various dialogs and their final rewards. Each of the sampled actions becomes a part of the training data, and the action is associated with the final reward of the rollout. In other words, the approach uses selfplay in which the agent negotiates with itself to learn better strategies. The final reward achieved by a rollout is used to update the policy network parameters. This reward is computed based on the value of the items negotiated at the end of the dialog. This approach can be viewed as an instance of the REINFORCE algorithm [533]. One issue with selfplay is that the agents tend to learn their own language, which deviates from natural human language when both sides use reinforcement learning. Therefore, one of the agents is constrained to be a supervised model.
For the final prediction, one possibility is to directly sample from the probabilities output by the GRU. However, such an approach is often not optimal when working with recurrent networks. Therefore, a twostage approach is used. First, c candidate utterances are created by using sampling. The expected reward of each candidate utterance is computed and the one with the largest expected value is selected. In order to compute the expected reward, the output was scaled by the probability of the dialog because lowprobability dialogs were unlikely to be selected by either agent.
A number of interesting observations were made in [290] about the performance of the approach. First, the supervised learning methods often tended to give up easily, whereas the reinforcement learning methods were more persistent in attempting to obtain a good deal. Second, the reinforcement learning method would often exhibit humanlike negotiation tactics. In some cases, it feigned interest in an item that was not really of much value in order to obtain a better deal for another item.
9.7.4 SelfDriving Cars
As in the case of the robot locomotion task, the car is rewarded for progressing from point A to point B without causing accidents or other undesirable road incidents. The car is equipped with various types of video, audio, proximity, and motion sensors in order to record observations. The objective of the reinforcement learning system is for the car to go from point A to point B safely irrespective of road conditions.
Driving is a task for which it is hard to specify the proper rules of action in every situation; on the other hand, it is relatively easy to judge when one is driving correctly. This is precisely the setting that is well suited to reinforcement learning. Although a fully selfdriving car would have a vast array of components corresponding to inputs and sensors of various types, we focus on a simplified setting in which a single camera is used [46, 47]. This system is instructive because it shows that even a single frontfacing camera is sufficient to accomplish quite a lot when paired with reinforcement learning. Interestingly, this work was inspired by the 1989 work of Pomerleau [381], who built the Autonomous Land Vehicle in a Neural Network (ALVINN) system, and the main difference from the work done over 25 years back was one of increased data and computational power. In addition, the work uses some advances in convolutional neural networks for modeling. Therefore, this work showcases the great importance of increased data and computational power in building reinforcement learning systems.
Scenarios involving imitation learning are often similar to those involving reinforcement learning. It is relatively easy to use reinforcement setting in this scenario by giving a reward when the car makes progress without human intervention. On the other hand, if the car either does not make progress or requires human intervention, it is penalized. However, this does not seem to be the way in which the selfdriving system of [46, 47] is trained. One issue with settings like selfdriving cars is that one always has to account for safety issues during training. Although published details on most of the available selfdriving cars are limited, it seems that supervised learning has been the method of choice compared to reinforcement learning in this setting. Nevertheless, the differences between using supervised learning and reinforcement learning are not significant in terms of the broader architecture of the neural network that would be useful. A general discussion of reinforcement learning in the context of selfdriving cars may be found in [612].
The convolutional neural network architecture is shown in Figure 9.10. The network consists of 9 layers, including a normalization layer, 5 convolutional layers, and 3 fully connected layers. The first convolutional layer used a 5 × 5 filter with a stride of 2. The next two convolutional layers each used nonstrided convolution with a 3 × 3 filter. These convolutional layers were followed with three fully connected layers. The final output value was a control value, corresponding to the inverse turning radius. The network had 27 million connections and 250, 000 parameters. Specific details of how the deep neural network performs the steering are provided in [47].
The resulting car was tested both in simulation and in actual road conditions. A human driver was always present in the road tests to perform interventions when necessary. On this basis, a measure was computed on the percentage of time that human intervention was required. It was found that the vehicle was autonomous 98% of the time. A video demonstration of this type of autonomous driving is available in [611]. Some interesting observations were obtained by visualizing the activation maps of the trained convolutional neural network (based on the methodology discussed in Chapter 8). In particular, it was observed that the features were heavily biased towards learning aspects of the image that were important to driving. In the case of unpaved roads, the feature activation maps were able to detect the outlines of the roads. On the other hand, if the car was located in a forest, the feature activation maps were full of noise. Note that this does not happen in a convolutional neural network that is trained on ImageNet because the feature activation maps would typically contain useful characteristics of trees, leaves, and so on. This difference in the two cases is because the convolutional network of the selfdriving setting is trained in a goaldriven matter, and it learns to detect features that are relevant to driving. The specific characteristics of the trees in a forest are not relevant to driving.
9.7.5 Inferring Neural Architectures with Reinforcement Learning
The reinforcement learning method uses a recurrent network as the controller to decide the parameters of the convolutional network, which is also referred to as the child network [569]. The overall architecture of the recurrent network is illustrated in Figure 9.11. The choice of a recurrent network is motivated by the sequential dependence between different architectural parameters. The softmax classifier is used to predict each output as a token rather than a numerical value. This token is then used as an input into the next layer, which is shown by the dashed lines in Figure 9.11. The generation of the parameter as a token results in a discrete action space, which is generally more common in reinforcement learning as compared to a continuous action space.
The performance of the child network on a validation set drawn from CIFAR10 is used to generate the reward signal. Note that the child network needs to be trained on the CIFAR10 data set in order to test its accuracy. Therefore, this process requires a full training procedure of the child network, which is quite expensive. This reward signal is used in conjunction with the REINFORCE algorithm in order to train the parameters of the controller network. Therefore, the controller network is really the policy network in this case, which generates a sequence of interdependent parameters.
A key point is about the number of layers of the child network (which also decides the number of layers in the recurrent network). This value is not held constant but it follows a certain schedule as training progresses. In the early iterations, the number of layers is fewer, and therefore the learned architecture of the convolutional network is shallow. As training progresses, the number of layers slowly increases over time. The policy gradient method is not very different from what is discussed earlier in this chapter, except that a recurrent network is trained with the reward signal rather than a feedforward network. Various types of optimizations are also discussed in [569], such as efficient implementations with parallelism and the learning of advanced architectural designs like skip connections.
9.8 Practical Challenges Associated with Safety
Simplifying the design of highly complex learning algorithms with reinforcement learning can sometimes have unexpected effects. By virtue of the fact that reinforcement learning systems have larger levels of freedom than other learning systems, it naturally leads to some safety related concerns. While biological greed is a powerful factor in human intelligence, it is also a source of many undesirable aspects of human behavior. The simplicity that is the greatest strength of rewarddriven learning is also its greatest pitfall in biological systems. Simulating such systems therefore results in similar pitfalls from the perspective of artificial intelligence. For example, poorly designed rewards can lead to unforeseen consequences, because of the exploratory way in which the system learns its actions. Reinforcement learning systems can frequently learn unknown “cheats” and “hacks” in imperfectly designed video games, which tells us a cautionary tale of what might happen in a lessthanperfect real world. Robots learn that simply pretending to screw caps on bottles can earn faster rewards, as long as the human or automated evaluator is fooled by the action. In other words, the design of the reward function is sometimes not a simple matter.
Furthermore, a system might try to earn virtual rewards in an “unethical” way. For example, a cleaning robot might try to earn rewards by first creating messes and then cleaning them [10]. One can imagine even darker scenarios for robot nurses. Interestingly, these types of behaviors are sometimes also exhibited by humans. These undesirable similarities are a direct result of simplifying the learning process in machines by leveraging the simple greedcentric principles with which biological organisms learn. Striving for simplicity results in ceding more control to the machine, which can have unexpected effects. In some cases, there are ethical dilemmas in even designing the reward function. For example, if it becomes inevitable that an accident is going to occur, should a selfdriving car save its driver or two pedestrians? Most humans would save themselves in this setting as a matter of reflexive biological instinct; however, it is an entirely different matter to incentivize a learning system to do so. At the same time, it would be hard to convince a human operator to trust a vehicle where her safety is not the first priority for the learning system. Reinforcement learning systems are also susceptible to the ways in which their human operators interact with them and manipulate the effects of their underlying reward function; there have been occasions where a chatbot was taught to make offensive or racist remarks.
Learning systems have a harder time in generalizing their experiences to new situations. This problem is referred to as distributional shift. For example, a selfdriving car trained in one country might perform poorly in another. Similarly, the exploratory actions in reinforcement learning can sometimes be dangerous. Imagine a robot trying to solder wires in an electronic device, where the wires are surrounded with fragile electronic components. Trying exploratory actions in this setting is fraught with perils. These issues tell us that we cannot build AI systems with no regard to safety. Indeed, some organizations like OpenAI [613] have taken the lead in these matters of ensuring safety. Some of these issues are also discussed in [10] with broader frameworks of possible solutions. In many cases, it seems that the human would have to be involved in the loop to some extent in order to ensure safety [424].
9.9 Summary
This chapter studies the problem of reinforcement learning in which agents interact with the environment in a rewarddriven manner in order to learn the optimal actions. There are several classes of reinforcement learning methods, of which the Qlearning methods and the policydriven methods are the most common. Policydriven methods have become increasingly popular in recent years. Many of these methods are endtoend systems that integrate deep neural networks to take in sensory inputs and learn policies that optimize rewards. Reinforcement learning algorithms are used in many settings like playing video or other types of games, robotics, and selfdriving cars. The ability of these algorithms to learn via experimentation often leads to innovative solutions that are not possible with other forms of learning. Reinforcement learning algorithms also pose unique challenges associated with safety because of the oversimplification of the learning process with reward functions.
9.10 Bibliographic Notes
An excellent overview on reinforcement learning may be found in the book by Sutton and Barto [483]. A number of surveys on reinforcement learning are available at [293]. David Silver’s lectures on reinforcement learning are freely available on YouTube [619]. The method of temporal differences was proposed by Samuel in the context of a checkers program [421] and formalized by Sutton [482]. Qlearning was proposed by Watkins in [519], and a convergence proof is provided in [520]. The SARSA algorithm was introduced in [412]. Early methods for using neural networks in reinforcement learning were proposed in [296, 349, 492, 493, 494]. The work in [492] developed TDGammon, which was a backgammon playing program.
A system that used a convolutional neural network to create a deep Qlearning algorithm with raw pixels was pioneered in [335, 336]. It has been suggested in [335] that the approach presented in the paper can be improved with other wellknown ideas such as prioritized sweeping [343]. Asynchronous methods that use multiple agents in order to perform the learning are discussed in [337]. The use of multiple asynchronous threads avoids the problem of correlation within a thread, which improves convergence to higherquality solutions. This type of asynchronous approach is often used in lieu of the experience replay technique. Furthermore, an nstep technique, which uses a lookahead of n steps (instead of 1 step) to predict the Qvalues, was proposed in the same work.
One drawback of Qlearning is that it is known to overestimate the values of actions under certain circumstances. An improvement over Qlearning, referred to as double Qlearning, was proposed in [174]. In the original form of Qlearning, the same values are used to both select and evaluate an action. In the case of double Qlearning, these values are decoupled, and therefore one is now learning two separate values for selection and evaluation. This change tends to make the approach less sensitive to the overestimation problem. The use of prioritized experience replay to improve the performance of reinforcement learning algorithms under sparse data is discussed in [428]. Such an approach significantly improves the performance of the system on Atari games.
In recent years, policy gradients have become more popular than Qlearning methods. An interesting and simplified description of this approach for the Atari game of Pong is provided in [605]. Early methods for using finite difference methods for policy gradients are discussed in [142, 355]. Likelihood methods for policy gradients were pioneered by the REINFORCE algorithm [533]. A number of analytical results on this class of algorithms are provided in [484]. Policy gradients have been used in for learning in the game of Go [445], although the overall approach combines a number of different elements. Natural policy gradients were proposed in [230]. One such method [432] has been shown to perform well at learning locomotion in robots. The use of generalized advantage estimation (GAE) with continuous rewards is discussed in [433]. The approach in [432, 433] uses natural policy gradients for optimization, and the approach is referred to as trust region policy optimization (TRPO). The basic idea is that bad steps in learning are penalized more severely in reinforcement learning (than supervised learning) because the quality of the collected data worsens. Therefore, the TRPO method prefers secondorder methods with conjugate gradients (see Chapter 3), in which the updates tend to stay within good regions of trust. Surveys are also available on specific types of reinforcement learning methods like actorcritic methods [162].
Monte Carlo tree search was proposed in [246]. Subsequently, it was used in the game of Go [135, 346, 445, 446]. A survey on these methods may be found in [52]. Later versions of AlphaGo dispensed with the supervised portions of learning, adapted to chess and shogi, and performed better with zero initial knowledge [446, 447]. The AlphaGo approach combines several ideas, including the use of policy networks, Monte Carlo tree search, and convolutional neural networks. The use of convolutional neural networks for playing the game of Go has been explored in [73, 307, 481]. Many of these methods use supervised learning in order to mimic human experts at Go. Some TDlearning methods for chess, such as NeuroChess [496], KnightCap [22], and Giraffe [259] have been explored, but were not as successful as conventional engines. The pairing of convolutional neural networks and reinforcement learning for spatial games seems to be a new (and successful) recipe that distinguishes Alpha Zero from these methods. Several methods for training selflearning robots are presented in [286, 432, 433]. An overview of deep reinforcement learning methods for dialog generation is provided in [291]. Conversation models that use only supervised learning with recurrent networks are discussed in [440, 508]. The negotiation chatbot discussed in this chapter is described in [290]. The description of selfdriving cars is based on [46, 47]. An MIT course on selfdriving cars is available at [612]. Reinforcement learning has also been used to generate structured queries from natural language [563], or for learning neural architectures in various tasks [19, 569].
Reinforcement learning can also improve deep learning models. This is achieved with the notion of attention [338, 540], in which reinforcement learning is used to focus on selective parts of the data. The idea is that large parts of the data are often irrelevant for learning, and learning how to focus on selective portions of the data can significantly improve results. The selection of relevant portions of the data is achieved with reinforcement learning. Attention mechanisms are discussed in Section 10.2 of Chapter 10 In this sense, reinforcement learning is one of the topics in machine learning that is more tightly integrated with deep learning than seems at first sight.
9.10.1 Software Resources and Testbeds
Although significant progress has been made in designing reinforcement learning algorithms in recent years, commercial software using these methods is still relatively limited. Nevertheless, numerous software testbeds are available that can be used in order to test various algorithms. Perhaps the best source for highquality reinforcement learning baselines is available from OpenAI [623]. TensorFlow [624] and Keras [625] implementations of reinforcement learning algorithms are also available.
Most frameworks for testing and development of reinforcement learning algorithms are specialized to specific types of reinforcement learning scenarios. Some frameworks are lightweight, and can be used for quick testing. For example, the ELF framework [498], created by Facebook, is designed for realtime strategy games, and is an opensource and lightweight reinforcement learning framework. The OpenAI Gym [620] provides environments for development of reinforcement learning algorithms for Atari games and simulated robots. The OpenAI Universe [621] can be used to turn reinforcement learning programs into Gym environments. For example, selfdriving car simulations have been added to this environment. An Arcade learning environment for developing agents in the context of Atari games is described in [25]. The MuJoCo simulator [609], which stands for MultiJoint dynamics with Contact, is a physics engine, and is designed for robotics simulations. An application with the use of MuJoCo is described in this chapter. ParlAI [622] is an opensource framework for dialog research by Facebook, and is implemented in Python. Baidu has created an opensource platform of its selfdriving car project, referred to as Apollo [626].
9.11 Exercises
 1.
The chapter gives a proof of the likelihood ratio trick (cf. Equation 9.24) for the case in which the action a is discrete. Generalize this result to continuousvalued actions.
 2.
Throughout this chapter, a neural network, referred to as the policy network, has been used in order to implement the policy gradient. Discuss the importance of the choice of network architecture in different settings.
 3.
You have two slot machines, each of which has an array of 100 lights. The probability distribution of the reward from playing each machine is an unknown (and possibly machinespecific) function of the pattern of lights that are currently lit up. Playing a slot machine changes its light pattern in some welldefined but unknown way. Discuss why this problem is more difficult than the multiarmed bandit problem. Design a deep learning solution to optimally choose machines in each trial that will maximize the average reward per trial at steadystate.
 4.
Consider the wellknown game of rockpaperscissors. Human players often try to use the history of previous moves to guess the next move. Would you use a Qlearning or a policybased method to learn to play this game? Why? Now consider a situation in which a human player samples one of the three moves with a probability that is an unknown function of the history of 10 previous moves of each side. Propose a deep learning method that is designed to play with such an opponent. Would a welldesigned deep learning method have an advantage over this human player? What policy should a human player use to ensure probabilistic parity with a deep learning opponent?
 5.
Consider the game of tictactoe in which a reward drawn from { − 1, 0, +1} is given at the end of the game. Suppose you learn the values of all states (assuming optimal play from both sides). Discuss why states in nonterminal positions will have nonzero value. What does this tell you about creditassignment of intermediate moves to the reward value received at the end?
 6.
Write a Qlearning implementation that learns the value of each stateaction pair for a game of tictactoe by repeatedly playing against human opponents. No function approximators are used and therefore the entire table of stateaction pairs is learned using Equation 9.5. Assume that you can initialize each Qvalue to 0 in the table.
 7.The twostep TDerror is defined as follows:$$\displaystyle{ \delta _{t}^{(2)} = r_{ t} +\gamma r_{t+1} +\gamma ^{2}V (s_{ t+2})  V (s_{t}) }$$
 (a)
Propose a TDlearning algorithm for the 2step case.
 (b)
Propose an onpolicy nstep learning algorithm like SARSA. Show that the update is truncated variant of Equation 9.16 after setting λ = 1. What happens for the case when n = ∞?
 (c)
Propose an offpolicy nstep learning algorithm like Qlearning and discuss its advantages/disadvantages with respect to (b).
 (a)
Bibliography
 [10]D. Amodei at al. Concrete problems in AI safety. arXiv:1606.06565, 2016. https://arxiv.org/abs/1606.06565
 [19]B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv:1611.02167, 2016. https://arxiv.org/abs/1611.02167
 [22]J. Baxter, A. Tridgell, and L. Weaver. Knightcap: a chess program that learns by combining td (lambda) with gametree search. arXiv cs/9901002, 1999.Google Scholar
 [25]M. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, pp. 253–279, 2013.CrossRefGoogle Scholar
 [26]R. E. Bellman. Dynamic Programming. Princeton University Press, 1957.Google Scholar
 [46]M. Bojarski et al. End to end learning for selfdriving cars. arXiv:1604.07316, 2016.https://arxiv.org/abs/1604.07316
 [47]M. Bojarski et al. Explaining How a Deep Neural Network Trained with EndtoEnd Learning Steers a Car. arXiv:1704.07911, 2017.https://arxiv.org/abs/1704.07911
 [52]C. Browne et al. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), pp. 1–43, 2012.CrossRefGoogle Scholar
 [73]C. Clark and A. Storkey. Training deep convolutional neural networks to play go. ICML Confererence, pp. 1766–1774, 2015.Google Scholar
 [135]S. Gelly et al. The grand challenge of computer Go: Monte Carlo tree search and extensions. Communcations of the ACM, 55, pp. 106–113, 2012.CrossRefGoogle Scholar
 [142]P. Glynn. Likelihood ratio gradient estimation: an overview, Proceedings of the 1987 Winter Simulation Conference, pp. 366–375, 1987.Google Scholar
 [162]I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, 42(6), pp. 1291–1307, 2012.CrossRefGoogle Scholar
 [165]X. Guo, S. Singh, H. Lee, R. Lewis, and X. Wang. Deep learning for realtime Atari game play using offline MonteCarlo tree search planning. Advances in NIPS Conference, pp. 3338–3346, 2014.Google Scholar
 [174]H. van Hasselt, A. Guez, and D. Silver. Deep Reinforcement Learning with Double QLearning. AAAI Conference, 2016.Google Scholar
 [187]N. Heess et al. Emergence of Locomotion Behaviours in Rich Environments. arXiv:1707.02286, 2017.https://arxiv.org/abs/1707.02286 Video 1 at: https://www.youtube.com/watch?v=hx_bgoTF7bs Video 2 at: https://www.youtube.com/watch?v=gn4nRCC9TwQ&feature=youtu.be
 [230]S. Kakade. A natural policy gradient. NIPS Conference, pp. 1057–1063, 2002.Google Scholar
 [246]L. Kocsis and C. Szepesvari. Bandit based montecarlo planning. ECML Conference, pp. 282–293, 2006.Google Scholar
 [259]M. Lai. Giraffe: Using deep reinforcement learning to play chess. arXiv:1509.01549, 2015.Google Scholar
 [286]S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39), pp. 1–40, 2016.Video at: https://sites.google.com/site/visuomotorpolicy/
 [290]M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, and D. Batra. Deal or No Deal? EndtoEnd Learning for Negotiation Dialogues. arXiv:1706.05125, 2017.https://arxiv.org/abs/1706.05125
 [291]J. Li, W. Monroe, A. Ritter, M. Galley,, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv:1606.01541, 2016.https://arxiv.org/abs/1606.01541
 [293]Y. Li. Deep reinforcement learning: An overview. arXiv:1701.07274, 2017.https://arxiv.org/abs/1701.07274
 [296]L.J. Lin. Reinforcement learning for robots using neural networks. Technical Report, DTIC Document, 1993.Google Scholar
 [307]C. Maddison, A. Huang, I. Sutskever, and D. Silver. Move evaluation in Go using deep convolutional neural networks. International Conference on Learning Representations, 2015.Google Scholar
 [335]V. Mnih et al. Humanlevel control through deep reinforcement learning. Nature, 518 (7540), pp. 529–533, 2015.CrossRefGoogle Scholar
 [336]V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv:1312.5602., 2013.https://arxiv.org/abs/1312.5602
 [337]V. Mnih et al. Asynchronous methods for deep reinforcement learning. ICML Confererence, pp. 1928–1937, 2016.Google Scholar
 [338]V. Mnih, N. Heess, and A. Graves. Recurrent models of visual attention. NIPS Conference, pp. 2204–2212, 2014.Google Scholar
 [343]A. Moore and C. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1), pp. 103–130, 1993.Google Scholar
 [346]M. Müller, M. Enzenberger, B. Arneson, and R. Segal. Fuego  an opensource framework for board games and Go engine based on MonteCarlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2, pp. 259–270, 2010.CrossRefGoogle Scholar
 [349]K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), pp. 4–27, 1990.CrossRefGoogle Scholar
 [355]A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. Uncertainity in Artificial Intelligence, pp. 406–415, 2000.Google Scholar
 [374]J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), pp. 682–697, 2008.CrossRefGoogle Scholar
 [381]D. Pomerleau. ALVINN, an autonomous land vehicle in a neural network. Technical Report, Carnegie Mellon University, 1989.Google Scholar
 [412]G. Rummery and M. Niranjan. Online Qlearning using connectionist systems (Vol. 37). University of Cambridge, Department of Engineering, 1994.Google Scholar
 [421]A. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, pp. 210–229, 1959.MathSciNetCrossRefGoogle Scholar
 [424]W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without Error: Towards Safe Reinforcement Learning via Human Intervention. arXiv:1707.05173, 2017.https://arxiv.org/abs/1707.05173
 [427]S. Schaal. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), pp. 233–242, 1999.CrossRefGoogle Scholar
 [428]T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv:1511.05952, 2015.https://arxiv.org/abs/1511.05952
 [432]J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. ICML Conference, 2015.Google Scholar
 [433]J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. ICLR Conference, 2016.Google Scholar
 [440]I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building endtoend dialogue systems using generative hierarchical neural network models. AAAI Conference, pp. 3776–3784, 2016.Google Scholar
 [445]D. Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529.7587, pp. 484–489, 2016.CrossRefGoogle Scholar
 [446]D. Silver et al. Mastering the game of go without human knowledge. Nature, 550.7676, pp. 354–359, 2017.CrossRefGoogle Scholar
 [447]D. Silver et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv, 2017.https://arxiv.org/abs/1712.01815
 [453]H. Simon. The Sciences of the Artificial. MIT Press, 1996.Google Scholar
 [481]I. Sutskever and V. Nair. Mimicking Go experts with convolutional neural networks. International Conference on Artificial Neural Networks, pp. 101–110, 2008.Google Scholar
 [482]R. Sutton. Learning to Predict by the Method of Temporal Differences, Machine Learning, 3, pp. 9–44, 1988.Google Scholar
 [483]R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.Google Scholar
 [484]R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS Conference, pp. 1057–1063, 2000.Google Scholar
 [492]G. Tesauro. Practical issues in temporal difference learning. Advances in NIPS Conference, pp. 259–266, 1992.Google Scholar
 [493]G. Tesauro. Tdgammon: A selfteaching backgammon program. Applications of Neural Networks, Springer, pp. 267–285, 1992.Google Scholar
 [494]G. Tesauro. Temporal difference learning and TDGammon. Communications of the ACM, 38(3), pp. 58–68, 1995.CrossRefGoogle Scholar
 [496]S. Thrun. Learning to play the game of chess NIPS Conference, pp. 1069–1076, 1995.Google Scholar
 [498]Y. Tian, Q. Gong, W. Shang, Y. Wu, and L. Zitnick. ELF: An extensive, lightweight and flexible research platform for realtime strategy games. arXiv:1707.01067, 2017.https://arxiv.org/abs/1707.01067
 [508]O. Vinyals and Q. Le. A Neural Conversational Model. arXiv:1506.05869, 2015.https://arxiv.org/abs/1506.05869
 [519]C. J. H. Watkins. Learning from delayed rewards. PhD Thesis, King’s College, Cambridge, 1989.Google Scholar
 [520]C. J. H. Watkins and P. Dayan. Qlearning. Machine Learning, 8(3–4), pp. 279–292, 1992.zbMATHGoogle Scholar
 [533]R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4), pp. 229–256, 1992.zbMATHGoogle Scholar
 [540]K. Xu et al. Show, attend, and tell: Neural image caption generation with visual attention. ICML Confererence, 2015.Google Scholar
 [563]V. Zhong, C. Xiong, and R. Socher. Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv:1709.00103, 2017.https://arxiv.org/abs/1709.00103
 [569]B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv:1611.01578, 2016.https://arxiv.org/abs/1611.01578
 [583]
 [602]
 [603]
 [604]
 [605]
 [606]
 [607]
 [608]https://qz.com/639952/ googlesaiwonthegamegobydefyingmillenniaofbasichumaninstinct/
 [609]
 [610]
 [611]
 [612]
 [613]
 [619]
 [620]
 [621]
 [622]
 [623]
 [624]
 [625]
 [626]