# Reward-Based Learning, Model-Based and Model-Free

## Definition

Reinforcement learning (RL) techniques are a set of solutions for optimal long-term action choice such that actions take into account both immediate and delayed consequences. They fall into two broad classes: model-based and model-free approaches. Model-based approaches assume an explicit model of the environment and the agent. The model describes the consequences of actions and the associated returns. From this, optimal policies can be inferred. Psychologically, model-based descriptions apply to goal-directed decisions, in which choices reflect current preferences over outcomes. Model-free approaches forget any explicit knowledge of the dynamics of the environment or the consequences of actions and evaluate how good actions are through trial-and-error learning. Model-free values underlie habitual and Pavlovian conditioned responses that are emitted reflexively when faced with certain stimuli. While model-based techniques have substantial computational demands, model-free techniques require extensive experience.

## Detailed Description

### Theory

#### Reinforcement Learning

\( \mathcal{S} \): a set of states \( s\in \mathcal{S} \)

_{.}\( \mathcal{A} \): a set of actions \( a\in \mathcal{A} \)

_{.}\( \mathcal{T}\left({s}^{\prime }|s,a\right) \): the transition function maps each state-action pairs to a distribution over successor states

*s*^{′}, with \( s,{s}^{\prime}\in \mathcal{S};a\in \mathcal{A} \) and \( {\sum}_{s^{\prime }}\mathcal{T}\left({s}^{\prime }|s,a\right)=1 \).ℛ (

*s*,*a*,*s*^{′}) →*r*: the reinforcement function mapping state-action-successor state triples to a scalar return*r*.

*a*←

*π*(

*s*) that maps each state to the action maximizing the total expected future return of actions

*a*in state

*s*.

The sum in Eq. 1 may not be finite. For this reason, it is often replaced by the discounted total expected reward \( \mathbb{E}\left[{\sum}_{t^{\prime }=0}^{\infty }{\gamma}^{t^{\prime }}{r}_{t^{\prime }}|s,a\right] \) with the discount factor 0 ≤ *γ* ≤ 1. The discount factor sets the relative importance of immediate and future rewards: *γ* = 0 means that only the next reward is considered, whereas *γ* = 1 considers all rewards to have equal importance no matter how far in the future they occur.

#### Model-Based RL

*Bellman equation*(Bellman 1957):

*a*in state

*s*. The optimal policy maps each state to the action with the highest \( \mathcal{Q} \) value:

*w*(determined by the number of actions \( \left|\mathcal{A}\right| \) and the size of the state-space reached by these actions). The computational cost of simple tree search is \( \mathcal{O}\left({w}^d\right) \) where

*d*is the depth of the tree (see Fig. 1 for an example). Although dynamic programming methods such as policy iteration reduce this cost to \( \mathcal{O}\left({\left|\mathcal{S}\right|}^3\right) \), this is still computationally prohibitive for most real-life problems and additionally difficult to implement neurally as it involves matrix inversion. Psychological and neurobiological accounts of model-based RL thus emphasize sequential evaluations of decision trees.

#### Model-Free RL

*Temporal difference reinforcement learning (TDRL)*constructs estimates of state or state-action values from these samples by bootstrapping. To achieve this, the total future reward is written as the sum of the immediate reward plus the average value of the successor state:

*δ*

_{V}, one can arrive at correct values by iterative updates

*a*

_{t}~

*π*(

*s*

_{t}) and on the

*t*’th such interaction obtains state and reward samples from the world:

A similar approach can be applied to learning state-action values (Watkins and Dayan 1992). Thus, while model-based RL methods prospectively predict the consequences of actions based on an understanding of the structure of the world, model-free methods retrospectively approximate these based on past experience. Nevertheless, under certain situations, model-free methods have strong convergence guarantees (Bertsekas and Tsitsiklis 1996; Sutton and Barto 1998; Puterman 2005). Policies *π* are often in turn formalized as parametric functions of the value functions \( \mathcal{V} \) or \( \mathcal{Q} \) themselves, although this may break certain guarantees (Bertsekas and Tsitsiklis 1996).

**Actor-Critic**(Barto et al. 1983). The Critic uses TD to estimate the value \( {\mathcal{V}}_t(s) \) for states, while the Actor maintains the policy used to select actions. After each action

*a*

_{t,}the Critic calculates the prediction error and sends it to the Actor. A positive prediction error indicates that the action improved the potential for future rewards, and the tendency to select the action should be increased. An example of using the prediction error is to select actions based on the Gibbs softmax method

*p*

_{t}(

*s*,

*a*) defines the “propensity” to take action

*a*in state

*s*. These propensities are updated by the prediction error

*p*

_{t}(

*s*,

*a*) ←

*p*

_{t − 1}(

*s*,

*a*) + ϵ

*δ*

_{t}.

#### Sampling and Computational Costs

The algorithms discussed so far suffer either from catastrophic computational requirements or from equally drastic dependence on extensive sampling in realistic environments. Solutions to these drawbacks fall into four general categories: (1) subdivision into smaller subtasks (possibly each having their own subgoal; cf. Dietterich 1999; Sutton et al. 1999); (2) pruning of the decision tree (cf. Knuth and Moore 1975; Huys et al. 2012); (3) approximations (e.g., neural networks for function approximation, Sutton and Barto 1998; or (4) structured representations (Boutilier et al. 1995) and sampling techniques (Kearns and Singh 2002; Kocsis and Szepesvari 2006).

*π*, we can rewrite this as

**R**]

_{s}is the first sum in Eq. 13 above. That is, the values of the states are linear in the immediate rewards

**R**, with the weights given by

**I**+

**P**+

**P**

^{2}+

**P**

^{3}+ … = (

**I**−

**P**)

^{−1}, which is the total time spent in each state-action pair.

The strengths of model-based and model-free computations can also be combined to offset their mutual weaknesses. In Dyna-Q (Sutton 1990), samples as in Eq. 9 are generated from the agent’s internal estimates of \( \mathcal{T} \) and ℛ to updating model-free values. Conversely, model-free state values can be substituted for subtrees to reduce the size of decision trees (e.g., Campbell et al. 2002).

If the states \( \mathcal{S} \) are not fully observable, the problem becomes a partially observable MDP (Kaelbling et al. 1998), which presents substantial additional complexities.

### Behavior

Model-based and model-free accounts of behavior were held to be incompatible for much of the last century (Hull 1943; Tolman 1948). However, key signatures of both systems can be discerned within individual animals’ (Balleine and Dickinson 1994; Killcross and Coutureau 2003; Yin et al. 2004, 2005) and humans’ (Valentin et al. 2007; Daw et al. 2011) behavior and neurobiology. These signatures reflect central differences in their utilization of information. For a discussion, see Daw et al. (2005), Dayan and Berridge (2014), and Huys et al. (2014). This is also evidence for the use of the successor representation in humans (Russek et al. 2017; Momennejad et al. 2017).

*a*are reinforced in the presence of certain stimuli or in situations

*s*. These experiments are modelled using \( \mathcal{Q}\left(a,s\right) \) values. In Pavlovian paradigms, stimuli

*s*lead to reinforcements independent of subjects’ actions. These paradigms are modelled using stimulus values \( \mathcal{V}(s) \). Importantly, there can be model-based and model-free versions of both, leading to a quartet of values \( {\mathcal{V}}^{\mathrm{MF}}(s),{\mathcal{V}}^{\mathrm{MB}}(s),{\mathcal{Q}}^{\mathrm{MF}}\left(s,a\right) \) and \( {\mathcal{Q}}^{\mathrm{MB}}\left(s,a\right) \). Both model-free values \( {\mathcal{V}}^{\mathrm{MF}}(s) \) and \( {\mathcal{Q}}^{\mathrm{MF}}\left(s,a\right) \) are

*scalar*representations that change

*slowly*. These two features account for its key behavioral signatures (Fig. 2).

The consequences of the *scalar* nature of model-free values are most clearly seen in Pavlovian scenarios, where \( {\mathcal{V}}^{\mathrm{MF}}(s) \) reflect only the magnitude of reinforcements but not other aspects such as whether an action was rewarded by food or water. One paradigmatic example is blocking experiments (Kamin 1969). In these, learning the reward association of a stimulus “B” in a compound “AB” is prevented if “A” already fully predicts the reward. Then the reward is fully predicted; no prediction error occurs. Hence, model-free values are not updated and hence no learning occurs. Thus, if the model-free system makes no prediction about certain aspects of stimuli, then shifts in these aspects should not lead to learning. In transreinforcer blocking, animals treat a reward reduction and delivery of a shock punishment as equivalent (Dickinson and Dearing 1979), arguing for a linear and unitary representation of rewards and punishments as encapsulated in the single value r in Eq. 11. In Pavlovian unblocking, animals similarly can show an insensitivity toward shifts between rewards of equal magnitude but different modality (e.g., water and food; McDannald et al. 2011), showing that only the reward value, but not its other sensory features, is encoded. As a scalar value, model-free values can, however, replace reinforcements and be approached (if positive; Dayan et al. 2006) or avoided (if negative; Guitart-Masip et al. 2011). In conditioned reinforcement experiments, behavior is motivated by stimuli associated with the rewards (i.e., having positive model-free value \( {\mathcal{V}}^{\mathrm{MF}}(s) \)) even in the absence of the rewards themselves (Bouton 2006). This can be captured by Actor-Critic models (Barto et al. 1983). By the same argument, model-free state or stimulus values \( {\mathcal{V}}^{\mathrm{MF}}(s) \) can also influence the vigor with which ongoing actions are performed (Pavlovian-instrumental transfer; Huys et al. 2011). These three features are also central to the notion of incentive value (McClure et al. 2003).

*δ*and hence would predict continued responding. Conversely, by considering the now undesired outcome of actions, model-based evaluation should lead to a reduction in lever pressing on the very first trial after the devaluation. Accounts of the shift from early model-based and goal-directed to later model-free and habitual behavior rely on their statistical properties (Daw et al. 2005) or the tradeoff between the cost of cognition and the value of improved choices (Keramati et al. 2011).

### Neurobiology

The component of model-free learning best understood is the representation of the temporal prediction error *δ*. Interpreting earlier work by Schultz and Romo (1990) and Montague et al. (1996) pointed out that the phasic firing of dopaminergic midbrain neurons corresponds closely to the positive portion of the prediction error *δ*. This has been extensively validated with single-electrode recordings (even in humans; Zaghloul et al. 2009), functional neuroimaging (D’Ardenne et al. 2008), cyclic voltammetry (Day et al. 2007), with optogenetic manipulations (Steinberg et al. 2013) and in diseases of the dopamine neurons (Frank et al. 2004). This is true both in Pavlovian (Waelti et al. 2001; Flagel et al. 2011) and instrumental scenarios (Morris et al. 2006; Roesch et al. 2007). These phasic prediction errors are not just a linear reflection of the magnitude and probability of the expected reward (Tobler et al. 2005; Bayer and Glimcher 2005) but also of the summed long-term future rewards (Schultz et al. 1997; Enomoto et al. 2011). Dopamine neurons have a low-firing baseline and therefore appear to represent the negative portion of the prediction errors *δ* by the length of the pause in firing (Bayer et al. 2007). Phasic firing covaries with the development of behavioral responses (Waelti et al. 2001; Flagel et al. 2011) and can causally drive learning (Steinberg et al. 2013; Saunders et al. 2018). Furthermore, pharmacological manipulations of dopamine alter the behavioral expression of model-free vs model-based behaviors (Nelson and Killcross 2006; Wunderlich et al. 2012).

In comparison, the neural location where prediction errors are summated into model-free values is much less well understood, although multiple parts of the affective neural circuitry appear to be involved, from the ventral (Cardinal et al. 2002; Corbit and Balleine 2011; McDannald et al. 2011) and dorsal portions of the striatum (Yin et al. 2004, 2005), the ventromedial prefrontal cortex (Killcross and Coutureau 2003; Smith and Graybiel 2013), to the amygdala (Corbit and Balleine 2005).

Similarly, the neural bases of the model-based system are also poorly understood. Depending on the nature of the structure represented in \( \mathcal{T} \), different neural substrates will be required. Hence, there is a priori no reason to expect a unitary representation of a single model-based system. However, particular features of the system can probably be pinpointed. For instance, learning about a stimulus-stimulus transition matrix recruits the posterior parietal cortex (Gläscher et al. 2010), while model-based expectations of stimulus value involve the ventromedial prefrontal cortex (Hampton et al. 2006; Schoenbaum et al. 2009). Recordings from spatial navigation tasks in the rodent hippocampus are so far unique in yielding direct neural evidence of the implementation of sequential tree search (Johnson and Redish 2007; Pfeiffer and Foster 2013).

### Psychopathology

Given the representation of a key model-free component by dopaminergic neurons, pathological excesses of dopamine have been suggested to involve a shift from model-based toward model-free decision-making (Redish et al. 2008; Robbins et al. 2012; Huys et al. 2014). This has been clearly demonstrated in laboratory animals (Dickinson et al. 2000; Nelson and Killcross 2006), though data in humans has been less clear-cut (Voon et al. 2015; Sebold et al. 2017; Nebe et al. 2017). Similar arguments have been made about other disorders with a striatal component, particularly obsessive-compulsive disorders (Gillan et al. 2011, 2016), and models incorporating additional neurobiological details about the striatum can account for some of the choice patterns seen in Parkinson’s disease, ADHD, and Tourette’s (Maia and Frank 2011).

