Rescorla–Wagner Models with Sparse Dynamic Attention


The Rescorla–Wagner (R–W) model describes human associative learning by proposing that an agent updates associations between stimuli, such as events in their environment or predictive cues, proportionally to a prediction error. While this model has proven informative in experiments, it has been posited that humans selectively attend to certain cues to overcome a problem with the R–W model scaling to large cue dimensions. We formally characterize this scaling problem and provide a solution that involves limiting attention in a R–W model to a sparse set of cues. Given the universal difficulty in selecting features for prediction, sparse attention faces challenges beyond those faced by the R–W model. We demonstrate several ways in which a naive attention model can fail explain those failures and leverage that understanding to produce a Sparse Attention R–W with Inference framework (SAR-WI). The SAR-WI framework not only satisfies a constraint on the number of attended cues, it also performs as well as the R–W model on a number of natural learning tasks, can correctly infer associative strengths, and focuses attention on predictive cues while ignoring uninformative cues. Given the simplicity of proposed alterations, we hope this work informs future development and empirical validation of associative learning models that seek to incorporate sparse attention.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    if \(R = (\sum C_i) \mod 2\) then no single cue contains any usable information, though collectively they contain complete information.


  1. Alexander WH (2007) Shifting attention using a temporal difference prediction error and high-dimensional input. Adapt Behav 15(2):121–133

    Article  Google Scholar 

  2. Bellman R (1966) Dynamic programming. Science 153(3731):34–37

    MATH  Article  Google Scholar 

  3. Blair MR, Watson MR, Walshe RC, Maj F (2009) Extremely selective attention: eye-tracking studies of the dynamic allocation of attention to stimulus features in categorization. J Exp Psychol Learn Memory Cognit 35(5):1196

    Article  Google Scholar 

  4. Cochran AL, Cisler JM (2019) A flexible and generalizable model of online latent-state learning. PLoS Comput Biol 15(9):e1007331

    Article  Google Scholar 

  5. Denton SE, Kruschke JK (2006) Attention and salience in associative blocking. Learn Behav 34(3):285–304

    Article  Google Scholar 

  6. Esber GR, Haselgrove M (2011) Reconciling the influence of predictiveness and uncertainty on stimulus salience: a model of attention in associative learning. Proc R Soc B Biol Sci 278(1718):2553–2561

    Article  Google Scholar 

  7. Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101

    MathSciNet  MATH  Google Scholar 

  8. Frank MJ, Badre D (2011) Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. Cereb Cortex 22(3):509–526

    Article  Google Scholar 

  9. Frey PW, Sears RJ (1978) Model of conditioning incorporating the rescorla-wagner associative axiom, a dynamic attention process, and a catastrophe rule. Psychol Rev 85(4):321

    Article  Google Scholar 

  10. Gluck MA, Bower GH (1988) From conditioning to category learning: an adaptive network model. J Exp Psychol Gen 117(3):227

    Article  Google Scholar 

  11. Gordon GJ (2001) Reinforcement learning with function approximation converges to a region. In: Advances in neural information processing systems, pp 1040–1046

  12. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    MATH  Google Scholar 

  13. Hagerup T, Mehlhorn K, Munro J (1993) Optimal algorithms for generating discrete random variables with changing distributions. Lect Notes Comput Sci 700:253–264

    MATH  Article  Google Scholar 

  14. Harris JA (2006) Elemental representations of stimuli in associative learning. Psychol Rev 113(3):584

    Article  Google Scholar 

  15. Hauser TU, Iannaccone R, Walitza S, Brandeis D, Brem S (2015) Cognitive flexibility in adolescence: neural and behavioral mechanisms of reward prediction error processing in adaptive decision making during development. Neuroimage 104:347–354

    Article  Google Scholar 

  16. Hitchcock P, Niv Y, Radulescu A, Rothstein NJ, Sims CR (2019) Measuring trial-wise choice difficulty in multi-feature reinforcement learning. PsyArXiv.

  17. Kim S, Rehder B (2011) How prior knowledge affects selective attention during category learning: an eyetracking study. Memory Cognit 39(4):649–665

    Article  Google Scholar 

  18. Koenig S, Kadel H, Uengoer M, Schubö A, Lachnit H (2017) Reward draws the eye, uncertainty holds the eye: associative learning modulates distractor interference in visual search. Front Behav Neurosci 11:128

    Article  Google Scholar 

  19. Kokkola NH, Mondragón E, Alonso E (2019) A double error dynamic asymptote model of associative learning. Psychol Rev 126(4):506

    Article  Google Scholar 

  20. Kruschke JK (1992) Alcove: an exemplar-based connectionist model of category learning. Psychol Rev 99(1):22

    Article  Google Scholar 

  21. Lawrence DH (1949) Acquired distinctiveness of cues: I. Transfer between discriminations on the basis of familiarity with the stimulus. J Exp Psychol 39(6):770

    Article  Google Scholar 

  22. Lawrence DH (1950) Acquired distinctiveness of cues: II. Selective association in a constant stimulus situation. J Exp Psychol 40(2):175

    Article  Google Scholar 

  23. Le Pelley ME (2004) The role of associative history in models of associative learning: a selective review and a hybrid model. Q J Exp Psychol Sect B 57(3b):193–243

    Article  Google Scholar 

  24. Le Pelley M, Beesley T, Griffiths O (2011) Overt attention and predictiveness in human contingency learning. J Exp Psychol Anim Behav Process 37(2):220

    Article  Google Scholar 

  25. Leong YC, Radulescu A, Daniel R, DeWoskin V, Niv Y (2017) Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron 93(2):451–463

    Article  Google Scholar 

  26. Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND (2011) Differential roles of human striatum and amygdala in associative learning. Nat Neurosci 14(10):1250

    Article  Google Scholar 

  27. Lovejoy E (1968) Attention in discrimination learning: a point of view and a theory. Holden-Day, San Francisco

    Google Scholar 

  28. Mackintosh NJ (1975) A theory of attention: variations in the associability of stimuli with reinforcement. Psychol Rev 82(4):276

    Article  Google Scholar 

  29. McLaren I, Mackintosh N (2000) An elemental model of associative learning: I. Latent inhibition and perceptual learning. Anim Learn Behav 28(3):211–246

    Article  Google Scholar 

  30. Meier KM, Blair MR (2013) Waiting and weighting: information sampling is a balance between efficiency and error-reduction. Cognition 126(2):319–325

    Article  Google Scholar 

  31. Niv Y, Daniel R, Geana A, Gershman SJ, Leong YC, Radulescu A, Wilson RC (2015) Reinforcement learning in multidimensional environments relies on attention mechanisms. J Neurosci 35(21):8145–8157

    Article  Google Scholar 

  32. Nosofsky RM, Palmeri TJ, McKinley SC (1994) Rule-plus-exception model of classification learning. Psychol Rev 101(1):53

    Article  Google Scholar 

  33. Pearce JM, Hall G (1980) A model for pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychol Rev 87(6):532

    Article  Google Scholar 

  34. Rehder B, Hoffman AB (2005) Eyetracking and selective attention in category learning. Cognit Psychol 51(1):1–41

    Article  Google Scholar 

  35. Rescorla RA, Wagner AR et al (1972) A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Class Cond II Curr Res Theory 2:64–99

    Google Scholar 

  36. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407

    MathSciNet  MATH  Article  Google Scholar 

  37. Rumelhart DE, Hinton GE, Williams GE (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. MIT Press, Cambridge, MA

    Google Scholar 

  38. Schmajuk NA, Lam YW, Gray J (1996) Latent inhibition: A neural network approach. Journal of Experimental Psychology: Animal Behavior Processes 22(3):321

    Google Scholar 

  39. Sutherland NS, Mackintosh NJ (2016) Mechanisms of animal discrimination learning. Academic Press, New York

    Google Scholar 

  40. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

    Google Scholar 

  41. Trabasso T, Bower GH (1975) Attention in learning: theory and research. Krieger Pub Co, Malabar

    Google Scholar 

  42. Wang J, Zhao P, Hoi SC, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 3(26):698–710

    Article  Google Scholar 

  43. Wills AJ, Lavric A, Croft G, Hodgson TL (2007) Predictive learning, prediction errors, and attention: evidence from event-related potentials and eye tracking. J Cognit Neurosci 19(5):843–854

    Article  Google Scholar 

  44. Young ME, Wasserman EA (2002) Limited attention and cue order consistency affect predictive learning: a test of similarity measures. J Exp Psychol Learn Memory Cognit 28(3):484

    Article  Google Scholar 

  45. Yu K, Wu X, Ding W, Pei J (2016) Scalable and accurate online feature selection for big data. ACM Trans Knowl Discov Data (TKDD) 11(2):16

    Google Scholar 

  46. Zeaman D, House BJ (1963) The role of attention in retardate discrimination learning. In: Ellis NR (ed) Handbook of mental deficiency, vol 1(3). McGraw-Hill, New York, pp 159–223

    Google Scholar 

  47. Zhou P, Hu X, Li P, Wu X (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl Based Syst 136:187–199

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Amy L. Cochran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


A Limits

Here, we provide details on the expression for the expectation and covariance matrix of associative strength on trial t: \(\theta _t\) and \(\varSigma _t\). Throughout, we assume only that observations \((R_t,C_t)\) are mutually independent between trials (Assumption 2). Taking expectation of the R–W update in (1) gives a recursive expression for \(\theta _t\):

$$\begin{aligned} \theta _{t+1}= & {} \theta _t + \mathbb {E}\left[ \alpha C_t\left( R_t - C_t' V_t\right) \right] \\= & {} \theta _t + \alpha \mathbb {E}\left[ R_tC_t\right] - \alpha \mathbb {E}\left[ C_tC_t' \right] \theta _t\\= & {} \theta _t - \alpha \mathbb {E}\left[ C_tC_t' \right] (\theta _t-v^*) \\= & {} v^* + \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) (\theta _t-v^*), \end{aligned}$$

where Eq. 2 allows for the substitution introducing \(v^*\). Applying this recursive formula repeatedly over trials t, a direct formula can be recovered:

$$\begin{aligned} \theta _{t+1} = v^* + \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) ^t (\theta _1-v^*). \end{aligned}$$

Thus, how quickly expected associative strength \(\theta _{t+1}\) goes to \(v^*\) depends on \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \). Repeated application of the sub-multiplicative property yields the following bound:

$$\begin{aligned} \begin{Vmatrix} \theta _{t+1}-v^* \end{Vmatrix}= & {} \begin{Vmatrix} \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) ^t (v^*-\theta _1) \end{Vmatrix} \\\le & {} \begin{Vmatrix} I-\alpha \mathbb {E}\left[ C_tC_t' \right] \end{Vmatrix}^t \begin{Vmatrix}v^*-\theta _1\end{Vmatrix}, \end{aligned}$$

where we are using the 2-norm for vectors and matrices. Furthermore, the 2-norm of a symmetric matrix such as \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) is the absolute value of its largest eigenvalue. Additionally, all eigenvalues \(\rho _{C,1},\ldots ,\rho _{C_,N}\) of \(\mathbb {E}\left[ C_tC_t' \right] \) are real and positive, since \(\mathbb {E}\left[ C_tC_t' \right] \) is positive semidefinite, and the eigenvalues of \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) are \(1-\alpha \rho _{C,1},\ldots ,1-\alpha \rho _{C_,N}\), since \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) shares eigenvectors with \(\mathbb {E}\left[ C_tC_t' \right] \). Thus,

$$\begin{aligned} \begin{Vmatrix} I-\alpha \mathbb {E}\left[ C_tC_t' \right] \end{Vmatrix} = \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}. \end{aligned}$$

The condition \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) ensures mean associative strength \(\theta _t\) converges exponentially to \(v^*\). Consequently,

$$\begin{aligned} \theta _{t+1} = v^* + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t\right) , \end{aligned}$$

as noted in the main text.

For the covariance matrix of \(V_{t+1}\), denoted by \(\varSigma _{t+1}\), we first note that

$$\begin{aligned} \varSigma _{t+1}= & {} \text {Var}\left[ V_t + \alpha C_t R_t - \alpha C_tC_t'V_t \right] \\= & {} \text {Var}\left[ V_t + \alpha \left( R_t C_t - C_tC_t'\theta _t\right) - \alpha C_tC_t'\left( V_t-\theta _t\right) \right] . \end{aligned}$$

Upon expanding the expression above using properties of variance, we have that

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t + \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] + \alpha ^2 \text {Var}\left[ C_tC_t'\left( V_t-\theta _t\right) \right] \\&+\, \alpha \mathrm {Cov}\left[ V_t,R_t C_t - C_t C_t' \theta _t\right] + \alpha \mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t,V_t\right] \\&-\, \alpha \mathrm {Cov}\left[ V_t,C_t C_t' (V_t-\theta _t)\right] - \alpha \mathrm {Cov}\left[ C_t C_t' (V_t-\theta _t),V_t\right] \\&-\, \alpha ^2 \mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t, C_t C_t' (V_t - \theta _t) \right] \\&-\, \alpha ^2 \mathrm {Cov}\left[ C_t C_t' (V_t - \theta _t), R_t C_t - C_t C_t' \theta _t \right] \end{aligned}$$

We can use the fact that \(V_t\) is independent from \(R_t\) and \(C_t\) to show that several terms above are zero, since

$$\begin{aligned}&\mathrm {Cov}\left[ V_t,R_t C_t - C_t C_t' \theta _t\right] \\&\quad = \mathbb {E}\left[ \left( V_t-\theta _t\right) \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) '\right]&= 0, \\&\mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t,V_t\right] \\&\quad = \mathbb {E}\left[ \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) \left( V_t-\theta _t\right) '\right]&= 0, \\&\mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t, C_t C_t' (V_t - \theta _t) \right] \\&\quad = \mathbb {E}\left[ \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) (V_t - \theta _t)' C_t C_t'\right]&= 0, \\&\mathrm {Cov}\left[ C_t C_t' (V_t - \theta _t), R_t C_t - C_t C_t' \theta _t \right] \\&\quad = \mathbb {E}\left[ C_t C_t' (V_t - \theta _t)\left( R_t C_t - C_t C_t' \theta _t- \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) '\right]&= 0. \end{aligned}$$

Setting these terms to zero yields

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t + \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] + \alpha ^2 \text {Var}\left[ C_tC_t'\left( V_t-\theta _t\right) \right] \\&- \,\alpha \mathrm {Cov}\left[ V_t,C_t C_t' (V_t-\theta _t)\right] - \alpha \mathrm {Cov}\left[ C_t C_t' (V_t-\theta _t),V_t\right] . \end{aligned}$$

We can use \(\varSigma _t\) to rewrite some remaining terms in the above expression:

$$\begin{aligned} \text {Var}\left[ C_tC_t'(V_t-\theta _t)\right]= & {} \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] , \\ \mathrm {Cov}\left[ V_t,C_tC_t'(V_t-\theta _t)\right]= & {} \mathrm {Cov}\left[ V_t-\theta _t,C_tC_t'(V_t-\theta _t)\right] = \varSigma _t\mathbb {E}\left[ C_tC_t'\right] ; \\ \mathrm {Cov}\left[ C_tC_t'(V_t-\theta _t),V_t\right]= & {} \mathrm {Cov}\left[ C_tC_t'(V_t-\theta _t),V_t-\theta _t\right] =\mathbb {E}\left[ C_tC_t'\right] \varSigma _t. \end{aligned}$$

Plugging these expressions in the prior expression leads to

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ C_tC_t'\right] - \alpha \mathbb {E}\left[ C_tC_t'\right] \varSigma _t + \alpha ^2 \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \\&+ \, \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] . \end{aligned}$$

Because \(\varSigma _t\) is involved in left and right matrix multiplication, we can use vectorization to simplify this expression. Namely,

$$\begin{aligned}&\mathtt {vec}\left( \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ C_tC_t'\right] - \alpha \mathbb {E}\left[ C_tC_t'\right] \varSigma _t + \alpha ^2 \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \right) \\&\quad = \mathtt {vec}(\varSigma _t) - \alpha \, \mathtt {vec}\left( \varSigma _t\mathbb {E}\left[ C_tC_t'\right] \right) - \alpha \, \mathtt {vec}\left( \mathbb {E}\left[ C_tC_t'\right] \varSigma _t\right) \\&\qquad + \alpha ^2 \mathtt {vec}\left( \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \right) \\&\quad = \mathbb {E}\left[ I - \alpha (I\otimes C_t C_t')-\alpha (C_t C_t'\otimes I) +\alpha ^2 (C_t C_t'\otimes C_t C_t') \right] \mathtt {vec}(\varSigma _t) \\&\quad = (I-\alpha K_{\alpha }) \mathtt {vec}(\varSigma _t), \end{aligned}$$

where \(\otimes \) is the Kronecker product, \(\mathtt {vec}\) reshapes a matrix into a vector, and

$$\begin{aligned} K_{\alpha } = \mathbb {E}\left[ (I\otimes C_t C_t')+(C_t C_t'\otimes I) -\alpha (C_t C_t'\otimes C_t C_t') \right] . \end{aligned}$$

Thus, vectorizing the equation for \(\varSigma _{t+1}\) gives

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})= & {} (I-\alpha K_{\alpha })\mathtt {vec}(\varSigma _t) + \alpha ^2 \mathtt {vec}\left( \text {Var}\left[ R_tC_t-C_tC_t'\theta _t\right] \right) . \end{aligned}$$

This recursive formula can be used to recover a direct formula:

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= (I-\alpha K_{\alpha })^t\mathtt {vec}(\varSigma _1)\\&\quad + \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-C_s C_s' \theta _s\right] ) \\&=\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-C_s C_s' \theta _s\right] ), \end{aligned}$$

where \(\varSigma _1=0\) since \(V_1\) is set to a fixed value.

To arrive at the final expression provided in the main next, we need the following three results:

  1. (1)

    If \(K_{\alpha }^{-1}\) exists, then \(\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} = \alpha K_{\alpha }^{-1}\left( I- (I-\alpha K_{\alpha })^t \right) .\)

  2. (2)

    If \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\), then \(\text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] = \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t\right) \).

  3. (3)

    \(I - \alpha K_{\alpha } = \mathcal {O}\left( 1-\alpha \rho _{K,\alpha }\right) \) where \(\rho _{K,\alpha }\) is the smallest eigenvalue of \(K_{\alpha }\).

Result 1 follows the existence of \(K_{\alpha }^{-1}\), since then

$$\begin{aligned} \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s = (I-(I-\alpha K_{\alpha }))^{-1} = \frac{1}{\alpha } K_{\alpha }^{-1}, \end{aligned}$$


$$\begin{aligned} \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s}&= \alpha ^2 \sum _{s=0}^{t-1} (I-\alpha K_{\alpha })^{s} \\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s - \alpha ^2 \sum _{s=t}^{\infty } (I-\alpha K_{\alpha })^s \\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s - \alpha ^2 (I-\alpha K_{\alpha })^t \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s\\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s \left( I- (I-\alpha K_{\alpha })^t\right) \\&= \alpha K_{\alpha }^{-1}\left( I- (I-\alpha K_{\alpha })^t \right) . \end{aligned}$$

Remark 4

Assuming \(K_{\alpha }\) is invertible is amenable when exploring convergence of the covariance matrix \(\varSigma _t\), since to guarantee convergence, we later need to impose an stronger assumption that all eigenvalues of \(K_{\alpha }\) are strictly positive. Strictly positive eigenvalues ensure that \(K_{\alpha }\) is invertible. Thus \(K_{\alpha }\) is invertible in the situation we care about: when \(\varSigma _t\) converges. Of course, the covariance matrix does not always converge as we saw in the haystack problem described in the main text, and in these situations, \(K_{\alpha }\) may or may not be invertible.

Result 2 uses \(\mathbb {E}[R_t C_t-C_t C_t'v^*]=0\) from the definition of \(v^*\) and \((\theta _t-v^*)=\mathcal {O}\left( \max _i \begin{vmatrix} 1 - \alpha \rho _{C,i}\end{vmatrix}^t\right) \) from our analysis of the mean of \(V_t\) to get that

$$\begin{aligned} \text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] =&\text {Var}\left[ R_t C_t-C_t C_t' v^* - C_t C_t'(\theta _t-v^*)\right] \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] +\,\text {Var}\left( C_t C_t'(\theta _t-v^*)\right) \\&-\,\mathrm {Cov}\left( R_t C_t-C_t C_t' v^*, C_t C_t'(\theta _t-v^*)\right) \\&-\,\mathrm {Cov}\left( C_t C_t'(\theta _t-v^*),R_t C_t-C_t C_t' v^*\right) \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \\&+\,\mathbb {E}\left[ (C_t C_t'-\mathbb {E}[C_t C_t'])(\theta _t-v^*)(\theta _t-v^*)'(C_t C_t'-\mathbb {E}[C_t C_t'])\right] \\&-\,\mathbb {E}\left[ \left( R_t C_t-C_t C_t' v^*\right) (\theta _t-v^*)'\left( C_t C_t'-\mathbb {E}[C_t C_t']\right) \right] \\&-\,\mathbb {E}\left[ \left( C_t C_t'-\mathbb {E}[C_t C_t']\right) (\theta _t-v^*)\left( R_t C_t-C_t C_t' v^*\right) '\right] \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \\&+ \mathcal {O}\left( \max \{\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t, \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^{2t} \}\right) . \end{aligned}$$

If \(\max _{i}\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) (an assumption we later need to ensure convergence of the covariance matrix), then we can drop \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^{2t}\) from the last term since then \(\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^2< \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) for all i. Hence, we arrive at the desired result:

$$\begin{aligned} \text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t \right) \end{aligned}$$

Result 3 follows from noting that symmetric matrix \(I - \alpha K_{\alpha }\) shares eigenvectors with symmetric matrix \(K_{\alpha }\) and is also positive semi-definite, since

$$\begin{aligned} I - \alpha K_{\alpha }= & {} \mathbb {E}\left[ (I-\alpha C_t C_t')\otimes (I-\alpha C_t C_t') \right] . \end{aligned}$$

Thus, eigenvalues of \(I - \alpha K_{\alpha }\) are nonnegative and are all of the form \(1-\alpha \rho \) for each eigenvalue \(\rho \) of \(K_{\alpha }.\) Consequently,

$$\begin{aligned} I - \alpha K_{\alpha } = \mathcal {O}\left( 1-\alpha \rho _{K,\alpha }\right) \end{aligned}$$

where \(\rho _{K,\alpha }\) is the smallest eigenvalue of \(K_{\alpha }\).

Putting these last three results together when \(K_{\alpha }^{-1}\) exists and \(\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1\) leads to

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}\left( \text {Var}\left[ R_s C_s-C_s C_s' v^*\right] \right. \\&\quad \left. + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^s\right) \right) \\&= \alpha K_{\alpha }^{-1} (I-(I-\alpha K_{\alpha })^t)\,\mathtt {vec}\left( \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \right) \\&\quad +\, \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^s\right) \\&= \alpha K_{\alpha }^{-1}\left( I-\mathcal {O}\left( \left( 1-\alpha \rho _{K,\alpha }\right) ^t\right) \right) \,\mathtt {vec}\left( \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \right) \\&\quad +\, \alpha ^2 \sum _{s=1}^{t} \mathcal {O}\left( \max \{\max _i \begin{vmatrix} 1-\alpha \rho _{C,i}\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha }\right) ^t\}\right) \\&=\alpha K_{\alpha }^{-1} \mathtt {vec}\left( \text {Var}\left[ R_tC_t-C_tC_t'v^*\right] \right) \\&\quad +\, \mathcal {O}\left( t\max \{\max _i \begin{vmatrix} 1-\alpha \rho _{C,i}\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha }\right) ^t\}\right) . \end{aligned}$$

In conclusion, if \(\rho _{K,\alpha } > 0\) (which implies \(K_{\alpha }^{-1}\) exists) and \(\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1\), then the covariance matrix \(\varSigma _t\) converges to a long-term covariance matrix that scales linearly with \(\alpha \).

B Adding a Decay Term

Associative models of learning often incorporate a decay term into model updates which shrinks associative strengths to zero. While usually only absent cues have their associative strengths decay (c.f., Niv et al. 2015), to simplify subsequent analysis, we consider an R–W model modified to decay all associative strengths to zero using the following update:

$$\begin{aligned} V_{t+1} = V_t + \alpha C_t \left( R_t - C_t'V_t\right) - \alpha \eta V_t, \end{aligned}$$

for some decay parameter \(\eta \) with \(1>\alpha \eta \ge 0\). Under Assumptions 1– 2, this version of the R–W model is stochastic gradient descent applied to the following objective function:

$$\begin{aligned} \mathbb {E}\left[ \left( R_t - C_t'V_t\right) ^2 +\eta V_t'V_t\right] , \end{aligned}$$

which tries to find associative strengths that minimize a combination of square prediction error and its square norm. Thus, this R–W model with decay addresses the L2 regularized version of the original prediction problem addressed by the R–W model. A solution \(v^*_{\eta }\) to this regularized objective function satisfies:

$$\begin{aligned} \mathbb {E}\left[ \eta I + C_t C_t'\right] v^*_{\eta } = \mathbb {E}\left[ R_t C_t \right] . \end{aligned}$$

Following similar arguments in the R–W model, we can recover an expression for the expectation \(\theta _t\) and covariance matrix \(\varSigma _t\) of associative strength on trial t for this R–W model with decay. Expectation of the update rule gives:

$$\begin{aligned} \theta _{t+1}= & {} \theta _t + \mathbb {E}\left[ \alpha C_t\left( R_t - C_t' V_t\right) \right] - \alpha \eta \theta _t\\= & {} v^*_{\eta } + \left( I - \alpha (\eta I + \mathbb {E}\left[ C_tC_t' \right] )\right) (\theta _t-v^*_{\eta }), \end{aligned}$$

and thus:

$$\begin{aligned} \theta _{t+1} = v^*_{\eta } + \left( (I-\alpha (\eta I + \mathbb {E}\left[ C_tC_t' \right] )\right) ^t (\theta _1-v^*_{\eta }). \end{aligned}$$

In other words, \(\theta _{t+1}\) approaches \(v^*_{\eta }\) at a rate that depends on

$$\begin{aligned} \begin{Vmatrix}I-\alpha \left( \eta I + \mathbb {E}\left[ C_tC_t' \right] \right) \end{Vmatrix} = \max _{i} \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i}) \end{vmatrix} \end{aligned}$$

provided \(\max _{i} \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i}) \end{vmatrix} < 1\). As before, \(\rho _{C,i}\) denotes an eigenvalue of \(\mathbb {E}[C_t C_t']\). For sufficiently small \(\alpha \) and \(\eta \), the decay term can speed up convergence in the mean.

For the covariance matrix \(\varSigma _{t+1}\), we have

$$\begin{aligned} \varSigma _{t+1}= & {} \text {Var}\left[ V_t + \alpha C_t R_t - \alpha C_tC_t'V_t - \eta V_t \right] \\= & {} \text {Var}\left[ V_t + \alpha \left( R_t C_t - (\eta I+ C_tC_t')\theta _t\right) - \alpha (\eta I + C_tC_t')\left( V_t-\theta _t\right) \right] , \end{aligned}$$

which expands to

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ \eta I + C_tC_t'\right] - \alpha \mathbb {E}\left[ \eta I + C_tC_t'\right] \varSigma _t\\&+ \alpha ^2 \mathbb {E}\left[ (\eta I + C_tC_t')\varSigma _t (\eta I + C_tC_t')\right] \\&+ \, \alpha ^2 \text {Var}\left[ R_t C_t - (\eta I +C_tC_t')\theta _t\right] . \end{aligned}$$


$$\begin{aligned} K_{\alpha ,\eta } = \mathbb {E}\left[ I\otimes (\eta I + C_t C_t')+(\eta I + C_t C_t')\otimes I -\alpha (\eta I + C_t)\otimes (\eta I + C_t C_t') \right] \end{aligned}$$

with minimum eigenvalue \(\rho _{K,\alpha ,\eta }\), then vectorization yields

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= (I-\alpha K_{\alpha ,\eta })\mathtt {vec}(\varSigma _t) + \alpha ^2 \mathtt {vec}(\text {Var}\left[ R_t C_t-(\eta I + C_t C_t') \theta _t\right] ) \\&=\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha ,\eta })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-(\eta I + C_s C_s') \theta _s\right] ). \end{aligned}$$

Using similar arguments before, we have that when \(K_{\alpha ,\eta }^{-1}\) exists:

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= \alpha K_{\alpha ,\eta }^{-1} \mathtt {vec}\left( \text {Var}\left[ R_tC_t-(\eta I + C_tC_t')v^*_{\eta }\right] \right) \\&\quad +\, \mathcal {O}\left( t\max \{\max _i \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i})\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha ,\eta }\right) ^t\}\right) \end{aligned}$$

Note when \(\eta =0\), then \(v^*_0 = v^*\), \(K_{\alpha ,0}=K_{\alpha }\), and \(\rho _{K,\alpha ,0}=\rho _{K,\alpha }\), and hence, we recover our expressions for a R–W model without decay.

Based on these derived expressions, we reproduced figures from the main text for the R–W model with decay. Comparing Figs. 11 to 3 from the main text, the R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the R–W model without decay for sufficiently small \(\alpha \) and N, but does not expand the region of \(\alpha ,N\) values with convergent means and variances. Decay has little effect on long-term variance.

Fig. 11

Statistical properties of associative strength \(V_t\) for the R–W model with decay in the haystack problem. Dashed line marks the boundary between regions where long-term variance exists and does not exist. Decay parameter \(\eta \) was set to 0.5 (Color figure online)

Fig. 12

Statistical properties of associative strength \(V_t\) for the Sparse R–W model with decay in the haystack problem. The dashed line is \(k=N\), which represents the R–W model with decay. Decay parameter \(\eta \) was set to 0.5, and learning rate \(\alpha \) was 0.2 (Color figure online)

Fig. 13

Estimated error in prediction for R–W, Leong, SAR-WI, and SAR-WI with decay on four learning tasks. All models have \(\alpha = 0.1\), \(N=20\), the Leong model was run with \(\beta =10\), the two SAR-WI models was run with an attention bandwidth k of 5, and the SAR-WI model with decay was run with \(\eta =0.5\). Expectations were estimated by averaging \(4 \times 10^4\) simulations of the task (Color figure online)

We can extend the R–W model with decay to the sparse attention setting:

$$\begin{aligned} V_{t+1} = V_t + \alpha X_t C_t \left( R_t - (X_t C_t)'V_t\right) - \alpha \eta V_t, \end{aligned}$$

Notice that this version of association decay applies to all associations equally, and thus represents a generalized decay or forgetfulness likely due to some general biological or cognitive process which decays or rewrites on top of all associations. Since attended variables are also updated, computations are not sparse. The net effect of this model would be that consecutively unattended associations would decay, while attended ones may sustain associations. As before, the derivations for mean and variance hold with \(C_t\) replaced with \(X_t C_t\). Comparing Figs. 12 to 4 from the main text, the Sparse R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the Sparse R–W model without decay, but leads to worse long-term variance.

We can also add a general decay term to the SAR-WI framework:

$$\begin{aligned} V_{t+1}&= V_t + \alpha X_t C^0_t(R^0_t -(X_tC^0_t)'V_t) - \eta \alpha V_t, \\ \bar{C}_{t+1}&= \bar{C}_t + \alpha X_t(C_t-\bar{C}_t) - \eta \alpha \bar{C}_t \\ \bar{R}_{t+1}&= \bar{R}_t + \alpha (R_t-\bar{R}_t) - \eta \alpha \bar{R}_t \end{aligned}$$

where as before

$$\begin{aligned} C^0_t&= C_t - \bar{C_t}\\ R^0_t&= R_t - \bar{R_t}. \end{aligned}$$

Figure 13 adds the SAR-WI model with decay to the comparison of models on four tasks. On these tasks and for these sets of parameters, the SAR-WI model with decay performs worse than the SAR-WI model without decay.

C Selecting Cues Without a Fixed Attention Bandwidth

When satisfying the fixed attention bandwidth constraint is less important, or when it is natural to consider the number of attended cues changing in response to their associations, another approach is to sample cues independently from each other:

$$\begin{aligned} P(X_{t,i,i}) = \frac{ V_{t,i}^2 \bar{C}_{t,i}(1-\bar{C}_{t,i}) + \epsilon }{D + \epsilon }, \end{aligned}$$

for some \(\epsilon >0\) and some normalization constant D. If rewards are Bernoulli, then one could use \(D = \bar{R}_t(1-\bar{R}_t)\), since independent Bernoulli cue between trials ensures

$$\begin{aligned} \sum _{i} (v_{i}^*)^2 \mathbb {E}[C_{t,i}] (1-\mathbb {E}[C_{t,i}]) \le \mathbb {E}[R_t](1-\mathbb {E}[R_t]). \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nishimura, J., Cochran, A.L. Rescorla–Wagner Models with Sparse Dynamic Attention. Bull Math Biol 82, 69 (2020).

Download citation


  • Rescorla–Wagner
  • Attention
  • Learning