Rescorla–Wagner Models with Sparse Dynamic Attention

Nishimura, Joel; Cochran, Amy L.

doi:10.1007/s11538-020-00743-w

Rescorla–Wagner Models with Sparse Dynamic Attention

Original Article
Published: 04 June 2020

Volume 82, article number 69, (2020)
Cite this article

Bulletin of Mathematical Biology Aims and scope Submit manuscript

595 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

The Rescorla–Wagner (R–W) model describes human associative learning by proposing that an agent updates associations between stimuli, such as events in their environment or predictive cues, proportionally to a prediction error. While this model has proven informative in experiments, it has been posited that humans selectively attend to certain cues to overcome a problem with the R–W model scaling to large cue dimensions. We formally characterize this scaling problem and provide a solution that involves limiting attention in a R–W model to a sparse set of cues. Given the universal difficulty in selecting features for prediction, sparse attention faces challenges beyond those faced by the R–W model. We demonstrate several ways in which a naive attention model can fail explain those failures and leverage that understanding to produce a Sparse Attention R–W with Inference framework (SAR-WI). The SAR-WI framework not only satisfies a constraint on the number of attended cues, it also performs as well as the R–W model on a number of natural learning tasks, can correctly infer associative strengths, and focuses attention on predictive cues while ignoring uninformative cues. Given the simplicity of proposed alterations, we hope this work informs future development and empirical validation of associative learning models that seek to incorporate sparse attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Controlled Adaptive Network Model for Joint Attention

Active inference and the two-step task

Article Open access 21 October 2022

Multiple associative structures created by reinforcement and incidental statistical learning mechanisms

Article Open access 23 October 2019

Notes

if $R = (\sum C_i) \mod 2$ then no single cue contains any usable information, though collectively they contain complete information.

References

Alexander WH (2007) Shifting attention using a temporal difference prediction error and high-dimensional input. Adapt Behav 15(2):121–133
Article Google Scholar
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Article MATH Google Scholar
Blair MR, Watson MR, Walshe RC, Maj F (2009) Extremely selective attention: eye-tracking studies of the dynamic allocation of attention to stimulus features in categorization. J Exp Psychol Learn Memory Cognit 35(5):1196
Article Google Scholar
Cochran AL, Cisler JM (2019) A flexible and generalizable model of online latent-state learning. PLoS Comput Biol 15(9):e1007331
Article Google Scholar
Denton SE, Kruschke JK (2006) Attention and salience in associative blocking. Learn Behav 34(3):285–304
Article Google Scholar
Esber GR, Haselgrove M (2011) Reconciling the influence of predictiveness and uncertainty on stimulus salience: a model of attention in associative learning. Proc R Soc B Biol Sci 278(1718):2553–2561
Article Google Scholar
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101
MathSciNet MATH Google Scholar
Frank MJ, Badre D (2011) Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. Cereb Cortex 22(3):509–526
Article Google Scholar
Frey PW, Sears RJ (1978) Model of conditioning incorporating the rescorla-wagner associative axiom, a dynamic attention process, and a catastrophe rule. Psychol Rev 85(4):321
Article Google Scholar
Gluck MA, Bower GH (1988) From conditioning to category learning: an adaptive network model. J Exp Psychol Gen 117(3):227
Article Google Scholar
Gordon GJ (2001) Reinforcement learning with function approximation converges to a region. In: Advances in neural information processing systems, pp 1040–1046
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
MATH Google Scholar
Hagerup T, Mehlhorn K, Munro J (1993) Optimal algorithms for generating discrete random variables with changing distributions. Lect Notes Comput Sci 700:253–264
Article MATH Google Scholar
Harris JA (2006) Elemental representations of stimuli in associative learning. Psychol Rev 113(3):584
Article Google Scholar
Hauser TU, Iannaccone R, Walitza S, Brandeis D, Brem S (2015) Cognitive flexibility in adolescence: neural and behavioral mechanisms of reward prediction error processing in adaptive decision making during development. Neuroimage 104:347–354
Article Google Scholar
Hitchcock P, Niv Y, Radulescu A, Rothstein NJ, Sims CR (2019) Measuring trial-wise choice difficulty in multi-feature reinforcement learning. PsyArXiv. https://doi.org/10.31234/osf.io/ma3cf
Kim S, Rehder B (2011) How prior knowledge affects selective attention during category learning: an eyetracking study. Memory Cognit 39(4):649–665
Article Google Scholar
Koenig S, Kadel H, Uengoer M, Schubö A, Lachnit H (2017) Reward draws the eye, uncertainty holds the eye: associative learning modulates distractor interference in visual search. Front Behav Neurosci 11:128
Article Google Scholar
Kokkola NH, Mondragón E, Alonso E (2019) A double error dynamic asymptote model of associative learning. Psychol Rev 126(4):506
Article Google Scholar
Kruschke JK (1992) Alcove: an exemplar-based connectionist model of category learning. Psychol Rev 99(1):22
Article Google Scholar
Lawrence DH (1949) Acquired distinctiveness of cues: I. Transfer between discriminations on the basis of familiarity with the stimulus. J Exp Psychol 39(6):770
Article Google Scholar
Lawrence DH (1950) Acquired distinctiveness of cues: II. Selective association in a constant stimulus situation. J Exp Psychol 40(2):175
Article Google Scholar
Le Pelley ME (2004) The role of associative history in models of associative learning: a selective review and a hybrid model. Q J Exp Psychol Sect B 57(3b):193–243
Article Google Scholar
Le Pelley M, Beesley T, Griffiths O (2011) Overt attention and predictiveness in human contingency learning. J Exp Psychol Anim Behav Process 37(2):220
Article Google Scholar
Leong YC, Radulescu A, Daniel R, DeWoskin V, Niv Y (2017) Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron 93(2):451–463
Article Google Scholar
Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND (2011) Differential roles of human striatum and amygdala in associative learning. Nat Neurosci 14(10):1250
Article Google Scholar
Lovejoy E (1968) Attention in discrimination learning: a point of view and a theory. Holden-Day, San Francisco
Google Scholar
Mackintosh NJ (1975) A theory of attention: variations in the associability of stimuli with reinforcement. Psychol Rev 82(4):276
Article Google Scholar
McLaren I, Mackintosh N (2000) An elemental model of associative learning: I. Latent inhibition and perceptual learning. Anim Learn Behav 28(3):211–246
Article Google Scholar
Meier KM, Blair MR (2013) Waiting and weighting: information sampling is a balance between efficiency and error-reduction. Cognition 126(2):319–325
Article Google Scholar
Niv Y, Daniel R, Geana A, Gershman SJ, Leong YC, Radulescu A, Wilson RC (2015) Reinforcement learning in multidimensional environments relies on attention mechanisms. J Neurosci 35(21):8145–8157
Article Google Scholar
Nosofsky RM, Palmeri TJ, McKinley SC (1994) Rule-plus-exception model of classification learning. Psychol Rev 101(1):53
Article Google Scholar
Pearce JM, Hall G (1980) A model for pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychol Rev 87(6):532
Article Google Scholar
Rehder B, Hoffman AB (2005) Eyetracking and selective attention in category learning. Cognit Psychol 51(1):1–41
Article Google Scholar
Rescorla RA, Wagner AR et al (1972) A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Class Cond II Curr Res Theory 2:64–99
Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407
Article MathSciNet MATH Google Scholar
Rumelhart DE, Hinton GE, Williams GE (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. MIT Press, Cambridge, MA
Chapter Google Scholar
Schmajuk NA, Lam YW, Gray J (1996) Latent inhibition: A neural network approach. Journal of Experimental Psychology: Animal Behavior Processes 22(3):321
Google Scholar
Sutherland NS, Mackintosh NJ (2016) Mechanisms of animal discrimination learning. Academic Press, New York
Google Scholar
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
MATH Google Scholar
Trabasso T, Bower GH (1975) Attention in learning: theory and research. Krieger Pub Co, Malabar
Google Scholar
Wang J, Zhao P, Hoi SC, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 3(26):698–710
Article Google Scholar
Wills AJ, Lavric A, Croft G, Hodgson TL (2007) Predictive learning, prediction errors, and attention: evidence from event-related potentials and eye tracking. J Cognit Neurosci 19(5):843–854
Article Google Scholar
Young ME, Wasserman EA (2002) Limited attention and cue order consistency affect predictive learning: a test of similarity measures. J Exp Psychol Learn Memory Cognit 28(3):484
Article Google Scholar
Yu K, Wu X, Ding W, Pei J (2016) Scalable and accurate online feature selection for big data. ACM Trans Knowl Discov Data (TKDD) 11(2):16
Google Scholar
Zeaman D, House BJ (1963) The role of attention in retardate discrimination learning. In: Ellis NR (ed) Handbook of mental deficiency, vol 1(3). McGraw-Hill, New York, pp 159–223
Google Scholar
Zhou P, Hu X, Li P, Wu X (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl Based Syst 136:187–199
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical and Natural Sciences, Arizona State University, Glendale, AZ, USA
Joel Nishimura
Department of Mathematics and Population Health Sciences, University of Wisconsin - Madison, Madison, WI, USA
Amy L. Cochran

Authors

Joel Nishimura
View author publications
You can also search for this author in PubMed Google Scholar
Amy L. Cochran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amy L. Cochran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Limits

Here, we provide details on the expression for the expectation and covariance matrix of associative strength on trial t: $\theta _t$ and $\varSigma _t$. Throughout, we assume only that observations $(R_t,C_t)$ are mutually independent between trials (Assumption 2). Taking expectation of the R–W update in (1) gives a recursive expression for $\theta _t$:

$$\begin{aligned} \theta _{t+1}= & {} \theta _t + \mathbb {E}\left[ \alpha C_t\left( R_t - C_t' V_t\right) \right] \\= & {} \theta _t + \alpha \mathbb {E}\left[ R_tC_t\right] - \alpha \mathbb {E}\left[ C_tC_t' \right] \theta _t\\= & {} \theta _t - \alpha \mathbb {E}\left[ C_tC_t' \right] (\theta _t-v^*) \\= & {} v^* + \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) (\theta _t-v^*), \end{aligned}$$

where Eq. 2 allows for the substitution introducing $v^*$. Applying this recursive formula repeatedly over trials t, a direct formula can be recovered:

$$\begin{aligned} \theta _{t+1} = v^* + \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) ^t (\theta _1-v^*). \end{aligned}$$

Thus, how quickly expected associative strength $\theta _{t+1}$ goes to $v^*$ depends on $I-\alpha \mathbb {E}\left[ C_tC_t' \right] $. Repeated application of the sub-multiplicative property yields the following bound:

$$\begin{aligned} \begin{Vmatrix} \theta _{t+1}-v^* \end{Vmatrix}= & {} \begin{Vmatrix} \left( I-\alpha \mathbb {E}\left[ C_tC_t' \right] \right) ^t (v^*-\theta _1) \end{Vmatrix} \\\le & {} \begin{Vmatrix} I-\alpha \mathbb {E}\left[ C_tC_t' \right] \end{Vmatrix}^t \begin{Vmatrix}v^*-\theta _1\end{Vmatrix}, \end{aligned}$$

where we are using the 2-norm for vectors and matrices. Furthermore, the 2-norm of a symmetric matrix such as $I-\alpha \mathbb {E}\left[ C_tC_t' \right] $ is the absolute value of its largest eigenvalue. Additionally, all eigenvalues $\rho _{C,1},\ldots ,\rho _{C_,N}$ of $\mathbb {E}\left[ C_tC_t' \right] $ are real and positive, since $\mathbb {E}\left[ C_tC_t' \right] $ is positive semidefinite, and the eigenvalues of $I-\alpha \mathbb {E}\left[ C_tC_t' \right] $ are $1-\alpha \rho _{C,1},\ldots ,1-\alpha \rho _{C_,N}$, since $I-\alpha \mathbb {E}\left[ C_tC_t' \right] $ shares eigenvectors with $\mathbb {E}\left[ C_tC_t' \right] $. Thus,

$$\begin{aligned} \begin{Vmatrix} I-\alpha \mathbb {E}\left[ C_tC_t' \right] \end{Vmatrix} = \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}. \end{aligned}$$

The condition $\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1$ ensures mean associative strength $\theta _t$ converges exponentially to $v^*$. Consequently,

$$\begin{aligned} \theta _{t+1} = v^* + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t\right) , \end{aligned}$$

as noted in the main text.

For the covariance matrix of $V_{t+1}$, denoted by $\varSigma _{t+1}$, we first note that

$$\begin{aligned} \varSigma _{t+1}= & {} \text {Var}\left[ V_t + \alpha C_t R_t - \alpha C_tC_t'V_t \right] \\= & {} \text {Var}\left[ V_t + \alpha \left( R_t C_t - C_tC_t'\theta _t\right) - \alpha C_tC_t'\left( V_t-\theta _t\right) \right] . \end{aligned}$$

Upon expanding the expression above using properties of variance, we have that

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t + \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] + \alpha ^2 \text {Var}\left[ C_tC_t'\left( V_t-\theta _t\right) \right] \\&+\, \alpha \mathrm {Cov}\left[ V_t,R_t C_t - C_t C_t' \theta _t\right] + \alpha \mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t,V_t\right] \\&-\, \alpha \mathrm {Cov}\left[ V_t,C_t C_t' (V_t-\theta _t)\right] - \alpha \mathrm {Cov}\left[ C_t C_t' (V_t-\theta _t),V_t\right] \\&-\, \alpha ^2 \mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t, C_t C_t' (V_t - \theta _t) \right] \\&-\, \alpha ^2 \mathrm {Cov}\left[ C_t C_t' (V_t - \theta _t), R_t C_t - C_t C_t' \theta _t \right] \end{aligned}$$

We can use the fact that $V_t$ is independent from $R_t$ and $C_t$ to show that several terms above are zero, since

$$\begin{aligned}&\mathrm {Cov}\left[ V_t,R_t C_t - C_t C_t' \theta _t\right] \\&\quad = \mathbb {E}\left[ \left( V_t-\theta _t\right) \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) '\right]&= 0, \\&\mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t,V_t\right] \\&\quad = \mathbb {E}\left[ \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) \left( V_t-\theta _t\right) '\right]&= 0, \\&\mathrm {Cov}\left[ R_t C_t - C_t C_t' \theta _t, C_t C_t' (V_t - \theta _t) \right] \\&\quad = \mathbb {E}\left[ \left( R_t C_t - C_t C_t' \theta _t - \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) (V_t - \theta _t)' C_t C_t'\right]&= 0, \\&\mathrm {Cov}\left[ C_t C_t' (V_t - \theta _t), R_t C_t - C_t C_t' \theta _t \right] \\&\quad = \mathbb {E}\left[ C_t C_t' (V_t - \theta _t)\left( R_t C_t - C_t C_t' \theta _t- \mathbb {E}[R_t C_t - C_t C_t'\theta _t]\right) '\right]&= 0. \end{aligned}$$

Setting these terms to zero yields

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t + \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] + \alpha ^2 \text {Var}\left[ C_tC_t'\left( V_t-\theta _t\right) \right] \\&- \,\alpha \mathrm {Cov}\left[ V_t,C_t C_t' (V_t-\theta _t)\right] - \alpha \mathrm {Cov}\left[ C_t C_t' (V_t-\theta _t),V_t\right] . \end{aligned}$$

We can use $\varSigma _t$ to rewrite some remaining terms in the above expression:

$$\begin{aligned} \text {Var}\left[ C_tC_t'(V_t-\theta _t)\right]= & {} \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] , \\ \mathrm {Cov}\left[ V_t,C_tC_t'(V_t-\theta _t)\right]= & {} \mathrm {Cov}\left[ V_t-\theta _t,C_tC_t'(V_t-\theta _t)\right] = \varSigma _t\mathbb {E}\left[ C_tC_t'\right] ; \\ \mathrm {Cov}\left[ C_tC_t'(V_t-\theta _t),V_t\right]= & {} \mathrm {Cov}\left[ C_tC_t'(V_t-\theta _t),V_t-\theta _t\right] =\mathbb {E}\left[ C_tC_t'\right] \varSigma _t. \end{aligned}$$

Plugging these expressions in the prior expression leads to

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ C_tC_t'\right] - \alpha \mathbb {E}\left[ C_tC_t'\right] \varSigma _t + \alpha ^2 \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \\&+ \, \alpha ^2 \text {Var}\left[ R_t C_t - C_tC_t'\theta _t\right] . \end{aligned}$$

Because $\varSigma _t$ is involved in left and right matrix multiplication, we can use vectorization to simplify this expression. Namely,

$$\begin{aligned}&\mathtt {vec}\left( \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ C_tC_t'\right] - \alpha \mathbb {E}\left[ C_tC_t'\right] \varSigma _t + \alpha ^2 \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \right) \\&\quad = \mathtt {vec}(\varSigma _t) - \alpha \, \mathtt {vec}\left( \varSigma _t\mathbb {E}\left[ C_tC_t'\right] \right) - \alpha \, \mathtt {vec}\left( \mathbb {E}\left[ C_tC_t'\right] \varSigma _t\right) \\&\qquad + \alpha ^2 \mathtt {vec}\left( \mathbb {E}\left[ C_tC_t'\varSigma _t C_tC_t'\right] \right) \\&\quad = \mathbb {E}\left[ I - \alpha (I\otimes C_t C_t')-\alpha (C_t C_t'\otimes I) +\alpha ^2 (C_t C_t'\otimes C_t C_t') \right] \mathtt {vec}(\varSigma _t) \\&\quad = (I-\alpha K_{\alpha }) \mathtt {vec}(\varSigma _t), \end{aligned}$$

where $\otimes $ is the Kronecker product, $\mathtt {vec}$ reshapes a matrix into a vector, and

$$\begin{aligned} K_{\alpha } = \mathbb {E}\left[ (I\otimes C_t C_t')+(C_t C_t'\otimes I) -\alpha (C_t C_t'\otimes C_t C_t') \right] . \end{aligned}$$

Thus, vectorizing the equation for $\varSigma _{t+1}$ gives

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})= & {} (I-\alpha K_{\alpha })\mathtt {vec}(\varSigma _t) + \alpha ^2 \mathtt {vec}\left( \text {Var}\left[ R_tC_t-C_tC_t'\theta _t\right] \right) . \end{aligned}$$

This recursive formula can be used to recover a direct formula:

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= (I-\alpha K_{\alpha })^t\mathtt {vec}(\varSigma _1)\\&\quad + \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-C_s C_s' \theta _s\right] ) \\&=\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-C_s C_s' \theta _s\right] ), \end{aligned}$$

where $\varSigma _1=0$ since $V_1$ is set to a fixed value.

To arrive at the final expression provided in the main next, we need the following three results:

(1)
If $K_{\alpha }^{-1}$ exists, then $\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} = \alpha K_{\alpha }^{-1}\left( I- (I-\alpha K_{\alpha })^t \right) .$
(2)
If $\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1$, then $\text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] = \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t\right) $.
(3)
$I - \alpha K_{\alpha } = \mathcal {O}\left( 1-\alpha \rho _{K,\alpha }\right) $ where $\rho _{K,\alpha }$ is the smallest eigenvalue of $K_{\alpha }$.

Result 1 follows the existence of $K_{\alpha }^{-1}$, since then

$$\begin{aligned} \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s = (I-(I-\alpha K_{\alpha }))^{-1} = \frac{1}{\alpha } K_{\alpha }^{-1}, \end{aligned}$$

and

$$\begin{aligned} \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s}&= \alpha ^2 \sum _{s=0}^{t-1} (I-\alpha K_{\alpha })^{s} \\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s - \alpha ^2 \sum _{s=t}^{\infty } (I-\alpha K_{\alpha })^s \\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s - \alpha ^2 (I-\alpha K_{\alpha })^t \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s\\&= \alpha ^2 \sum _{s=0}^{\infty } (I-\alpha K_{\alpha })^s \left( I- (I-\alpha K_{\alpha })^t\right) \\&= \alpha K_{\alpha }^{-1}\left( I- (I-\alpha K_{\alpha })^t \right) . \end{aligned}$$

Remark 4

Assuming $K_{\alpha }$ is invertible is amenable when exploring convergence of the covariance matrix $\varSigma _t$, since to guarantee convergence, we later need to impose an stronger assumption that all eigenvalues of $K_{\alpha }$ are strictly positive. Strictly positive eigenvalues ensure that $K_{\alpha }$ is invertible. Thus $K_{\alpha }$ is invertible in the situation we care about: when $\varSigma _t$ converges. Of course, the covariance matrix does not always converge as we saw in the haystack problem described in the main text, and in these situations, $K_{\alpha }$ may or may not be invertible.

Result 2 uses $\mathbb {E}[R_t C_t-C_t C_t'v^*]=0$ from the definition of $v^*$ and $(\theta _t-v^*)=\mathcal {O}\left( \max _i \begin{vmatrix} 1 - \alpha \rho _{C,i}\end{vmatrix}^t\right) $ from our analysis of the mean of $V_t$ to get that

$$\begin{aligned} \text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] =&\text {Var}\left[ R_t C_t-C_t C_t' v^* - C_t C_t'(\theta _t-v^*)\right] \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] +\,\text {Var}\left( C_t C_t'(\theta _t-v^*)\right) \\&-\,\mathrm {Cov}\left( R_t C_t-C_t C_t' v^*, C_t C_t'(\theta _t-v^*)\right) \\&-\,\mathrm {Cov}\left( C_t C_t'(\theta _t-v^*),R_t C_t-C_t C_t' v^*\right) \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \\&+\,\mathbb {E}\left[ (C_t C_t'-\mathbb {E}[C_t C_t'])(\theta _t-v^*)(\theta _t-v^*)'(C_t C_t'-\mathbb {E}[C_t C_t'])\right] \\&-\,\mathbb {E}\left[ \left( R_t C_t-C_t C_t' v^*\right) (\theta _t-v^*)'\left( C_t C_t'-\mathbb {E}[C_t C_t']\right) \right] \\&-\,\mathbb {E}\left[ \left( C_t C_t'-\mathbb {E}[C_t C_t']\right) (\theta _t-v^*)\left( R_t C_t-C_t C_t' v^*\right) '\right] \\ =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \\&+ \mathcal {O}\left( \max \{\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t, \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^{2t} \}\right) . \end{aligned}$$

If $\max _{i}\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1$ (an assumption we later need to ensure convergence of the covariance matrix), then we can drop $\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^{2t}$ from the last term since then $\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^2< \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1$ for all i. Hence, we arrive at the desired result:

$$\begin{aligned} \text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] =&\text {Var}\left[ R_t C_t-C_t C_t' v^*\right] + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t \right) \end{aligned}$$

Result 3 follows from noting that symmetric matrix $I - \alpha K_{\alpha }$ shares eigenvectors with symmetric matrix $K_{\alpha }$ and is also positive semi-definite, since

$$\begin{aligned} I - \alpha K_{\alpha }= & {} \mathbb {E}\left[ (I-\alpha C_t C_t')\otimes (I-\alpha C_t C_t') \right] . \end{aligned}$$

Thus, eigenvalues of $I - \alpha K_{\alpha }$ are nonnegative and are all of the form $1-\alpha \rho $ for each eigenvalue $\rho $ of $K_{\alpha }.$ Consequently,

$$\begin{aligned} I - \alpha K_{\alpha } = \mathcal {O}\left( 1-\alpha \rho _{K,\alpha }\right) \end{aligned}$$

where $\rho _{K,\alpha }$ is the smallest eigenvalue of $K_{\alpha }$.

Putting these last three results together when $K_{\alpha }^{-1}$ exists and $\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1$ leads to

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathtt {vec}\left( \text {Var}\left[ R_s C_s-C_s C_s' v^*\right] \right. \\&\quad \left. + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^s\right) \right) \\&= \alpha K_{\alpha }^{-1} (I-(I-\alpha K_{\alpha })^t)\,\mathtt {vec}\left( \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \right) \\&\quad +\, \alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^s\right) \\&= \alpha K_{\alpha }^{-1}\left( I-\mathcal {O}\left( \left( 1-\alpha \rho _{K,\alpha }\right) ^t\right) \right) \,\mathtt {vec}\left( \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] \right) \\&\quad +\, \alpha ^2 \sum _{s=1}^{t} \mathcal {O}\left( \max \{\max _i \begin{vmatrix} 1-\alpha \rho _{C,i}\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha }\right) ^t\}\right) \\&=\alpha K_{\alpha }^{-1} \mathtt {vec}\left( \text {Var}\left[ R_tC_t-C_tC_t'v^*\right] \right) \\&\quad +\, \mathcal {O}\left( t\max \{\max _i \begin{vmatrix} 1-\alpha \rho _{C,i}\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha }\right) ^t\}\right) . \end{aligned}$$

In conclusion, if $\rho _{K,\alpha } > 0$ (which implies $K_{\alpha }^{-1}$ exists) and $\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1$, then the covariance matrix $\varSigma _t$ converges to a long-term covariance matrix that scales linearly with $\alpha $.

B Adding a Decay Term

Associative models of learning often incorporate a decay term into model updates which shrinks associative strengths to zero. While usually only absent cues have their associative strengths decay (c.f., Niv et al. 2015), to simplify subsequent analysis, we consider an R–W model modified to decay all associative strengths to zero using the following update:

$$\begin{aligned} V_{t+1} = V_t + \alpha C_t \left( R_t - C_t'V_t\right) - \alpha \eta V_t, \end{aligned}$$

for some decay parameter $\eta $ with $1>\alpha \eta \ge 0$. Under Assumptions 1– 2, this version of the R–W model is stochastic gradient descent applied to the following objective function:

$$\begin{aligned} \mathbb {E}\left[ \left( R_t - C_t'V_t\right) ^2 +\eta V_t'V_t\right] , \end{aligned}$$

which tries to find associative strengths that minimize a combination of square prediction error and its square norm. Thus, this R–W model with decay addresses the L2 regularized version of the original prediction problem addressed by the R–W model. A solution $v^*_{\eta }$ to this regularized objective function satisfies:

$$\begin{aligned} \mathbb {E}\left[ \eta I + C_t C_t'\right] v^*_{\eta } = \mathbb {E}\left[ R_t C_t \right] . \end{aligned}$$

Following similar arguments in the R–W model, we can recover an expression for the expectation $\theta _t$ and covariance matrix $\varSigma _t$ of associative strength on trial t for this R–W model with decay. Expectation of the update rule gives:

$$\begin{aligned} \theta _{t+1}= & {} \theta _t + \mathbb {E}\left[ \alpha C_t\left( R_t - C_t' V_t\right) \right] - \alpha \eta \theta _t\\= & {} v^*_{\eta } + \left( I - \alpha (\eta I + \mathbb {E}\left[ C_tC_t' \right] )\right) (\theta _t-v^*_{\eta }), \end{aligned}$$

and thus:

$$\begin{aligned} \theta _{t+1} = v^*_{\eta } + \left( (I-\alpha (\eta I + \mathbb {E}\left[ C_tC_t' \right] )\right) ^t (\theta _1-v^*_{\eta }). \end{aligned}$$

In other words, $\theta _{t+1}$ approaches $v^*_{\eta }$ at a rate that depends on

$$\begin{aligned} \begin{Vmatrix}I-\alpha \left( \eta I + \mathbb {E}\left[ C_tC_t' \right] \right) \end{Vmatrix} = \max _{i} \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i}) \end{vmatrix} \end{aligned}$$

provided $\max _{i} \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i}) \end{vmatrix} < 1$. As before, $\rho _{C,i}$ denotes an eigenvalue of $\mathbb {E}[C_t C_t']$. For sufficiently small $\alpha $ and $\eta $, the decay term can speed up convergence in the mean.

For the covariance matrix $\varSigma _{t+1}$, we have

$$\begin{aligned} \varSigma _{t+1}= & {} \text {Var}\left[ V_t + \alpha C_t R_t - \alpha C_tC_t'V_t - \eta V_t \right] \\= & {} \text {Var}\left[ V_t + \alpha \left( R_t C_t - (\eta I+ C_tC_t')\theta _t\right) - \alpha (\eta I + C_tC_t')\left( V_t-\theta _t\right) \right] , \end{aligned}$$

which expands to

$$\begin{aligned} \varSigma _{t+1}= & {} \varSigma _t - \alpha \varSigma _t\mathbb {E}\left[ \eta I + C_tC_t'\right] - \alpha \mathbb {E}\left[ \eta I + C_tC_t'\right] \varSigma _t\\&+ \alpha ^2 \mathbb {E}\left[ (\eta I + C_tC_t')\varSigma _t (\eta I + C_tC_t')\right] \\&+ \, \alpha ^2 \text {Var}\left[ R_t C_t - (\eta I +C_tC_t')\theta _t\right] . \end{aligned}$$

Letting

$$\begin{aligned} K_{\alpha ,\eta } = \mathbb {E}\left[ I\otimes (\eta I + C_t C_t')+(\eta I + C_t C_t')\otimes I -\alpha (\eta I + C_t)\otimes (\eta I + C_t C_t') \right] \end{aligned}$$

with minimum eigenvalue $\rho _{K,\alpha ,\eta }$, then vectorization yields

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= (I-\alpha K_{\alpha ,\eta })\mathtt {vec}(\varSigma _t) + \alpha ^2 \mathtt {vec}(\text {Var}\left[ R_t C_t-(\eta I + C_t C_t') \theta _t\right] ) \\&=\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha ,\eta })^{t-s} \mathtt {vec}(\text {Var}\left[ R_s C_s-(\eta I + C_s C_s') \theta _s\right] ). \end{aligned}$$

Using similar arguments before, we have that when $K_{\alpha ,\eta }^{-1}$ exists:

$$\begin{aligned} \mathtt {vec}(\varSigma _{t+1})&= \alpha K_{\alpha ,\eta }^{-1} \mathtt {vec}\left( \text {Var}\left[ R_tC_t-(\eta I + C_tC_t')v^*_{\eta }\right] \right) \\&\quad +\, \mathcal {O}\left( t\max \{\max _i \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i})\end{vmatrix}^t,\,\left( 1-\alpha \rho _{K,\alpha ,\eta }\right) ^t\}\right) \end{aligned}$$

Note when $\eta =0$, then $v^*_0 = v^*$, $K_{\alpha ,0}=K_{\alpha }$, and $\rho _{K,\alpha ,0}=\rho _{K,\alpha }$, and hence, we recover our expressions for a R–W model without decay.

Based on these derived expressions, we reproduced figures from the main text for the R–W model with decay. Comparing Figs. 11 to 3 from the main text, the R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the R–W model without decay for sufficiently small $\alpha $ and N, but does not expand the region of $\alpha ,N$ values with convergent means and variances. Decay has little effect on long-term variance.

We can extend the R–W model with decay to the sparse attention setting:

$$\begin{aligned} V_{t+1} = V_t + \alpha X_t C_t \left( R_t - (X_t C_t)'V_t\right) - \alpha \eta V_t, \end{aligned}$$

Notice that this version of association decay applies to all associations equally, and thus represents a generalized decay or forgetfulness likely due to some general biological or cognitive process which decays or rewrites on top of all associations. Since attended variables are also updated, computations are not sparse. The net effect of this model would be that consecutively unattended associations would decay, while attended ones may sustain associations. As before, the derivations for mean and variance hold with $C_t$ replaced with $X_t C_t$. Comparing Figs. 12 to 4 from the main text, the Sparse R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the Sparse R–W model without decay, but leads to worse long-term variance.

We can also add a general decay term to the SAR-WI framework:

$$\begin{aligned} V_{t+1}&= V_t + \alpha X_t C^0_t(R^0_t -(X_tC^0_t)'V_t) - \eta \alpha V_t, \\ \bar{C}_{t+1}&= \bar{C}_t + \alpha X_t(C_t-\bar{C}_t) - \eta \alpha \bar{C}_t \\ \bar{R}_{t+1}&= \bar{R}_t + \alpha (R_t-\bar{R}_t) - \eta \alpha \bar{R}_t \end{aligned}$$

where as before

$$\begin{aligned} C^0_t&= C_t - \bar{C_t}\\ R^0_t&= R_t - \bar{R_t}. \end{aligned}$$

Figure 13 adds the SAR-WI model with decay to the comparison of models on four tasks. On these tasks and for these sets of parameters, the SAR-WI model with decay performs worse than the SAR-WI model without decay.

C Selecting Cues Without a Fixed Attention Bandwidth

When satisfying the fixed attention bandwidth constraint is less important, or when it is natural to consider the number of attended cues changing in response to their associations, another approach is to sample cues independently from each other:

$$\begin{aligned} P(X_{t,i,i}) = \frac{ V_{t,i}^2 \bar{C}_{t,i}(1-\bar{C}_{t,i}) + \epsilon }{D + \epsilon }, \end{aligned}$$

(12)

for some $\epsilon >0$ and some normalization constant D. If rewards are Bernoulli, then one could use $D = \bar{R}_t(1-\bar{R}_t)$, since independent Bernoulli cue between trials ensures

$$\begin{aligned} \sum _{i} (v_{i}^*)^2 \mathbb {E}[C_{t,i}] (1-\mathbb {E}[C_{t,i}]) \le \mathbb {E}[R_t](1-\mathbb {E}[R_t]). \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nishimura, J., Cochran, A.L. Rescorla–Wagner Models with Sparse Dynamic Attention. Bull Math Biol 82, 69 (2020). https://doi.org/10.1007/s11538-020-00743-w

Download citation

Received: 19 November 2019
Accepted: 01 May 2020
Published: 04 June 2020
DOI: https://doi.org/10.1007/s11538-020-00743-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rescorla–Wagner Models with Sparse Dynamic Attention

Abstract

Access this article

Similar content being viewed by others