Abstract
The Rescorla–Wagner (R–W) model describes human associative learning by proposing that an agent updates associations between stimuli, such as events in their environment or predictive cues, proportionally to a prediction error. While this model has proven informative in experiments, it has been posited that humans selectively attend to certain cues to overcome a problem with the R–W model scaling to large cue dimensions. We formally characterize this scaling problem and provide a solution that involves limiting attention in a R–W model to a sparse set of cues. Given the universal difficulty in selecting features for prediction, sparse attention faces challenges beyond those faced by the R–W model. We demonstrate several ways in which a naive attention model can fail explain those failures and leverage that understanding to produce a Sparse Attention R–W with Inference framework (SAR-WI). The SAR-WI framework not only satisfies a constraint on the number of attended cues, it also performs as well as the R–W model on a number of natural learning tasks, can correctly infer associative strengths, and focuses attention on predictive cues while ignoring uninformative cues. Given the simplicity of proposed alterations, we hope this work informs future development and empirical validation of associative learning models that seek to incorporate sparse attention.
Similar content being viewed by others
Notes
if \(R = (\sum C_i) \mod 2\) then no single cue contains any usable information, though collectively they contain complete information.
References
Alexander WH (2007) Shifting attention using a temporal difference prediction error and high-dimensional input. Adapt Behav 15(2):121–133
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Blair MR, Watson MR, Walshe RC, Maj F (2009) Extremely selective attention: eye-tracking studies of the dynamic allocation of attention to stimulus features in categorization. J Exp Psychol Learn Memory Cognit 35(5):1196
Cochran AL, Cisler JM (2019) A flexible and generalizable model of online latent-state learning. PLoS Comput Biol 15(9):e1007331
Denton SE, Kruschke JK (2006) Attention and salience in associative blocking. Learn Behav 34(3):285–304
Esber GR, Haselgrove M (2011) Reconciling the influence of predictiveness and uncertainty on stimulus salience: a model of attention in associative learning. Proc R Soc B Biol Sci 278(1718):2553–2561
Fan J, Lv J (2010) A selective overview of variable selection in high dimensional feature space. Stat Sin 20(1):101
Frank MJ, Badre D (2011) Mechanisms of hierarchical reinforcement learning in corticostriatal circuits 1: computational analysis. Cereb Cortex 22(3):509–526
Frey PW, Sears RJ (1978) Model of conditioning incorporating the rescorla-wagner associative axiom, a dynamic attention process, and a catastrophe rule. Psychol Rev 85(4):321
Gluck MA, Bower GH (1988) From conditioning to category learning: an adaptive network model. J Exp Psychol Gen 117(3):227
Gordon GJ (2001) Reinforcement learning with function approximation converges to a region. In: Advances in neural information processing systems, pp 1040–1046
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
Hagerup T, Mehlhorn K, Munro J (1993) Optimal algorithms for generating discrete random variables with changing distributions. Lect Notes Comput Sci 700:253–264
Harris JA (2006) Elemental representations of stimuli in associative learning. Psychol Rev 113(3):584
Hauser TU, Iannaccone R, Walitza S, Brandeis D, Brem S (2015) Cognitive flexibility in adolescence: neural and behavioral mechanisms of reward prediction error processing in adaptive decision making during development. Neuroimage 104:347–354
Hitchcock P, Niv Y, Radulescu A, Rothstein NJ, Sims CR (2019) Measuring trial-wise choice difficulty in multi-feature reinforcement learning. PsyArXiv. https://doi.org/10.31234/osf.io/ma3cf
Kim S, Rehder B (2011) How prior knowledge affects selective attention during category learning: an eyetracking study. Memory Cognit 39(4):649–665
Koenig S, Kadel H, Uengoer M, Schubö A, Lachnit H (2017) Reward draws the eye, uncertainty holds the eye: associative learning modulates distractor interference in visual search. Front Behav Neurosci 11:128
Kokkola NH, Mondragón E, Alonso E (2019) A double error dynamic asymptote model of associative learning. Psychol Rev 126(4):506
Kruschke JK (1992) Alcove: an exemplar-based connectionist model of category learning. Psychol Rev 99(1):22
Lawrence DH (1949) Acquired distinctiveness of cues: I. Transfer between discriminations on the basis of familiarity with the stimulus. J Exp Psychol 39(6):770
Lawrence DH (1950) Acquired distinctiveness of cues: II. Selective association in a constant stimulus situation. J Exp Psychol 40(2):175
Le Pelley ME (2004) The role of associative history in models of associative learning: a selective review and a hybrid model. Q J Exp Psychol Sect B 57(3b):193–243
Le Pelley M, Beesley T, Griffiths O (2011) Overt attention and predictiveness in human contingency learning. J Exp Psychol Anim Behav Process 37(2):220
Leong YC, Radulescu A, Daniel R, DeWoskin V, Niv Y (2017) Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron 93(2):451–463
Li J, Schiller D, Schoenbaum G, Phelps EA, Daw ND (2011) Differential roles of human striatum and amygdala in associative learning. Nat Neurosci 14(10):1250
Lovejoy E (1968) Attention in discrimination learning: a point of view and a theory. Holden-Day, San Francisco
Mackintosh NJ (1975) A theory of attention: variations in the associability of stimuli with reinforcement. Psychol Rev 82(4):276
McLaren I, Mackintosh N (2000) An elemental model of associative learning: I. Latent inhibition and perceptual learning. Anim Learn Behav 28(3):211–246
Meier KM, Blair MR (2013) Waiting and weighting: information sampling is a balance between efficiency and error-reduction. Cognition 126(2):319–325
Niv Y, Daniel R, Geana A, Gershman SJ, Leong YC, Radulescu A, Wilson RC (2015) Reinforcement learning in multidimensional environments relies on attention mechanisms. J Neurosci 35(21):8145–8157
Nosofsky RM, Palmeri TJ, McKinley SC (1994) Rule-plus-exception model of classification learning. Psychol Rev 101(1):53
Pearce JM, Hall G (1980) A model for pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychol Rev 87(6):532
Rehder B, Hoffman AB (2005) Eyetracking and selective attention in category learning. Cognit Psychol 51(1):1–41
Rescorla RA, Wagner AR et al (1972) A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. Class Cond II Curr Res Theory 2:64–99
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407
Rumelhart DE, Hinton GE, Williams GE (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, vol 1. MIT Press, Cambridge, MA
Schmajuk NA, Lam YW, Gray J (1996) Latent inhibition: A neural network approach. Journal of Experimental Psychology: Animal Behavior Processes 22(3):321
Sutherland NS, Mackintosh NJ (2016) Mechanisms of animal discrimination learning. Academic Press, New York
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
Trabasso T, Bower GH (1975) Attention in learning: theory and research. Krieger Pub Co, Malabar
Wang J, Zhao P, Hoi SC, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 3(26):698–710
Wills AJ, Lavric A, Croft G, Hodgson TL (2007) Predictive learning, prediction errors, and attention: evidence from event-related potentials and eye tracking. J Cognit Neurosci 19(5):843–854
Young ME, Wasserman EA (2002) Limited attention and cue order consistency affect predictive learning: a test of similarity measures. J Exp Psychol Learn Memory Cognit 28(3):484
Yu K, Wu X, Ding W, Pei J (2016) Scalable and accurate online feature selection for big data. ACM Trans Knowl Discov Data (TKDD) 11(2):16
Zeaman D, House BJ (1963) The role of attention in retardate discrimination learning. In: Ellis NR (ed) Handbook of mental deficiency, vol 1(3). McGraw-Hill, New York, pp 159–223
Zhou P, Hu X, Li P, Wu X (2017) Online feature selection for high-dimensional class-imbalanced data. Knowl Based Syst 136:187–199
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Limits
Here, we provide details on the expression for the expectation and covariance matrix of associative strength on trial t: \(\theta _t\) and \(\varSigma _t\). Throughout, we assume only that observations \((R_t,C_t)\) are mutually independent between trials (Assumption 2). Taking expectation of the R–W update in (1) gives a recursive expression for \(\theta _t\):
where Eq. 2 allows for the substitution introducing \(v^*\). Applying this recursive formula repeatedly over trials t, a direct formula can be recovered:
Thus, how quickly expected associative strength \(\theta _{t+1}\) goes to \(v^*\) depends on \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \). Repeated application of the sub-multiplicative property yields the following bound:
where we are using the 2-norm for vectors and matrices. Furthermore, the 2-norm of a symmetric matrix such as \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) is the absolute value of its largest eigenvalue. Additionally, all eigenvalues \(\rho _{C,1},\ldots ,\rho _{C_,N}\) of \(\mathbb {E}\left[ C_tC_t' \right] \) are real and positive, since \(\mathbb {E}\left[ C_tC_t' \right] \) is positive semidefinite, and the eigenvalues of \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) are \(1-\alpha \rho _{C,1},\ldots ,1-\alpha \rho _{C_,N}\), since \(I-\alpha \mathbb {E}\left[ C_tC_t' \right] \) shares eigenvectors with \(\mathbb {E}\left[ C_tC_t' \right] \). Thus,
The condition \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) ensures mean associative strength \(\theta _t\) converges exponentially to \(v^*\). Consequently,
as noted in the main text.
For the covariance matrix of \(V_{t+1}\), denoted by \(\varSigma _{t+1}\), we first note that
Upon expanding the expression above using properties of variance, we have that
We can use the fact that \(V_t\) is independent from \(R_t\) and \(C_t\) to show that several terms above are zero, since
Setting these terms to zero yields
We can use \(\varSigma _t\) to rewrite some remaining terms in the above expression:
Plugging these expressions in the prior expression leads to
Because \(\varSigma _t\) is involved in left and right matrix multiplication, we can use vectorization to simplify this expression. Namely,
where \(\otimes \) is the Kronecker product, \(\mathtt {vec}\) reshapes a matrix into a vector, and
Thus, vectorizing the equation for \(\varSigma _{t+1}\) gives
This recursive formula can be used to recover a direct formula:
where \(\varSigma _1=0\) since \(V_1\) is set to a fixed value.
To arrive at the final expression provided in the main next, we need the following three results:
-
(1)
If \(K_{\alpha }^{-1}\) exists, then \(\alpha ^2 \sum _{s=1}^{t} (I-\alpha K_{\alpha })^{t-s} = \alpha K_{\alpha }^{-1}\left( I- (I-\alpha K_{\alpha })^t \right) .\)
-
(2)
If \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\), then \(\text {Var}\left[ R_t C_t-C_t C_t' \theta _t\right] = \text {Var}\left[ R_t C_t-C_t C_t' v^*\right] + \mathcal {O}\left( \max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^t\right) \).
-
(3)
\(I - \alpha K_{\alpha } = \mathcal {O}\left( 1-\alpha \rho _{K,\alpha }\right) \) where \(\rho _{K,\alpha }\) is the smallest eigenvalue of \(K_{\alpha }\).
Result 1 follows the existence of \(K_{\alpha }^{-1}\), since then
and
Remark 4
Assuming \(K_{\alpha }\) is invertible is amenable when exploring convergence of the covariance matrix \(\varSigma _t\), since to guarantee convergence, we later need to impose an stronger assumption that all eigenvalues of \(K_{\alpha }\) are strictly positive. Strictly positive eigenvalues ensure that \(K_{\alpha }\) is invertible. Thus \(K_{\alpha }\) is invertible in the situation we care about: when \(\varSigma _t\) converges. Of course, the covariance matrix does not always converge as we saw in the haystack problem described in the main text, and in these situations, \(K_{\alpha }\) may or may not be invertible.
Result 2 uses \(\mathbb {E}[R_t C_t-C_t C_t'v^*]=0\) from the definition of \(v^*\) and \((\theta _t-v^*)=\mathcal {O}\left( \max _i \begin{vmatrix} 1 - \alpha \rho _{C,i}\end{vmatrix}^t\right) \) from our analysis of the mean of \(V_t\) to get that
If \(\max _{i}\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) (an assumption we later need to ensure convergence of the covariance matrix), then we can drop \(\max _{i} \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^{2t}\) from the last term since then \(\begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix}^2< \begin{vmatrix} 1-\alpha \rho _{C,i} \end{vmatrix} < 1\) for all i. Hence, we arrive at the desired result:
Result 3 follows from noting that symmetric matrix \(I - \alpha K_{\alpha }\) shares eigenvectors with symmetric matrix \(K_{\alpha }\) and is also positive semi-definite, since
Thus, eigenvalues of \(I - \alpha K_{\alpha }\) are nonnegative and are all of the form \(1-\alpha \rho \) for each eigenvalue \(\rho \) of \(K_{\alpha }.\) Consequently,
where \(\rho _{K,\alpha }\) is the smallest eigenvalue of \(K_{\alpha }\).
Putting these last three results together when \(K_{\alpha }^{-1}\) exists and \(\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1\) leads to
In conclusion, if \(\rho _{K,\alpha } > 0\) (which implies \(K_{\alpha }^{-1}\) exists) and \(\max _i\begin{vmatrix}1-\alpha \rho _{C,i}\end{vmatrix} < 1\), then the covariance matrix \(\varSigma _t\) converges to a long-term covariance matrix that scales linearly with \(\alpha \).
B Adding a Decay Term
Associative models of learning often incorporate a decay term into model updates which shrinks associative strengths to zero. While usually only absent cues have their associative strengths decay (c.f., Niv et al. 2015), to simplify subsequent analysis, we consider an R–W model modified to decay all associative strengths to zero using the following update:
for some decay parameter \(\eta \) with \(1>\alpha \eta \ge 0\). Under Assumptions 1– 2, this version of the R–W model is stochastic gradient descent applied to the following objective function:
which tries to find associative strengths that minimize a combination of square prediction error and its square norm. Thus, this R–W model with decay addresses the L2 regularized version of the original prediction problem addressed by the R–W model. A solution \(v^*_{\eta }\) to this regularized objective function satisfies:
Following similar arguments in the R–W model, we can recover an expression for the expectation \(\theta _t\) and covariance matrix \(\varSigma _t\) of associative strength on trial t for this R–W model with decay. Expectation of the update rule gives:
and thus:
In other words, \(\theta _{t+1}\) approaches \(v^*_{\eta }\) at a rate that depends on
provided \(\max _{i} \begin{vmatrix} 1-\alpha (\eta + \rho _{C,i}) \end{vmatrix} < 1\). As before, \(\rho _{C,i}\) denotes an eigenvalue of \(\mathbb {E}[C_t C_t']\). For sufficiently small \(\alpha \) and \(\eta \), the decay term can speed up convergence in the mean.
For the covariance matrix \(\varSigma _{t+1}\), we have
which expands to
Letting
with minimum eigenvalue \(\rho _{K,\alpha ,\eta }\), then vectorization yields
Using similar arguments before, we have that when \(K_{\alpha ,\eta }^{-1}\) exists:
Note when \(\eta =0\), then \(v^*_0 = v^*\), \(K_{\alpha ,0}=K_{\alpha }\), and \(\rho _{K,\alpha ,0}=\rho _{K,\alpha }\), and hence, we recover our expressions for a R–W model without decay.
Based on these derived expressions, we reproduced figures from the main text for the R–W model with decay. Comparing Figs. 11 to 3 from the main text, the R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the R–W model without decay for sufficiently small \(\alpha \) and N, but does not expand the region of \(\alpha ,N\) values with convergent means and variances. Decay has little effect on long-term variance.
We can extend the R–W model with decay to the sparse attention setting:
Notice that this version of association decay applies to all associations equally, and thus represents a generalized decay or forgetfulness likely due to some general biological or cognitive process which decays or rewrites on top of all associations. Since attended variables are also updated, computations are not sparse. The net effect of this model would be that consecutively unattended associations would decay, while attended ones may sustain associations. As before, the derivations for mean and variance hold with \(C_t\) replaced with \(X_t C_t\). Comparing Figs. 12 to 4 from the main text, the Sparse R–W model with decay leads to faster convergence of the mean and variance of associative strengths over the Sparse R–W model without decay, but leads to worse long-term variance.
We can also add a general decay term to the SAR-WI framework:
where as before
Figure 13 adds the SAR-WI model with decay to the comparison of models on four tasks. On these tasks and for these sets of parameters, the SAR-WI model with decay performs worse than the SAR-WI model without decay.
C Selecting Cues Without a Fixed Attention Bandwidth
When satisfying the fixed attention bandwidth constraint is less important, or when it is natural to consider the number of attended cues changing in response to their associations, another approach is to sample cues independently from each other:
for some \(\epsilon >0\) and some normalization constant D. If rewards are Bernoulli, then one could use \(D = \bar{R}_t(1-\bar{R}_t)\), since independent Bernoulli cue between trials ensures
Rights and permissions
About this article
Cite this article
Nishimura, J., Cochran, A.L. Rescorla–Wagner Models with Sparse Dynamic Attention. Bull Math Biol 82, 69 (2020). https://doi.org/10.1007/s11538-020-00743-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11538-020-00743-w