The reliability and stability of visual working memory capacity

Xu, Z.; Adam, K. C. S.; Fang, X.; Vogel, E. K.

doi:10.3758/s13428-017-0886-6

The reliability and stability of visual working memory capacity

Published: 07 April 2017

Volume 50, pages 576–588, (2018)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

The reliability and stability of visual working memory capacity

Download PDF

Z. Xu¹,
K. C. S. Adam²,
X. Fang¹ &
…
E. K. Vogel²

7578 Accesses
46 Citations
18 Altmetric
Explore all metrics

Abstract

Because of the central role of working memory capacity in cognition, many studies have used short measures of working memory capacity to examine its relationship to other domains. Here, we measured the reliability and stability of visual working memory capacity, measured using a single-probe change detection task. In Experiment 1, the participants (N = 135) completed a large number of trials of a change detection task (540 in total, 180 each of set sizes 4, 6, and 8). With large numbers of both trials and participants, reliability estimates were high (α > .9). We then used an iterative down-sampling procedure to create a look-up table for expected reliability in experiments with small sample sizes. In Experiment 2, the participants (N = 79) completed 31 sessions of single-probe change detection. The first 30 sessions took place over 30 consecutive days, and the last session took place 30 days later. This unprecedented number of sessions allowed us to examine the effects of practice on stability and internal reliability. Even after much practice, individual differences were stable over time (average between-session r = .76).

Twenty years of load theory—Where are we now, and where should we go next?

Article 04 January 2016

Gillian Murphy, John A. Groeger & Ciara M. Greene

OpenWMB: An open-source and automated working memory task battery for OpenSesame

Article Open access 04 April 2024

Fábio Monteiro, Letícia Botan Nascimento, … Carla S. Nascimento

From short-term store to multicomponent working memory: The role of the modal model

Article 26 November 2018

Alan D. Baddeley, Graham J. Hitch & Richard J. Allen

Working memory (WM) capacity is a core cognitive ability that predicts performance across many domains. For example, WM capacity predicts attentional control, fluid intelligence, and real-world outcomes such as perceiving hazards while driving (Engle, Tuholski, Laughlin, & Conway, 1999; Fukuda, Vogel, Mayr, & Awh, 2010; Wood, Hartley, Furley, & Wilson, 2016). For these reasons, researchers are often interested in devising brief measures of WM capacity to investigate the relationship of WM capacity to other cognitive processes. However, truncated versions of WM capacity tasks could potentially be inadequate for reliably measuring an individual’s capacity. Inadequate measurement could obscure correlations between measures, or even differences in performance between experimental conditions. Furthermore, although WM capacity is considered to be a stable trait of the observer, little work has directly examined the role of extensive practice in the measurement of WM capacity over time. This is of particular concern because of the popularity of research examining whether training affects WM capacity (Melby-Lervåg & Hulme, 2013; Shipstead, Redick, & Engle, 2012). Extensive practice on any given cognitive task has the potential to significantly alter the nature of the variance that determines performance. For example, extensive practice has the potential to induce a restriction-of-range problem, in which the bulk of the observers reach similar performance levels—thus reducing any opportunity to observe correlations with other measures. Consequently, a systematic study of the reliability and stability of WM capacity measures is critical for improving the measurement and reproducibility of major phenomena in this field.

In the present study, we sought to establish the reliability and stability of one particular WM capacity measure: change detection. Change detection measures of visual WM have gained popularity as a means of assessing individual differences in capacity. In a typical change detection task, participants briefly view an array of simple visual items (for ~100–500 ms), such as colored squares, and remember these items across a short delay (~1–2 s). At test, observers are presented with an item at one of the remembered locations, and they indicate whether the presented test item is the same as the remembered item (“no-change” trial) or is different (“change trial”). Performance can be quantified as raw accuracy or converted into a capacity estimate (“K”). In capacity estimates, performance for change trials and no-change trials is calculated separately as hits (the proportion of correct change trials) and false alarms (the proportion of incorrect no-change trials) and converted into a set-size-dependent score (Cowan, 2001; Pashler, 1988; Rouder, Morey, Morey, & Cowan, 2011).

Several beneficial features of change detection tasks have led to their increased popularity. First, change detection memory tasks are simple and short enough to be used with developmental and clinical populations (e.g., Cowan, Fristoe, Elliott, Brunner, & Saults, 2006; Gold, Wilk, McMahon, Buchanan, & Luck, 2003; Lee et al., 2010). Second, the relatively short length of trials lends the task well to neural measures that require large numbers of trials. In particular, neural studies employing change detection tasks have provided strong corroborating evidence of capacity limits in WM (Todd & Marois, 2004; Vogel & Machizawa, 2004) and have yielded insights into potential mechanisms underlying individual differences in WM capacity (for a review, see Luria, Balaban, Awh, & Vogel, 2016). Finally, change detection tasks and closely related memory-guided saccade tasks can be used with animal models from pigeons (Gibson, Wasserman, & Luck, 2011) to nonhuman primates (Buschman, Siegel, Roy, & Miller, 2011), providing a rare opportunity to directly compare behavior and neural correlates of task performance across species (Elmore, Magnotti, Katz, & Wright, 2012; Reinhart et al., 2012).

A main aim of this study is to quantify the effect of measurement error and sample size on the reliability of change detection estimates. In previous studies, change detection estimates of capacity have yielded good reliability estimates (e.g., Pailian & Halberda, 2015; Unsworth, Fukuda, Awh, & Vogel, 2014). However, measurement error can vary dramatically with the number of trials in a task, thus impacting reliability; Pailian and Halberda found that reliability of change detection estimates greatly improved when the number of trials was increased. Researchers frequently employ vastly different numbers of trials and participants in studies of individual differences, but the effect of trial number on change detection reliability has never been fully characterized. In studies using large batteries of tasks, time and measurement error are forces working in opposition to one another. When researchers want to minimize the amount of time that a task takes, measures are often truncated to expedite administration. Such truncated measures increase measurement noise and potentially harm the reliability of the measure. At present, there is no clear understanding of the minimum number of either participants or trials that is necessary to obtain reliable estimates of change detection capacity.

In addition to measurement error within-session, reliability of individual differences could be compromised with extensive practice. Previously, it was found that visual WM capacity estimates were stable (r = .77) after 1.5 years between testing sessions (Johnson et al., 2013). However, the effect of extensive practice on change detection estimates of capacity has yet to be characterized. Extensive practice could harm the reliability and stability of measures in a couple of ways. First, it is possible that participants could improve so much that they reach performance ceiling, thus eliminating variability between individuals. Second, if individual differences are due to the utilization of optimal versus suboptimal strategies, then participants might converge to a common mean after engaging in extensive practice and finding optimal task strategies. Both of these hypothetical possibilities would call into question the true stability of WM capacity estimates, and likewise severely harm the statistical reliability of the measure. As such, in Experiment 2 we directly quantified the extent of extensive practice on the stability of WM capacity estimates.

Overview of experiments

We measured the reliability and stability of a single-probe change detection measure of visual WM capacity. In Experiment 1, we measured the reliability of capacity estimates obtained with a commonly used version of the color change detection task for a relatively large number of participants (n = 135) and a larger than typical number of trials (t = 540). In Experiment 2, we measured the stability of capacity estimates across an unprecedented number of testing sessions (31). Because of the large number of sessions, we could investigate the stability of change detection estimates after extended practice and over a period of 60 days.

Experiment 1

Materials and method

Participants

A total of 137 individuals (102 females, 35 males; mean age = 19.97, SD = 1.07) with normal or corrected-to-normal vision participated in the experiment. Participants provided written informed consent, and the study was approved by the Ethics Committee at Southwest University. Participants received monetary compensation for their participation. Two participants were excluded because they had negative average capacity values, resulting in a final sample of 135 participants.

Stimuli

The stimuli were presented on monitors with a refresh rate of 75 Hz and a screen resolution of 1,024 × 768. Participants sat approximately 60 cm from the screen, though a chinrest was not used so all visual angle estimates are approximate. In addition, there were some small variations in monitor size (five 16-in. CRT monitors, three 19-in. LCD monitors) in testing rooms, leading to small variations in the size of the colored squares from monitor to monitor. Details are provided about the approximate range in degrees of visual angle.

All stimuli were generated in MATLAB (The MathWorks, Natick, MA) using the Psychophysics Toolbox (Brainard, 1997). Colored squares (51 pixels; range of 1.55° to 2.0° visual angle) served as memoranda. Squares could appear anywhere within an area of the monitor subtending approximately 10.3° to 13.35° horizontally and 7.9° to 9.8° vertically. Squares could appear in any of nine distinct colors, and colors were sampled without replacement within each trial (RGB values: red = 255 0 0; green = 0 255 0; blue = 0 0 255; magenta = 255 0 255; yellow = 255 255 0; cyan = 0 255 255; orange = 255 128 0; white = 255 255 255; black = 0 0 0). Participants were instructed to fixate a small black dot (approximate range: .36° to .47° of visual angle) at the center of the display.

Procedures

Each trial began with a blank fixation period of 1,000 ms. Then, participants briefly viewed an array of four, six, or eight colored squares (150 ms), which they remembered across a blank delay period (1,000 ms). At test, one colored square was presented at one of the remembered locations. The probabilities were equal that the probed square was the same color (no-change trial) or was a different color (change trial). Participants made an unspeeded response by pressing the “z” key, if the color was the same, or the “/” key, if the color was different. Participants completed 180 trials of set sizes 4, 6, and 8 (540 trials total). Trials were divided into nine blocks, and participants were given a brief rest period (30 s) after each block. To calculate capacity, change detection accuracy was transformed into a K estimate using Cowan’s (2001) formula K = N × (H − FA), where N represents the set size, H is the hit rate (proportion of correct responses to change trials), and FA is the false alarm rate (proportion of incorrect responses to no-change trials). Cowan’s formula is best for single-probe displays like the one employed here. For change detection tasks using whole-display probes, Pashler’s (1988) formula may be more appropriate (Rouder et al., 2011).

Results

Descriptive statistics for each set size condition are shown in Table 1, and data for both Experiments 1 and 2 are available online at the website of the Open Science Framework, at https://osf.io/g7txf/. We observed a significant difference in performance across set sizes, F(2, 268) = 20.6, p < .001, η _p ² = .133, and polynomial contrasts revealed a significant linear trend, F(1, 134) = 36.48, p < .001, η _p ² = .214, indicating that the average performance declined slightly with increased memory load.

Table 1 Descriptive statistics for Experiment 1

Full size table

Reliability of the full sample: Cronbach’s alpha

We computed Cronbach’s alpha (unstandardized) using K scores from the three set sizes as items (180 trials contributing to each item), and obtained a value of α = .91 (Cronbach, 1951). We also computed Cronbach’s alpha using K scores from the nine blocks of trials (60 trials contributing to each item) and obtained a nearly identical value of α = .92. Finally, we computed Cronbach’s alpha using raw accuracy for single trials (540 items), and obtained an identical value of α = .92. Thus, change detection estimates had high internal reliability for this large sample of participants, and the precise method used to divide trials into “items” does not impact Cronbach’s alpha estimates of reliability for the full sample. Furthermore, using raw accuracy versus bias-corrected K scores did not impact reliability.

Reliability of the full sample: Split-half

The split-half correlation of the K scores for even and odd trials was reliable, r = .88, p < .001, 95% CI [.84, .91]. Correcting for attenuation yielded a split-half correlation value of r = .94 (Brown, 1910; Spearman, 1910). Likewise, the capacity scores from individual set sizes correlated with each other: r _ss4-ss6 = .84, p < .001, 95% CI [.78, .88]; r _ss6-ss8 = .79, p < .001, 95% CI [.72, .85]; r _ss4-ss8 = .76, p < .001, 95% CI [.68, .83]. Split-half correlations for individual set sizes yielded Spearman–Brown-corrected correlation values of r = .91 for set size 4, r = .86 for set size 6, and r = .76 for set size 8, respectively.

The drop in capacity from set size 4 to set size 8 has been used in the literature as a measure of filtering ability. However, the internal reliability of this difference score has typically been low (Pailian & Halberda, 2015; Unsworth et al., 2014). Likewise, we found here that the split-half reliability of the performance decline from set size 4 to set size 8 (“4–8 Drop”) was low, with a Spearman–Brown-corrected correlation value of r = .24. Although weak, this correlation is of the same strength that was reported in earlier work (Unsworth et al., 2014). The split-half reliability of the performance decline from set size 4 to set size 6 was slightly higher, r = .39, and the split-half reliability of the difference between set size 6 and set size 8 performance was very low, r = .08. The reliability of differences scores can be impacted both by (1) the internal reliability of each measure used to compute the difference and (2) the degree of correlation between the two measures (Rodebaugh et al., 2016). Although the internal reliability of each individual set size was high, the positive correlation between set sizes may have decreased the reliability of the set size difference scores.

An iterative down-sampling approach

To investigate the effects of sample size and trial number on the reliability estimates, we used an iterative down-sampling procedure. Two reliability metrics were assessed: (1) Cronbach’s alpha using single-trial accuracy as items, and (2) split-half correlations using all trials. For the down-sampling procedure, we randomly sampled participants and trials from the full dataset. The number of participants (n) was varied from 5 to 135 in steps of 5. The number of trials (t) was varied from 5 to 540 in steps of 5. Number of participants and number of trials were factorially combined (2,916 cells total). For each cell in the design, we ran 100 sampling iterations. On each iteration, n participants and t trials were randomly sampled from the full dataset and reliability metrics were calculated for the sample.

Figure 1 shows the results of the down-sampling procedure for Cronbach’s alpha. Figure 2 shows the results of the down-sampling procedure for split-half reliability estimates. In each plot, we show both the average reliabilities obtained across the 100 iterations (Figs. 1a and 2a) and the worst reliabilities obtained across the 100 iterations (Figs. 1b and 2b). Conceptually, we could think of each iteration of the down-sampling procedure as akin to running one “experiment,” with participants randomly sampled from our “population” of 137. Although it is good to know the average expected reliability across many experiments, the typical experimenter will run an experiment only once. Thus, considering the “worst case scenario” is instructive for planning the number of participants and the number of trials to be collected. For a more complete picture of the breadth of the reliabilities obtained, we can also consider the variability in reliabilities across iterations (SD) and the range of reliability values (Fig. 2c and d). Finally, we repeated this iterative down-sampling approach for each individual set size. The average reliability as well as the variability of the reliabilities for individual set sizes are shown in Fig. 3. Note that each set size begins with 1/3 as many trials as in Figs. 1 and 2.

Next, we looked at some potential characteristics of samples with low reliability (e.g., iterations with particularly low vs. high reliability). We ran 500 sampling iterations of 30 participants and 120 trials, then we did a median split for high- versus low-reliability samples. No significant differences emerged in the mean (p = .86), skewness (p = .60), or kurtosis (p = .70) values of high- versus low-reliability samples. There were, however, significant effects of sample range and variability. As would be expected, samples with higher reliability had larger standard deviations, t(498) = 26.7, p < .001, 95% CI [.14, .17], and wider ranges, t(498) = 15.2, p < .001, 95% CI [.52, .67], than samples with low reliability.

A note for fixed capacity + attention estimates of capacity

So far, we have discussed only the most commonly used methods of estimating WM capacity (K scores and percentages correct). Other methods of estimating capacity have been used, and we now briefly mention one of them. Rouder and colleagues (2008) suggested adding an attentional lapse parameter to estimates of visual WM capacity, a model referred to as fixed capacity + attention. Adding an attentional lapse parameter accounts for trials in which participants are inattentive to the task at hand. Specifically, participants commonly make errors on trials that should be well within capacity limits (e.g., set size 1), and adding a lapse parameter can help explain these anomalous dips in performance. Unlike typical estimates of capacity, in which a K value is computed directly for performance for each set size and then averaged, this model uses a log-likelihood estimation technique that estimates a single capacity parameter by simultaneously considering performance across all set sizes and/or change probability conditions. Critically, this model assumes that data are obtained for at least one subcapacity set size, and that any error made on this set size reflects an attentional lapse. If the model is fit to data that lack at least one subcapacity set size (e.g., one or two items), then the model will fit poorly and provide nonsensical parameter estimates.

Recently, Van Snellenberg, Conway, Spicer, Read, and Smith (2014) used the fixed capacity + attention model to calculate capacity for a change detection task, and they found that the reliability of the model’s capacity parameter was low (r = .32) and did not correlate with other WM tasks. Critically, however, this study used only relatively high set sizes (4 and 8) and lacked a subcapacity set size, so model fits were likely poor. Using code made available by Rouder et al., we fit a fixed capacity + attention model to our data (Rouder, n.d.). We found that when this model is misapplied (i.e., used on data without at least one subcapacity set size), the internal reliability of the capacity parameter was low (r uncorrected = .35) and was negatively correlated with raw change detection accuracy, r = −.25, p = .004. If we had only applied this model to our data, we would have mistakenly concluded that change detection measures offer poor reliability and do not correlate with other measures of WM capacity.

Discussion

Here, we have shown that when sufficient numbers of trials and participants are collected, the reliability of change detection capacity is remarkably high (r > .9). On the other hand, a systematic down-sampling method revealed that insufficient trials or insufficient participant numbers could dramatically reduce the reliability obtained in a single experiment. If researchers hope to measure the correlation between visual WM capacity and some other measure, Figs. 1 and 2 can serve as an approximate guide to expected reliability. Because we had only a single sample of the largest n (137), we cannot make definitive claims about the reliabilities of future samples of this size. However, given the stabilization of correlation coefficients with large sample sizes and the extremely high correlation coefficient obtained, we can be relatively confident that the reliability estimate for our full sample (n = 137) would not change substantially in future samples of university students. Furthermore, we can make claims about how the reliability of small, well-defined subsamples of this “population” can systematically deviate from an empirical upper bound.

The average capacity obtained for this sample was slightly lower than some other values in the literature, typically cited as around three or four items. The slightly lower average for this sample could potentially cause some concern about the generalizability of these reliability values for future samples. For the present study’s sample, the average K scores for set sizes 4 and 8 were K = 2.3 and 2.0, respectively. The largest, most comparable sample to the present sample is a 495-participant sample in a work by Fukuda, Woodman, and Vogel (2015). The average K scores for set sizes 4 and 8 were K = 2.7 and 2.4, respectively, and the task design was nearly identical (150-ms encoding time, 1,000-ms retention interval, no color repetitions allowed, and set sizes 4 and 8). The difference of 0.3–0.4 items between these two samples is relatively small, though likely significant. However, for the purposes of estimating reliability, the variance of the distribution is more important than the mean. The variabilities observed in the present sample (SD = 0.7 for set size 4, SD = 0.97 for set size 8) were very similar to those observed in the Fukuda et al. sample (SD = 0.6 for set size 4 and SD = 1.2 for set size 8), though unfortunately the Fukuda et al. study did not report reliability. Because of the nearly identical variabilities of scores across these two samples, we can infer that our reliability results would indeed generalize to other large samples for which change detection scores have been obtained.

We recommend applying an iterative down-sampling approach to other measures when expediency of task administration is valued, but reliability is paramount. The stats-savvy reader may note that the Spearman–Brown prophecy formula also allows one to calculate how many observations must be added to improve the expected reliability, according to the formula

$$ N = \frac{\rho {*}_{x{ x}^{\prime }}\left(1-{\rho}_{{}_{x{ x}^{\prime }}}\right)}{\rho_{{}_{x{ x}^{\prime }}}\left(1 - \rho {*}_{x{ x}^{\prime }}\right)} $$

where $ \rho {*}_{x{ x}^{\prime }} $ is the desired correlation strength, $ {\rho}_{{}_{x{ x}^{\prime }}} $ is the observed correlation, and N is the number of times that a test length must be multiplied to achieve the desired correlation strength. Critically, however, this formula does not account for the accuracy of the observed correlation. Thus, if one starts from an unreliable correlation coefficient obtained with a small number of participants and trials, one will obtain an unreliable estimate of the number of observations needed to improve the correlation strength. In experiments such as this one, both the number of trials and the number of participants will drastically change estimates of the number of participants needed to observe correlations of a desired strength.

Let’s take an example from our iterative down-sampling procedure. Imagine that we ran 100 experiments, each with 15 participants and 150 total trials of change detection. Doing so, we would obtain 100 different estimates of the strength of the true split-half correlation. We could then apply the Spearman-Brown formula to each of these 100 estimates in order to calculate the number of trials needed to obtain a desired reliability of r = .8. So doing, we would find that, on average, we would need around 140 trials to obtain the desired reliability. However, because of the large variability in the observed correlation strength (r = .37 to .97), if we had only run the “best case” experiment (r = .97), we would estimate that we need only 18 trials to obtain our desired reliability of r = .8 with 15 participants. On the other hand, if we had run the “worst case” experiment (r = .37), then we would estimate that we need 1,030 trials. There are downsides to both types of estimation errors. Although a pessimistic estimate of the number of trials needed (>1,000) would certainly ensure adequate reliability, this might come at the cost of time and participants’ frustration. Conversely, an overly optimistic estimate of the number of trials needed (<20) would lead to underpowered studies that would waste time and funds.

Finally, we investigated an alternative parameterization of capacity based on a model that assumes a fixed capacity and an attention lapse parameter (Rouder et al., 2008). Critically, this model attempts to explain errors for set sizes that are well within capacity limits (e.g., one item). If researchers inappropriately apply this model to change detection data with only large set sizes, they would erroneously conclude that change detection tasks yield poor reliability and fail to correlate with other estimates of capacity (e.g., Van Snellenberg et al., 2014).

In Experiment 2, we shifted our focus to the stability of change detection estimates. That is, how consistent are estimates of capacity from day to day? We collected an unprecedented number of sessions of change detection performance (31) spanning 60 days. We examined the stability of capacity estimates, defined as the correlation between individuals’ capacity estimates from one day to the next. Since capacity is thought to be a stable trait of the individual, we predicted that individual differences in capacity should be reliable across many testing sessions.

Experiment 2