Eyetracking is increasingly being used to investigate attentional bias (Armstrong & Olatunji, 2012; Liossi, Schoth, Godwin, & Liversedge, 2014; Mogg, Bradley, Field, & De Houwer, 2003). Compared to traditional reaction time based methods of attentional bias, such as the dot-probe task, eyetracking is proposed to provide a more direct, and therefore superior, measure of sensory processing (Armstrong & Olatunji, 2012; Toh, Rossell, & Castle, 2011). However, because few reports have been published on the reliability of eyetracking, it is unknown whether study results are valid. This is because the reliability of a measure can influence Type II error rates, effect sizes, and confidence intervals (Kopriva & Shaw, 1991; Loken & Gelman, 2017; Meyer, 2010).

Attentional bias

Attentional bias describes the preferential allocation of cognitive resources to the detection of salient stimuli (Crombez, Van Ryckeghem, Eccleston, & Van Damme, 2013). Attentional bias to threat stimuli has been identified in the development and maintenance of clinical conditions such as addiction, anxiety, depression and chronic pain (Sharpe, Haggman, Nicholas, Dear, & Refshauge, 2014; White, Suway, Pine, Bar-Haim, & Fox, 2011). Recently, attentional-bias modification training has been found to reduce symptoms of affective and pain disorders (Amir, Beard, Burns, & Bomyea, 2009; Amir, Weber, Beard, Bomyea, & Taylor, 2008; Sharpe et al., 2012).

Models of attentional bias, such as the “vigilance–avoidance” model (Mogg, Bradley, Miles, & Dixon, 2004) and “threat interpretation” model (Todd et al., 2015), consider attentional bias to be dynamic; attentional bias may shift toward or away from a stimulus during the stimulus exposure. For example, the vigilance-avoidance model posits that individuals may attend to a threat stimulus during initial exposure (vigilance) but then after detection, avoid the threat stimulus (avoidance) (Mogg et al., 2004). These models incorporate a temporal component of processing, broadly categorised into overall, early and late processing. To investigate such models, methods that can consistently distinguish between the temporal components of attentional processing are needed.

Eyetracking

Eyetracking continuously measures eye movements to stimuli presented on either a computer screen or mobile head centered video device. Prespecified spatial (e.g., displacement) and temporal (e.g., velocity and acceleration) eye movement parameters are used to derive “fixations” and “saccades.” Fixation-based measures can be categorized according to the component of attention they are proposed to measure: overall, early, or late. Overall attention combines early and late stage processing and reflects the viewing pattern across the total stimulus duration. For example, if stimuli are presented for 4,000 ms, the total dwell time toward the salient stimulus is considered an indicator of overall attention. Early attention reflects the initial viewing pattern when stimuli are first presented and has been used to indicate initial vigilance, which may be important in threat detection (Armstrong & Olatunji, 2012). Examples include location of the first fixation, first fixation latency, and duration of the first fixation to a salient stimulus. Late attention reflects the viewing pattern that occurs after the initial viewing pattern and is thought to reflect rumination or maintenance, which are important in theories of depression (Donaldson, Lam, & Mathews, 2007). Examples of late attention outcome measures are second- or last-run dwell times and the dwell time for the second half of the stimulus duration.

Eyetracking has been used to investigate attentional bias in clinical conditions such as depression (Armstrong & Olatunji, 2012), anxiety (Armstrong & Olatunji, 2012), addictive disorders (Mogg et al., 2003), obesity (Gao et al., 2011), posttraumatic stress disorder (Felmingham, Rennie, Manor, & Bryant, 2011), and pain (Fashler & Katz, 2014; Yang, Jackson, & Chen, 2013). Distinguishing the temporal components allows researchers to more clearly define the role of attentional bias in these clinical conditions. For example, Duque (2015) found that participants with major depressive disorder had an attentional bias to sad faces for maintenance indices (late processing) such as total fixation time, but not for orientating (early processing) attention indices.

There is considerable variability in the tasks and procedural variables used in eyetracking research (Radach & Kennedy, 2004). For example, different tasks (e.g., preferential-looking task, visual-search tasks, dot-probe tasks), outcome measures (e.g., first-fixation latency, percentage of initial fixations, average visit duration), and stimuli (e.g., words, images, faces) are common (Fashler & Katz, 2014; Felmingham et al., 2011; Yang, Jackson, Gao, & Chen, 2012). However the reliability of attentional-bias tasks has been questioned (Rodebaugh et al., 2016), and good quality information on the reliability of procedural variables will help inform which tasks, outcome measures, and stimuli to use in future research studies.

Reliability

It is important that paradigms and procedural variables, such as those used to investigate attentional bias, produce measurements that are reliable. Poor reliability has statistical and conceptual implications. It has been demonstrated, for example, that effect sizes can vary depending on the sample size (Loken & Gelman, 2017), and statistical power is reduced as the reliability of a task decreases (Meyer, 2010). Conceptually, it is difficult to reproduce study findings if tasks and procedural variables are not reliable. Conclusions from experiments with poor reliability are therefore questionable (Loken & Gelman, 2017).

Because there is some variation in descriptions of reliability, we used the taxonomy described by Mokkink et al. (2010). Reliability comprises three measurement properties: test–retest reliability, measurement error and internal consistency (Mokkink et al., 2010). These three measurement properties reflect conceptually different aspects of reliability and should all be considered when investigating reliability (Mokkink et al., 2010; Scholtes, Terwee, & Poolman, 2011). A minimum of two testing sessions is required to assess test–retest reliability and agreement, whereas internal consistency can be evaluated using data from a single testing session.

Test–retest reliability indicates how well a task can distinguish between participants with reference to the consistency between measurements (de Vet, Terwee, Knol, & Bouter, 2006). Both the consistency of results between measurements and the participant variance are used to calculate test–retest reliability—that is, did participants all score the same, or was there adequate variability in the results to distinguish participants from each other? The preferred method for assessing test–retest reliability is the intraclass correlation coefficient (ICC). ICCs vary between 0 and 1, although theoretically it is possible to report values lower than 0. Higher numbers reflect stronger evidence of test–retest reliability (Weir, 2005).

Measurement error reflects the consistency of results between measurements—that is, how similar are the results between testing sessions (de Vet et al., 2006)? Unlike test–retest reliability, the variance between participants is not considered when calculating measurement error (Kottner & Streiner, 2011). Measurement error is reported in the same unit as the task. Low measurement error is preferred.

Low measurement error (i.e., consistent results between testing sessions) and poor test–retest reliability (i.e., inability to distinguish participants) arise when there is too little variance between participants (their scores are too similar). For example, if newborn human babies were weighed twice on the same day using scales designed for newborn elephants all the human babies would have consistent scores between measurements (low measurement error), however the scores would not be able to distinguish between the human babies due to the low variance in scores (poor test–retest reliability). Test–retest reliability is therefore considered more beneficial for discriminative testing—that is, when aiming to differentiate participants on the basis of a set of scores from a certain task, as in cross-sectional studies. Measurement error is preferred for evaluative testing, when testing participants over time and measuring within-subjects change. Variance in participant scores is regarded as being less important for evaluative testing. When researchers investigate attentional-bias tasks, procedural variables are required that will be able to produce scores that can both accurately discriminate between participants (discriminative testing indicated by test–retest reliability) and accurately measure change over time for individual participants (evaluative testing indicated by agreement) (Guyatt, Walter, & Norman, 1987).

Internal consistency indicates how the subjects respond to individual items on a task—that is, the homogeneity of the items on a scale (interrelatedness) (Streiner, 2003c). For example, when investigating attentional bias to threat-related anxiety, participants would be expected to view each threat-related word in a similar manner. A high level of internal consistency provides confidence that the interpretation of the composite score is an accurate measure of the underlying construct being investigated. Cronbach’s alpha is the preferred method for analyzing internal consistency, as it considers the mean of all possible splits. Split-half reliability tends to underestimate reliability because it splits a scale in half, so depending on how a scale is split, a different reliability may be returned (Streiner, 2003c).

Eyetracking tasks that measure attentional bias should be able to discriminate between people (high test–test reliability) and should have consistent scores on repeated testing (low measurement error) and high interrelatedness of the items (high internal consistency) (Kottner & Streiner, 2011; Streiner, 2003b).

Previous research

Waechter, Nelson, Wright, Hyatt, and Oakman (2014) examined the internal consistency of eyetracking within a dot-probe paradigm in university students with high and low social anxiety. Anger, disgust, and happy facial images were paired with calm or neutral facial images using a 5,000-ms exposure time. The reliability coefficients for early attention were low (e.g., proportion of first fixations to angry faces: α = –2.18). Conversely, the eye movement indices using the full stimulus exposure (overall attention) had excellent reliability (e.g., proportion of viewing time to angry faces 0–5,000 ms: α = .94). Waechter et al. concluded that more research was needed to establish reliable methods to assess attentional bias.

Price et al. (2015) reported the test–retest reliability of eyetracking within a dot-probe paradigm using fearful and neutral facial images and a 2,000-ms exposure time. Single and average ICC measures were reported for healthy children (aged 9–13 years) at five time points over a 14-week period. The ICC scores for all trials varied between –.03 to .55, depending on the data-filtering process (e.g., excluding reaction times <300 ms and >2,500 ms and ±3 SDs from individual’s session means) and the reliability statistic used to interpret the results (i.e., single or average ICC). Importantly, none of the ICCs were above a standard threshold for acceptability (i.e., ICC > .70).

Lazarov, Abend, and Bar-Haim (2016) tested the internal consistency and test–retest reliability of eyetracking within a free-viewing task using 16 simultaneously presented facial images for 6,000 ms (half the faces displayed disgusted expressions, and half were neutral). The participants were 20 university students with high social anxiety and 20 university students with low social anxiety. Measures of early attentional bias (latency to first fixation, first-fixation location, first-fixation dwell time) and overall attention (total dwell time) were reported. Cronbach’s alpha scores, representing internal consistency, for overall measures of attentional bias were high, ranging from .89 to .95. One week test–retest reliability for overall attentional bias using Pearson’s correlation coefficients ranged from .62 to .68. Test–retest reliability was lower for early measures of attentional bias, ranging from .06 to .08, than for the measures of overall attention.

The data from these studies suggest that measures of early attention or measures that use less of the available stimulus presentation time may have lower internal consistency and poorer test–retest reliability than measures that use more of the available stimulus duration.

There are no published data on the reliability of using words as stimuli in attentional-bias research using eyetracking. This is important as a systematic review found that words are the most common stimuli in attentional-bias tasks (Bar-Haim, Lamy, Pergamin, Bakermans-Kranenburg, & van IJzendoorn, 2007).

There is also a lack of published data on the agreement of eyetracking when it is used to investigate attentional bias. The evidence from one measurement property of reliability does not provide evidence for another measurement property (Guyatt et al., 1987). For example, test–retest reliability and internal consistency are not suitable measures of reliability for evaluative studies—that is, those comparing within-subjects measures over time. Instead, agreement is the measurement property that indicates whether a measurement tool is appropriate for determining longitudinal changes (de Vet et al., 2006; Guyatt et al., 1987). An understanding of all three measurement properties of reliability can allow researchers to decide for what purpose, between-subjects (discriminative) or within-subjects (evaluative) testing, a tool is appropriate.

Healthy participants are commonly used as a comparison group in attentional-bias studies. No studies have reported reliability data on healthy adult participants using eyetracking with words as stimuli. Reliability is known to be specific to the population in which it is being tested; therefore, it is possible that the measurement properties of eyetracking may vary between clinical and healthy participants (Lakes, 2013). As compared with healthy control participants, greater variation is often found in data obtained from clinical populations (Bartko, 1991). If measurement error is stable, then increased between-subjects variance will increase test–retest reliability, whereas decreased between-subjects variance may decrease test–retest reliability. For example, Farzin, Scaggs, Hervey, Berry-Kravis, and Hessl (2011) investigated the reliability of gaze aversion to different facial features, in participants with Fragile X syndrome (FXS) and healthy controls. The test–retest reliability for the proportion of time spent looking at the mouth was higher (ICC = .97) in the FXS cohort than in the healthy controls (ICC = .63). Farzin et al. noted that the reduced between-subjects variance in the healthy controls may explain the lower ICC values (test–retest reliability). Accurately investigating between-group differences requires adequate reliability in both clinical populations and healthy controls.

Present study

The primary aim of this study was to assess the reliability of eyetracking when it is used to investigate attentional bias to threat-related words in healthy participants. Reliability will be assessed using test–retest reliability, measured with ICC(2, 1); measurement error, measured with the standard error of measurement; and internal consistency, measured with Cronbach’s alpha.

Method

Study design

We used an observational test–retest design. Healthy participants completed identical preferential-looking tasks on two occasions. A methodological protocol for the study was published to the completion-of-data collection (Open Science Framework Project MT3K8). Deviations from the protocol are noted in this article. Ethics approval was obtained from the University of New South Wales Human Research Ethics Committee (HC14240).

Participants

Healthy adult participants were recruited from the Sydney metropolitan area. Participants were included in the study if they were 18–75 years old, had good level of English proficiency, and had normal or corrected-to-normal vision.

English proficiency was assessed using three questions from the Language Experience and Proficiency Questionnaire. Participants were asked to select, on a scale from 0 to 10, their levels of proficiency in speaking, understanding, and reading English. A minimum score of 7, which is regarded as “good,” was required for participants to be included (Marian, Blumenfeld, & Kaushanskaya, 2007). We excluded participants with poor English proficiency because fixations might be unrelated to the threat value of the word. Global measures of self-report proficiency are good indicators of actual performance on specific measures of language ability (Marian et al., 2007).

Participants were also excluded if they were currently reporting pain in any body region, reported a previous pain condition that lasted more than 6 months, or reported pain in any body region that had lasted more than 72 h at any time during the past 3 months.

Materials

Apparatus

An EyeLink 1000 eyetracker (Version 4.56; SR Research, Ontario, Canada) with remote camera upgrade, desktop mount, 16-mm lens, and target sticker was used to record monocular eye movements from the right eye at 500 Hz. Stimuli were displayed on a HP Compaq LA2205 wide LCD monitor with a 1,680 × 1,050 resolution, 32 bits per pixel, and a refresh rate of 60 Hz. The preferential-looking task was programmed with Experiment Builder (Version 1.10.1241; SR Research, Ontario, Canada). A 5-point calibration procedure was used and accepted when the average calibration error was less than 1° of visual angle. We used a 5-point calibration, instead of the default 9-point calibration, as the stimuli did not extend to the corners of the display. This is in keeping with previous other eyetracking studies that use remote eyetrackers with no fixed headmount (Lazarov et al., 2016). All stimuli were presented in white on a black background.

Procedure

Testing took place at Neuroscience Research Australia. Each participant attended one 90-min session. Upon completion of the task, participants completed a demographic questionnaire along with the short-form version of the Depression Anxiety and Stress Scales (Lovibond & Lovibond, 1995) and the Pain Catastrophising Scale (Sullivan, Bishop, & Pivik, 1995). Participants were given 20 min to complete the questionnaires, followed by a compulsory 10-min washout period, during which time they were seated quietly, before the task was conducted a second time (retest).

Preferential-looking task

In preferential-looking tasks, two competing stimuli are displayed and participants are instructed to view stimuli as they wish. We used a preferential-looking task instead of the more traditional dot-probe task because previous research had suggested that the dot-probe task is not reliable (Rodebaugh et al., 2016; Schmukle, 2005). The test–retest reliability and agreement for words is unknown when using eyetracking.

The preferential-looking task consisted of eight practice trials and 48 active trials. Each trial consisted of three sequentially presented still screens. The first screen displayed a fixation cross (font: Times New Roman normal; size: 90; location: x = 840, y = 525 [center of screen]). Participants were instructed to fix their gaze on the middle of the cross. A researcher sitting in an adjacent room monitored the participants’ gaze. After a stable fixation had been made on the cross for 2,000 ms, the researcher manually progressed the trial to the next screen. The researcher used a timer on the display screen that was automatically reset at the start of each trial. The second screen displayed two words (the stimuli), presented on the left and right sides of the screen for 4,000 ms (Tahoma normal font, size 30). One of the words was a “threat word” and the other a “neutral (control) word.” Participants were instructed to read both words and to keep reading them until the words had disappeared. The third screen, a blank screen, was displayed automatically for 1,000 ms. Prior to each trial, a drift check was performed. If the calibration error was more than 1° of visual angle, a new calibration was performed.

To avoid participant fatigue, trials were arranged into three equal blocks of 16 trials. After each block, participants were given a self-timed break of 30 s or longer. The threat words in each block of trials came from one of three threat categories (1) “sensory pain,” (2) “affective pain,” or (3) “general threat.” Each block only contained words from one threat category. The eight words from each threat category (target) were paired with a “neutral (control) word” matched for length and frequency in everyday language, using an English control word search engine (Table 1; Guasch, Boada, Ferré, & Sánchez-Casas, 2013). Word pairs were presented twice within each block, with each word being presented on both the left and the right. The word pairs were randomized within each block, and the same word pair was not presented during consecutive trials. The order of blocks were randomized.

Table 1 Threat (target) and matched neutral (control) words presented to participants

Threat word selection

The “sensory pain” words (Table 1) were selected from a study that had investigated the words that participants used to describe their back pain (Jensen, Johnson, Gertz, Galer, & Gammaitoni, 2013). The “affective pain” words (Table 1) were selected from a study that had investigated attentional bias in participants with acute low back pain (Sharpe et al., 2014). The general threat words (Table 1) had previously been used to investigate attentional bias to threat in chronic-pain patients (Dehghani, Sharpe, & Nicholas, 2003).

Statistical analysis

Outcome measures

Twelve eyetracking outcome measures commonly used to assess attentional bias were calculated from the extracted data (Table 2; Kimble, Fleming, Bandy, Kim, & Zambetti, 2010; Liossi et al., 2014; Yang et al., 2013). These outcome measures, selected a priori, were chosen to reflect the different stages of attentional bias: overall attention, early attention, and late attention. Each outcome measure was calculated as a ratio of fixation time of the target word to the control word, and then converted to a percentage. A mean attentional-bias score was calculated for each participant in each word category for the test and retest sessions.

Table 2 Outcome measures and associated equations used to assess the different stages of attentional bias

Data reduction

The raw gaze data were automatically parsed into sequences of saccades and fixations and loaded into the SR Research EyeLink Dataviewer (Version 2.3.22, Ontario, Canada). The standard cognitive configuration was used to define fixations (i.e., recording parse type: gaze saccade; velocity threshold: 30 ms; saccade acceleration threshold: 8,000 ms/s; saccade motion threshold: 0.1 ms/s2). A 100-pixel area of interest, dependent on the word length, was set around each word (i.e., the area was set relative to the start and end of each word). No other filters were applied to the data—for example, no merge of fixations, no minimum fixation duration, and no blink correction. An interest period was created for each respective outcome measure and an interest area report was extracted. The subsequent data filtering and reliability analyses were completed in STATA (version 13.1; Stata Corp., Texas, USA).

We excluded trials during which the eyetracker lost and did not regain view of the eye (e.g., trials in which a blink occurred were still included if the eyetracker regained view of the eye after the blink), or when the participant did not adhere to the instructions (i.e., participants were instructed to look directly at the middle of the fixation cross until it disappeared and then to read both words and keep reading them until the words disappeared). Three criteria were used to exclude invalid trials:

  1. 1.

    A fixation was not made to both interest areas. No detection of a fixation to both interest areas implies the eyetracker may have lost view of the eye and not regained the view of the eye, or the participant did not read both words (Mogg et al., 2003). Since participants were instructed to read both words if a fixation was not captured on both words, this was considered an invalid trial.

  2. 2.

    The first-fixation latency to either interest area was less than 30 ms. Any fixations that occurred less than 30 ms after word presentation were likely not due to the content of the words.

  3. 3.

    Less than 3,000 ms (75%) of fixations were captured during the interest period (e.g., 0–4,000 ms) anywhere on the screen. That is, trials were still included if more than 75% of the fixations were captured at any location of the screen, not just within the interest areas. If less than 75% of fixations were captured during the interest period, the eyetracker might have lost tracking of the eye and not regained view of it, or the participant might have looked away from the screen after viewing both words (Fashler & Katz, 2014).

Participants were instructed to keep reading the words while they were being presented. After applying these criteria, if more than 25% of a participant’s trials were excluded, then all of the participant’s data were also excluded (Vervoort, Trost, Prkachin, & Mueller, 2013).

Reliability analysis

An intraclass correlation coefficient (ICC) was calculated to assess test–retest reliability. ICCs are able to detect systematic differences between testing sessions and are preferred over other correlation coefficients such as Pearson’s r, which in contrast does not consider systematic differences between testing sessions (Weir, 2005).

We used a two-way random-effects model with absolute agreement (ICC 2,1) (Shrout & Fleiss, 1979) as our primary outcome measure of test–retest reliability. A random-effects model is preferred because it considers systematic differences between testing sessions. The single measure was used, since this reflects how eyetracking is normally done in experimental research; that is, participants are normally tested on one occasion. A two-way random-effects model using an average measure (ICC 2,2) was also calculated. This average measure was included to indicate whether testing people twice and using the mean score is more reliable than using the results from one testing session (see the supplementary material, Table S1). As per our protocol, we also calculated a two-way fixed-effects model for consistency of agreement (ICC 3,1), to investigate the consistency of the scores (supplementary material, Table S1) (Streiner, 2003b). A two-way fixed-effects model does not consider systematic difference between testing sessions (de Vet et al., 2006).

The standard error of measurement (SEM) was calculated as an indicator of measurement error. We deviated from our protocol (Open Science Framework MT3K8) by using the variance scores, \( {SEM}_{\mathrm{agreement}}=\sqrt{{\sigma^2}_{\mathrm{retest}}+{\sigma^2}_{\mathrm{residual}}} \) (de Vet et al., 2006) instead of the standard deviation, \( SEM= SD\times \left(\sqrt{1-{ICC}_{2,1}}\right) \), to calculate the SEM. We did this because variance scores consider systematic differences between measurements (de Vet et al., 2006). With each outcome measure entered as the dependent variable, the participants and the test–retest sessions were considered random factors in a mixed model in order to estimate the variance for the participants (\( {\sigma}_{\mathrm{p}}^2 \)), the test–retest variance (\( {\sigma}_{\mathrm{retest}}^2 \)), and the residual variance (\( {\sigma}_{\mathrm{residual}}^2 \)). These variances are reported in the supplementary material (Table S2).

Internal consistency, reflecting “the interrelatedness of items on a test,” was calculated using Cronbach’s alpha for each set of words and each outcome measure, using the scores from the first testing session (Cronbach, 1951; Streiner, 2003c).

Sample size

We followed the recommendations from de Vet, Terwee, Mokkink, and Knol (2011) to calculate the required sample size. Using the simulated power calculations in Giraudeau and Mary (2001), we estimated that 50 participants would be required, using two repeated measurements, to detect an ICC of .8 with a confidence interval of ±0.1 and an alpha of .05.

Results

Participants

We recruited and screened 50 participants from the community. Informed consent was obtained from all individual participants included in the study. After the preplanned data filtering, 49 participants were included in the final analysis (see below). The mean participant age was 27.5 years (SD = 10.0, range = 18–73), and 26 (52%) of the participants were female. Education details, psychological scales, and language information are provided in Table 3. The mean scores for depression, anxiety, stress, and catastrophizing were in the normal range (Lovibond & Lovibond, 1995; Sullivan et al., 1995).

Table 3 Education, psychological scales, and language data for participants included in the final analysis

Data reduction

We excluded 315 (6.56%) of trials in accordance with our pre-planned data filtering procedure. Seventy nine trials were excluded because no fixation was detected in both interest areas (13 trials had no fixations to either interest area, 66 trials had a fixation in only one interest area). A total of 37 trials were excluded when a fixation was detected less than 30 ms after the words were displayed. Another 135 trials were excluded when less than 75% (3,000 ms) of fixations were detected anywhere on the screen. The data reduction process resulted in one participant with less than 75% of their trials remaining (i.e., <36 trials). For this participant, in addition to the previously removed trials, all of the remaining trials were excluded (64 trials across both testing sessions). In all, 49 participants (4,485 trials) were included in the final analysis.

Test–retest reliability

Test–retest reliability data are presented in Table 4. Point estimates ranged from ICC(2, 1) = –.31 to .71. The sensory words had a lower mean ICC (.08) than the affective words (.32), and the general threat words (.29). Considering only the affective words and general threat words the total dwell time (0–4,000 ms) demonstrated the highest reliability (affective words: ICC = .61, general threat words: ICC = .71). The reliability coefficients for the affective and general threat words were also higher for the total dwell time (500–4,000 ms) and total dwell time (1,000–4,000 ms) (Table 4).

Table 4 Mean results from the two testing sessions, internal consistency as measured with Cronbach’s alpha, test–retest reliability as measured with ICC(2, 1), and measurement error as measured with the standard error of measurement (SEM)

Measurement error

The SEM results are also presented in Table 4; lower SEMs represent more stable outcome measures. Point estimates for the SEM ranged between 3.02% and 14.59% across all word groups in all outcome measures. All word groups demonstrated a similar pattern of SEMs. The mean SEMs were 5.59% for the sensory words, 4.82% for the affective words, and 4.98% for the general threat words. The first-fixation duration recorded the lowest SEM scores (affective words SEM = 3.03%, general threat words SEM = 3.11%, sensory words SEM = 3.40%). The second-run dwell time demonstrated the highest SEMs, indicating less stable scores between testing sessions (affective words SEM = 13.43%, general threat words SEM = 11.21%, sensory words SEM = 14.49%).

Internal consistency

Finally, Cronbach’s alpha scores for the first testing session are presented in Table 4. Point estimates ranged from .57 to .99 (mean = .89). Most outcome measures reported high internal consistency (e.g., total dwell time: affective words α = .94, general threat words α = .93, sensory words α = .94). The lowest Cronbach’s alpha scores were recorded for the first-fixation duration (affective words α = .57, general threat words α = .67, sensory words α = .70) and second-run dwell time (affective words α = .72, general threat words α = .70, sensory words α = .72).

Discussion

We assessed the reliability of a preferential-looking eyetracking task to investigate attentional bias to threat-related words in healthy participants. Test–retest reliability varied according to the threat word category (sensory pain words, general threat words, and affective pain words) and outcome measure. Low ICCs were found for most outcome measures (e.g., first-fixation latency), indicating that they may not be appropriate for discriminative testing (when comparing participant groups). The results for the measurement error (SEM) suggest stable outcome measures between sessions and for internal consistency (Cronbach’s α) a high level of interrelatedness between the word stimuli within each threat word category.

Test–retest reliability

Test–retest reliability varied according to the threat word category. Sensory pain words demonstrated the lowest test–retest reliability. Test–retest reliability considers the variance between a subject’s repeated measurements relative to the overall group variance (de Vet et al., 2006). Decreased participant variance, relative to measurement error, decreases the test–retest reliability. When we examined the variance between participants (\( {\sigma}_{\mathrm{p}}^2 \)) across all word groups, there was less variance between participants for the sensory pain words (Fig. 1). It is not clear why the sensory pain words had less variance than the affective pain words and general threat words.

Fig. 1
figure 1

Participant variances (σ p 2) for each outcome measure

The second-run dwell time demonstrated high participant variance (Fig. 1) but still recorded low ICCs. The high variance between participants was not enough to overcome the relatively high measurement error between testing sessions.

Considering the different outcome measures available for researchers, our study showed that more reliable results are likely when one uses outcome measures that utilize more of the trial duration. Outcome measures that incorporated more of the 4,000-ms trial duration, such as the total dwell time on threat words (04,000 ms), demonstrated higher test–retest reliability than outcome measures that used less of the trial duration, such as the total dwell time on threat words (0500 ms).

Furthermore, the outcome measures selected to reflect early attention (probability of first fixation to target word, first-fixation latency, first-run dwell time, and first-fixation duration) had lower test–retest reliability than those selected to measure late attention [second-run dwell time, last-run dwell time, total dwell time (500–4,000 ms), and total dwell time (1,000–4,000 ms)]. Early attention outcome measures used less of the available viewing time and demonstrated less variance between participants than the late attention outcome measures (Fig. 1). This demonstrates that both the threat word group selected and the proportion of viewing time incorporated in the outcome measure are important procedural variables for the test–retest reliability of eyetracking measures.

We found higher test–retest reliability than Price et al. (2015). In their study, using a pediatric sample, facial stimuli were presented for 2,000 ms, whereas in our study the stimuli were presented for 4,000 ms (Price et al., 2015). It may be that increased stimulus exposure time allows greater variation, thereby increasing the ICC value. In support of this, Lazarov et al. (2016) presented stimuli for 6,000 ms and reported test–retest reliabilities of more than .62, using outcome measures that made use of longer stimulus exposure times—for example, total dwell time on threat faces. However, it may be that improved reliability for longer stimulus durations has a ceiling. The optimal stimulus duration to optimize test–retest reliability is likely related to the number and type of stimuli presented; for example, more stimuli may require longer exposure times, and pictures may require a longer presentation time than words. As was noted by Waechter et al. (2014), reliability is task and population specific.

Measurement error

The consistent and relatively low SEM values indicated stable measurements between sessions. The second-run dwell time was an exception, demonstrating higher SEM values (affective words = 13.4%, general threat words = 11.2%, sensory words = 14.6%) than the other outcome measures (all less than 6.6%). This was explained by the test–retest variance and reflected in the standard deviations of the mean scores (Table 4). The large standard deviations of the second-run dwell time suggest that there was considerable variability in the viewing patterns between test sessions. Because the SEM second-run dwell time values were higher than all other outcome measures, we would caution against using this outcome measure for discriminative or evaluative purposes when other, more reliable outcome measures are available. The results suggest the remaining outcome measures are appropriate for evaluative testing.

We are not aware of any other studies that have reported measurement error for eyetracking tasks that investigated attentional bias. We would encourage future research to report measurement error alongside other indicators of reliability. Because interest is growing in using the outcomes from eyetracking in interventional studies (Todd, Sharpe, & Colagiuri, 2016; Vazquez, Blanco, Sanchez, & McNally, 2016), it is important to know whether participant change scores are greater than the measurement error of the task.

Internal consistency

Our internal consistency results suggest that fewer test items could be used to achieve the same scores. Internal consistency measures the interrelatedness among items, and as such, high Cronbach’s alpha scores suggests that using fewer stimuli may achieve the same scores for participants. Waechter et al. (2014) reported similar internal consistency results in a preferential-looking eyetracking task, measuring attentional bias using 72 trials. They reported Cronbach’s alpha scores of .94, .94, and .96 for the total viewing time over 5,000 ms for angry, disgust, and happy images, respectively. This further suggests that when using the more stable and reliable outcome measures (longer proportion of viewing time) fewer items could potentially be used, thus reducing time involved for testing (Scholtes et al., 2011).

Individual variation

Researchers are commonly interested in testing for differences between groups. Test–retest reliability is the most informative reliability construct for that purpose. The nuance of test–retest reliability is that too little variance between participants will result in low reliability, (unable to distinguish participants). However, if measurements are not stable between sessions, this will also produce low reliability (too much variability between measurements). These effects are highlighted when examining measures of early attention. Location of first fixation and first-run dwell time both have low test–retest reliability; however, this is likely true for different reasons.

The poor reliability for location of the first fixation is most likely due to low variance between participants. Waechter et al. (2014) suggested low reliability may be due to the “look up” bias, in which participants will consistently look up first if stimuli are presented vertically or look left first if stimuli are presented horizontally. Viewing the word on the left first is consistent with the normal left to right reading pattern observed in English readers (Liversedge & Findlay, 2000; Rayner, 1989). Decreased variability between participants, due to normal reading patterns, is likely to reduce the test–retest reliability for the location of first fixation.

The low reliability coefficients reported for first-fixation duration to threat words is likely due to poor stability of measurements between sessions. In this context other factors that influence viewing patterns such as global speed of processing may be at play. This hypothesis also extends to first-run dwell time and second-run dwell time outcome measures. Individual viewing patterns influenced the between-participant variation.

It may be that outcome measures that use more of the available viewing time strike a balance in having sufficient between participant variance but similar enough scores between testing sessions. In this study outcome measures that used more of the stimulus duration (e.g., 0–4,000 ms) were stable between measurements, and also not confounded by other individual viewing patterns such as global speed of processing.

It must be emphasized that reliability is specific to the population and the task for which it has been evaluated. The results of our study using healthy participants, words as stimuli, and a presentation time of 4,000 ms cannot be assumed to generalize to other populations (e.g., anxiety patients) or to other stimuli (e.g., images) or presentation times (e.g., 500 ms).

What is an acceptable level of reliability?

There is no definitive benchmark regarding an acceptable level of reliability (Charter & Feldt, 2001). The sample size, setting (i.e., clinical or research), and purpose (e.g., clinical diagnosis of life-threatening illness) will all contribute to the subjective assessment of what is acceptable in a specific situation. Although reliability benchmarks have not been well justified, some guidance is necessary (Streiner, 2003b). Nunnally (1994) suggested a value of .70 may equate to modest reliability when used to compare groups, and Cicchetti (1994) suggested a tiered approach for determining acceptability (i.e., <.40 = poor, .40–.59 = fair, .60–.74 = good, .75–1.00 = excellent). We would caution against using eyetracking measures with reliability coefficients less than .60, for research purposes. Outcome measures with higher reliability may be required when investigating between group differences with a small sample size (e.g., less than 20 participants). Our results suggest that most outcome measures are not reliable enough to differentiate participants when assessing attentional bias in healthy participants using threat words. Some of the outcome measures, such as the total dwell time of threat words (0–4,000 ms), may be appropriate depending on the stimulus (i.e., general threat words are appropriate but not sensory words).

Limitations

Although it is important that reliability be established for a healthy sample, our results may not generalize to nonhealthy samples. Reliability estimates are only valid for the sample being tested, and to the stimuli and outcome measures used in an experiment. The reliability of attentional bias using eyetracking has been investigated in a sample of participant with high and low social anxiety (Lazarov et al., 2016; Waechter et al., 2014). However, since these studies used facial images in nonclinical populations (participants were university students screened as having high or low social anxiety), it is unknown whether these results will generalize to other clinical samples. Further studies will be required that investigate reliability in clinical samples using a variety of stimuli (words/pictures/faces) and outcome measures, across all three measurement properties of reliability (internal consistency, agreement, and test–retest reliability).

Researcher degrees of freedom (RDoF) denote the decisions researchers make when collecting and analyzing data (Simmons, Nelson, & Simonsohn, 2011). There are many RDoF during eyetracking data filtering—for example, what constitutes a valid trial, and which fixations to retain for analysis. Minimizing RDoF, by specifying in advance how data will be collected and analyzed, decreases the risk of false positive results and may increase the reproducibility of findings (Simmons et al., 2011). We used a preplanned data-filtering process (Skinner et al., 2016). Stating in advance how and why one plans to remove trials avoids biased and subjective influences on the fixation locations (i.e., individual trials were not manipulated by the investigator). There is, however, the potential for removing trials unnecessarily, and thereby decreasing the power of statistical analysis. We argue that potentially removing unnecessary trials is an appropriate compromise for increased transparency in data analysis, decreasing RDoF, minimizing false positive results and potentially increasing reproducibility.

Our data-filtering method excluded trials with a first-fixation latency less than 30 ms, resulting in the exclusion of 46 trials. Previous research has used a more conservative cutoff (e.g., 80–100 ms); if we used a more traditional 80-ms cutoff, we would have included an additional nine trials. Rather than include the additional nine trials, we chose to preserve our a priori published data reduction plan. The first-fixation latency cutoff is another RDoF, which highlights the many decisions that researchers must make.

Recommendations

Our results suggest that for discriminative testing, outcome measures with a short exposure time or that use sensory words may be unreliable (low test–retest reliability). However, for evaluative testing, all of our outcome measures except second-run dwell time may be appropriate (low measurement error). Given that we found high internal consistency yet low test–retest reliability, Cronbach’s alpha should not be used to justify the reliability of a task (Gliner, Morgan, & Harmon, 2001; Streiner, 2003a).

Our findings suggest that the outcome measures that investigate early stages of attentional bias are unreliable. One of the proposed advantages of eyetracking is the ability to distinguish early and late stages of attention. Our results suggest that the current outcome measures used to assess early stages of attention do not have adequate test–retest reliability and are therefore unable to distinguish reliably the different stages of attentional bias.

Comparing our results to those of other studies suggests that the test–retest reliability of eyetracking is superior to that of the dot-probe task in healthy participants. Dear, Sharpe, Nicholas, and Refshauge (2011) reported bivariate reliability coefficients of –.06, –.14, and .01 for the dot-probe task using words on two occasions in healthy participants. Schmukle (2005) reported similarly poor test–retest reliability coefficients using a word-based dot-probe task. Evidence therefore suggests that when investigating attentional bias, eyetracking may provide higher test–retest reliability than the dot-probe task. This, however, needs confirmation across different populations and with different stimuli. Any potential benefit gained from eyetracking, such as increased reliability, will need to be considered against the increased cost of eyetracking equipment and the more complex data analysis techniques.

The challenge moving forward is to use outcome measures and stimuli that are relevant to both the population and the underlying mechanism being investigated, while still providing reliable data. We suggest reporting reliability statistics for test–retest, measurement error, and internal consistency for all tasks and outcome measures used to investigate attentional bias. With rapid advances in technology and the emerging prospect of virtual reality to assess attentional bias, it is critical that reliability be reported.

Conclusion

The outcome measure and threat word category used in eyetracking experiments influence test–retest reliability. Outcome measures with longer exposure times have increased test–retest reliability. Measurement error in eyetracking appears to be low. These results require replication in clinical populations and with different stimuli.

Author note

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Funding

I.W.S. is supported by an NHMRC Postgraduate scholarship (APP1093794); G.L.M. is supported by a Principal Research Fellowship from the NHMRC (ID 1061279); H.L. is supported by an NHMRC Postgraduate scholarship (APP133828); A.C.T. is supported by an NHMRC Postgraduate scholarship (APP1075670); S.M.G. is supported by an NHMRC Project Grant (ID 1084240) and Al & Val Rosenstraus Rebecca L. Cooper Medical Resesarch funding; and J.H.M. is supported by NHMRC Project Grants (ID 1008003 and 1043621).

Conflicts of interest

G.L.M. has received support from Pfizer, Australian Institute of Sport, Grunenthal, Kaiser Permanente California, Return to Work SA, Agile Physiotherapy, and Results Physiotherapy; grants from National Health and Medical Research Council of Australia; speaker fees for lectures on pain and rehabilitation; royalties from Explain Pain, Painful Yarns, Graded Motor Imagery Handbook, and The Explain Pain Supercharged Handbook: Protectometer, Noigroup Publications. All other authors declare they have no conflicts of interest.