A fundamental question related to reading is: How is attention allocated for the purpose of identifying words? This question has generally been answered in one of two ways. The first is that attention is allocated in a strictly serial manner to support the lexical processing and identification of only one word at any given time (Reichle, Liversedge, Pollatsek, & Rayner, 2009). The second is that attention is allocated as a gradient to support the processing and identification of multiple words (typically three or four) at any given time (Radach & Kennedy, 2013). Both hypotheses have been instantiated as formal models that explain how attention (from their respective theoretical positions) is coordinated with lexical processing during reading, and how this in turn produces the patterns of eye movements that are observed during reading. The most developed examples of these models are the E-Z Reader (Reichle, Pollatsek, Fisher, & Rayner, 1998; Reichle, Pollatsek, & Rayner, 2012) for the serial-attention hypothesis, and SWIFT (Engbert, Nuthmann, Richter, & Kliegl, 2005) for the parallel-attention hypothesis. Both models explain a wide range of reading-related findings despite the fact that they adopt widely divergent views about attention—views that, as demonstrated in this article, might be considered extreme because they fail to address possible limitations of attentional control that result in some amount of “leakage” of information from unattended words.

Before we report this demonstration, however, it is important to be clear that, although models like E-Z Reader and SWIFT adopt extreme stances with respect to attention allocation, both models also already incorporate theoretical assumptions that suggest a complexity lacking in much of the “either/or” debate. For example, E-Z Reader assumes a preattentive visual stage of processing in which features on the printed page are propagated in parallel from across the visual field, subject to the limitations of visual acuity. This assumption was adopted to accommodate orthographic parafoveal-on-foveal effects, or the finding that the orthographic properties (e.g., irregular spelling) of wordN + 1 can be detected in the parafovea and influence the time spent looking at wordN (e.g., Inhoff, Starr, & Shindler, 2000). Likewise, the most recent version of SWIFT has adopted the assumption that attention is like a flexible “zoom lens” that can be focused more or less tightly to permit the lexical processing of one or more than one word concurrently, effectively allowing the model to mimic a serial-attention model as necessary (Schad & Engbert, 2012).

Evidence supporting this notion of flexible attention allocation was recently reported by Reingold, Sheridan, Meadmore, Drieghe, and Liversedge (2016). Across two experiments using a novel selective-reading paradigm, participants were instructed to read sentences in which the words (rendered in blue) were separated by intervening character strings and/or words (rendered in orange). The key findings were that, although participants were able to selectively attend to and read the blue sentences (as demonstrated by, e.g., frequency effects on blue words), their allocation of attention was not perfect, because some amount of information from the orange distractors also influenced participants’ eye movements (e.g., more fixations on orange words than character strings). These findings were interpreted as providing “some support for at least sporadic lexical processing of distractors” (p. 2015).

More recently, Liu and Reichle (2018) have suggested that, rather than conceptualizing attention as a “spotlight” encompassing some number of words, it might instead be conceptualized as being object-based in the sense of Duncan’s (1984; see also Chen, 2012) classic demonstration that people are better at attending to two features of a single object than to one feature from each of two objects, even if the two objects are colocated in space (e.g., one object superimposed on the other). Liu and Reichle examined this issue across two experiments using a methodology similar to Duncan’s, but with superimposed Chinese words/sentences rather than objects (similar to those shown in Fig. 1). In their first experiment, participants made lexical decisions about one of two superimposed Chinese words/nonwords, with both attended and unattended words being either high or low frequency, and the key result was that the latencies to make “word” responses were only affected by the frequency of the attended word. Similarly, in their second experiment, participants were instructed to read one of two superimposed sentences containing colocated target words that were either high or low frequency; again, the key result was that looking times on the target words were only affected by the frequency of the attended word. These findings together suggest that attention can be focused on one of two spatially colocated words, allowing the processing of the word in the absence of any influence of the unattended word. This interpretation may be overly simplistic, however, because the stimuli were specifically designed to allow participants to focus their attention on one word/sentence (e.g., by using target and distractor sentences that differed along their entire length, thereby precluding any low-level visual cues that might otherwise alert participants to the presence of specific target words).

Fig. 1
figure 1

Example sentence with English translation. (For illustrative purposes, the target and distractor words are respectively indicated by solid and dashed lines, and the stimuli are rendered on a white—rather than black—background.) (Color figure online)

The above design constraint was relaxed in the experiment reported below by using pairs of sentences that were identical except for the target words (see Fig. 1), thereby providing our participants with subtle but informative low-level orthographic cues about the locations of the target and distractor words. As in Liu and Reichle’s (2018) second experiment, our participants were instructed to read one of the two sentences for comprehension, with the frequency of both the target and distractor words being manipulated to determine if information from one or both words influenced the time spent fixating on the target word. If low-level orthographic cues indicating the presence of the target-distractor words does engage attention, then the frequencies of both words might contribute to the looking times on the target word, but possibly to varying degrees. These predictions were tested using standard fixation-based inferential statistics and survival analyses (defined below) to examine the processing time course of the target and distractor words.

Method

Participants

Thirty native Chinese-speaking undergraduate students from Sun Yat-sen University were paid 20 yuan and gave informed consent prior to their participation. All participants had normal or corrected-to-normal vision, with no reported color blindness. Participants were naïve about the purpose of this experiment.

Materials and design

This experiment used a 2 (target-word frequency: high vs. low) × 2 (distractor-word frequency: high vs. low) within-subjects design. Target words consisted of 160 pairs of high-frequency (M = 121.17 per million, SD = 98.65) and low-frequency (M = 2.17 per million, SD = 1.53) words, with one word of either type embedded (near the center) within one of 160 sentences used by Liu, Reichle, and Li (2015, 2016). These sentences were rated as being natural, and the target words were unpredictable from their preceding sentence contexts (see Liu et al., 2015, 2016). The distractor words also consisted of 160 pairs of high-frequency (M = 131.32 per million, SD = 105.29) and low-frequency (M = 2.69 per million, SD = 7.14) words, with each pair selected to fit naturally within one of the target-word sentences. (This was confirmed with a normative study using 14 additional participants; the overall rated naturalness of the target-word vs. distractor-word sentences did not differ; t < 0.09, p > .926). As Fig. 1 shows, stimuli were rendered in Song 30 font (each character ≈1° visual angle) by diagonally offsetting (to the lower right) a distractor-word sentence from a target-word sentence by approximately 0.25° of visual angle. Participants were instructed to read the target sentence, but both the color of the sentence being read (i.e., red vs. green) and the color of the superimposed sentence (i.e., red on green or vice versa) were counterbalanced across participants. To evaluate the extent to which readers might process the distractor words, 10 additional participants were given the same instructions but asked to report back what they had read after each sentence; with such instructions, participants only reported 2.31% (SD = 0.017) of the distractor words.

Apparatus

Stimuli were displayed against a black background on a 27-in. LED monitor (ASUS, PG27AQ) with a resolution of 2560 × 1440 pixels and a 144-Hz refresh rate. Stimulus presentation was controlled by an OpenGL-based Psychophysics Toolbox-3, incorporating EyeLink Toolbox extensions in MATLAB (Natick, MA, USA). Eye movements were recorded using an SR-Research EyeLink 1000-plus eye tracker (Kanata, ON, Canada) sampling at 1000 Hz. Viewing was binocular, but eye-movement data were only collected from the right eye.

Procedure

Participants were given task instructions upon arriving at the lab, and then seated approximately 58 cm from the video monitor. A chin/forehead rest was used to minimize head movements. An initial three-point calibration and validation procedure was performed until the maximal error was less than 0.4° of visual angle, with recalibration/revalidation being conducted as necessary. During the experiment, participants first read 16 practice sentences (excluded from our analyses), and then read the 160 experimental sentences in a random order. Each trial consisted of a drift check in the middle of the screen followed by a fixation box (1° × 1°, the size of a single character) at the location of the first character of the sentence. If the initial fixation did not register in the box or the drift check indicated more than a 0.4° error, then the participant was recalibrated; otherwise, a sentence appeared, which participants read silently for comprehension, terminating the trial using a button box. Participants also used the button box to answer a comprehension question that occurred after approximately one-third of the sentences and to start each new trial.

Results

Comprehension accuracy

Mean sentence comprehension accuracy was 97% and there were no differences across conditions (all ps > .05).

Eye-movement measures

Approximately 3.5% of trials were removed because a blink occurred during a fixation on, immediately before, or immediately after the target word. Our analyses used three standard eye-movement measures: (1) first-fixation duration (FFD), or the duration of the initial fixation on the target word during first-pass reading; (2) gaze duration (GD), or the sum of all first-pass fixations on the target word; and (3) total-viewing time (TT), or the sum of all fixations on the target word. These measures were analyzed using linear mixed-effect models (LMMs), using a parsimonious random-effect structure by iteratively reducing insignificant variance and covariance components from the maximal models, using target-word frequency, distractor-word frequency, and their interaction as predictor variables.

As can be seen by inspecting the mean fixation-duration measures (see Table 1) and the LMMs (see Table 2), all three fixation-duration measures exhibited a similar pattern. First, the fixation-duration measures were shorter on high-frequency than on low-frequency target words, FFD: b = −26.97, 95% CI [−35.97, −17.98], SE = 4.59, t = −5.88, p < .001; GD: b = −54.52, 95% CI [−72.46, −36.58], SE = 9.15, t = −5.96, p < .001; TT: b = −82.96, 95% CI [−113.59, −52.34], SE = 15.62, t = −5.31, p < .001. Although the distractor-word frequency effect was completely absent for total-viewing time (p = .261), the first-fixation durations and gaze durations were shorter on high-frequency than on low-frequency distractor words, FFD: b = −9.69, 95% CI [−18.81, −0.57], SE = 4.66, t = −2.08, p = .038; GD: b = −23.11, 95% CI [−39.82, −6.40], SE = 8.53, t = −2.71, p = .008, consistent with previous results using another selective-reading paradigm (Reingold et al., 2016). However, there were no interaction between target-word frequency and distractor-word frequency on any measure (all ps > .303). And perhaps most importantly, the target-word frequency effects (27, 52, and 86 ms for FFD, GD, and TT, respectively) were more than twice as large as the distractor-word frequency effects (9, 24, and 17 ms for FFD, GD, and TT, respectively).

Table 1 Mean first-fixation durations (FFD), gaze durations (GD), and total viewing times (TT), as a function of target-word and distractor-word frequency (standard errors are shown in parentheses)
Table 2 LMM inferential statistics for first-fixation durations (FFD), gaze durations, (GD) and total-viewing times (TT)

Thus, some amount of distractor-word processing evidently did occur during reading, as evidenced by the finding that its frequency also modulated (but to a much lesser degree and independently of the target word’s frequency) the looking times on the target words. This suggests that the control of attention is imperfect; rather than being focused exclusively on the target words, some information about the distractor words “leaked in,” thereby influencing when the eyes moved from the target words. One hypothesis for why this happened is that the allocation of attention to individual “objects” (in this instance, the target words) is better described using an attenuation metaphor—the signal coming in from the distractor “channel” cannot be attenuated completely, but only to some degree. By this account, information from the target and distractor words accumulates in parallel, but more slowly for distractors than for targets. An alternative (but not mutually exclusive) hypothesis for the modest distractor-word effects is simply that low-level visual cues unique to the distractor occasionally alerted participants to their presence, thereby inducing a type of “pop-out” effect. This account is plausible because the sentences in which target and distractor words were embedded were identical except for the target and distractor words themselves (see Fig. 1). By this second account, information about the distractor might accrue only occasionally and/or perhaps later during any given fixation, after enough time has elapsed to allow participants to notice the presence of distractors. Survival analyses (described next) were therefore completed to examine when in time the frequencies of the target and distractor words exerted their influence on fixations.

Survival analyses

As their name suggest, survival analyses originate from medical studies and provide a method to quantify the survival rates of patients over time, from their initial diagnosis. In medical contexts, this method is useful for comparing the survival rates of different treatment groups over time (e.g., 5-year survival rates of cancer patients receiving different treatments). In the present context, the method extends previous work (e.g., see Reingold, Reichle, Glaholt, & Sheridan, 2012) to determine when different variables exert their influence on fixation durations—in this instance, the effects of target-word versus distractor-word frequency. To do this, we first calculated the proportion of first-fixation durations within each successive 25-ms time bin over a range of 0–600 ms for each participant and as a function of target-word versus distractor-word frequency. These values were then averaged across participants in each condition to generate the distributions shown in Fig. 2a (by target-word frequency) and 2b (by distractor-word frequency). As can be seen by comparing the distributions, there is more divergence between the first-fixation duration distributions for high-frequency versus low-frequency target words than for high-frequency versus low-frequency distractor words, consistent with the results of the statistical analyses reported earlier.

Fig. 2
figure 2

Distributions of first-fixation durations as a function of a target-word frequency versus b distractor-word frequency, as well as their corresponding survival curves using the confidence interval DPA procedure (cd) and the individual participant DPA procedure (ef). Vertical solid lines indicate divergence point estimates, and dotted lines denote the 95% confidence intervals (cd) and standard deviations of individual participants (ef)

Next, we calculated the percentage of first-fixation durations longer than time t for each 1-ms time bin (using the same range of 0–600 ms). This was also done for each participant and condition and then averaged across participants. The resulting survival curves are shown in Fig. 2c (by target-word frequency) and 2d (by distractor-word frequency). To determine if these curves diverge as a function of word frequency, a bootstrap resampling procedure was used to complete a divergence point analysis (DPA). To ensure the reliability of these estimated divergence points, a confidence interval DPA was used to compute confidence intervals for both the estimates and individual participant DPAs for individual participants. Both DPA procedures have been updated from those used by Reingold et al. (2012) to improve reliability and statistical power (see Reingold & Sheridan, 2014, for a detailed introduction).

To complete the confidence interval DPA procedure, we used 1000 iterations of random resampling of first-fixation durations for each participant and condition at each 1-ms bin. The divergence point estimate was then defined as the first 1-ms bin in a run of five consecutive bins in which the survival rate in the low-frequency condition was at least 1.5% greater than the survival rate in the high-frequency condition for each iteration. The 95% confidence interval was then defined by the 25th and 975th values in the ranked divergence point estimates across the 1000 iterations. The median of the 1000 sorted divergence point values was defined as the divergence point estimate for the sample.

To complete the individual participant DPA procedure, we used 1000 iterations of random resampling of first-fixation durations for individual participants. For each iteration, 1200 first-fixation durations for a given participant were randomly sampled with replacement from the respective pools of fixations corresponding to the high-frequency and low-frequency conditions. The two samples of 1200 fixations were then rank ordered and yoked according to their rank (i.e., shortest fixations for the high-frequency and low-frequency conditions were paired, then the next two shortest were paired, etc.). The differences between the low-frequency and high-frequency conditions were then computed. The divergence point was then defined as the average duration of the pair of fixations corresponding to the first rank-ordered bin in 100 consecutive bins with the positive values. The median value across the successful iterations (i.e., those resulting in a divergence point estimate) was then defined as the divergence point estimate for an individual. Only those participants for whom a divergence point value was obtained in more than 50% of iterations were included in the computation of group divergence point estimates.

As Fig. 2c and e show, the high-frequency and low-frequency survival curves for target words diverged early, irrespective of the DPA procedures used. The confidence interval DPA procedure yielded a divergence point estimate of 164 ms, with a 95% confidence interval from 115 to 172 ms. Only 4.42% and 2.83% of fixations in the high-frequency and low-frequency conditions were shorter than this divergence point. Similarly, the individual participant DPA procedure produced a mean divergence point across participants of 160 ms (SD = 43 ms). Using this procedure, only 3.65% and 2.13% of fixations in the high-frequency and low-frequency conditions were shorter than the divergence point.

However, as Fig. 2d and f show, the high-frequency and low-frequency survival curves for distractor words diverged much later, again irrespective of the DPA procedures used. The confidence interval DPA procedure yielded a divergence point estimate of 179 ms, with a 95% confidence interval from 154 to 246 ms., and 8.25% and 6.77% of fixations in the high-frequency and low-frequency conditions were shorter than this divergence point. Similarly, the individual participant DPA procedure produced a mean divergence point across participants of 215 ms (SD = 83 ms). Using this procedure, 21.66% and 18.47% of fixations in the high-frequency and low-frequency conditions were shorter than the divergence point.

Discussion

The results of our survival analyses indicate that the effect of target-word frequency on first-fixation durations emerged 15–55 ms earlier than the effect of distractor-word frequency. This result suggests that the effect of distractor-word frequency reflects a later and/or occasional noticing of the distractor words during target-word processing rather than the concurrent processing of targets and distractors at different rates. This interpretation is in harmony with the earlier findings of Liu and Reichle’s (2018) second experiment, wherein absence of low-level orthographic cues indicating the locations of the target and distractor words allowed the participants to attend exclusively to the target words, so that only the frequencies of those words modulated the looking times on the target word. Our results, together with those reported by Liu and Reichle (2018) and Reingold et al. (2016), thus suggest that, although readers may attempt to focus their attention on individual words during reading, the capacity to do so is not perfect, and that the distractor words do at least occasionally “grab” attention. This should not be surprising, because there is ample evidence that the maintenance of attentional control in the face of prepotent, competing responses (e.g., Stroop task; see MacLeod, 1991) is nontrivial, as evidenced by increases in both response latencies and errors.

Although neither the present experiment nor the ones reported by Liu and Reichle (2018) entail natural reading, the collective results and the object-based view of attention that they espouse have obvious ramifications for our conceptualization of attention during natural reading. For example, with languages like English, readers may have very little difficulty treating words as individual “objects” because they are well demarcated (by the blank spaces between them). In contrast, the absence of clear boundaries between Chinese words means that readers must somehow segment continuous lines of characters into their constituent words. Because any given Chinese word can contain one to four characters, it is not obvious how readers would allocate attention in a strictly object-based manner. Instead, Chinese readers may focus their “window” of attention on three to four characters, thereby allowing whatever lexical processing is engaged during word identification to segment out a word from the characters as it is identified (Li, Rayner, & Cave, 2009).

By this account, how Chinese readers allocate their attention can only approximate the serial, object-based manner that is likely employed by readers of languages like English because of an unfortunate convention of the Chinese writing system—the lack of spaces between words. The conjecture is supported by demonstrations that the introduction of clear word boundaries in Chinese (e.g., via inserting blank spaces between words; Bai, Yan, Liversedge, Zang, & Rayner, 2008) does not disrupt the overall rate of reading; despite the fact that such manipulations presumably change the highly practiced skills involved in both word segmentation/identification and saccadic targeting, whatever “cost” might otherwise be evident from these changes are presumably offset by the fact that the words can be processed and identified in a serial, object-based manner. This account is also congruent with simulations using artificial reading agents, which illustrate the computational advantage inherent in serial processing of words, despite (most people’s) intuitions that the parallel processing of words affords greater efficiency (Liu, Reichle, & Gao, 2013). Such empirical and computational demonstrations suggest that, although the control of attention is imperfect, with a variety of factors possibly resulting in the unintended “leakage” of information from other sources, readers might nonetheless try to emulate serial processing to the degree possible because it supports near optimal word identification (Reichle et al., 2009).