In the present article, we introduce the Provo Corpus, a large corpus of eye-tracking data with accompanying predictability norms. The primary purpose of the Provo Corpus is to facilitate the investigation of predictability effects in reading. Some analyses of the data available in the Provo Corpus are reported in Luke and Christianson (2016). The corpus is publicly available, and can be downloaded from the Open Science Framework at https://osf.io/sjefs.

Prediction in language processing is a topic that has received considerable attention in recent years. It has been the subject of a number of reviews (DeLong, Troyer, & Kutas, 2014; Huettig, 2015; Huettig & Mani, 2016; Kuperberg & Jaeger, 2016; Kutas, DeLong, & Smith, 2011; Staub, 2015; Van Petten & Luka, 2012) and is a significant component in many models of language processing (Christiansen & Chater, 2016; Dell & Chang, 2014; Pickering & Garrod, 2007, 2013). Predictability is known to influence how we process language, both spoken (Altmann & Kamide, 1999, 2007; Kamide, Altmann, & Haywood, 2003; Staub, Abbott, & Bogartz, 2012) and written (Ashby, Rayner, & Clifton, 2005; Balota, Pollatsek, & Rayner, 1985; Ehrlich & Rayner, 1981; Kennedy, Pynte, Murray, & Paul, 2013; Kliegl, Grabner, Rolfs, & Engbert, 2004; Rayner, Slattery, Drieghe, & Liversedge, 2011; Rayner & Well, 1996).

The most common way to establish the predictability of a given word is through the cloze procedure (Taylor, 1953). In this procedure, participants are presented with a portion of a sentence or passage up to the word of interest and then asked to produce the word that would be most likely to follow. Traditionally, this method has been used to assess the predictability of a single word, usually a noun, in either a highly constraining or a nonconstraining sentence context. Many sets of predictability norms have been made publicly available (e.g., Bloom & Fischler, 1980; Schwanenflugel, 1986). The cloze procedure, along with the predictability norms that have been made publicly available using that procedure, has greatly facilitated research into predictive processes.

A useful method for studying prediction in reading is the collection of eye-tracking data. Participants in these studies read sentences or passages in which the predictability of one or more words is already known (Kennedy et al., 2013; Kliegl et al., 2004; Rayner & Well, 1996) while their eye movements are monitored. These types of studies have revealed much about how prediction affects reading (Staub, 2015). A few corpora of eye movement data exist (Cop, Dirix, Drieghe, & Duyck, 2017; Kennedy, Hill, & Pynte, 2003; Kennedy et al., 2013; Kliegl et al., 2004; Kliegl, Nuthmann, & Engbert, 2006), with varying degrees of availability.

The Provo Corpus consists of two parts, predictability norms and eye-tracking data. The predictability norms consist of completion norms for every word in 55 paragraphs. The eye-tracking corpus consists of eye movement data from 84 native English-speaking participants, who read all 55 paragraphs for comprehension. Below, we compare and contrast both the predictability norms and the eye-tracking corpus with existing norms and corpora, pointing out similarities and differences. Then we discuss potential uses of the Provo Corpus. Next, we describe in detail the contents of the corpus—first the predictability ratings, and then the eye-tracking data. Finally, we describe how interested parties can get access to the corpus.

Comparison of the Provo predictability norms with other extant norms

A number of predictability norming studies have been published over the years. Notable among these are Bloom and Fischler (1980) and Schwanenflugel (1986). These studies are sentence completion norms: A sentence was presented, minus the final word, and participants were asked to produce the final word. More recent predictability norms have followed a similar procedure (see, e.g., Hamberger, Friedman, & Rosen, 1996; McDonald & Tamariz, 2002).

The predictability norms in the Provo Corpus differ from these other published norms in several significant ways. As we mentioned, the existing norms are all sentence completion norms, meaning that they involve single sentences, in which only the last word in the sentence is normed. The Provo predictability norms are paragraphs, rather than sentences, and norms are provided for each word in the paragraph, rather than just the final word. Although traditional sentence completion norms are well-suited to event-related potential (ERP) and eye-tracking experiments that manipulate the predictability of a single target word in a sentence, the Provo norms are ideal for studies in which responses (such as reading times or ERPs) are examined for every word (see, e.g., Luke & Christianson, 2016; Payne, Lee, & Federmeier, 2015; Smith & Levy, 2013). Furthermore, traditional predictability norms focus heavily on highly constraining sentences (cloze scores > .67), which turn out to be relatively rare in connected texts (Luke & Christianson, 2016); the Provo corpus provides a more naturalistic distribution of predictability. Additionally, where existing predictability norms focus exclusively on content words, especially nouns, the Provo predictability norms include norms for function words as well as for a wider variety of content words (adverbs, adjectives, and verbs are more well-represented).

Comparison of the Provo Corpus with other eye-tracking corpora

Several other eye-tracking corpora exist. Among these, the Ghent Eye-Tracking Corpus is notable, because it is large (participants read an entire novel) and publicly available (Cop et al., 2017). However, two other well-known corpora deserve special mention, because predictability ratings are available for these corpora: the Dundee Corpus and the Potsdam Sentence Corpus.

The Dundee Corpus (Kennedy et al., 2003; Kennedy et al., 2013) is a large corpus of eye movements from ten native English speakers (and ten native French speakers) reading texts from newspaper editorials (56,212 tokens). Texts were presented on-screen in a multiline format. For a subset of the texts (16 four-line paragraphs), predictability data were obtained for each word (272 participants total, making approximately 25 responses per word). The Provo Corpus is similar to the Dundee Corpus in that it is a corpus of texts, but the Provo Corpus drew on both more participants and more texts for its predictability norms.

The Potsdam Sentence Corpus (Kliegl et al., 2004; Kliegl et al., 2006) is a collection of 144 German sentences, with predictability estimates (cloze scores) available for all but the first word in each sentence. These predictability norms were obtained using a cloze procedure, in which 272 native German speakers provided responses, producing a total of 83 complete predictability protocols. The eye-tracking corpus consists of data from 222 participants reading these sentences. Like the Potsdam Corpus, the Provo Corpus contains predictability norms for all words. The Provo Corpus has 134 sentences total, but it differs from the Potsdam Sentence Corpus in that these sentences were presented as part of connected multiline texts, rather than in isolation.

There is an additional, significant difference between the Provo Corpus and these other corpora with predictability ratings. In all three corpora, cloze scores are included for all normed words, and these cloze scores represent the proportions of responses provided by participants in the cloze procedure that matched the target word orthographically (e.g., if the target word was “apple” and the response was “apple,” that is a match; “turtle,” “fruit,” and “red delicious” are not matches). However, some theorists argue that prediction is a graded process (for a review, see Kuperberg & Jaeger, 2016), so even if the context is not sufficiently constraining to permit the prediction of orthography, it may still permit the prediction of morpho-syntactic or semantic information. For example, for the paragraph that begins “With schools still closed, cars still buried and streets still,” it is unlikely that most readers will form a strong prediction that the next word will be “blocked” (the cloze score for this word was only .07 in our predictability norming study). However, readers should be able to predict with some accuracy that the next word will be a verb (it follows a noun and an adverb, after all), that it will be in the past tense (the other verbs in the sentence were), and maybe even that the verb will mean something similar to “blocked,” like closed or inaccessible. Indeed, the participants in our predictability norming study produced a verb 79% of the time when given the sentence fragment above. That verb was in the past tense most of the time (72% of all responses) and was semantically related to the target word “blocked” (the two most frequent responses were “closed” and “covered”). With this in mind, the Provo Corpus contains predictability ratings for word class and (where appropriate) inflection, as well as mean semantic relatedness scores (latent semantic analysis; see Landauer & Dumais, 1997, and below for more information) that represent the semantic similarity between the target word and cloze task responses. These additional ratings quantify the predictability of the morpho-syntactic (word class, inflection) and semantic information, permitting a deeper investigation into the graded nature of prediction. See Luke and Christianson (2016) for some examples of analyses using these variables.

Potential uses of the Provo corpus

The Provo Corpus is primarily intended for the study of prediction in reading; however, its usefulness is not restricted to this purpose. The Provo Corpus is a large data set of the eye movements of skilled readers reading connected text. As such, it should prove useful for studying other aspects of reading behavior and for evaluating models of eye movement control in reading. The Dundee and Potsdam Corpora have already proven invaluable in this regard (see, e.g., Engbert, Nuthmann, Richter, & Kliegl, 2005; Kennedy et al., 2013; Kliegl & Engbert, 2005; Kliegl et al., 2004; Nuthmann, Engbert, & Kliegl, 2007; Pynte, New, & Kennedy, 2009; Smith & Levy, 2013).

Content of the Provo corpus

Data collection for the Provo Corpus proceeded in two stages. In the first stage, the predictability norms were created; cloze scores were collected via a large-scale online survey for each word in 55 paragraphs taken from various sources. In the second stage, each of these 55 paragraphs was presented to a different set of participants to read while their eyes were tracked, creating a large corpus of eye movement data. Both sets of data (predictability norms and eye-tracking data) are available as part of the Provo Corpus. In the section that follows, we describe the predictability norms in more detail. Then, in the next section, we provide details about the eye-tracking corpus.

Predictability norms

Participants

Four hundred seventy-eight participants from Brigham Young University completed an online survey for course credit through the Psychology Department subject pool. The responses from eight participants were discarded because they were not native speakers of English or did not complete the survey. In total, data from 470 people (267 females, 203 males) were included. Participants’ ages ranged from 18 to 50 years (M: 21). All were high school graduates with at least some college experience, and approximately 10% had received some degree beyond a high school diploma.

Materials

Fifty-five short passages were taken from a variety of sources, including online news articles, popular science magazines, and public-domain works of fiction. These passages were an average of 50 words long (range: 39–62) and contained 2.5 sentences on average (range: 1–5). The sentences were on average 13.3 words long (range: 3–52). Across all texts, there were 2,689 words total, including 1,197 unique word forms.

The words were tagged for parts of speech using the Constituent Likelihood Automatic Word-Tagging System (CLAWS; Garside & Smith, 1997). Using the tags provided by CLAWS, words were then divided into nine separate classes. In total, the passages contained 227 adjectives, 169 adverbs, 196 conjunctions, 364 determiners, 682 nouns, 287 prepositions, 109 pronouns, 502 verbs, and 153 other words and symbols. In addition, inflectional information was also coded for the words within each class, where appropriate. Nouns were coded for number, and verbs were coded for tense.

Words ranged from 1 to 15 letters long (M: 4.76). A measure of the semantic association between the target word and the entire preceding passage context was obtained using latent semantic analysis (LSA; Landauer & Dumais, 1997). This LSA context score was obtained using the General Reading–Up to First Year of College topic space with 300 factors. LSA cosines typically range from 0 to 1, with larger values indicating greater meaning overlap between two terms. LSA context scores ranged from .03 to .97 (M = .53). This variable quantifies the semantic fit of the target word with the preceding context. Target word positions within the passage (sentence number) and within the sentence (word-in-sentence number) were also obtained.

Procedure

Participants completed an online survey administered through the Qualtrics Research Suite software (Qualtrics, Provo, UT). Participants first answered a few demographic questions (gender, age, education level, and language history), then proceeded to complete the main body of the survey. For each question, participants were instructed to “Please type the word that you think will come next.” Beneath this instruction was a portion of one of the texts, with a response box for the participant to type in a word. For the first question about a text, only the first word in the text was visible, then the first two words for the second question, the first three words for the third, and so on, until for the last question about a text all words but the final word in the text were visible. Thus, participants provided responses for all but the first word in each text. Participants were required to give a response before proceeding to the next question, and within a text, all questions were presented in a fixed order, so that participants were never given a preview of the upcoming words in a text.

Each participant was randomly assigned to complete five texts, giving responses for an average of 227 different words. For each word in each text, an average of 40 participants provided a response (range: 19–43).

Content of predictability norms file

Responses were edited for spelling. When a response contained contractions or multiple words, the first word was coded. Each survey response was then tagged for its part of speech using CLAWS, and the responses were divided into word classes and coded for inflection, as we described previously for the target words. Responses and targets (the word that actually appeared in that position in the text) were compared to see whether they matched in three different ways: orthographically (cloze score), by word class, and (for nouns and verbs) by inflection. Responses and the target were considered to match orthographically if the two full word forms were orthographically identical. For the purposes of this comparison, all letters were in lowercase. A word class match was coded if the response and target belonged to the same word class, and an inflectional match was coded if the words belonged to the same word class and carried the same inflectional suffix. LSA (Landauer & Dumais, 1997) was also used to provide an estimate of the relatedness of the responses and targets for all content word targets. The LSA cosine between each response and target was obtained using the General Reading topic space via the Web-based LSA interface (http://lsa.colorado.edu). Note that this procedure, which compared the response and target words, is different from the LSA procedure described previously, in which the target words were compared to the entire preceding passage. Comparing two words provides an estimate of the semantic relatedness of these two words, while comparing the target word with its context estimates the contextual fit of the target word. Thus, the corpus provides measures of the contextual fit of the target word and of its semantic predictability. Most of these variables can be found in the eye-tracking corpus file, described in the next section. Table 1 lists and defines the variables in the Provo predictability norms.

Table 1 Predictability norm variables, with descriptions

Eye-tracking data

Participants

Eighty-four participants from Brigham Young University completed the eye-tracking portion of the study. All participants were native speakers of American English with 20/20 corrected or uncorrected vision. They received course credit through the Psychology Department subject pool. None had participated in the predictability norming survey.

Apparatus

For the eye-tracking portion of the study, eye movements were recorded via an SR Research EyeLink 1000 Plus eye-tracker (spatial resolution of 0.01°) sampling at 1000 Hz. Participants were seated 60 cm away from a monitor with a display resolution of 1,600 × 900, so that approximately three characters subtended 1° of visual angle (the monitor was 40 × 24 deg of visual angle). Head movements were minimized with a chin and forehead rest. Although viewing was binocular, eye movements were recorded from the right eye. The experiment was controlled with the SR Research Experiment Builder software.

Procedure

Participants were told that they would be reading short texts on a computer screen while their eye movements were recorded. These texts were the same 55 texts that had been used in the survey. Each trial involved the following sequence. The trial began with a gaze trigger, a black circle presented in the position of the first character in the text. Once a stable fixation was detected on the gaze trigger, the text was presented. The participant read the text and pressed a button when finished. Then a new gaze trigger appeared, and the next trial began. The texts were presented in a random order for each participant. Participants had no task other than to read for comprehension.

Content of eye-tracking data file

Prior to the analysis of eye-tracking data, the data were cleaned, with fixations shorter than 80 ms and longer than 800 ms being removed (about 4% of the data). We note that this cleaning procedure does not guarantee that all of the measures will be outlier-free. Any saccade-based measure and any measure comprising the sum of several fixations (e.g., gaze duration or total reading time) would still contain outliers. We have left these in so that users may apply their own preferred cleaning criteria. There are also some missing data values in the file. These cells are denoted with “NA.” Different reading measures were computed for predefined interest areas around each word in each passage, comprising the letters of each word and half of the white space surrounding each word, both vertically and horizontally.

In Table 2, the columns that appear in the Provo Eye-Tracking Corpus are listed and described. First, participant and word identification variables are listed. Then come variables associated with traditional measures of predictability (cloze scores). Next are variables associated with morpho-syntactic predictability (the predictability of a word class and inflection). Variables associated with semantic relationships and predictability appear next. Finally, eye-tracking variables conclude the list. These variables are the output of the SR Research Data Viewer (SR Research Ltd., version 1.11.1), and the descriptions for these variables come from or are modifications of the descriptions found in the Data Viewer User’s Manual. Means and standard deviations for these variables can be found in Luke and Christianson (2016), Table 6. Various analyses using this data are also described in Luke and Christianson (2016).

Table 2 Eye-tracking corpus variables, with descriptions

Availability

The Provo Corpus can be downloaded from the Open Science Framework at http://osf.io/sjefs. It consists of two files, which can be downloaded separately. The file Provo_Corpus-Predictability_Norms.csv is a comma-separated text file that contains the predictability norms, in the format described above. This file is for users who want to create predictability stimuli or to explore how different factors (e.g., word frequency, contextual constraint) influence the cloze task responses (see, e.g., Luke & Christianson, 2016; Staub, Grant, Astheimer, & Cohen, 2015). Users interested in the eye-tracking corpus should download the file Provo_Corpus-Eyetracking_Data.csv, another comma-separated text file, which contains the eye-tracking data. This file also contains summary predictability values (see Table 2), so that users only interested in the eye-tracking data do not need to download both files.