Skip to main content

Do Users’ Reading Skills and Difficulty Ratings for Texts Affect Choices and Evaluations?

  • Chapter
  • First Online:
Taming the Corpus

Abstract

In our contribution, we consider how corpus data can be used as a proxy for the written language environment around us in constructing offline studies of native-speaker intuition and usage. We assume a broadly emergent perspective on language: in other words, the linguistic competence of individuals is not identical or hardwired but forms gradually through exposure and coalescence of patterns of production and reaction. We hypothesize that while users presumably all in theory have access to the same linguistic material, their actual exposure to it and their ability to interpret it may differ, which will result in differing judgments and choices. Our study looks at the interaction between corpus frequency and two possible indicators of individual difference: attitude towards reading tasks and performance on reading tasks. We find a small but consistent effect of task performance on respondents’ judgments but do not confirm any effects on respondents’ production tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Fidler and Cvrček’s (2015) study of keyword analysis in Czech presidential New Year speeches uses this approach to good effect to demonstrate how different types of exposure, in the guise of reference corpora, can be used to model differing potential receptions of a text.

  2. 2.

    Reading span tasks ask participants to read unconnected sentences, memorizing the final word of each sentence, which they then must recall later. There is some dispute about what exactly they are measuring (Hupet, Desmette, & Schelstraete, 1997), but as Conway et al. point out, they have been widely used nonetheless to assess how we tap into our working memory’s storage and processing functions: “The task is essentially a simple word span task, with the added component of the comprehending of sentences. Subjects read sentences and, in some cases, verify the logical accuracy of the sentences, while trying to remember words, one for each sentence presented” (Conway et al., 2005, p. 771).

  3. 3.

    An example of clearly differentiated usage is, e.g, between the exponents {em} and {ou} in the instr. sg.: the former is used with masc. and neut. nouns, while the latter appears with fem. nouns. The only place we get overlap—e.g, s (v)okurkem ~ s (v)okurkou ‘with cucumber’—is where the gender of the noun is unstable across dialects. When usage is not clearly differentiated, often some factors or tendencies can be identified that contribute to choice, but none that clearly demarcate it.

  4. 4.

    A further contributory factor to the persistence of variation in Czech may be the relatively weak position of the standard, which does not function as a common speech variety across the vast majority of the country (see, e.g, Sgall, 2011, p. 183, one among many texts that could be cited in this regard). Attempts at standardizing one or another variant tend to be perceived as applying only to formal written texts.

  5. 5.

    Compare, for example, the appearance of fleeting [e] in the fem. and neut. gen. pl. and the description of the masc. animate nom. pl. exponents {i}~{ové}~{é} in Grepl et al. (1995), pp. 248–249, 256–257. The first is described in terms of a default form and the conditions under which insertion takes place, while the latter variation is described using overlapping semantic, phonological, and suprasegmental criteria that may apply. The same approach is used in the normative Internet Language Manual (Ústav pro jazyk český 2004).

  6. 6.

    Latinate nouns (octopi~octopuses, etc.) are another area where variation can be looked at in English, but it has been an area of more research in derivational morphology, where variation is more widespread (normality~normalcy, etc.). However, derivational morphology is not seen as having the same impact on our understanding of utterance structure and the creation of “grammatical” meaning as does inflectional morphology.

  7. 7.

    On our proxies for perception and use, see the “Methodology” section below.

  8. 8.

    This term is more often translated as “fiction,” but in the CNC corpora prior to SYN2015, it includes examples of the genre literatura faktu: creative nonfiction such as memoirs, travelogues, etc.

  9. 9.

    The latest corpus in the series, SYN2015, is not balanced in this fashion; see inter alia Čermák, Králík, and Kučera (1997) on the research underlying the original corpora and Cvrček, Čermáková, and Křen (2016) on the composition of SYN2015.

  10. 10.

    A programmatic explanation for this shift away from “real-world balance” towards “text-type balance” is given in Cvrček et al. (2016).

  11. 11.

    When lemmatization succeeds, the CNC always disambiguates and resolves in favor of one assignment for each place in the tag (unlike, for example, the Russian National Corpus, where ambiguities are never resolved and all possible tags are associated with a token). This disambiguation is partially rule-based and partially the result of a heuristic correction based on manual tagging of a portion of the corpus. When lemmatization fails, typically due to a very rare or poorly formed (misspelled) word form, no morphological analysis can take place and the form is tagged as nerozpoznaný ‘unrecognized’; our searches will not have picked up such forms.

  12. 12.

    Sometimes these sentences needed to be modified—typically shortened to remove extraneous material, but also sometimes substituting lexical items to achieve a more “neutral” effect for the trigger. This was to avoid respondent reactions directed not at the target feature but at some other aspect of the text that was irrelevant, which could confound the results. In some instances (esp. with rarer lexemes), no suitable sentence could be found, and so we looked for sentences with synonyms or other lexemes close in meaning and substituted the target word in order to create the trigger.

  13. 13.

    Surveymonkey did not support randomizing question order across two separate locations in a survey, so the constituent triggers of a block always had to remain in that block.

  14. 14.

    If all respondents are at ceiling, the task will not serve to isolate relevant factors, as we cannot distinguish among the respondents based on performance.

  15. 15.

    Read-able.com warned us, “Ooh, that’s probably a bit too complicated. Have you thought about using smaller words and shorter sentences?”

  16. 16.

    Forms that were unrepresented in the corpus or represented only sporadically were not used, so as not to create the impression of an unnatural text. Instead, for those lexemes the common form was inserted.

  17. 17.

    See further for information on ANOVAs. The assumptions of ANOVA include a dependent variable with interval values and a limited number of “levels” per factor. A seven-point scale such as the one we use for our ratings is considered to give ordinal values (showing order or priority but where there is no demonstrable mathematical relationship between the values) rather than interval values (showing points on a scale with a demonstrable mathematical relationship: equally spaced, each double/ten times the preceding, etc.). However, when the number of respondents exceeds 100, ordinal values such as our impressionistic seven-point scale give equally good results. We created levels for our factors by “binning” responses to get 4–6 groups for each factor. We practiced good data hygiene here by defining our bins prior to analysis rather than afterwards and by ensuring that bins with very small numbers of respondents were amalgamated with other bins.

  18. 18.

    Reading comprehension texts and questions are available on request from the corresponding author (n.bermel@sheffield.ac.uk).

  19. 19.

    We also report, but do not discuss, the F value, which is the ratio of between-groups variances to within-groups variances. An F value of 1 tends to confirm the null hypothesis.

  20. 20.

    The anomalous shape of the “1 right” band has to do with the fact that only two respondents fell into this bracket, so the reactions are highly dependent on individual idiosyncrasies.

References

  • Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental study. Cognition, 90, 119–161.

    Article  Google Scholar 

  • Bermel, N. (2004a). V korpuse nebo v korpusu? Co nám řekne (a neřekne) ČNK o morfologické variaci v tvarech lokálu [V korpuse or v korpusu? What the Czech National Corpus will (and will not) tell us about morphological variation in locative case forms]. In Z. Hladková & P. Karlík (Eds.), Čeština – univerzália a specifika 5 (pp. 163–171). Prague, Czech Republic: Nakladatelství Lidové Noviny.

    Google Scholar 

  • Bermel, N. (2004b). Jak často se vyskytují (vyskytujou) tzv. hovorové tvary 1. os. j. č. a 3 os. mn. č. v Českém národním korpusu? [How often do the so-called colloquial forms of the 1 sg. and 3 pl. occur in the Czech National Corpus]? In P. Karlík (Ed.), Korpus jako zdroj dat o češtině (pp. 29–40). Brno, Czech Republic: Masarykova univerzita.

    Google Scholar 

  • Bermel, N. (2010). Variace a frekvence variant na příkladu tvrdých neživotných maskulin [Variation and the frequency of variants in hard masculine inanimate nouns]. In S. Čmejrková, J. Hoffmannová, & E. Havlová (Eds.), Užívání a prožívání jazyka (pp. 135–140). Prague, Czech Republic: Karolinum.

    Google Scholar 

  • Bermel, N., & Knittl, L. (2012a). Corpus frequency and acceptability judgments: A study of morphosyntactic variants in Czech. Corpus Linguistics and Linguistic Theory, 8, 241–275.

    Article  Google Scholar 

  • Bermel, N., & Knittl, L. (2012b). Morphosyntactic variation and syntactic environments in Czech nominal declension: Corpus frequency and native-speaker judgments. Russian Linguistics, 36, 91–119.

    Article  Google Scholar 

  • Bermel, N., Knittl, L., & Russell, J. (2015a). Morphological variation and sensitivity to frequency of forms among native speakers of Czech. Russian Linguistics, 39, 283–308.

    Article  Google Scholar 

  • Bermel, N., Knittl, L., & Russell, J. (2015b). From standard to norm through the lens of corpora and native speakers. Prace Filologiczne, 67, 21–43.

    Google Scholar 

  • Bermel, N., Knittl, L., & Russell, J. (2017). Frequency data from corpora partially explain native-speaker ratings and choices in overabundant paradigm cells. Corpus Linguistics and Linguistic Theory. https://doi.org/10.1515/cllt-2016-0032.

    Article  Google Scholar 

  • Bybee, J. L., & Slobin, D. I. (1982). Rules and schemas in the development and use of the English past tense. Language, 58, 265–289.

    Article  Google Scholar 

  • Čermák, F., Doležalová-Spoustová, D., Hlaváčová, J., Hnátková, M., Jelínek, T., Kocek, J. et al. (2005). SYN2005: A genre-balanced corpus of written Czech. Prague, Czech Republic: Ústav Českého národního korpusu FF UK, from www.korpus.cz

  • Čermák, F., Králík, J., & Kučera, K. (1997). Recepce současné češtiny a reprezentativnost korpusu (Výsledky a některé souvislosti jedné orientační sondy na pozadí budování Českého národního korpusu) [The reception of contemporary Czech and corpus representativity: Results and some relevant points of a preliminary sounding done during the building of the Czech National Corpus]. Slovo a slovesnost, 58, 117–123.

    Google Scholar 

  • Chandler, S. (2010). The English past tense: Analogy redux. Cognitive Linguistics, 21, 371–417.

    Article  Google Scholar 

  • Conway, A. R., Kane, M. J., Buntin, M. F., Zach Hambrick, D., Wilhelm, O., & Engle, R. W. (2005). Working memory span tasks: A methodological review and user’s guide. Psychonomic Bulletin & Review, 12, 769–786.

    Article  Google Scholar 

  • Cvrček, V., Čermáková, A., & Křen, M. (2016). Nová koncepce synchronních korpusů psané češtiny. Slovo a slovesnost, 77, 83–101.

    Google Scholar 

  • Dąbrowska, E. (2008). The effects of frequency and neighbourhood density on adult speakers’ productivity with Polish case inflections. Journal of Memory and Language, 58, 931–951.

    Article  Google Scholar 

  • Dąbrowska, E. (2010). Naive v. expert intuitions: An empirical study of acceptability judgments. The Linguistic Review, 27, 1–23.

    Article  Google Scholar 

  • Divjak, D. (2017). The role of lexical frequency in the acceptability of syntactic variants: Evidence from that-clauses in Polish. Cognitive Science, 41, 354–382. First published online 2016: 1–26.

    Article  Google Scholar 

  • Eddington, D. (2000). Analogy and the dual-route model of morphology. Lingua, 110, 281–298.

    Article  Google Scholar 

  • Fidler, M. U., & Cvrček, V. (2015). A data-driven analysis of reader viewpoints: Reconstructing the historical reader using keyword analysis. Journal of Slavic Linguistics, 23, 197–239.

    Article  Google Scholar 

  • Frisch, S., & Brea-Spahn, M. (2010). Metalinguistic judgments of phonotactics by monolinguals and bilinguals. Laboratory Phonology, 1, 345–360.

    Article  Google Scholar 

  • Goldberg, A. (2011). Corpus evidence of the viability of statistical preemption. Cognitive Linguistics, 22, 131–153. https://doi.org/10.1515/COGL.2011.006

  • Grepl, M., Hladká, Z., Jelínek, M., Karlík, P., Krčmová, M., Nekula, M., et al. (1995). Příruční mluvnice češtiny [A grammar handbook of Czech]. Prague, Czech Republic: Nakladatelství Lidové noviny.

    Google Scholar 

  • Haber, L. R. (1976). Leaped and leapt: A theoretical account of linguistic variation. Foundations of Language, 14, 211–238.

    Google Scholar 

  • Hupet, M., Desmette, D., & Schelstraete, M.-A. (1997). What does Daneman and Carpenter’s reading span really measure? Perceptual and Motor Skills, 84(2), 603–608.

    Article  Google Scholar 

  • Křen, M., Bartoň, T., Cvrček, V., Hnátková, M., Jelínek, T., Kocek, J., et al. (2010). SYN2010: A genre-balanced corpus of written Czech. Prague, Czech Republic: Ústav Českého národního korpusu FF UK, from www.korpus.cz

  • Lečić, D. (2015). Morphological doublets in Croatian: The case of the instrumental singular. Russian Linguistics, 39, 375–393.

    Article  Google Scholar 

  • Mulder, K., & Hulstijn, J. H. (2011). Linguistic skills of adult native speakers as a function of age and level of education. Applied Linguistics, 32, 475–494.

    Article  Google Scholar 

  • Prasada, S., & Pinker, S. (1993). Generalization of regular and irregular morphological patterns. Language and Cognitive Processes, 8, 1–56.

    Article  Google Scholar 

  • Sgall, P. (2011). Perspektivy standardní češtiny [Perspectives on standard Czech]. In E. Hajíčová & J. Panevová (Eds.), Jazyk, mluvení, psaní (pp. 180–204). Prague, Czech Republic: Karolinum.

    Google Scholar 

  • Staum Casasanto, L., Hofmeister, P., & Sag, I. (2010). Understanding acceptability judgments: Additivity and working memory effects. In Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 224–229). Austin, TX: Cognitive Science Society.

    Google Scholar 

  • Thornton, A. (2012). Reduction and maintenance of overabundance: A case study on Italian verb paradigms. Word Structure, 5, 183–207.

    Article  Google Scholar 

  • Ústav pro jazyk český. (2004–2017). Internetová jazyková příručka [The internet language manual]. Jazyková poradna: Ústavu pro jazyk český, from http://prirucka.ujc.cas.cz/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neil Bermel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bermel, N., Knittl, L., Russell, J. (2018). Do Users’ Reading Skills and Difficulty Ratings for Texts Affect Choices and Evaluations?. In: Fidler, M., Cvrček, V. (eds) Taming the Corpus. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-98017-1_2

Download citation

Publish with us

Policies and ethics