The Effect of Lexical Facility

  • Michael Harrington


This chapter summarizes the findings from Chaps. 6, 7, 8, 9, and 10. Seven studies were reported that evaluated the sensitivity of size, speed, and consistency to differences in proficiency and performance in domains of academic English. Key findings are identified.


Academic English Radical Exit Proficiency Differences Written Test Results Word Frequency Level 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  • Summarize the empirical results from Chaps.  6,  7,  8,  9, and  10.

  • Highlight key findings.

11.1 Introduction

This chapter summarizes the findings from Chaps.  6,  7,  8,  9, and  10. Seven studies have evaluated the sensitivity of size, speed, and consistency to differences in proficiency and performance in domains of academic English. Throughout the book, this sensitivity has been characterized as how well the measures discriminate between the criterion levels and, more importantly, the relative magnitude of these differences, both individually and in combination. Of particular interest is the degree to which composite measures provide a more sensitive measure of the observed differences than vocabulary size alone.

11.2 Sensitivity of Lexical Facility Measures by Performance Domain

Table 11.1 summarizes the group means for the individual lexical facility measures, vocabulary size (VKsize), mean recognition time (mnRT), and coefficient of variation (CV). Also included are the hits, which are the percentage of words recognized.
Table 11.1

Summary of means (M) and standard deviations (SD) for VKsize, hits, mnRT, and CV measures for Studies 1–5
















Study 1: University groups











L2 university










L1 university










Study 2: Entry standards









































L1 English










Study 3: IELTS band scores



















































Study 4: Sydney











Lower intermediate










Upper intermediate




















Study 5: Singapore









































The different groups, setting, and lack of a single, independent measure of proficiency make direct comparisons across the studies impossible—indeed, the lexical facility measure represents one such single measure. However, several generalizations can be made. The VK size scores and, to a lesser extent, hits are the most consistent in separating the proficiency levels across all the studies. The IELTS 6.5 band scores were nearly identical in Studies 2 (M = 57) and 3 (M = 59).1 The adjacent 7+ (both 7 and 7.5) band for the respective studies were Studies 2 (M = 73) and 3 (M = 72). This is similar to the L2 university score in Study 1 (M = 71), the Malaysian group in Study 2 (M = 70). All of this suggests that VKsize is a reliable measure of vocabulary skill. The mnRT results are much less consistent. The 6.5 group in Study 2 has a much higher mnRT (M = 1444) than the 6.5 group in Study 3 (M = 1031). The mnRTs for the 7+ levels were also different, for Studies 2 (M = 1280)2 and 3 (M = 861). The amount of recognition time variability underscores a difficulty with its use as a measure. This is discussed below. The two language program groups were much slower than the university groups, despite the similarity of the VKsize scores for the advanced group in Study 4 (M = 69) and the English for Academic Purposes (EAP) group in Study 5 (M = 59), with to the IELTS 6.5 band scores in Study 2 and 3. This is consistent with the notion that recognition speed lags behind size (Zhang and Lu 2013). Aside from Study 1, the CV scores were only sensitive in instances where the proficiency difference between the groups was great, as in that between IELTS band scores of 5 and 7+.

Table 11.2 summarizes the effect sizes for the individual and composite measures in the seven studies. Only effect sizes (Cohen’s d) for the significant pairwise comparisons of means are presented. Blank cells indicate that the mean difference did not reach statistical significance. An effect size can be interpreted in the absence of statistically significant differences, but for presentation purposes, only the significant results will be discussed. See the specific chapters for effect sizes not reported here.
Table 11.2

Summary of lexical facility measures’ effect sizes for individual and composite measures


Range of Cohen d effect sizes for pairwise comparisonsa







Study 1: University proficiency levels

N = 110






Study 2a: University entry standards study: written test






Study 2b: University entry standards study: spoken test

N = 132






Study 3: IELTS band scores

N = 371






Study 4: Australian language program placement

N = 87




Study 5: Singapore language program levels N = 56





Variance accounted for in hierarchical regression model


ΔR2 VKsize

ΔR2 mnRT


R2 total

Study 3: IELTS band scores

N = 344






FA < 20%





FA < 10%





Study 6: EAP grade

Entry group

N = 72

Exit group

N = 68

Entry all





Entry FA < 20%





Exit all





Exit FA < 20%





Note: *p < .10; **p < .05; ***p < .001 (two-tailed); VKsize, correction for guessing scores (hits - false alarms); mnRT, mean recognition time in milliseconds; CV, coefficient of variation; FA < 20% = only participants with false-alarm rates less than 20% included in analysis; FA < 10% = only participants with false-alarm rates less than 10% included in analysis.

aComparisons: Study 1: L2 preuniversity – L2 university – L1 university; Study 2: IELTS 6 – IELTS 7 – Malaysian – Singaporean – L1 English; Study 3: IELTS overall band score 5 – 5.5 – 6 – 6.5 – 7+; Study 4: Elementary – lower intermediate – upper intermediate – advanced; Study 5: Elementary – preintermediate – intermediate – English for Academic Purposes (EAP; advanced).

The benchmark used throughout the book for interpreting the magnitude of the observed effect size is taken from Plonsky and Oswald’s meta-analysis (2014, p. 889). For mean differences between groups, values around .40 are considered small, around .70 medium, and 1.0 and beyond large. The authors recommend higher values for within-group contrasts, namely .60, 1.00, and 1.40, respectively. The comparisons in Table 11.1 all represent between-group contrasts. Within-group comparisons are made as part of the test of the word frequency assumption, but the nonparametric test in that analysis uses r as the effect size.

The table also contains the effect sizes, in R 2 , for regression model variance accounted for by the three measures. Benchmarks for interpreting r/ R 2 are .25/.06 for small, .40/.16 for medium, and .60/.36 for large effects (Plonsky and Oswald 2014, p. 889).

The first five studies examined the sensitivity of the measures to program- and test-based group differences. The last two investigated the measures’ sensitivity to individual differences in English-medium academic performance. The results of each study are briefly summarized.

University English Group Differences (Study 1, Chap.  6)

Study 1 focused on the sensitivity of the individual and composite measures to differences between three distinct English proficiency groups. All three measures (VKsize, mnRT, and CV) were sensitive to differences between L2 preuniversity students, L2 university students, and first language (L1) university students at an Australian university. The effect sizes for all measures exceeded, most considerably, the benchmark 1 for a strong effect. The proficiency difference between the preuniversity and L1 university groups is reflected in very high effect sizes, ranging from 2.68 for the VKsize difference to over 5 for the composite VKsize_mnRT_CV. The effect sizes for the L2 preuniversity and L2 university groups were also robust, ranging from 1.70 for the VKsize to 2.60 for the mnRT comparison. The stronger effect sizes for the VKsize_mnRT_CV comparisons (1.4–5.4) over the individual VKsize comparisons (1.1–2.7) support the proposal that a combination of size and speed provides a more sensitive measure of group differences than size alone. However, this result was largely due to the strength of the mnRT measure. This was the only study in which the mnRT effect size was stronger than that of VKsize.

The study also analyzed group performance as a function of word frequency levels. Hits (percentage of words identified) were used instead of the VKsize measure, as the latter incorporates false-alarm performance and is calculated across the entire item set. Hits and mnRT performance across the progressively lower frequency bands (e.g., 2K, 3K, 5K, and 10K) decreased uniformly in all three groups. All the adjacent-level differences were statistically significant, and r values ranged between .41 and .63 for the hits, and .38 and .64 for the mnRT, all in the medium-to-strong range. The CV was less sensitive to frequency-level differences. Although mean differences mirrored those of the hits and mnRT, only the 2K–3K difference was statistically significant, and the r-value of .17 indicates no effect.

Study 1 showed that all three measures were stable dimensions of L2 vocabulary proficiency and highly sensitive to group differences. mnRT and CV measures also accounted for additional variability beyond size alone. The study also demonstrated the validity of frequency-of-occurrence levels as indices of L2 vocabulary learning.

University English Entry Standards and IELTS (Study 2, Chap.  7 and Study 3, Chap.  8)

Studies 2 and 3 examined the sensitivity of the measures to group differences in English entry standards used in Australian universities. In Study 2, the sensitivity of the measures was examined across five groups of international students. It comprised entering students with a university minimum IELTS 6.5 overall band score, a combined group of the next two bands IELTS 7+ (7 and 7.5 combined), Malaysian high school graduates, Singaporean high school graduates, and a baseline group of L1 students from English-speaking countries. Unlike Study 1, the four L2 groups did not represent a fixed order of proficiency, though the IELTS groups were expected to be different from each other and the baseline L1 English group was better than all the rest. The motivation for the study was to assess how well the measures can serve as an independent benchmark to compare the somewhat disparate groups; all are assumed to share a threshold level of English proficiency but also differ noticeably beyond that. The study also compared performance on the lexical facility measures in written and spoken formats to assess the mode of presentation on test outcomes, for both group differences and the frequency assumption.

Considering the three measures, the VKsize scores improved on a continuum of IELTS 6.5 < IELTS 7+ < Malaysian high school < Singaporean high school < L1 English. This was true for both presentation modes, though the spoken test scores were consistently lower. In the written version, the differences between the IELTS 6.5 group and the other four groups were statistically significant, with the effect size d values ranging from .86 to 1.81. The Singaporean group was also significantly different from the IELTS 7+ (d = .66) and the Malaysian groups (d = .95), but was not different from the L1 English group. In the spoken test results, the pattern was the same, except that the Singaporean group was not significantly different from the Malaysian group, but differed from the L1 English group (d = .89).

The individual mnRTs were less sensitive to group differences in both the written and spoken test data. For the written version, the groups split into two. The IELTS 6.5 and 7+ groups did not differ from each other, but were different from the other three groups (d = 1.14–1.86), who in turn did not differ from each other. In the spoken test results, the IELTS 6.5 group differed significantly from the other groups (d = .63–1.73). The L1 English group was also significantly different from all the other groups (d = .82–1.72).

The CV measure was the least sensitive of the three individual measures. In the written test results, the L1 English group was significantly different from the other groups, though effect sizes were in the medium range (d = .71–.99). The same pattern was evident for the spoken test data, though the effect sizes were slightly larger (d = .71–1.49). The difference between the IELTS 6.5 and IELTS 7+ groups was also significant (d = 1.48).

The most sensitive measure in both the written and spoken formats was the composite VKsize_mnRT. It discriminated between all groups in both versions, with the sole exception of the written Singaporean and L1 English group difference. It also yielded higher effect sizes in seven of the ten comparisons in the written test results and all (10 out of 10) of the comparisons in the spoken test results. However, although the d values were higher, the confidence intervals (CIs) show that the differences were not statistically significant. The other composite measure VKsize_mnRT_CV was less informative due to the inclusion of the CV.

The study also analyzed test performance as a function of word frequency levels. As was the case in Study 1, hits and mnRT performance across the progressively lower frequency bands (e.g., 2K, 3K, 5K, and 10K) decreased uniformly in all three groups. All the adjacent-level differences were statistically significant, and r values ranged between .41 and .63 for the hits, and .38 and .64 for the mnRT, all in the medium-to-strong range. The CV was less sensitive to frequency-level differences, although mean differences mirrored those of the hits and mnRT. Only the 2K–3K difference was statistically significant, and the negligible r-value of .03 indicated no effect.

In summary, the VKsize and mnRT measures were sensitive to group differences, while the CV was less so. The effect size ranged from moderate to strong depending on the comparison. The composite VKsize_mnRT was the most sensitive, consistent with the proposal that the combination of size and speed was better than size alone in characterizing group differences. It was also evident that the spoken format yielded a pattern of results similar to the written, though the size scores were lower and the responses slower.

Study 3 examined the sensitivity of the measures to IELTS overall band-score differences among students in a preuniversity foundation-year course. Scores across five adjacent band-score levels (5–5.5–6–6.5–7+) were examined. The VKsize score discriminated among all the IELTS band-score differences, except for the lowest adjacent comparison, 5–5.5. The smallest effect size (d = .48) was between the adjacent 5.5 and 6 levels and the largest (2.93) between the nonadjacent 5 and 7+ levels. The results for the hits were similar. The mnRT measures discriminated between all adjacent levels, with effect sizes ranging from .78 for the lowest comparison (5–5.5) to 1.60 for the largest (5–7+). The CV was only significant for the 5–7+ (d = .42) and 5–7+ (.88) comparisons.The two composite measures, VKsize_mnRT and VKsize_mnRT_CV, were more sensitive than the individual VKsize measure. They discriminated between all the adjacent levels with comparable effect sizes, providing further support for the lexical facility proposal. The advantage of combining size and speed was also supported by the regression analysis, where the mnRT results accounted for additional unique variance in the model beyond VKsize, though for only about 6% of the variance, compared with 42% for VKsize.

The IELTS band-score results replicate those of the first two studies: the VKsize and mnRT measures were reliable discriminators of test levels and accounted for moderate-to-strong effect sizes for these differences, and the mnRT contributed a unique amount of variance in doing this. The CV scores were again shown to be less informative than the other two measures, being sensitive only to the most extreme band-score differences.

Language Program Placement (Studies 4 and 5, Chap.  9)

Studies 4 and 5 examined the sensitivity of the measures to proficiency as characterized by placement in English language programs. Study 4 compared the three measures with in-house placement tests at a Sydney English language school. There was a strong correlation between the lexical facility measures and the program placement tests, evident both in discriminating among the four levels and in the overall placement decisions. VKsize scores discriminated between all the program levels (beginner, lower intermediate, upper intermediate, and advanced), with strong effect sizes (.99–2.13). mnRT discriminated between the elementary and other levels, but not between the other three. The effect sizes for the significant comparisons ranged from .80 to 1.22. The composite VKsize_mnRT matched the individual VKsize in sensitivity, with slightly higher effect sizes in most of the respective comparisons. The CV measures showed no sensitivity to any of the measures.

Study 5 examined the lexical facility measures as correlates of placement levels, (elementary, preintermediate, intermediate, and EAP) in an English language school in Singapore. The students were similar to those in Study 4 regarding learning goals but were of lower overall proficiency. VKsize scores discriminated the EAP group from the other three levels, with strong effect sizes (.99–1.94), but the three levels themselves did not differ. The mnRTs discriminated between all three adjacent levels, with effect sizes ranging from .73 to 1.61. The CV measure was not sensitive to any level differences. The composite VKsize_mnRT also discriminated between all four levels, with effects sizes slightly stronger than the individual mnRT measure, though the differences were not statistically significant. The relatively greater sensitivity of the individual mnRT and composite VKsize_mnRT measures over the individual VKsize measure provides another bit of support for the lexical facility proposal.

The results from the two language school studies are consistent with the first three studies. Vocabulary size and mnRT measures are reliable discriminators of test levels, with the differences mostly accompanied by strong effect sizes. The combination of size and speed results in a more sensitive measure than size alone.

Academic English Performance (Studies 6 and 7, Chap.  10)

Studies 1–5 concerned the sensitivity of the lexical facility measures to differences in proficiency as used for functional ends, as in evaluating university entry standards and language program placement. In the last two studies, the sensitivity of the measures to individual differences in academic English performance was examined. Study 6 investigated the measures’ predictors of course grades in an English for Academic Purposes (EAP) course, and Study 7 as correlates of grade point average (GPA) in the same university preparation program.

Study 6 examined the sensitivity of the measures to individual differences in academic performance in a preuniversity foundation-year course in Australia. The measures were correlated with the semester-end grades for a year-long EAP course. The study also examined two groups that differed by when they took the Timed Yes/No test: an entry test group took the test at the beginning of the academic year, and an exit test group took it at the end, just before receiving the course grade. The entry group results reflect the strength of the measures as predictors of overall course performance, while the exit group findings provided a test of how well the measures correlate more immediately with the end-of-year course grades. There was a marginally strong correlation between VKsize performance and academic English grades (r = .60) for the exit group, and a moderate correlation (r = .44) for the entry group. The mnRT results had a small but significant correlation with grades for the exit group (r = .32), and a nonsignificant correlation for the entry group. The CV did not correlate with course grades in either group. The VKsize_mnRT correlations for both groups were the same as the respective VKsize scores. A regression analysis examining the size and speed measures as joint predictors of course grades showed that the VKsize scores accounted for all the significant variance in academic English grades for the entry group (20%). In the exit group, VKsize also accounted for most of the variance (over 30%). The mnRT measure also accounted for a small (but significant) amount of variance. In both the regression models on the complete data set, it accounted for 3% of the variance, though at a more liberal p-level of .10. For the analysis in which the individual false-alarm rates were trimmed at 20%, it accounted for 6% at the conventional p < .05 level.

In summary, the VKsize and mnRT measures were more sensitive for the exit group than for the entry group. A moderately strong correlation was evident for the exit group between the final grades and VKsize and, to a lesser extent, mnRT. The mnRT accounted for about 5% of the exit group grade variance, an amount comparable to the earlier entry standards, IELTS band scores, and language program studies. There was a substantial difference between the two groups in academic grades and test performance that may have had a bearing on the results.

Study 7 explored the link between test performance and program-end GPAs in the same cohort as in Study 6. Not unexpectedly, the results mirrored those of the earlier study. For the entry group, there were small correlations (in the low .3 range) for VKsize, mnRT, and the two combined. The same correlations for the exit group were in the medium range (.45).

The VKsize and mnRT measures were also considered as predictors of GPA in tertiary English-medium programs in Oman (Roche and Harrington 2013; Harrington and Roche 2014b, b). The first study compared the two measures and academic writing skill as predictors of first-semester GPAs (Roche and Harrington 2013), and the second included reading skill, along with writing and the two lexical facility measures, as predictors of GPA. (Harrington and Roche 2014b, b). Roche and Harrington (2013) found that VKsize and mnRT accounted for unique GPA variance in a regression analysis that only included the two measures as predictors, though the amount of variance (about 10 and 8%, respectively) was relatively low. But they also found that when the measures are entered in a model that included an academic English writing score, the two measures accounted for no additional variance. Similarly, Harrington and Roche (2014b) examined the combined effect of reading skill, writing skill, and the two lexical facility scores as GPA predictors and also found that academic writing skill was the best overall predictor of GPA. It accounted for most of the variance in the criterion (27%), but that the other three measures also accounted for a significant amount of variance (reading, 3%; VKsize, 3%; and mnRT, 2%). When the effect of the VKsize and mnRT scores were considered independently, both accounted for a small but significant amount of GPA variance (7 and 9%). Harrington and Roche (2014a) also found that the sensitivity of the lexical facility measures as predictors of GPA varied by academic field of study.

In summary, for the Omani data, the VKsize and mnRT measures were less sensitive to individual academic grade and GPA differences than to the group differences examined earlier. This was particularly the case where they are compared with writing and reading tasks that measure more global proficiency.

11.3 Key Findings

The studies have sought to establish lexical facility as a context-independent index of L2 vocabulary skill sensitive to performance differences in various academic English domains. There were three closely related aims of the research. The research sought to
  • compare the three measures of lexical facility (VKsize, mnRT, and CV) as stable indices of L2 vocabulary skill;

  • evaluate the sensitivity of these measures individually and as composites to differences in a range of academic English domains; and, in doing so,

  • establish the degree to which the composite measures combining the VKsize measure with the mnRT and CV measures provide a more sensitive measure of L2 proficiency than the VKsize measure alone.

The main findings relative to these aims are now summarized.

Vocabulary Size Is a Sensitive Measure of Proficiency

The VKsize score was the most sensitive individual measure. It was as good (in Study 1) or better (Studies 2–4 and 6–7) than the mnRT measure in discriminating between proficiency levels. In the regression models reported in Studies 3 and 5, VKsize accounted for far greater variance than mnRT (and of course CV). The effect sizes for the VKsize differences were consistently strong, whether reflected in Cohen’s d or the R2 value. In the trimmed data set in Study 3, VKsize accounted for over half the total variance. This finding was not unexpected, given that previous work on vocabulary size by Laufer, Nation, and their colleagues has shown that frequentist-based vocabulary size measures are a robust correlate of L2 academic performance. The findings strongly replicate the earlier research.

Mean Response Time Is Also a Sensitive Measure of Proficiency

The mnRT measure also discriminated between the groups across the studies, though was slightly less sensitive than the VKsize measure. The d effect sizes for the significant pairwise comparisons were at minimum of medium strength, with most strong. In Study 1, mnRT had a larger effect size than VKsize across the L1 and two L2 groups, as well as between the two L2 groups alone. In the regression analyses, the measure accounted for 3–5% of the unique variance in the models. The measure was less informative of differences between English grades and GPAs, although even here it was sensitive to some of the group comparisons.

The CV Is Less Sensitive Than the VKsize and mnRT Measures

The lexical facility account introduced in Chap.  4 proposed that response time consistency, as measured by the CV, can be a reliable and informative index of vocabulary skill and a sensitive measure of proficiency, both by itself and in combination with VKsize and mnRT. This proposal received very limited support. In only one study (Study 1) did the CV mirror the sensitivity of the other two measures. Significant CV effects were only evident in group comparisons in which level differences were very distinct, as in the IELTS 6.5 and L1 English groups in Study 2, and the IELTS 5 and 7+ groups in Study 3. The CV can be considered as an index of proficiency, but only in somewhat crude terms.

VKsize and mnRT Together Are More Sensitive Than VKsize Alone

The proposal that size and speed together provide a more sensitive measure than size alone is at the heart of the lexical facility account. This was supported. The composite measure VKsize_mnRT was generally more sensitive than VKsize alone. This was evident in both the number of significant group comparisons and the relative effect sizes of these differences. In five of the seven studies, the composite VKsize_mnRT measure produced a larger effect size than for the VKsize measure alone, though the differences were not always statistically significant. In the regression studies, mnRT accounted for a significant amount of unique variance beyond vocabulary size, although the magnitude of the effect was small (3–6%).

The findings replicate earlier research that demonstrates a reliable relationship between size and vocabulary speed (Laufer and Nation 2001; Harrington 2006), and is at odds with Miralpeix and Meara (2010), who found none. The results indicate that recognition time does provide an additional, reliable source of information about individual vocabulary skill. This is the central finding of the research reported here, and it provides a solid basis for combining time and speed as a measurement dimension, that is, for lexical facility.

A Frequency-Based Measure Provides a Valid Index of Vocabulary Knowledge

A distinctive feature of the lexical facility account and the vocabulary size literature more generally is the use of word frequency statistics to estimate vocabulary size. A basic assumption is that word frequency is a strong predictor of when a word is learned and the speed with which it is recognized. The findings here and elsewhere (e.g., Milton 2009) show that frequency levels provide a reliable and informative framework for characterizing vocabulary development that directly relates to performance. This holds for both written and spoken modes; however, it was evident that performance on the spoken version was consistently lower. Word frequency statistics provide an objective, context-independent way to benchmark L2 vocabulary development.

False-Alarm Rates Are a General Indicator of Proficiency as Well as Guessing (but Might Not Make That Much Difference)

The most distinctive feature of the Yes/No Test format is the use of pseudowords. The self-report nature of the format motivates the inclusion of these phonologically possible, but meaningless, words as a means to gauge if the test-taker is guessing. In principle, the false-alarm rate is a measure of guessing independent of vocabulary size, as estimated from the hits. In practice, this was not the case. There was substantial variability in the false-alarm rates within and across studies, but overall the false-alarm rates were a fair reflection of proficiency levels. They were much higher for lower-proficiency groups and progressively dropped as levels improved. The mean performance by the lower-proficiency groups was 20% and higher, while for more proficient L2 and L1 groups, it was under 10%. Differences in false-alarm rates evident across the studies here, and in other published research, raise the issue of the comparability of findings across studies. In Studies 3 and 5, secondary analyses were carried out in which the data sets were trimmed for individuals who had mean false-alarm rates exceeding 20% (Chaps.  3 and  5) and 10% (Chap.  3). The statistical tests were then run again. The results were very similar to the original analyses, with the trimmed data sets yielding larger effect sizes, though the differences were not significant. It was also evident that the hits by themselves yield a reasonably sensitive measure of vocabulary knowledge, though not as strong as the VKsize measure. This all suggests that false alarms may not be necessary for measuring individual performance (Harsch and Hartig 2015).

Recognition Time Can Be a Messy Measure

The collection of recognition time data and its use as evidence for underlying knowledge states is typically associated with the laboratory. In these controlled settings, the focus is on response time variability in largely error-free performance in which target behaviors are narrowly defined and technical demands readily met. The research presented here has examined mean recognition time differences in error-filled performance more everyday instructional settings. Ensuring optimum performance, that is, that the test-taker is attending to the task and working as quickly (and accurately) as possible, is a challenge. A significant threat to the reliability of the results is a systematic trade-off in how quickly and accurately a test-taker responds. Responding very quickly with many errors, or very slowly with few errors, will render the results difficult to interpret. There was little evidence of a systematic correlation in individual performance between higher accuracy and slower performance (or vice versa). It is not possible to rule out any trade-off behavior, but there was no evidence of systematic bias in any of the individuals or groups studied. All of the studies showed significant positive relationships between VKsize and the inverted mnRT, but the size of the correlations (.2–.5) indicated that other factors were also at play.

The variability across the studies is also a concern. The IELTS 6.5 group in Study 2 had a much higher mnRT (M = 1444) than the 6.5 group in Study 3 (M = 1031) despite the VKsize scores being nearly identical. The mnRT means for both the language program groups (Studies 4 and 5) are much higher relative to the VKsize scores compared with the other groups. As noted, this may be due to a relative lag in development or may reflect different testing conditions. Both studies were administered by local, on-site staff at the Sydney and Singaporean schools. All the other studies were carried out by the author or close colleagues. Recognition time performance is far more sensitive to differences in individual motivation and attention, and it is possible that the importance of the recognition time responses received less emphasis by the administrators in the language program studies. There was also considerable variation within the other studies, which also varied somewhat in testing conditions, for example, data collected in a group versus individually.

While acknowledging these limitations, the results also show that recognition time on its own, and in combination with size, provides a reliable and informative means of characterizing L2 vocabulary knowledge that is sensitive to proficiency differences in important functional domains of academic English performance.

11.4 Conclusions

The findings support the key element of the lexical facility proposal, namely that the combination of size and speed provides a more sensitive index of differences in L2 lexical skill than size alone. This advantage is reflected in greater sensitivity to proficiency and performance differences in a range of academic English domains. In five of the seven studies, the combination of size and speed resulted in larger effect sizes than for the VKsize measure alone, whether in the regression models or in the composite scores, although the differences were not always statistically significant and further confirmation is needed. The CV was much less sensitive to proficiency differences, having significant and strong effect for all the pairwise comparisons only in Study 1. These effects were only evident when comparing groups where the level differences were highly distinct, as in the IELTS 6.5 and L1 English groups in Study 2. The usefulness of the CV as an index of proficiency remains very much an open question.

Unique to the testing format used here is the inclusion of pseudowords to assess guessing. The false-alarm rate provided a somewhat stable measure of performance, but considerable variability within and between groups was also evident. The results suggest that word performance (hits) alone can provide a reasonable measure of vocabulary knowledge without including false-alarm performance.

The validity of the testing format was also established in the analysis of performance at the word frequency-of-occurrence levels. The results from both written and spoken versions showed that word frequency statistics provide a reliable and robust predictor of outcomes.

The final chapter revisits the original lexical facility proposal in light of these findings and identifies directions for future research.


  1. 1.

    The VKsize score is an indirect measure of the individual’s vocabulary size. A very rough estimate of what a VKsize score of 70 represents as overall vocabulary size can be calculated by taking 70% of 10,000, which is the word range sampled in almost all the tests here. That would be a minimum of 7000 words. Note this is based on the unlikely assumption that the false-alarm rate adjusts the hit rate exactly for the actual size. The individual will also know some words beyond the 10K level, but it will be a steadily diminishing percentage of these, maybe an additional 1500 words, for a total of 8500 words. This is a rough estimate of the actual size, and given that only four frequency bands are sampled, one that is closer to a guess than an estimate. For more precise estimation, the Vocabulary Size Test, which samples each level from 1K to 15K, is superior (Beglar 2010).

  2. 2.

    Study 2 also had a much slower L1 group (M = 960) compared with the baseline L1 group in Study 1 (M = 777).


  1. Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language Testing, 27(1), 101–118. doi: 10.1177/0265532209340194.CrossRefGoogle Scholar
  2. Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proficiency. EUROSLA Yearbook, 6(1), 147–168.CrossRefGoogle Scholar
  3. Harrington, M., & Roche, T. (2014a). Word recognition skill and academic achievement across disciplines in an English-as-lingua-franca setting. In U. Knoch (Ed.), Papers in Language Testing, 16, 4.Google Scholar
  4. Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students at an English-medium university in Oman: Post-enrolment language assessment in an English-as-a-foreign language setting. Journal of English for Academic Purposes, 15, 34–37.CrossRefGoogle Scholar
  5. Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size tests as predictors of receptive language skills. Language Testing, 33(4), 555–575.CrossRefGoogle Scholar
  6. Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322.CrossRefGoogle Scholar
  7. Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.CrossRefGoogle Scholar
  8. Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol: Multilingual Matters.Google Scholar
  9. Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from
  10. Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64, 878–912. doi: 10.1111/lang. 12079.CrossRefGoogle Scholar
  11. Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a predictor of academic performance in an English as a foreign language setting. Language Testing in Asia, 3(1), 1–13. doi: 10.1186/2229-0443-3-12.CrossRefGoogle Scholar
  12. Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary breadth knowledge growth and fluency development. Applied Linguistics, 35(3), 283–304. doi: 10.1093/applin/amt014.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Michael Harrington
    • 1
  1. 1.University of QueenslandBrisbaneAustralia

Personalised recommendations