1 Introduction

This is an investigation into the use of the notion core vocabulary in some areas of linguistics and related disciplines. The inspiration for this paper comes from my involvement in two recent research projects where this notion plays a central role.

The KELLY EU projectFootnote 1 aims at developing vocabularies corresponding to the six language learner proficiency levels of the Council of Europe’s Framework of Reference (CEFR). The aim is to have 1,000–2,000 words for each level, further subdivided into thematic areas. The vocabularies are being developed for nine languages—Arabic, Chinese, English, Greek, Italian, Norwegian, Polish, Russian and Swedish—and translated among all language pairs by professional translators. In the KELLY approach, the notion of coreness or centrality corresponds to adjusted frequency in a very large corpus collected from the WWW using the Web-as-Corpus methodology (Baroni and Bernardini 2006). The lists for the lowest CEFR level containing approximately 1,500 vocabulary items are arguably candidates for core vocabularies for these languages. My involvement in this project comes from a long-standing interest in intelligent computer-assisted language learning which I share with Lauri Carlson and on which we have collaborated at one time (Borin et al. 2002).

The goal of the Digital areal linguistics—or IDS (Intercontinental Dictionary Series)—projectFootnote 2 (funded by the Swedish Research Council) is to create a database of comparable lexical items in a number of representative South Asian languages, with a focus on the Himalayan region in India and to use this database for investigating the Himalayas as a linguistic area. The project is a collaboration with the global IDS project (Max Planck Institute for Evolutionary Anthropology, Leipzig),Footnote 3 an international initiative for collecting comparable basic vocabulary lists in a large number of languages.

The methodological musings reported in the next section arose from my practical work in the IDS project, where I undertook the task of updating the Swedish word list (approximately 1,500 senses).Footnote 4

In order to put the discussion in Sect. 2 on a surer footing, I made some comparisons of a number of proposed core vocabularies. The results of these comparisons are presented and discussed in Sects. 3 and 4.

2 Afflicted with Ontological Angst: Words—Senses—Concepts

As linguists, we have intuitions about which lexical units are more central in our own language, and which more peripheral. At any rate, there are fairly clear cases, such as dog vs. cep ‘a mushroom of the species Boletus edulis’, or mortal vs. fraught. In other cases it may be more difficult to determine relative centrality. An interesting question in connection with this, which has occupied linguists for centuries is this: If we determine the most central lexical items for a large number of diverse languages, how much overlap will we see in the resulting sets?

The notion of a core vocabulary crops up in more than one linguistic subdiscipline, but with different meanings and based on different theoretical and methodological premises:

  • In lexicology and lexicography, the core vocabulary is equivalent to a defining vocabulary, a set of words using which all definitions in a dictionary must be expressed, directly or indirectly, and which itself will not be defined—only described—in the dictionary. This is in principle a language-specific notion; different languages could have different core vocabularies.

  • In semantics, as a language-independent extension of the foregoing, the core vocabulary is a set of senses—universal lexical-semantic primitives—in terms of which all vocabulary items in all languages can be expressed. On some construals, however, these primitives need not actually correspond to lexical items in any language.

  • In genetic linguistics, a core vocabulary—often referred to as a Swadesh list (Swadesh 1950, 1955)—is a list of supposedly universal senses together with their lexical realizations in many languages, which are used in an endeavor referred to as lexicostatistics, the purpose of which is to investigate genetic and areal relationships among languages. This kind of core vocabulary must be resistant to replacement over time in order to work as intended.

  • In lexical typology, for investigating questions such as the universality of lexical expression, borrowability, etc.

  • In applied linguistics, corresponding to the notion that vocabulary growth in language learners is not random or spurious, the core vocabulary is the first, most basic set of words that a language learner will need to master in a foreign language in order to fulfill some minimal requirement of competence in the language.

In recent years, the field of computational linguistics has arguably been added to the above. Here, core vocabularies enter the stage in the form of ontologies, or formally organized concept systems.Footnote 5 These systems are not explicitly referred to as consisting of lexical or even linguistic items, but the most reasonable interpretation is in fact that they are a kind of lexical structures. Here I am quite in agreement with the view expressed by Wilks (2009: 4), namely that

items in ontologies and taxonomies are and remain words in natural languages—the very ones they seem to be, in fact—and that this fact places strong constraints on the degree of formalisation that can ever be achieved by the use of such structures. The word “drink” has many meanings (e.g., the sea) and attempts to restrict it, within structures, and by rules, constraints or the domain used, can only have limited success. […] Those who continue to maintain that ‘universal words’ are not the English words they look most like, must at least tell us which of the senses closest to the ‘universal word’ they intend it to bear under formalisation.

Wilks’s remark puts the finger squarely on a practical-methodological difficulty which arises in connection with the comparison of core vocabularies: How do we determine that two vocabulary items are the “same”, in one language and—in particular—across languages? In other words: Do we know how to compare vocabulary items?

Actually, it turns out that we do not; there is even some confusion about what a vocabulary item is. If we look for clarification to the field of lexicography—which is where we would expect to find the expert views par excellence on this issue—we will soon learn that there is actually no consensus on what the basic vocabulary unit is.

Lexicographers as a rule do not involve themselves in linguistic discussions about the meaning of the term “word”; this is something that they leave to theoretical linguists. Nevertheless, their actual practices in structuring dictionaries provide ample empirical evidence that there is more than one notion of word involved, if by this we refer to the basic structural unit of a dictionary, the dictionary entry.

Some of the contenders that we encounter in actual dictionaries are:Footnote 6

  • the etymological word

  • the lemma

  • the lempos

  • the lemgram

  • the sense

  • the synset

The etymological word is often the basic unit in older dictionaries at least in Scandinavia. Figure 1 shows the three entries with the headword mál in an Icelandic–Swedish dictionary first published in the 1940s.Footnote 7 Note that there is no difference in inflection among the three entries; they are strong neuter nouns. They are given different entries in the dictionary by dint of each having a different historical origin, i.e., a different etymology.

Fig. 1
figure 1

Sven BF Jansson: Isländsk-svensk ordbok, s.v. mál

The lemma, in the sense ‘citation form’, is often used as the basic organizing unit in dictionaries.Footnote 8 See Fig. 2, taken from one of the English dictionaries available online at http://dictionary.reference.com. The subdivision into parts of speech is made inside the basic entry.

Fig. 2
figure 2

One of the online English dictionaries at dictionary.reference.com, s.v. moan

In our own work on an infrastructure for lexical resources for language technology (Borin et al. 2010, 2012), we have felt the need for a more precise terminology in reference to the various linguistic units appearing in our lexical resources. The terms lempos and lemgram were coined in this connection, but they reflect already existing ways of organizing dictionaries.

The lempos is the combination of lemma and part of speech. Figure 3 shows this with the help of another of the English dictionaries available online at http://dictionary.reference.com. Note that the entry hang contains two verbs with different inflectional paradigms—with the past tense forms hung and hanged—and also verbs with two different syntactic behaviors—transitive and intransitive—although note further that only one of the intransitive usages is explicitly labeled as such (sense 6 in Fig. 3).

Fig. 3
figure 3

One of the online English dictionaries at dictionary.reference.com, s.v. hang

The lemgram is an extension of the lempos idea. This is a combination of a lemma with a formally definable behavior, such as part of speech, inflectional paradigm, pronunciation, word-formation structure and syntactic behavior. In Fig. 4 we see three entries from the online Swedish dictionary available (to subscribers) at http://www.ne.se. The formal characteristics which determine the subdivision into entries in this dictionary are part of speech, pronunciation and inflectional paradigm. In Fig. 4 there is one noun and two verb lemgrams. The formal distinction between the two Swedish verbs is exactly the same as in the case of English hang cited above: We have a verb with strong inflection (past stack, supine stuckit) meaning ‘to stick/prick/stab (tr.); to leave/get lost (itr.; colloquial)’ and one with weak inflection (past stickade, supine stickat), meaning ‘to knit’. The two different syntactic behaviors of the strong verb do not play a role in defining lemgrams in this dictionary, even though they arguably influence the inflectional paradigms, since 2 sticka in the sense ‘to leave/get lost (itr.)’ does not form a past participle.

Fig. 4
figure 4

The online Swedish dictionary at http://www.ne.se s.v. sticka

The sense is of course a dictionary ‘atom’ with a long and venerable tradition. Our central computational lexical resource for modern Swedish, SALDO (Borin et al. 2008),Footnote 9 is organized primarily into word and multi-word senses. Figure 5 shows the senses recognized (at present) in SALDO for the lemma sticka (five verbs and two nouns).

Fig. 5
figure 5

SALDO, s.v. sticka

One basic lexical unit that cannot be left out of this discussion is the synset. The synset is generally defined as consisting of words which are interchangeable in at least one (sentence) context without changing the (truth-conditional) meaning of the sentence in question. This is claimed to be the basic lexical unit in Princeton WordNet (PWN; Fellbaum 1998).Footnote 10 I say “claimed to be”, since a count of the cardinalities of the synsets in PWN 3.0 reveals that 54 % of the synsets are singleton sets, i.e., have only one member. This makes one suspect that senses are logically prior to synsets in PWN, too, as they are in SALDO and many other lexical resources. In Fig. 6 we see the 9 PWN noun synsets for the lemma stick (there are also 16 verb synsets for this lemma). Note that six out of these are singleton synsets.

Fig. 6
figure 6

Princeton WordNet: noun synsets s.v. stick

Finally, in the ontologies mentioned earlier, the basic entities are concepts. The interesting question—which is mostly consigned to silence in the literature on ontologies—is how concepts in ontologies and lexical units in dictionaries are ontologically, as it were, interconnected. One—perhaps too facile—answer is this (from Pease and Fellbaum 2010):

The basic unit of WordNet is a set of cognitively equivalent synonyms, or synset. […] Each synset represents a concept, and each member of a synset encodes the same concept. (p. 25)

[…]

[…] the concept-word mappings of any given language are to some extent accidental; existing words do not fully reflect the inventory of concepts that is available. (p. 27)

[…]

In a formal ontology, the meaning of the terms only consists of the formal mathematics used to define those terms. The names of the terms could be replaced by arbitrary unique character strings and their meaning would still be the same. This independence from language gives some confidence in SUMO as a starting point for a true interlingua. (p. 30)

As already stated, with Yorick Wilks and others I believe this to be a delusion; either the “concepts” in formal ontologies actually are mathematical objects, and have at most a very indirect and incomplete connection to language—the less likely alternative, in my view—or they are simply words in English, in which case they have at most a very tenuous and incomplete connection to mathematics, as well as a largely unexplored connection to a putative set of universal concepts.

It is far from clear exactly what a “concept” is in this literature. One possible interpretation of the quote just given is that the “inventory of concepts that is available” would be the union of those concepts which find lexical expression in some—at least one—language out of the world’s approximately 7,000 languages (Lewis 2009). Another interpretation would be that the inventory of concepts is independent of language, so that there will be concepts that never receive linguistic expression in any language. Logically, this independence must be one-way, however; it seems that everything that gets at least lexical expression—even if this notion is somewhat unclear—in some language, will also necessarily be a concept. The literature on concepts—in linguistics, philosophy and psychology—is actually too vague to be of much use to us in our ontological quandary. It is sometimes proposed in the literature that concepts have compositional structure, a bit like linguistic units. Does this then mean that an utterance like After work last Tuesday, I took the way by the pub because I felt that I needed some company, but there was nobody there that I knew, so I went to the bookstore and looked for a good book instead corresponds to a concept? Nobody knows, it seems, or rather, nobody has spent much effort thinking about questions such as where concepts end, as it were, if there is a syntax with composition rules for concepts, etc.Footnote 11

Somewhat in contrast to the quote above, in the introduction to Fellbaum (1998) we read that “[t]he majority of lexicalized concepts are shared among languages” (p. 8). The findings reported in recent work in linguistic typology and language universals clearly run counter to this claim:

languages do differ almost without limit as to which meanings they choose to lexicalize (von Fintel and Matthewson 2008: 151)

languages differ enormously in the concepts that they provide ready-coded in grammar and lexicon [… and] many languages make semantic distinctions that we certainly would never think of making (Evans and Levinson 2009: 435)

There is no sense of “broad” under which “the grammars and lexicons of all languages are broadly similar.” (Levinson 2003: 28)

From even a small sample of languages it is clear that many impressionistically “basic” items of English vocabulary (such as go, water and eat) lack exact equivalents in other languages. (Goddard 2001: 57)

The conclusions of the preceding seem to be that (1) there is a real danger of becoming confused and inconsistent about which kind of lexical unit one is working with, and (2) attempting to keep concepts and lexical senses theoretically distinct is hardly ever worth the effort. However, since even lexicographers working on a single language—arguably the foremost experts in this area—do not seem to come to an agreement about the basic lexical units of a language, (3) until this issue has been resolved, there is absolutely no need to posit language-independent lexical units of any kind, neither the “universal words” of the formal ontology community (see above), nor “comparative concepts” of the kind proposed by Haspelmath (2010).

Tentatively, then—even though, as we have seen, we do not actually know, in a deeper sense, what we are comparing—we may still compare core vocabularies simply using the English glosses and assume that they in the normal case—especially for the small vocabularies that we will be concerned with hereFootnote 12—reflect the same or at least comparable senses.

3 Comparing Core Vocabularies

To make things more concrete, in Table 1 we show four different ‘core’ (or ‘basic’) vocabulary lists (ordered alphabetically):

  1. 1.

    The Automated Similarity Judgement Program (ASJP)Footnote 13 40-item vocabulary for genetic and areal linguistics research (Holman et al. 2008) (referred to as A40 below)

  2. 2.

    The first 40 items of the Leipzig–Jakarta (LJ) vocabulary from the loanword typology projectFootnote 14 (Haspelmath and Tadmor 2009) (L40)

  3. 3.

    Goddard’s 42 universal lexical items (Goddard 2001) (G42)

  4. 4.

    All items common to at least eight of the nine KELLY languages (40 items) (K-8)

Table 1 Four different small “core vocabularies”

The vocabulary that sticks out in Table 1 K-8. For some reason, the translation methodology and selection criteria used in the KELLY project conspire to yield a list conspicuously different from the other core vocabularies. If we instead compare the other five short vocabulary lists with the most basic (A1-level) monolingual English KELLY list (1,119 items), we get on the order of 60–80 % correspondences (percentages in terms of the short 40–100 item lists). This is more like what we would expect from a basic vocabulary list, and comparable to what we find in a comparison of the same five lists with the 800-item Basic English list (Ogden 1930) or the first 2,000 items in the SUBTLEXus word-form frequency list derived from a corpus of subtitles in films.Footnote 15 Still, it is noteworthy that at most 80 % of the A40 list and 75 % or less of the A100 list (see below) are found in these longer basic vocabulary lists for English, i.e., a fifth to a quarter of the ASJP items are missing. ASJP sense labels are English words, after all, so somehow one would expect a figure much closer to 100 %.

The K-8 list is thus not comparable to the ASJP or LJ lists, but since it reflects a set of common words that have emerged from a combination of corpus processing and manual translation among all 72 language pairs, a comparison of K-8 with G42 should arguably be meaningful.Footnote 16

In Table 2 some detailed comparisons among several vocabulary lists are presented. In addition to the lists shown in Table 1, the following are also given:

  • A long (100-item) ASJP list, which is a modified Swadesh list (Swadesh 1950, 1955), where items have been reordered on the basis of empirically measured retention rates in a large lexical material (A100)

  • The first 100 items of the LJ list (L100)

  • Two additional KELLY lists: KELLY-9 (K-9; 5 items common to all 9 KELLY languages) and KELLY-7 (K-7; 271 items common to at least 7 KELLY languages out of which one is English)

Table 2 The number of shared items in the vocabularies. Three data items are given in each cell (if not empty or 0) on the form n (r/c %): n= the number of shared items; r= the percentage of shared items in relation to the list named at the beginning of the current row; and c= the percentage of shared items in relation to the list named at the top of the current column

Table 2 lists the number of items shared between the lists and their percentages relative to the sizes of the whole lists.

Further, A40, L40 and G42 have three items in common—I, one and you—and A100 and L100 share 12 items with G42. G42 has no common items with the K-8/K-9 lists, and shares only three items with the longest (271-item) K-7 list: big, time and two.

4 Conclusions

It is perhaps not surprising that there should be so little overlap among different kinds of ‘core vocabularies’, since they aim at capturing different aspects of ‘coreness’:

  • Item stability:

    • The ASJP vocabulary consists of maximally stable form–meaning pairs; it can be seen as a refinement of traditional Swadesh lists, based on a broad range of empirical data.

    • The Leipzig–Jakarta list reflects resistance to borrowing.

  • Sense inventory:

    • Goddard’s list contains senses that receive lexical expression in all languages.Footnote 17

    • Basic learner vocabularies contain high-frequency—much-used and consequently highly useful—lexical items.

Still, we would expect the ASJP list to be a subset of the LJ list, instead of showing a not too large overlap with it, but the latter is what we actually find. This is because resistance to borrowing is a special case of resistance to vocabulary item replacement. Another common form of replacement is the substitution of a native vocabulary item for another native word (‘semantic change’).

There is no logical need for universal word-meanings to be highly frequent. In this sense Goddard’s criteria and those used in compiling learner vocabulary lists are orthogonal to the criterion of replacement. There is also no logical reason for universal word-meanings to be highly frequent, although there may be good pragmatic reasons: Highly technical vocabulary could be expected to behave in a way that could make it universal in Goddard’s sense—because such vocabulary items would mean the same wherever they occurred—were it not for the simple fact that this vocabulary will be confined to a small fraction of the world’s languages in each individual case. Hence, Goddard’s universal word-meanings will by pragmatic necessity belong to everyday language.

High-frequency senses may or may not undergo the linguistic equivalent of an extreme makeover. From experience we know that, e.g., sentence adverbs and indefinite pronouns—arguably central and in part universal vocabulary items—often are non-cognate even in closely related languages.

On the other hand, it is often mentioned in works on historical linguistics that high-frequency items tend to preserve older inflectional patterns as irregularities (seen from the point of view of the present-day inflectional system), which can consequently be used in internal reconstruction for inferring the older system. Intuitively, the older inflectional patterns should be accompanied by the corresponding older lexical items, i.e., this would lead to the conclusion that high-frequency vocabulary should be more stable than the above comparisons with the Basic English and SUBTLEXus lists show.

Why the ASJP and LJ lists do not show the expected inclusion relationship and why high-frequency central vocabulary is less stable than expected are two mysterious aspects of core vocabularies which will need further investigation.