1 Introduction

Knowledge resources, such as thesauri, taxonomies, and recently ontologies are of high importance in the applications of nowadays information technologies, especially for information retrieval, information extraction, or knowledge discovery methods. Especially with the enormous development of Web the role of knowledge resources has increased drastically. However, one of the real barriers in wide use of them is multilinguality of information resources - according to Pimienta et al. (2009), about 65 % of the Internet is a non-English content.

In the context of information retrieval the attempts in solving the multilinguality problems go back to the Salton’s works in early 70-ties of the previous century (see e.g. Salton 1970). Since then, several methods of using multilingual dictionaries for improving information retrieval in multilingual text databases have been developed (Hull and Grefenstette 1996; Ballesteros and Croft 1997; Pirkola A 1998). On the other hand, a lot of time and effort has been invested to build and maintain multilingual thesauri and/or flat dictionaries, to be used in enhancing information retrieval in multilingual databases. Many of them (e.g. Eurovoc, GEMET, Agrovoc, INIS) are multilingual and domain oriented. Their translation possibilities are very limited, restricted to the concepts (main descriptors in the thesaurus languages), nevertheless they are extensively used for information retrieval in the international, usually multilingual databases.

Recently, in the context of Semantic Web the problems with multilinguality of the web resources become even more difficult. A vision of a multilingual semantic web has been presented in Gracia et al. (2012). In this vision, a special role is given to the multilingual mappings and linguistic information. In general, it well suits to the idea presented e.g. in Hotho et al. (2003), which consists in seeing ontology as a conceptual layer surrounded by lexical layers. In such approach, each language can be represented by a lexical layer directing from the lexical units in a given language layer to the language-independent concepts and instances belonging to the ontology nucleus (concepts and their instances). What seems to be essential, the lexical layers should cover all the language dependent relationships, including synonymy and polysemy. A more detailed proposal for such an ontology structure is in Wróblewska et al. (2012), and a potential of using such a structure in solving polysemy is presented in Protaziuk et al. (2012).

In order to cope with the multilinguality problems, multilingual components of knowledge resources become of highest importance. Although recently novel methods based on Wikipedia cross-language links have been presented by Sorg and Cimiano (2008a), it seems that specialized multilingual domain-oriented knowledge resources will still play an important role.

In Krajewski et al. (2014) we have presented an approach for translating a lexical layer to a target language. The method, based on mining subject-similar repositories given in the source and target languages, was constructed in such a way that in the first step it discovered a seed dictionary, and then in the second step the seed was used for finding semantically similar terms. Therefore, it was named Seed Based Dictionary Builder (SBDB). Its main feature is that it works without any explicitly predefined relationships between the source and target languages. It is a knowledge-poor approach – it uses merely two text repositories in the source and target languages respectively. To a large extent it is independent of the target and source languages. However, the main drawback of the SBDB method was that the translation was term-to-term rather than meaning-to-term. As a result, the output dictionary had no references to the meanings of the translated terms.

Protaziuk et al. (2012) proposed incorporating discriminants into the structure of ontology lexical layer, called LEXO. It was shown that such a structure is very useful for the cases of using ontology in text analysis for word sense disambiguation. The representation of term meanings by discriminants is also important in the process of translating dictionaries. In this paper we enhance our method (Krajewski et al. 2014) in such a way that instead of translating terms the method first discovers meanings of the terms, represented by discriminants. Then it builds the context vectors for the meanings, and finally translates the meanings. We denote the modified method as S B D B +. By assigning translations to the meanings rather than the terms we obtain much better precision of the phase of semantic translation. The paper is organized as follows: In Section 2 we discuss related work, then in Section 3 we define basic concepts and formally state the problem. Section 4 presents the algorithm in details. Then we provide experimental results of the method in Section 5. Section 6 concludes the paper.

2 Related work

Following Grefenstette (1993) the methods of text processing can be classified as knowledge-rich or knowledge-poor. The first ones either have a language knowledge embedded within the algorithms processing texts, or are based on using advanced predefined knowledge bases (ontologies, thesauri, semantic dictionaries, etc.). As opposite to knowledge-rich approaches, the latter ones do not use semantic knowledge bases that are difficult and costly to build, and the algorithms used for text processing do not have embedded deep language dependent knowledge. This classification applies also for automatic translation methods, including the research concerning building multilingual dictionaries.

The problems concerning various issues dealing with automatic translation have been widely explored for a long time. In the research, several directions can be distinguished, but the main goal to reduce the human involvement in the translation process, and speed up this process, remains unchanged.

In particular, one of the important research areas, where the translation quality between languages has essential importance is the field of multi-lingual information retrieval (MLIRFootnote 1), where the main objective is to perform search within a multilingual set of documents and collect relevant multilingual documents. Main difficulties resulting from cross-lingual translation ambiguities have been discussed already by Salton (1970).

As a matter of fact, there are many domain-oriented multilingual thesauri (such as Agrovoc, or Eurovoc), which could be used for multilingual information retrieval (Soergel (1997)). They are rich in concepts specific for the domains they are used for, however, there are serious limitations in using them efficiently:

  1. 1.

    The maintenance of multilingual thesauri is quite costly, in addition, they are not flexible enough to follow the domain developments (and linguistic changes in the domain);

  2. 2.

    The lexical granularity is quite sparse and not sufficient for high quality MLIR (usually they contain no more than some 100 000 entries);

  3. 3.

    The semantic relationships between concepts are usually rough, there is no space for polysemy;

  4. 4.

    They require quite a good knowledge from the indexers and end-users to obtain good results.

Hull and Grefenstette (1996) propose using specialized bilingual machine readable dictionaries (MRD). The problem that remains unsolved is to solve disambiguation of such translations. Also Ballesteros and Croft (1997) consider using MRD for translating queries. As an additional mean for improving the retrieval parameters they propose using pre-translation and post-translation feedbacks.

A representative method within CLIR is the approach named Cross Lingual Latent Semantic Analysis (LSA) described in Dumais et al. (1997). This method consists in performing SVD on vectors built with term frequencies in documents and its translated versions. Between the vectors built in a latent space, a similarity can be computed. Especially, new vector (representing a document or a query) could be added to the latent space, and its similarity to other vectors in the space can be measured.

An interesting approach, called CL-ESA (Cross Lingual - Explicit Semantic Analysis), has been presented in Sorg and Cimiano (2008a). The approach is an extension of Explicit Semantic Analysis (ESA), the idea presented in Gabrilovich and Markovitch (2007). Both, ESA and CL-ESA can be definitely classified as knowledge- rich, as they are based on extensive use of Wikipedia as a source of a text repository (articles), but also as a provider of semantic knowledge (ESA) and a structured cross-lingual relationships between Wikipedia articles. Given a document, CL-ESA uses a document-aligned cross-lingual reference collection (CL-ESA) in Wikipedia to represent the document as a language-independent concept vector (usually expressed by tf-idf). In CL-ESA, the relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations. Cimiano et al. (2009) have shown that in the field of CLIR CL-ESA approach outperforms latent concept models in terms of result quality, but also by means of performance. A good example of applying knowledge-rich approach for building BabelNet is presented by Navigli and Ponzetto (2012). BabelNet is a multilingual semantic network and is obtained from the automatic integration of WordNet, Wikipedia, Wiktionary and WikiData.

Other semantic support in translating the dictionaries can be taken from existing ontologies or thesauri. A method combining statistical and semantic approaches for translating thesauri and ontologies is presented in McCrae et al. (2011).

Unfortunately, these methods have limitations. Namely, they cannot be used when the semantic resources (Wikipedia, thesauri or ontologies) are insufficient. For Wikipedia, even for the languages well represented, it may turn out that some domain-specific parts are missing or are under-represented. On the other hand, many multilingual thesauri (such as Agrovoc, or Eurovoc) are rich from the point of view of the domain semantic, and the semantic contents can be a good starting point, especially while building a lexical layer for a domain ontology, however the lexical volume of the thesauri is quite limited (usually they contain no more than some 50 000 entries, and they ignore polysemy).

Among other knowledge-rich approaches, there are also syntax based methods for exploiting grammar structure of sentences (Yamada and Knight (2001)). This kind of approach, in opposite to the methods presented above, belong to the family of deep analysis methods. They are characterized by employing a knowledge about the syntactic or even semantic structure of given texts. The main problem with these approaches is that they are strongly language dependent and cannot be easily adopted to another pair of languages.

Another translation problem refers to the task of translating ontologies or thesauri. In this case where the text data are semi-structured, a support in translating such structures can be taken from semantic relationship contained in the structure. A method combining statistical and semantic approaches for translating thesauri and ontologies is presented in McCrae et al. (2011). Recently, in the context of linked data the problem of language-independent reasoning in Semantic Web was analyzed in Gracia et al. (2012). The authors discuss various layouts of ontologies and their lexical surroundings. Also in this case the lexicons for translating between the languages are shown as indispensable.

In the area of machine translation there are numerous knowledge-poor algorithms, which are based on statistical methods, and could be used for translating word-to-word or word-to-phrase between two languages. One of the first systems of this kind was IBM Model 1, presented by Brown et al. (1990). It was based on the EM algorithm and provided word-to-word translation. Then the approach has been extended by Koehn et al. (2003) for word-to-phrase translating. The idea has been further developed by Vogel et al. (1996) and Deng and Byrne (2008) for word-to-phrase translation, and with the use of the Hidden Markov Models. Based on features of IBM Model 4, Deng and Byrne (2008) improve essentially the translation quality by using the HMM models. The translation precision of the methods can reach a very high level. Unfortunately, the main problem with the approach is that it has to use large parallel repositories, whereas for most language pairs, parallel texts are hard to find. Even for texts derived from official multilingual speeches (e.g. from EU parliament) may not be appropriate for a particular subject domain.

Especially in the context of domain ontologies for such domains that witness dynamic development there is a high interest for lexicon translation methods, which are knowledge-poor and do not rely on parallel repositories. The idea of lexicon translation with nonparallel corpora has been presented in Rapp (1995). The presented method is based on the observation that if the words a and b collocate often in one language then their translations should collocate in the repository in the target language too. Additionally, for large corpora the frequencies of the collocations should be similar. However conceptually correct, this solution turns to be computationally very expensive. The problem of how to obtain a starting seed lexicon is not considered in the paper.

An interesting solution for obtaining the seed lexicon has been proposed by Koehn and Knight (2002). Namely the authors postulate building the seed translation dictionary from the source and target repositories, based on the existence of some words in both languages in the same form, or similar in terms of spelling of the corresponding source and target words.

Our approach for building bilingual dictionaries has been influenced by Rapp (1999) and Koehn and Knight (2002). Also in our approach we build a seed dictionary within the first phase of building bilingual dictionary, but opposite to Koehn and Knight (2002), we show that knowledge-poor data mining methods can be used successfully even for the languages belonging to different families (English and Polish). Having built the seed, we also build vectors for the source terms, and translate them with the seed dictionary, but the way the vectors are constructed is different. The main difference consists in finding various meanings of the terms to be translated, and then in finding the translations for the meanings, rather then terms. In Sections 3 and 4 we present our approach in more detail.

3 Basic concepts and definitions

In this Section we state formally the problem of building a bi-lingual dictionary. Given a repository R in a language L we can extract a dictionary D, composed of simple terms (one-word terms) or compound terms (multi-word terms). As in (Rybinski et al. 2008), any set XD of terms will be called termset. Significance of a termset X, called support and denoted by s u p(X), is expressed by the number of paragraphs of R, in which X appears. Termset X is called frequent if it occurs in more than 𝜖 paragraphs in R, where 𝜖>0 is a user-defined support threshold. We say that X is a context of term t if s u p(t X)>0 in R, and it is frequent context of t if s u p(t X)>𝜖 in R. Distinct meanings of terms are indicated by distinct contexts in which they appear frequently. This assumption is based on the distributional hypothesis (Harris (1981)), where the underlying idea is that “a word is characterized by the company it keeps”. The rule is very intuitive, the problem is, however, how the notion of a context is defined. To distinguish various meanings of a term t we introduce the concept of discriminants. Discriminants play a role of such contexts that define various “meanings” of the lexical unit t. We also distinguish a set of meanings \(\mathfrak {M}\), and treat discriminant as a representation of a meaning \(m \in \mathfrak {M}\). Below we present an idea of how the meanings of t can be represented. Given term t and the termsets X and Y, we define relative support as:

$$ relSup(t,X,Y)=\min\{sup(tXY)/sup(tX), sup(tXY)/sup(tY)\} $$
(1)

The relative support measures how ”distinct” in terms of support are collocations t X and t Y. Provided both t X and t Y are frequent, if X and Y belong to different domains the support of the combination of t, X and Y together is usually much lower. For example, having the frequent termsets apple, fruit and apple, motherboard, we can expect that the support of apple, fruit, motherboard will be much lower from at least one of the original termsets.

Definition 1

The termset X is called a δ-discriminant for t if for a given δ>0 the following conditions are satisfied:

  1. 1.

    the termset t X is frequent

  2. 2.

    there exists a termset Y, for which t Y is also frequent in R

  3. 3.

    r e l S u p(t, X, Y)≤δ.

We say that for the term t the termsets X and Y are δ-discriminants against each other.

We say that X discriminates a meaning of t against Y, or X and Y are against each other for term t. We are interested in minimal discriminants (i.e. such discriminants that cannot be reduced). Term t may have many discriminants. We denote them by D i s c(t). In other words, for given t we have d 1D i s c(t) iff there is d 2D i s c(t) and d 1 is discriminant against d 2 for t. Not every pair of discriminants from D i s c(t) are against each other. For example, for the term apple the termset processing unit is discriminant against juice, but it is not discriminant against motherboard (which is another discriminant of the term apple, and is against juice). Bearing this in mind, we can cluster the discriminants from D i s c(t) in such a way that in each cluster the discriminants are not against each other. For d 1, d 2D i s c(t), such that d 1 is against d 2, we write d 1 # d 2. So, we can cluster all the discriminants from D i s c(t) in the following way:

$$\begin{array}{@{}rcl@{}} && C_{t} = \{{C_{t}^{1}},\dots,{C_{t}^{k}}\}, \text{where } \\ {C_{t}^{i}}=\{d: d \in Disc(t) &\wedge& \forall_{d^{\prime} \in {C_{t}^{i}}} \: \neg(d \# d^{\prime} ) \;\wedge\; \forall_{j \neq i, d^{\prime} \in {C_{t}^{j}}} \: \: d \# d^{\prime} \}, i \in \{1, {\dots} ,k\} \end{array} $$
(2)

For a term t we have thus k t clusters. The discriminants within each cluster are compatible, whereas any two discriminants selected from any two various clusters from C t are against each other. We assign each \({C_{t}^{i}} \in C_{t}\) a meaning from \(\mathfrak {M}\).

Let us denote by C the set of all clusters of discriminants for all the terms in D:

$$\mathbf{C} = \bigcup_{t \in D} C_{t} $$

Let us illustrate the introduced notions by the following example

Example 1

Let us consider a word apple with its two meanings — a fruit hanging on a tree or the well known tech company. In order to illustrate how the discriminants can be found for these meanings we use the English version of Wikipedia.Footnote 2 Wikipedia includes 22276 articles containing the word apple. For the purpose of this example we consider the set X = {c o m p u t e r, c o m p a n y, f r u i t, i p h o n e, t r e e}, potentially containing discriminants of the two meanings. In Table 1 the supports of termsets combining apple with terms from X are presented.

Table 1 The supports of termsets combining apple with terms from X

Now, in order to find out discriminants we have to measure r e l S u p for {apple, x, y} for x, yX. Table 2 shows the supports of triples {apple, x, y}, so that we can calculate the needed r e l S u p(a p p l e, x, y) measures.

Table 2 Supports of {apple, x, y}

With the values of s u p(a p p l e, i p h o n e), s u p(a p p l e, f r u i t) from Table 1, and s u p(a p p l e, i p h o n e, f r u i t) from Table 2 we obtain r e l S u p(a p p l e, i p h o n e, f r u i t) = 0,005149. For the threshold δ = 0,01 and for t = a p p l e the terms fruit and company are discriminants against each other, and the terms iphone and fruit are discriminants against each other. Let us now verify if the term company is discriminating any of those two meanings into other meanings: from Tables 1 and 2 we can evaluate r e l S u p(a p p l e, i p h o n e, c o m p a n y) = 0,098 and r e l S u p(a p p l e, f r u i t, c o m p a n y) = 0,105, which means that for t = a p p l e the term company is nor discriminant against fruit nor against company. In other words, it can appear in the contexts of both meanings – apple as a fruit and apple as a company name.

We can now define semantics of the terms from dictionary D by a mapping that assigns a meaning to each pair \((t, {C_{t}^{i}})\), i.e. \(Sem\colon D \times \mathbf {C} \to \mathfrak {M}\), so that for any tD and \({C_{t}^{i}} \in C_{t}\) there is a meaning \(m \in \mathfrak {M}\). Given S e m(t, C) = m, tD, CC, we say that m is the meaning of t in the context of C. We do not impose any restrictions on the mapping S e m for assigning the same meaning to two various arguments, so we can have two terms p and q, pq, such that for certain \({C_{p}^{i}}, {C_{q}^{j}} \in \mathbf {C}\) we have \(Sem(p, {C_{p}^{i}}) = Sem(q,{C_{q}^{j}})\), which means that p and q are synonymic for some contexts, though do not have to be synonymic for other contexts.

Now we introduce our primary goal, which can be expressed as follows: We have given two repositories R S and R T of texts in the source and target languages L S and L T , respectively. The repositories determine flat dictionaries D S and D T , specific for L S and L T , and resulting from the (possibly independent) repositories R S and R T . The dictionaries may contain single words, as well as, compound terms (proper names, compound terms).

Problem

Given D S and D T find a translation function

$$\varGamma\colon D_{S} \times \mathbf{C} \to 2^{D_{T}},$$

such that for pD S , and \({C_{p}^{i}} \in C_{p}\) we have \(\varGamma (p, {C_{p}^{i}}) = \{t_{1},\dots , t_{k}\}\), where t i D T is a translation of the meaning of p in the context \({C_{p}^{i}}\) (i.e. \(Sem(p, {C_{p}^{i}})=m\)), to the target language L T . The set {t 1,…, t k } is a set of synonyms in L T , corresponding to the meaning m.

Below, we will propose a solution that supports building the translation function Γ.

4 The S B D B + algorithm

4.1 Brief description of the algorithm

As mentioned in Section 2, our approach has been influenced by Rapp (1999), and Koehn and Knight (2002). However our goal differs from the one presented in Koehn and Knight (2002). As specified in Section 3 we tend to find translations for the meanings expressed in the source language, rather than the words. What is important, the proposed method uses knowledge-poor text mining algorithms. It consists of four phases:

  1. 1.

    First, for the repositories R S and R T the monolingual dictionaries D S and D T are extracted.

  2. 2.

    Given D S and D T we build a bilingual seed dictionary.

  3. 3.

    In the third phase, in the repository D S for each source term t we mine for the discriminant sets C t

  4. 4.

    The fourth phase is devoted to building ,,context vectors”; for the source language we use the R S repository and build them for the pairs \((t,{C^{i}_{t}})\), whereas for the repository R T the context vectors are built for the terms sD T ; then, with the use of the seed dictionary built in Phase 2 we look for the most similar translate candidates

A conceptual diagram of the method is presented in Fig. 1. The first phase of extracting dictionary is a standard procedure, where we reject stop words and select only specific parts of speech.

Fig. 1
figure 1

Conceptual diagram of the method

The second phase is devoted to generate a seed dictionary. The idea is that words from two repositories – source and target, denoted by D S and D T respectively – are compared, and the similarity translation function (bijection) \(\gamma : D^{\prime }_{S} \to D^{\prime }_{T}\) , \(D^{\prime }_{S} \subseteq D_{S}\), and \(D^{\prime }_{T} \subseteq D_{T}\) Footnote 3 is built by using an edit distance measure. The phase is similar to the one presented in Koehn and Knight (2002), but opposite to it, no translation rules have to be predefined. Instead, during the comparison process the algorithm mines for the transformation rules that are specific for the source and target languages, S and T respectively. The rules are dynamically added to the set of rules, which are then reused in the continued comparison process. It means that no extra language-dependent knowledge about the rules for transcription from the source language to the target one is needed. We call this phasesyntactic translation, as the pairs are identified by syntactic similarities.

The third phase is crucial for the quality context vectors that are built in Step 4. It is used to identify the discriminants for each tD S , so that a set of meanings \(\mathfrak {M}\) is built for the terms in the source language.

In Phase 4, the results of the two previous phases (2 and 3) are used for building the translation function Γ, as defined in Section 3. In order to reach this goal, for every pair \((s, {C_{s}^{i}})\), \(s \in D_{S}, {C_{s}^{i}} \in C_{s}\), a context vector is built, and then the semantically close context vectors from D T are identified. The semantic relatedness can be identified as the seed dictionary can be used. As a result of this phase for the pairs \((s, {C_{s}^{i}})\) the k best candidates are provided, so that the final work can be easily performed manually. This phase is called semantic translation, as it mines contexts in the source and target languages and looks for the semantic similarities.

In the consecutive subsections the particular phases of the algorithm are presented in detail.

4.2 Lexical similarity

The idea of using lexical similarity is based on the assumption that some words exist in a similar form in the source and target languages. Actually, the similarities between languages are subject of intensive research by the linguists. McMahon (1994) discusses how the languages belonging to the same family diverge, whereas Kranich et al. (2011) give an interesting example how the languages belonging to different families influence each other. It seems therefore to be straightforward to apply the similarities for building seed dictionaries not only for the languages belonging to the same family.

Linguists indicate various reasons of similarities between the dictionaries of the different languages. For the languages belonging to the same group, to large extent the common roots in the past decide on the similarities. Additionally, an influence of one language to another one may result from various factors, such as, e.g. technology transfer, cultural influence, etc. This may also refer to the languages belonging to different groups, like e.g. German and Polish.

The words having similar form (and meaning) are named by linguists cognates, and it turns out that they appear quite often across the modern languages. The problem with cognates is that they are in slightly different forms (due to the linguistic rules in the source and target languages). To perform transcription from one form to another Koehn and Knight (2002) used linguistic knowledge and defined a set of rules. Instead, we propose to mine out the rules automatically while searching for cognates, so that no specific knowledge is needed.

To measure similarity between words an edit distance computation algorithm can be employed. We will use a slightly modified Damerau-Levenshtein edit distance measure. Usually the distance measure between the terms p and q is defined as follows: d(p, q) = n/m a x{l(p), l(q)}, where n is the number of changes (in characters), which have to be performed to achieve q from p, and l(t) is the length (in characters) of term t. In our case we reduce the measure by the number of frequent rules that are already applicable to the term translation (see Definition 2).

The proposed algorithm differs from the ones known from the literature that along with the process of looking for similar words it additionally mines the rules. So, if a given type of differences between similar words appears often enough, the algorithm adds new translation rules to the set of rules and continues building the dictionary with the use of the mined rules.

The method extracting seed lexicon is designed as an iterative process of building incrementally a bijection function \(\gamma \colon D^{\prime }_{S} \to D^{\prime }_{T}\) , \(D^{\prime }_{S} \subseteq D_{S}\), and \(D^{\prime }_{T} \subseteq D_{T}\). Schematically it is shown in the form of pseudocode for Algorithm 1. Each iteration consists of two steps:

  1. 1.

    Given D S and D T find two sets:

    1. (a)

      Δγ containing all pairs (s, t)∈D S ×D T , such that t can be reached from s by applying the translation rules from the set \(\mathcal {R}\) (empty at the beginning);

    2. (b)

      σ containing all candidate pairs, i.e. such pairs that the distance between s and t is minimal and not higher than a given threshold.

    Add Δγ to γ and remove the matching words from D S and D T , so that they will not participate in the next iterations;

  2. 2.

    For every pair (s, t)∈σ find all differences, and for each difference build a candidate of translation rule.

    1. (a)

      If the rule exist in the set of rules \(\mathcal {R}\), increase the rule support, otherwise, check if the rule exist in the set of candidate rules \(\mathcal {R}_{c}\) - if so, increase the support, if not, add the rule to \(\mathcal {R}_{c}\).

    2. (b)

      When passed all the pairs from σ, for every candidate translation rule \(r \in \mathcal {R}_{c}\) such that s u p(r)>δ, add the rule to the set of frequent rules \(\mathcal {R}\).

  3. 3.

    If no new rule is added to the set of rules \(\mathcal {R}\) then terminate, otherwise go to 1.

First, let us define the notion of transformation rule r. The rule has the form xy, \(x \in {\Sigma }^{*}_{S}\), and \(y \in {\Sigma }^{*}_{T}\), where Σ S , and Σ T are the alphabets of the languages S and T Footnote 4.

figure a

Given a set of rules ρ, and the terms p and q we say that p can be transformed to q with ρ iff every rule xy from ρ is applied at least once in substituting the substring x of p into y, and the final term obtained is q. For this we write \(p \xrightarrow {\rho } q\), and say that ρ is the edit difference set of rules for the pair (p, q). In order to indicate that ρ refers to the pair (p, q), whenever needed we will write ρ p q For example we say that

  1. 1.

    orthography can be transformed to ortografia with the set of rules ρ = {t ht, p hf}

  2. 2.

    comic can be transformed to komik with the set of rules ρ = {ck}

We call the rule r frequent if in the process of translating from D S to D T , it is applied more than δ times, where δ is a user-predefined threshold, i.e. s u p(r)>δ. At i-th iteration of computing the translation function γ, the set \(\mathcal {R}\) of frequent transformation rules is available. Now, let us present in detail the way how the procedure TRANSLATE works. The procedure uses a modified edit distance measure between terms t and s. Actually, we define the distance measure as a function of terms t, s, and a set of rules \(\mathcal {R}\).

Definition 2

Given a pair of terms (s, t) and the edit difference set of rules ρ for this pair, we define the distance between s and t as \(d_{\mathcal {R}}(s,t) = d(s,t) - ||\rho \cap \mathcal {R}||\), where d(s, t) is the standard Damerau-Levenshtein measure, and \(||\rho \cap \mathcal {R}||=|\rho \cap \mathcal {R}|/max\{l(t),l(s)\}\).

The pseudocode of the procedure TRANSLATE is presented as Algorithm 2. The procedure can work in two modes, namely as a procedure discovering the translations, and as the procedure discovering the translation candidates. In the first case (the parameter μ = 0), given sD S and \(\mathcal {R}\), the procedure TRANSLATE looks for tD T such that \(d_{\mathcal {R}}(s,t)=0\), i.e. \(\rho _{st} \subseteq \mathcal {R}\). If there are more possible translations of s than one, the procedure selects t 0, which maximizes the total support of ρ in \(\mathcal {R}\), i.e:

$$ \forall t \in D_{T} \sum\limits_{r\in \rho_{st}} sup(r) \leq \sum\limits_{r\in \rho_{st_{0}}} sup(r) $$
(3)
figure b

As we can see, for each sD S we look for t 0D T such that the distance d R (s, t 0) = 0, and the condition (3) is satisfied for t 0. If t 0 is found, the pair (s, t 0) is added to the result function γ and t and s 0 are removed from D S and D T respectively, which guarantees that γ is a bijection.

In the second case (μ>0), for each sD S the procedure TRANSLATE finds out such candidates tD T that the distance between s and t is less than μ. The candidates may become translations if at a given iteration the corresponding candidate translation rules from \(\mathcal {R}_{c}\) become frequent.

The process of extracting rules by comparing source and target terms copy and kopia is visualized in Fig. 2. There are two contiguous sequences generating candidate rules, namely (c, k) and (y, i a), so that we have ρ c o p y, k o p i a = {ck, yi a}. In the case when \(\rho _{copy,kopia} \subseteq \mathcal {R}\) the pair (c o p y, k o p i a) is added to γ. Otherwise, if \(d_{\mathcal {R}} (copy, kopia) \leq \mu \) the pair (c o p y, k o p i a) is added to the set σ of candidate pairs, and the support of rules from ρ c o p y, k o p i a is recalculated.

Fig. 2
figure 2

An example of l e x i c a ls i m algorithm for S = {computer,astronomy,copy} and T = {komputer,astronomia,kopia}

As a result of the function LEXICAL-SIM we receive a seed dictionary in a form of the translation function γ. The details concerning the quality of the translation will be discussed in Section 5. In the next subsection we will present how with a seed dictionary the semantic translation can be performed.

4.3 Semantic similarity

Within this phase of the method, a semantic similarity analysis is performed for the terms to be translated, so that the previously created seed dictionary is extended. A general idea, originated by Rapp (1999), is that in similar target and source repositories the statistical distribution of collocations for the corresponding terms, source and its translation, should be very similar to each other. Therefore the algorithm is based on contexts found out for each term extracted from the source and destination repositories, R S and R T respectively. In order to find semantically similar terms, one should construct context vectors for the terms in D S and D T , reflecting relationships between the terms and the terms from the seed dictionary γ, and then look for the vectors having similar distribution characteristics. Hence, given contexts (termsets) for all the terms xD S we limit them only to those termsets that contain tD o m(γ), and replace t by γ(t) in these termsets. Then limiting the termsets only to the terms from D o m(γ) we can build the context vectors, where the i–th position is the frequency of t i . In a similar way we proceed with the terms yD T , limiting the context vectors to the terms t C o D o m(γ). Now having vectors for the source and target languages, for each x we can search for the closest y by means of the cosine pseudometrics.

However, the main problem with this approach is that the context vector for a term from the source dictionary may cumulate dimensions specific for various meanings of this term. To illustrate it, let us consider the word vessel. The term can be used as a ship, blood vessel, or a dish. In each case the accompanying context terms are totally different, so if we do not split the meanings, all the three characteristics will mix up in one context vectors. Depending on the repository, some of the meanings represented in the context vector can be dominating. On the other hand in the target language, the three meanings may have different word representations, so we can expect that separate vectors, free of the other meanings components, will be created. The relatedness between the vector of vessel and its three corresponding vectors may become too difficult to discover, if possible at all.

The importance of this problem becomes much clearer when we look at the statistics for the natural languages. As stated in Miller et al. (1994), about 73 % of words in common English are polysemous, the average number of senses per word for all words found in English texts is about 6,5 (Mihalcea and Moldovan 2001). To this end, in our approach instead of building vectors for terms xD S , we build them for the pairs (x, m),\(m \in \mathfrak {M}\). As a result, we obtain contexts vectors specific for the meanings.

More formally, this part of the algorithm, responsible for semantic translation, can be sketched by means of the following steps:

  1. 1.

    Given the dictionary D S , we mine for the set of meanings \(\ \mathfrak {M}\), i.e. to each tD S we assign a set of meanings \(\{m_{1},...,m_{k_{t}}\}\), \(m_{i} \in \mathfrak {M}\). If no discriminants are identified for t, it represents one meaning;

  2. 2.

    For each pair \((t,m), t \in D_{S}, m \in \mathfrak {M}\) the termsets are identified from the repository R S . The termsets are built based on paragraphs containing t or 2w words windows around t. By W t m we denote the set of all termsets built for (t, m);

  3. 3.

    For each pair (t, m) we build \(W_{tm}^{\gamma }\) – we namely filter out of W t m only those termsets that contain terms from the seed γ, as calculated in Phase 1 (see p. 4.2);

  4. 4.

    Now, based on the sets \(W_{tm}^{\gamma }\), for each pair (t, m) we aggregate the semantic context vectors, building the vector space;

  5. 5.

    In order to reduce noise in the vectors we leave the top l positive dimensions, setting to zero the remaining ones;

  6. 6.

    In a similar way we build vector space for the target repository, except that it is constructed for terms instead of the pairs (t, m), so actually no meaning set is detected;

  7. 7.

    For each vector from V S we look for the closest k vectors v from V T .

Below we discuss all the steps in detail.

4.3.1 Finding meanings

As mentioned above, the first step of this phase is crucial for the quality of the semantic translation. Comparing this method to the original seed based methods (Rapp (1999), Koehn and Knight (2002), as well as Krajewski et al. (2014)), having split the meanings of each t, and building the context vectors for (t, m) instead of t, we obtain context vectors much more specific for the particular meanings, drastically reducing the noise usually cumulated in the vectors.

In order to perform Step 1 we apply the method called SenseSearcher, in short SnS ((Kozlowski 2014), Kozlowski and Rybinski (2014)). It is a knowledge-poor word sense induction algorithm based on closed frequent set. For the reasons listed below it is especially well suited for our purposes:

  • it is able to find infrequent, dominated senses;

  • the proposed method creates structure of senses, where coarse-grained senses contain related sub-senses (fine-grained senses), rather than a flat list of concepts:

  • the number of discovered senses does not have to be predefined by the user; it is determined solely by the content of the corpus;

  • as an output it provides the meaning discriminants, which can be directly used for building the representations of the pairs (t, m) for the consecutive steps of the translation;

  • its quality outperforms the existing state-of-the-art methods in most cases;

Below, we describe the algorithm in more detail. The pseudocode of the whole algorithm is presented as Algorithm 3. It consists of five phases:

  • Phase I is devoted to building an inverted index for the corpus. Each document is split into paragraphs (the paragraphs are used to build frequent termsets);

  • in Phase II, for each term a query with the given term is performed on the index, so that the paragraphs containing the term are found. The termsets are converted into the context representations. The algorithm forms as many contexts as many paragraphs are being analyzed.

  • Phase III is devoted to generating contextual patterns from the contexts generated in the previous phase. The contextual patterns are closed frequent termsets identified in the context space;

  • Phase IV is devoted to forming contextual patterns into sense frames, building multi-hierarchical structures corresponding to senses. In some exceptional cases few sense frames may refer to the same sense, it is connected with the size of input document corpus (lack of representativeness and high synonymy similarity against descriptiveness of terms);

  • in Phase V, sense frames are clustered. The clusters of sense frames are called senses. Optionally senses can be labeled with some descriptive terms.

figure c

Having completed the last phase of the algorithm, each group of similar sense frames can be treated as a coherent meaning, and the input paragraphs are partially assigned to matching senses. Given a term t, we can now identify its discriminants C t , and group them into clusters \(C_{t} = \{{C_{t}^{1}},\dots ,{C_{t}^{k}}\}\), according to the condition (2), i.e. each \({C_{t}^{i}}\) contains only discriminants compatible to each other, whereas for any two discriminants d 1, d 2 such that \(d_{1} \in {C_{t}^{i}}\), \(d_{2} \in {C_{t}^{j}}, i \neq j\), we have d 1 # d 2.

4.3.2 Context vectors spaces

A general idea for semantic translation, originated by Rapp (1999), consists in applying the rule that in similar target and source repositories the statistical distribution of collocations for the corresponding terms, source and its translation, should be similar to each other.

In order to explain how we build the vector space for the source repository D S let us recall that a meaning \(m \in \mathfrak {M}\) is represented by a set of compatible discriminants \({C_{t}^{i}} = \{d_{1},\dots ,d_{l}\}\). Let us note that given tD S , and \({C_{t}^{i}}\) we get the set W t m of all the termsets from R S for the given meaning by performing searches in the repository index for the queries td j , for each \(d_{j} \in {C_{t}^{i}} \). So, \(W_{tm}= \{X: t \in X, d \in X, d \in {C_{t}^{i}} \}\). Having W t m we filter out of W t m only those termsets that contain some terms from the seed γ, building the set of \(W_{tm}^{\gamma }\) by translating x to γ(x). So, we have the space \(W^{\gamma } = \{W_{tm}^{\gamma }: t \in D_{S} \land m \in \mathfrak {M} \}\). For the target repository, we proceed in a similar way, although, as we do not induce senses, for each tD T a set of termsets V t is built giving rise to the space V = {V t :tD T }, which then is filtered out to the space V γ in such a way that only such termsets remain in \(V_{t}^{\gamma }\), which contain some terms from C o D o m(γ).

Now for W γ and V γ two vector spaces can be built S S and S T for source and target termset spaces respectively. Clearly, the dimension of each space is limited to the seed size (i.e. |D o m(γ)|). Having the two spaces, for each xS S we look for the n most similar yS T . For the similarity measure we use the cosin pseudometrics (justified by the fact that the sets of termsets may be different in size), thus obtaining the translation candidates.

The question is though, how the vector spaces should be built, so that the similarity measure between x and y identifies properly y as the translation of x. In the paper by Koehn and Knight (2002) the vectors are built based on frequencies of ”neighboring terms”, i.e. in the vector corresponding to t, i-th position represents frequency of x i in all the windows with t. In our case the frequency vectors in S S are built based on frequencies of context terms in \(W_{tm}^{\gamma }\).

However, the frequency measure works ”locally”, namely, it does not take into account how often given termsets appear in the termset spaces for other pairs (t, m). Actually, one can expect that if a context x i appears often within the frequent termsets of \(W_{tm}^{\gamma }\) but not too often within the other sets of frequent termsets, it plays a special role in identifying the pair (t, m), and a similar situation should repeat in the target language. And other way around, if x i appears in too many sets \(W_{t_{p}m_{q}}^{\gamma }\), its role in distinguishing given pair (t, m) decreases in both languages. It is actually an adoption of the well known measure tf-idf, which takes into account a distinctive value of particular context termsets. Therefore, in addition to the experiments with the frequency based vector spaces we have decided to build the vector spaces based on the tf-idf measure. Below, the two methods of building vector spaces, one based on the frequencies, and another one on tf-idf, are called briefly local and global, respectively. In the next section we compare the two approaches.

5 Experiments

The experiments have been performed for English (source) and Polish (target). We used two text corpora, built from the English and Polish Wikipedia versions respectively. The wikilinks between the articles, descriptive labels and interlingual links have been neglected. So, the Wikipedia versions were used only as sets of paragraphs. First, by applying the algorithm for finding translations based on the lexical similarities we have built a seed, then we have performed experiments with semantic translations. Below we present and discuss the experiment results.

Lexical similarity experiments

The lexical similarity experiments are summarized on the Figs. 3 and 4. We have selected 5000 most frequent words from each corpus. Figure 3 shows the most frequent examples of the rules induced from D S and D T . As one can see (Fig. 4), from these starting dictionaries, 1507 translations have been properly detected, giving the precision 66,9 % (Fig. 4a). The precision can be increased by eliminating some ”dummy rules”, reducing slightly the recall. A good example of a ”dummy rule” for English-to-Polish translation is \(s \rightarrow \varnothing \), resulting from plural forms of English nouns translated to singular Polish equivalents. Just removing this one rule increases precision to 78,1 % (Fig. 4b). Even without this manual intervention the results outperform the results obtained by Koehn and Knight (2002) – with automatically generated rules we are able to find a larger seed, in spite of the fact that the source and target languages are from different language families, and the used corpora, although comparable, are not parallel as was in (Koehn and Knight 2002). It results from the fact that the procedure of building the rewriting rules discovers the ones which are statistically the most important for the corpora.

Fig. 3
figure 3

Top frequent rules generated by lexical similarity translation

Fig. 4
figure 4

Precision and recall of lexical similarity translation

Semantical similarity experiments

Given the seed, we have performed series of experiments with semantic translations. The first experiments were focused on finding the ways of reducing the noise typical for vector representation. We have tested two possibilities:

  1. 1.

    reducing the influence of the noisy dimensions of the vector space

  2. 2.

    boosting the role of the most distinctive dimensions by applying a tf-idf based measure for building the vector space.

For (1) we have tested a possibility of reducing the noise by zeroing in the vectors the less important dimensions. In particular, we have processed each vector in such a way that only the top n dimensions (i.e. the ones having the highest weights) are kept for the similarity calculations, whereas all the other dimensions are set to zero. Below, the procedure of zeroing the less valued dimensions is named context vector noise reduction. We say that n is context size for the vector space. The procedure does not reduce the dimensionality of the vector spaces. Obviously, the lower is the value of the context size, the faster is the translation process. But what is more important, the experiments have shown that the noise reduction procedure not only reduces the computation time, but first of all up to a certain value of n it provides better results. One of our goals was to find out optimal values of the context size for particular cases of vector spaces.

For (2), we have compared the quality of the semantic similarity for the vector spaces with local dimension measures versus vector spaces with globally evaluated dimensions, i.e. frequency based measures versus tf-idf ones.

From English Wikipedia we have selected manually 1000 terms having good translations in Polish WikipediaFootnote 5 and single well defined meanings, i.e. the ones for which the algorithm SnS finds only one meaning. Figure 5 shows the results for the selected subset. The translations are considered as correct if the correct translation is among the first 10 translation candidates from D T , i.e. the 10 candidates having the highest similarity measure to the source terms. From the experiment results the following observations can be made:

  1. 1.

    processing the vector spaces with the noise reduction procedure essentially reduces the computation cost of translation; up to a certain value n of the vector context size it also improves the translation quality of both, local and global methods;

  2. 2.

    vector spaces with globally evaluated dimensions provide better quality of the translations than the spaces with local dimensions; globally calculated vector spaces reach good translation quality already for the context size n ∈ [100..200];

  3. 3.

    a similar quality of translation for local method can be reached for much higher values of the context size (n ∈ [1000..2000])

Fig. 5
figure 5

Comparison of the local and the global versions with context vector noise reduction

In our previous paper (Krajewski et al. 2014), the vector space was built in a similar way as in (Koehn and Knight 2002), i.e. without the phase of identifying the set of meanings, so the context vectors were built just for t, instead of the pairs (t, m). The next series of experiments were devoted to the question how multiple meanings of the terms may deteriorate the quality of translations, and how discovering meanings and assigning them to the terms can improve the quality. To this end, in the second series of experiments we have selected manually a sample of 200 polysemic terms from D S , for which the number of meanings in English Wikipedia was at least 5 and performed the following two tests:

  1. 1.

    for the sample of highly polysemic terms we have build the vector space, and then tested the quality of translation without taking into account the meanings identified by the SnS phase;

  2. 2.

    for the same sample we have identified the meanings with SnS and then built the vector space for the pairs (t, m).

Figure 6 illustrates the results of the experiments. In both experiments the phase of the context vectors noise reduction have been performed, so that the quality of translations can be seen in the function of the context size parameter. As one can see the SnS phase of finding the meaning and then building the context vectors for the pairs (t, m) improves the quality of translations essentially. We can also see that for the vector space without meanings the optimal value for the context size is much higher than for the vector space with meanings (about 1000 versus 100).

Fig. 6
figure 6

Comparison of the translation quality for the polysemic terms

Additionally, we have performed experiments aiming at checking how the size of the seed dictionary influences the semantic translation quality. The results are shown on the Figs. 7 and 8. The experiments included the SnS phases for the meaning detection. Figure 7 illustrates the experiments with the vector context size n = 800, whereas Fig. 8 shows the results for n = 80. As one can see, for the vectors built for pairs (t, m) comparable results can be obtained with smaller context size, which means that the algorithm quality is determined by the meaning injection. Still, there are limitations in obtaining very high precision. From the figures one can see though, that by increasing the number of k best candidates we can reach reasonable results.

Fig. 7
figure 7

Seed size influence for the translation quality of top k = 1,5,10 translation results for the vector context size n = 800

Fig. 8
figure 8

Seed size influence for the translation quality of top k = 1,5,10 of translation results for the vector context size n = 80

6 Conclusion

In this work we proposed a novel approach to identify word translations from non-parallel or even unrelated texts. Comparing to the original seed based translation approach (Koehn and Knight 2002), the novel elements introduced in S B D B + are: (1) the phase of inducing lexicographical rules of translations; and (2) the phase of finding meanings of the terms from the dictionary.

The phase for finding meanings has been performed with the SnS method (see (Kozlowski and Rybinski 2014)).

According to our experiments, the size of the seed dictionary influences the quality of the phase of semantic translation. Therefore the proposed technique can be used iteratively, i.e. having discovered proper translations of the meanings in the semantic translation phase we can add the translations to the seed, and iteratively continue the translations for the expanded seed for those meaning that have not been positively translated earlier.

The proposed translation method is knowledge-poor and language independent. It is therefore applicable for maintaining multilingual ontologies devoted to continuously and dynamically changing domains. As shown, the method works well even for the languages from different language families. We have also shown that the procedure of noise reduction of the context vectors space improves the translation precision and improves computational efficiency.

The referred evaluations (Rapp (1999), Koehn and Knight (2002)) were performed on small datasets (100-1000 words) extracted from a radio news corpus, or some existing lexicons. The results presented in this paper vary from 40 % to 72 % of correctness. Opposite to that, we performed evaluation on all words from wikipedia texts, used as a text repository. Our method reports 50 % of correct translations found as the first candidates, and up to 80 % among the top 10 candidates in the case of polysemic terms.

As future work we plan developing the syntactic part of our method by verifying meanings of the syntactic translations with the SnS algorithm. We also intend to incorporate the method to our knowledge base software Ω- ΨR, (Koperwas et al. 2014) for balancing indexes for multilingual full text information retrieval in English and Polish, i.e., giving similar results for queries regardless of the language.