Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Nowadays, most of the semantically structured data, i.e. ontologies or taxonomies, have labels stored in English only. Although the increasing amount of ontologies offers an excellent opportunity to link this knowledge together, non-English users may encounter difficulties when using the ontological knowledge represented in English only [1]. Furthermore, applications in information retrieval or knowledge management, using monolingual ontologies are limited to the language in which the ontology labels are stored. Therefore, to make ontological knowledge accessible beyond language borders, these monolingual resources need to be enhanced with multilingual information [2].

Another important reason to translate ontologies is that they may already exist in different languages, but without aligning the concepts across languages we are not able to combine, compare or extend them. Furthermore, government institutions may be obliged to publish their ontologies or other structured data in their native language, e.g. financial reports need to be written in the language in which the financial institution operates. Therefore, performing ontology based data analytics would fail on providing reports in German, Spanish, or other official European languages, if we would use only existing ontologies in English. Additionally, medical ontologies, e.g. the ICD OntologyFootnote 1 can be used to standardize medical reports, but physicians, researchers or patient organizations will access these reports in their native language; therefore, only a cross-lingual aligned ontology may give an appropriate overview. Another example is the Europeana project,Footnote 2 in the heritage domain, where preservation of the cultural treasure of Europe shows the need of cross-lingual alignment of different resources.

Since manual multilingual enhancement of domain-specific ontologies is very time consuming and expensive, we engage a domain-aware statistical machine translation (SMT) system to automatically translate the ontology labels. As ontologies may change over time; having in place an SMT system adaptable to an ontology can therefore be very beneficial. Nevertheless, the quality of the SMT generated translations relies strongly on the translation model learned from the information stored in parallel corpora. In most cases, the inference of translation candidates cannot always be learned accurately when specific vocabulary, like ontology labels, appears infrequent in a parallel corpus. Additionally, ambiguous labels built out of only a few words do not always express enough semantic information to guide the SMT system in translating a label correctly in regards to the targeted domain. This can be observed in domain-unadapted SMT systems, e.g. Microsoft Translator,Footnote 3 where an ambiguous expression, like vessel stored in a medical ontology, is translated as Schiff Footnote 4 (en. ship) in German, but not into the targeted medical domain as Gefäß.

In this paper, we present ESSOT, a collaborative knowledge management platform with a domain-aware SMT system for supporting language experts in the task of translating ontologies. The benefits of such a platform are (i) the possibility of having an all-in-one solution, containing both an environment for modelling ontologies which enables the collaboration between different type of experts and (ii) a pluggable domain-adaptable service for supporting ontology translations. The proposed solution has been validated in two different settings: (i) in a real-world context, namely Organic.Lingua,Footnote 5 from quantitative and qualitative points of view, and, quantitatively only, (ii) on a set of ontologies aiming to evaluate the effectiveness of the SMT service in different domains.

The paper is structured as follow. Section 2 introduces the context of our main use case, the Organic.Lingua project. Section 3 provides an overview of the ESSOT architecture; while, Sect. 4 focuses on user facilities implemented for supporting the ontology translation task. In Sect. 5 we reported the evaluation conducted on both the translations suggested by the service and the usability of the platform. Finally, Sect. 6 provides a general overview about ontology translation and knowledge management tool; while, Sect. 7 concludes the paper.

2 The Organic.Lingua Project

Organic.Lingua is an EU-funded project that aims at providing automated multilingual services and tools facilitating the discovery, retrieval, exploitation and extension of digital educational content related to Organic Agriculture and AgroEcology. More concretely, the project aims at providing, on top of a web portal, cross-lingual facility services enabling users to (i) find resources in languages different from the ones in which the query has been formulated and/or the resource described (e.g., providing services for cross-lingual retrieval); (ii) manage meta-data information for resources in different languages (e.g., offering automated meta-data translation services); and (iii) contribute to evolving content (e.g., providing services supporting the users in content generation).

These objectives are reached in the Organic.Lingua project by means of two components: on the one hand, a web portal offering software components and linguistic resources able to provide multilingual services and, on the other hand, a conceptual model (formalized in the “Organic.Lingua ontology”) used for managing information associated with the resources provided to the final users and shared with other components deployed on the Organic.Lingua platform. In a nutshell, the usage of the Organic.Lingua ontology is twofold:

  • Resource annotation: each time a content provider inserts a resource in the repository, the resource is annotated with one or more concepts extracted from the ontology. The list of available concepts is retrieved by using an ontology service deployed in the ontology management component. Then, this list is exploited for annotating the learning resources published on the Web portal.

  • Resource retrieval: when web users perform queries on the system, the ontology is used, by the back-end information retrieval system, to perform advanced searches based on semantic techniques. Moreover, the Web portal is equipped with a graphical semantic tree that exploits the content of the ontology for facilitating the browsing of the resource repository classification. Finally, the ontology is used also by the Cross-Language Information Retrieval component for query expansion purposes.

Due to this intensive use of the ontology in the entire Organic.Lingua portal, the accuracy of the linguistic layer, represented by the set of translated labels, is crucial for supporting the annotation and retrieval functionalities. The maintenance of such an accuracy requires a precise methodology, and dedicated tools, for avoiding the loss of effectiveness of the components deployed on the platform.

3 Platform Architecture

In this Section, we present a general overview of the platform for managing the life-cycle of translating ontological entities. Figure 1 shows the architecture diagram, where we distinguish two main blocks:

  • the service-side: containing the components for the machine translation models used for suggesting translations when requests are performed by users; and,

  • the user-side: containing the facilities implemented for supporting experts in managing the multilingual layer of ontologies.

Fig. 1.
figure 1

Diagram of the overall architecture of the ESSOT platform

The service side contains the components used for creating and updating the model used by the domain-aware machine translation service. Such components are described below, while in Sect. 4 we provide a description of the facilities implemented for supporting users in managing ontologies.

Statistical Machine Translation. Our approach is based on statistical machine translation, where we wish to find the best translation \(\mathbf {e}\) of a source string \(\mathbf {f}\), given by a log-linear model combining a set of features. The translation that maximizes the score of the log-linear model is obtained by searching all possible translations candidates. The decoder, which functions as a search procedure, provides the most probable translation based on a statistical translation model learned from sentence aligned corpora.

For a broader domain coverage of datasets necessary to train an SMT system, we merged several parallel corpora, e.g. JRC-Acquis [3], Europarl [4], DGT (translation memories generated by the Directorate-General for Translation) [5], MultiUN corpus [6] and TED talks [7] among others, into one parallel dataset. For the translation approach, we engage the widely used Moses toolkit [8]. Word alignments were built with GIZA++ [9] and a 5-gram language model was build with KenLM [10].

Query Expansion for Sentence Selection. Due to the shortness of ontology labels, there is a lack of contextual information, which can otherwise help disambiguating short or ambiguous expressions. Therefore, our goal is to translate the identified ontology labels within the textual context of the targeted domain, rather than in isolation. With this selection approach, we aim to retain relevant sentences, where the English label vessel or injection belongs to the medical domain, but not to the technical domain. This process reduces the semantic noise in the translation process, since we try to avoid contextual information that does not belong to the domain of the targeted ontology.

Due to the specificity of the ontology labels, just an n-gram overlap approach is not sufficient to select all the useful sentences. For this reason, we follow the idea of [11], where the authors extend the semantic information of ontology labels using Word2VecFootnote 6 for computing distributed representations of words. The technique is based on a neural network that analyses the textual data provided as input and outputs a list of semantically related words [12]. Each input string, in our experiment ontology labels or source sentences, is vectorized using the surrounding context and compared to other vectorized sets of words in a multi-dimensional vector space. Word relatedness is measured through the cosine similarity between two word vectors. A score of 1 would represent a perfect word similarity; e.g. cholera equals cholera, while the medical expression medicine has a cosine distance of 0.678 to cholera. Since words, which occur in similar contexts tend to have similar meanings [13], this approach enables to group related words together.

The usage of the ontology hierarchy allows us to further improve the disambiguation of short labels, i.e., the related words of a label are concatenated with the related words of its direct parent. Given a label and a source sentence from the used concatenated corpus, related words and their weights are extracted from both of them, and used as entries of the vectors to calculate the cosine similarity. Finally, the most similar source sentence and the label should share the largest number of related words.

4 Supporting Users in the Ontology Translation Activity

The ESSOT system has been equipped with facilities supporting the collaborative translation of domain-specific ontologies in order to satisfy the requirements of the ontology translation task from a user perspective.

Concerning users, we identified two distinct groups: the Domain Experts and the Language Experts. Domain Experts are in charge of the modelling aspect of ontologies (i.e. creation of concepts, individuals, properties, and the relationships between them); while Language Experts are responsible of managing the labels associated with each entity by evaluating their correctness and, eventually, by providing a more fine-grained adaptation with respect to the domain described by the ontology.

Below, we present the list of the implemented facilities specifically designed for supporting the management of the multilingual layer of ontologies. Here, we focused on the interface that have been specifically implemented for managing the Organic.Lingua ontology.Footnote 7 However, such facilities can be adopted, in general, for managing the multilingual aspect of any ontology.

Domain And Language Experts View. The page dedicated to the management of an ontology label, specifically designed for the Domain and Language Experts, has been equipped with functionalities that permits revisions of the linguistic layer. This set of functionalities permits to revise translations of names and descriptions of each entity (concepts, individuals, and properties).

For facilitating the browsing and the editing of the translations, a quick view box has been inserted into the mask (as shown in Fig. 2); in this way, language experts are able to navigate through the available translations and, eventually, invoke the translation service for retrieving a suggestion or, alternatively, to edit the translation by themselves (Fig. 3).

Fig. 2.
figure 2

Multilingual box for facilitating the entity translation

Fig. 3.
figure 3

Quick translation box for editing label translations

Approval And Discussion Facilities. Given the complexity of translating domain specific ontologies, translations often need to be checked and agreed upon by a community of experts. This is especially true when ontologies are used to represent terminological standards which need to be carefully discussed and evaluated. To support this collaborative activity we foresee the usage of a wiki-style paradigm [14], expanded with the possibility of assigning specific translations of ontology labels to specific experts who need to monitor, check, and approve the suggested translations. This customization promotes the management of the changes carried out on the ontology (in both layers) by providing the facilities necessary to manage the life-cycle of each change.

These facilities may be split in two different sets of features. The first group may be considered as a monitor of the activities performed on each entity page. When changes are committed, approval requests are created. They contain the identification of the expert in charge of approving the change, the date on which the change has been performed, and a natural language description of the change. Moreover, a mechanism for managing the approvals and for maintaining the history of all approval requests for each entity is provided. Instead, the second set contains the facilities for managing the discussions associated with each entity page. A user interface for creating the discussions has been implemented together with a notification procedure that alerts users when new topics/replies, related to the discussions that they are following, have been posted.

“Quick” Translation Feature. For facilitating the work of language experts, we have implemented the possibility of comparing side-by-side two lists of translations. This way, the language expert in charge of revising the translations, avoiding to navigate among the entity pages, is able to speed-up the revision process.

Figure 4 shows such a view, by presenting the list of English concepts with their translations into Italian. At the right of each element of the table a link is placed allowing to invoke a quick translation box (as shown in Fig. 3) that gives the opportunity to quickly modify information without opening the entity page. Finally, in the last column, a flag is placed indicating that changes have been performed on that concept, and a revision/approval is requested.

Fig. 4.
figure 4

View for comparing label translations

5 Evaluation

Our goal is evaluating the usage and the usefulness of the ESSOT user facilities and the underlying service for suggesting domain-adapted translations.

In detail, we are interested in answering two main research questions:

  • RQ1 Does the proposed system provide an effective support, in terms of the quality of suggested translations, to the management of multilingual ontologies?

  • RQ2 Do the ESSOT functionalities provide an effective support to the collaborative management of a multilingual ontology?

In order to answer these questions, we performed two types of analysis:

  1. 1.

    Qualitative: the tool has been validated in the context of the Organic.Lingua project where we collected subjective judgements from the language experts. They have been involved in the evaluation of the tool on the general usability of the components and by providing feedback for future improvements.

  2. 2.

    Quantitative: beside the user evaluation, we collected objective measures concerning the effectiveness of the translations suggested by the embedded machine translation service. This information allows to have an estimation of the effort needed for adapting all translations by the language experts.

5.1 User Evaluation Context

Eleven language experts have been involved in the evaluation of the proposed platform for translating the Organic.Lingua ontology in three different languages: German, Spanish, and Italian. They were all experts of the agricultural domain, therefore, labels used by them have to be considered as a gold standard from the domain point of view. From the mother tongue perspective, the evaluation was performed by three German, four Spanish and four Italian native speaking experts. Most of them had no previous knowledge of the tool, hence an initial phase of training was necessary. The training was organized according to the following steps:

  • A one-day overall introduction to the tool.

  • A few short, on-line, training sessions with the ESSOT tool guided by ontology and tool experts, targeted to help domain experts to better understand the capabilities of the tool.

  • Hands-on usage of the tool: language experts were left to “play” with ESSOT in order to become familiar with the functionalities that they would use during the revision process. This exercise also had the secondary objective to collect doubts and problems encountered by experts.

After the initial training, experts were asked to translate the ontology in the three languages mentioned above. The experts used ESSOT facilities for completing the translation task and, at the end, they provided feedback on the tool support for accomplishing the task. A summary of these findings and lessons learned are presented in Sect. 5.2.

5.2 Qualitative Evaluation Results

To investigate the subjective perception of the eleven experts about the support provided for translating ontologies, we analysed the data collected through a questionnaire. For each functionality described in Sect. 4, we provided the information how often each aspect has been raised by the language experts.

Language Experts View

Pros: Easy to use for managing translations (9)

Usable interface for showing concept translations (3)

Approval And Discussion

Pros: Pending approvals give a clear situation about concept status (4)

Cons: Discussion masks are not very useful (8)

Quick Translation Feature

Pros: Best facility for translating concepts (8)

Cons: Interface design improvable (3)

The results show, in general, a good perception of the implemented functionalities, in particular concerning the procedure of translating a concept by exploiting the quick translation feature. Indeed, 9 out of 11 experts reported advantages on using this capability. Similar opinions have been collected about the language expert view, where the users perceived such a facility as a usable reference for having the big picture about the status of concept translations.

Results concerning the approach and discussion facility are inconclusive. On the one hand, the experts perceived positively the solution of listing approval requests on top of each concept page. This fact is connected with a personalization that we embed into the ESSOT home page. Indeed, after the login, users are able to see the list of pending approvals require their action. This way, it is more easy for them to locate the translations that have to be evaluated and, eventually, to approve or to modify them. On the other hand, we received negative opinions by almost all experts (8 out of 11) about the usability of discussion forms. This result shows us to focus future effort in improving this aspect of the tool.

Finally, concerning the “quick” translation facility, 8 out of 11 experts judged this facility as the most usable way for translating a concept. The main characteristic that has been highlighted is the possibility of performing a “mass-translation” activity without opening the page of each concept, with the positive consequence of saving a lot of time.

5.3 Quantitative Evaluation Results

The automatic evaluation on label translations provided by ESSOT is based on the correspondence between the automatically generated output and reference translations (gold standard), provided by domain and language experts. For the automatic evaluation we used the BLEU [15], METEOR [16] and TER [17] algorithms.

BLEU is calculated for individual translated segments (n-grams) by comparing them with reference translations. Those scores, between 0 and 100 (perfect translation), are then averaged over the whole evaluation dataset to reach an estimate the automatically generated translation’s overall quality. METEOR is based on the harmonic mean of precision and recall, whereby recall is weighted higher than precision. Along with standard exact word (or phrase) matching it has additional features, i.e. stemming, paraphrasing and synonymy matching. Differently to BLEU, the metric produces good correlation with human judgement at the sentence or segment level. TER is an error metric (lowers scores are better) for machine translation measuring the number of edits required to change a system output into one of the references.

Datasets. To demonstrate the performance of the proposed framework, we use several ontologies coming from different domains:

  • The Organic.Lingua ontology contains 291 concepts in the agricultural domain. All concepts within the ontology have been translated into 16 languages. In addition, mappings to Agrovoc and Eurovoc have also been defined.

  • The DOAP (Description of a Project) OntologyFootnote 8 defines the vocabulary to describe software projects. It was created to convey semantic information associated with free and open source software projects. It holds translations of labels into 6 languages,Footnote 9 whereby we use German and Spanish translations as the gold standard, which is compared with the automatically generated labels.

  • The Geoskills ontologyFootnote 10 holds the competencies, topics and educational contexts in five different languages, i.e., English, German, French, Spanish and Dutch.

  • The STW Thesaurus for Economics [18] provides the vocabulary of more than 6,000 standardized subject headings (in English and German) and 20,000 additional entry terms (keywords) belonging to the economical domain. In addition to that, the entries are richly interconnected by 16,000 broader/narrower and 10,000 related relations.

  • The Thesaurus for the Social Sciences (TheSoz) [19] enables indexing documents and research information in the social sciences. In overall it stores about 8,000 standardized subject headings in English, German and French.

Table 1. Automatic translation evaluation of the targeted ontologies by the Microsoft Translator API and our proposed system (bold results = best performance)

We evaluate the automatically generated translations into German, Italian and Spanish provided within ESSOT and the Microsoft Translator API. Since reference translations are needed to evaluate automatically generated translations, we use the translated labels provided by the domain experts as the gold standard.

The Organic.Lingua ontology provides 274 German, 354 Italian and 355 Spanish existing translations out of 404 English labels. As seen in Table 1, the contextual information for label translation used in ESSOT, significantly outperformsFootnote 11 the Microsoft Translator API. When translating English labels into German we gain a 51.3 % averaged improvement over the commercial system and 51.7 % for Spanish. In addition to that, it produces comparable results when translating into Italian (10.5 % improvement).

Besides the evaluation on the Organic.Lingua ontology labels, the multilingual gold standard within the DOAP and GeoSkills ontology enables the evaluation of automatically translated ontology labels into German and Spanish. In detail, the results for the DOAP ontology show similar performance between both translation systems. On the other hand, the results of the GeoSkills ontology labels show statistically significant ( p-value < 0.05) improvements over the Microsoft Translator API. For the STW and TheSoz ontology, which enables automatic evaluation of German language, only translations of the TheSoz labels show significant improvements of our system. Although the evaluation metrics show slight improvements when translating the STW ontology, the improvements are not significant.

As a final evaluation, we manually analysed the TheSoz translated labels regarding the most frequent errors of the both translation systems. The first observation is related to compound words, a frequent error class when translating into German. We observed that Microsoft often provided a non-compound translation in German. As an example, labels like company takeover, working week or crime fighting were translated word by word into German, i.e., Übernahme der Firma, wöchentliche Arbeitszeit or Bekämpfung von Kriminalität. Although these translations can be seen as correct translations, the provided gold standard in the ontology preferred German compounds Unternehmensübernahme, Wochenarbeitszeit, Verbrechensbekämpfung. Besides a small amount of wrong translations (partnership into marriege, translator into translator, young worker into Junge Arbeitnehmer), Microsoft’s system showed expected problems in disambiguating short expressions. Due to the shortness of the labels, the ontology label driver was translated as Treiber, which is correct in the IT domain (as hardware driver), but not in the targeted domain. Similarly, stroke, without contextual information, was translated as Strich (en. line, dash), although Schlaganfall would be the correct translation into German. For these ambiguous labels, our proposed system, which used a disambiguated contextual information, provided correct translations, i.e. Fahrer from driver or Schlaganfall from the English label stroke. On the other hand also our system did not always perform best. The largest observed error class were out-of-vocabulary issues, i.e. alignments between source and target language, which were not learned during the SMT training step. For example, TheSoz labels bonapartism, shamanism, patriciate or praxeology, which are no stored in our translations models, were provided as untranslated words on the target side.

5.4 Findings and Lessons Learned

The quantitative and qualitative results demonstrate the viability of the proposed platform in real-world scenarios and, in particular, its effectiveness in the proposed use case. Therefore, we can positively answer to both research questions, RQ1: the back-end component provides helpful suggestions for performing the ontology translation task, and RQ2: the provided interfaces are usable and useful for supporting the language experts in the translation activity.

Besides these, there were other insights, either positive and negative, that emerged during the subjective evaluation that we conducted.

The main positive aspect highlighted by the experts was related to the easy and quick way of translating a concept with respect to other available knowledge management tools (see details in Sect. 6), which do not enable specific support for translation. The suggestion-based service allowed effective suggestions and reduced the effort required for finalizing the translation of the ontology. As example, we may consider the Organic.Lingua specific use case, where the time for translating the ontology was reduced from 3.5 h (completely manual translation) to 2.1 h (translation performed with ESSOT ). This point confirms the capability of the domain-aware translation service of providing translations adapted to the specific topic of the ontology experts are going to model. However, even if on one hand, the experts perceived such a service very helpful from the point of view of domain experts (i.e. experts that are generally in charge of modeling ontologies but that might not have enough linguistic expertise for translating label properly with respect to the domain), facilities supporting the direct interaction with language experts (i.e. discussion form) should be more intuitive, for instance as the approval one.

The criticism concerning the interface design was reported also about the quick translation feature, where some of the experts commented that the comparative view might be improved from the graphical point of view. In particular, they suggested (i) to highlight translations that have to be revised, instead of using a flag, and (ii) to publish only the concept label instead of putting also the full description in order to avoid misalignments in the visualization of information.

Connected to the quick translation facility, experts judged it as the easiest way for executing a first round of translations. Indeed, by using the provided translation box, experts are able to translate concept information without navigating to the concept page and by avoiding a reload of the concepts list after the storing of each change carried out by the concept translation.

Finally, we can judge the proposed platform as a useful service for supporting the ontology translation task, especially in a collaborative environment when the multilingual ontology is created by two different types of experts: domain experts and language experts. Future work in this direction will focus on the usability aspects of the tool and on the improvement of the semantic model used for suggesting translations in order to further reduce the effort of the language experts. We plan also to extend the evaluation on other use cases.

6 Related Work

In this section, we want to summarize approaches related to the pure ontology translation task and to present a brief review of the most known ontology management tools current available by emphasizing their capabilities in supporting language experts for translating ontologies.

The task of ontology translation involves generating an appropriate translation for the lexical layer, i.e. labels stored in the ontology. Most of the previous related work focused on accessing existing multilingual lexical resources, like EuroWordNet or IATE [21, 22]. Their work focused on the identification of the lexical overlap between the ontology and the multilingual resources, which guarantees a high precision but a low recall. Consequently, external translation services like BabelFish, SDL FreeTranslation tool or Google Translate were used to overcome this issue [23, 24]. Additionally, [23, 25] performed ontology label disambiguation, where the ontology structure is used to annotate the labels with their semantic senses. Similarly, [26] show positive effect of different domain adaptation techniques, i.e., using web resources as additional bilingual knowledge, re-scoring translations with Explicit Semantic Analysis, language model adaptation) for automatic ontology translation. Differently to the aforementioned approaches, which rely on external knowledge or services, the machinery implemented in ESSOT is supported by a domain-aware SMT system, which provides adequate translations using the ontology hierarchy and the contextual information of labels in domain-relevant text data. Current frameworks for ontology label translation are accessing directly commercial systems, such as Google Translate or Microsoft Translate, whereby both systems are unable to detect the domain when translating short ambiguous expression, e.g. vessel, injection, track, head, equity. In this paper, we demonstrate a platform supporting a machine translation system to translate ontology labels in a domain-specific context.

If we perform a “skimming” of the systems available for ontology management, we identified four of them that may be compared with the capabilities provided by ESSOT : Neon [27], VocBench [28], Protégé [29], and Knoodl Footnote 12. However, they do not fully support experts in the specific task of translating ontologies. While the first two, Neon and VocBench, are the ones more oriented for supporting the management of multilinguality in ontologies by including dedicated mechanisms for modelling the multilingual fashion of each concept; the support for multilinguality provided by Protégé and Knoodl is restricted to the sole description of the labels. Finally, none of them implements the capability of connecting the tool to an external machine translation system for suggesting translations automatically.

7 Conclusions

This paper presents ESSOT, an Expert Supporting System for Ontology Translation implementing an automatic translation approach based on the enrichment of the text to translate with semantically structured data, i.e. ontologies or taxonomies. ESSOT system integrates a domain-adaptable semantic translation component and a collaborative knowledge management facilities for supporting language experts in the ontology translation activity. The platform has been concretely used in the context of the Organic.Lingua EU project and on a set of multilingual ontologies coming from different domains by demonstrating the effectiveness in the quality of the suggested translations and in the usefulness from the language experts point of view.