Keywords

1 Introduction

In the past few years, knowledge bases (KBs) have been successfully used in lots of AI-related areas such as Semantic Web, question answering and Web mining. Various KBs cover a broad range of domains and store rich, structured real-world facts. In a KB, each fact is stated in a triple of the form (entitypropertyvalue), in which value can be either a literal or an entity. The sets of entities, properties, literals and triples are denoted by EPL and T, respectively. Blank nodes are ignored for simplicity. There are two types of properties—relationships (R) and attributes (A)—and correspondingly two types of triples, namely relationship triples and attribute triples. A relationship triple \(tr\in E\times R\times E\) describes the relationship between two entities, e.g. (TexashasCapitalAustin), while an attribute triple \(tr\in E\times A\times L\) gives a literal attribute value to an entity, e.g. (TexasareaTotal,“696241.00”).

As widely noted, KBs often suffer from two problems: (i) Low coverage. Different KBs are constructed by different parties using different data sources. They contain complementary facts, which makes it imperative to integrate multiple KBs. (ii) Multi-linguality gap. To support multi-lingual applications, a growing number of multi-lingual KBs and language-specific KBs have been built. This makes it both necessary and beneficial to integrate cross-lingual KBs.

Entity alignment is the task of finding entities in two KBs that refer to the same real-world object. It plays a vital role in automatically integrating multiple KBs. This paper focuses on cross-lingual entity alignment. It can help construct a coherent KB and deal with different expressions of knowledge across diverse natural languages. Conventional cross-lingual entity alignment methods rely on machine translation, of which the accuracy is still far from perfect. Spohr et al. [21] argued that the quality of alignment in cross-lingual scenarios heavily depends on the quality of translations between multiple languages.

Following the popular translation-based embedding models [1, 15, 22], a few studies leveraged KB embeddings for entity alignment and achieved promising results [5, 11]. Embedding techniques learn low-dimensional vector representations (i.e., embeddings) of entities and encode various semantics (e.g. types) into them. Focusing on KB structures, the embedding-based methods provide an alternative for cross-lingual entity alignment without considering their natural language labels.

There remain several challenges in applying embedding methods to cross-lingual entity alignment. First, to the best of our knowledge, most existing KB embedding models learn embeddings based solely on relationship triples. However, we observe that attribute triples account for a significant portion of KBs. For example, we count triples of infobox facts from English DBpedia (2016-04),Footnote 1 and find 58,181,947 attribute triples, which are three times as many as relationship triples (the number is 18,598,409). Facing the task of entity alignment, attribute triples can provide additional information to embed entities, but how to incorporate them into cross-lingual embedding models remains largely unexplored. Second, thanks to the Linking Open Data initiative, there exist some aligned entities and properties between KBs, which can serve as bridge between them. However, as discovered in [5], the existing alignment between cross-lingual KBs usually accounts for a small proportion. So how to make the best use of it is crucial for embedding cross-lingual KBs.

To deal with the above challenges, we introduce a joint attribute-preserving embedding model for cross-lingual entity alignment. It employs two modules, namely structure embedding (SE) and attribute embedding (AE), to learn embeddings based on two facets of knowledge (relationship triples and attribute triples) in two KBs, respectively. SE focuses on modeling relationship structures of two KBs and leverages existing alignment given beforehand as bridge to overlap their structures. AE captures the correlations of attributes (i.e. whether these attributes are commonly used together to describe an entity) and clusters entities based on attribute correlations. Finally, it combines SE and AE to jointly embed all the entities in the two KBs into a unified vector space \(\mathbb {R}^d\), where d denotes the dimension of the vectors. The aim of our approach is to find latent cross-lingual target entities (i.e. truly-aligned entities that we want to discover) for a source entity by searching its nearest neighbors in \(\mathbb {R}^d\). We expect the embeddings of latent aligned cross-lingual entities to be close to each other.

In summary, the main contributions of this paper are as follows:

  • We propose an embedding-based approach to cross-lingual entity alignment, which does not depend on machine translation between cross-lingual KBs.

  • We jointly embed the relationship triples of two KBs with structure embedding and further refine the embeddings by leveraging attribute triples of KBs with attribute embedding. To the best of our knowledge, there is no prior work learning embeddings of cross-lingual KBs while preserving their attribute information.

  • We evaluated our approach on real-world cross-lingual datasets from DBpedia. The experimental results show that our approach largely outperformed two state-of-the-art embedding-based methods for cross-lingual entity alignment. Moreover, it could be complemented with conventional methods based on machine translation.

The rest of this paper is organized as follows. We discuss the related work on KB embedding and cross-lingual KB alignment in Sect. 2. We describe our approach in detail in Sect. 3, and report experimental results in Sect. 4. Finally, we conclude this paper with future work in Sect. 5.

2 Related Work

We divide the related work into two subfields: KB embedding and cross-lingual KB alignment. We discuss them in the rest of this section.

2.1 KB Embedding

In recent years, significant efforts have been made towards learning embeddings of KBs. TransE [1], the pioneer of translation-based methods, interprets a relationship vector as the translation from the head entity vector to its tail entity vector. In other words, if a relationship triple (hrt) holds, \(\mathbf {h}+\mathbf {r}\approx \mathbf {t}\) is expected. TransE has shown its great capability of modeling 1-to-1 relations and achieved promising results for KB completion. To further improve TransE, later work including TransH [22] and TransR [15] was proposed. Additionally, there exist a few non-translation-based approaches to KB embedding [2, 18, 20].

Besides, several studies take advantage of knowledge in KBs to improve embeddings. Krompaß et al. [13] added type constraints to KB embedding models and enhanced their performance on link prediction. KR-EAR [14] embeds attributes additionally by modeling attribute correlations and obtains good results on predicting entities, relationships and attributes. But it only learns attribute embeddings in a single KB, which hinders its application to cross-lingual cases. Besides, KR-EAR focuses on the attributes whose values are from a small set of entries, e.g. values of “gender” are {Female, Male}. It may fail to model attributes whose values are very sparse and heterogeneous, e.g. “name”, “label” and “coordinate”. RDF2Vec [19] uses local information of KB structures to generate sequences of entities and employs language modeling approaches to learn entity embeddings for machine learning tasks. For cross-lingual tasks, [12] extends NTNKBC [4] for cross-lingual KB completion. [7] uses a neural network approach that translates English KBs into Chinese to expand Chinese KBs.

2.2 Cross-Lingual KB Alignment

Existing work on cross-lingual KB alignment generally falls into two categories: cross-lingual ontology matching and cross-lingual entity alignment. For cross-lingual ontology matching, Fu et al. [8, 9] presented a generic framework, which utilizes machine translation tools to translate labels to the same language and uses monolingual ontology matching methods to find mappings. Spohr et al. [21] leveraged translation-based label similarities and ontology structures as features for learning cross-lingual mapping functions by machine learning techniques (e.g. SVM). In all these works, machine translation is an integral component.

For cross-lingual entity alignment, MTransE [5] incorporates TransE to encode KB structures into language-specific vector spaces and designs five alignment models to learn translation between KBs in different languages with seed alignment. JE [11] utilizes TransE to embed different KBs into a unified space with the aim that each seed alignment has similar embeddings, which is extensible to the cross-lingual scenario. Wang et al. [23] proposed a graph model, which only leverages language-independent features (e.g. out-/inlinks) to find cross-lingual links between Wiki knowledge bases. Gentile et al. [10] exploited embedding-based methods for aligning entities in Web tables. Different from them, our approach jointly embeds two KBs together and leverages attribute embedding for improvement.

3 Cross-Lingual Entity Alignment via KB Embedding

In this section, we first introduce notations and the general framework of our joint attribute-preserving embedding model. Then, we elaborate on the technical details of the model and discuss several key design issues.

We use lower-case bold-face letters to denote the vector representations of the corresponding terms, e.g., \((\mathbf {h},\mathbf {r},\mathbf {t})\) denotes the vector representation of triple (hrt). We use capital bold-face letters to denote matrices, and we use superscripts to denote different KBs. For example, \(\mathbf {E}^{(1)}\) denotes the representation matrix for entities in \(KB_1\) in which each row is an entity vector \(\mathbf {e}^{(1)}\).

3.1 Overview

The framework of our joint attribute-preserving embedding model is depicted in Fig. 1. Given two KBs, denoted by \(KB_1\) and \(KB_2\), in different natural languages and some pre-aligned entity or property pairs (called seed alignment, denoted by superscript \(^{(1,2)}\)), our model learns the vector representations of \(KB_1\) and \(KB_2\) and expects the latent aligned entities to be embedded closely.

Fig. 1.
figure 1

Framework of the joint attribute-preserving embedding model

Following TransE [1], we interpret a relationship as the translation from the head entity to the tail entity, to characterize the structure information of KBs. We let each pair in the seed alignment share the same representation to serve as bridge between \(KB_1\) and \(KB_2\) to build an overlay relationship graph, and learn representations of all the entities jointly under a unified vector space via structure embedding (SE). The intuition is that two alignable KBs are likely to have a number of aligned triples, e.g. (WashingtoncapitalOfAmerica) in English and its correspondence \((Washington,capitaleDes,\acute{E}tats\text {-}Unis)\) in French. Based on this, SE aims at learning approximate representations for the latent aligned triples between the two KBs.

However, SE only constrains that the learned representations must be compatible within each relationship triple, which causes the disorganized distribution of some entities due to the sparsity of their relationship triples. To alleviate this incoherent distribution, we leverage attribute triples for helping embed entities based on the observation that the latent aligned entities usually have a high degree of similarity in attribute values. Technically, we overlook specific attribute values by reason of their complexity, heterogeneity and cross-linguality. Instead, we abstract attribute values to their range types, e.g. \((Tom,age,``12'')\) to (TomageInteger), where Integer is the abstract range type of value “12”. Then, we carry out attribute embedding (AE) on abstract attribute triples to capture the correlations of cross-lingual and mono-lingual attributes, and calculate the similarities of entities based on them. Finally, the attribute similarity constraints are combined with SE to refine representations by clustering entities with high attribute correlations. In this way, our joint model preserves both relationship and attribute information of the two KBs.

With entities represented as vectors in a unified embedding space, the alignment of latent cross-lingual target entities for a source entity can be conducted by searching the nearest cross-lingual neighbors in this space.

3.2 Structure Embedding

The aim of SE is to model the geometric structures of two KBs and learn approximate representations for latent aligned triples. Formally, given a relationship triple \(tr=(h,r,t)\), we expect \(\mathbf {h}+\mathbf {r}=\mathbf {t}\). To measure the plausibility of tr, we define the score function \(f(tr)={\Vert \mathbf {h}+\mathbf {r}-\mathbf {t}\Vert }_2^2\). We prefer a lower value of f(tr) and want to minimize it for each relationship triple.

Figure 2 gives an example about how SE models the geometric structures of two KBs with seed alignment. In Phase (1), we initialize all the vectors randomly and let each pair in seed alignment overlap to build the overlay relationship graph. In order to show the triples intuitively in the figure, we regard an entity as a point in the vector space and move relationship vectors to start from their head entities. Note that, currently, entities and relationships distribute randomly. In Phase (2), we minimize scores of triples and let vector representations compatible within each relationship triple. For example, the relationship capitalOf would tend to be close to capitaleDes because they share the same head entity and tail entity. In the meantime, the entity America and its correspondence \(\acute{E}tats\text {-}Unis\) would move closely to each other due to their common head entity and approximate relationships. Therefore, SE is a dynamic spreading process. The ideal state after training is shown as Phase (3). We can see that the latent aligned entities America and \(\acute{E}tats\text {-}Unis\) lie together.

Fig. 2.
figure 2

An example of structure embedding

Furthermore, we detect that negative triples (a.k.a. corrupted triples), which have been widely used in translation-based embedding models [1, 15, 22], are also valuable to SE. Considering that another English entity China and its latent aligned French one Chine happen to lie closely to America, SE may take the Chine as a candidate for America by mistake due to their short distance. Negative triples would help reduce the occurrence of this coincidence. If we generate a negative triple \(tr' = (Washington,capitalOf,China)\) and learn a high score for \(tr'\), China would keep a distance away from America. As we enforce the length of any embedding vector to 1, the score function f has a constant maximum. Thus, we would like to minimize \( -f(tr') \) to learn a high score for \( tr' \).

In summary, we prefer lower scores for existing triples (positives) and higher scores for negatives, which leads to minimize the following objective function:

$$\begin{aligned} \mathcal {O}_{SE}=\sum _{tr\in T}\sum _{tr'\in T'_{tr}} \big (f(tr)-\alpha f(tr')\big ), \end{aligned}$$
(1)

where T denotes the set of all positive triples and \(T'_{tr}\) denotes the associated negative triples for tr generated by replacing either its head or tail by a random entity (but not both at the same time). \(\alpha \) is a ratio hyper-parameter that weights positive and negative triples and its range is [0, 1]. It is important to remember that each pair in the seed alignment share the same embedding during training, in order to bridge two KBs.

3.3 Attribute Embedding and Entity Similarity Calculation

Attribute Embedding. We call a set of attributes correlated if they are commonly used together to describe an entity. For example, attributes longitude, latitude and \(place\_name\) are correlated because they are widely used together to describe a place. Moreover, we want to assign a higher correlation to the pair of longitude and latitude because they have the same range type. We use seed entity pairs to establish correlations between cross-lingual attributes. Given an aligned entity pair \((e^{(1)},e^{(2)})\), we regard the attributes of \(e^{(1)}\) as correlated ones for each attribute of \(e^{(2)}\), and vice versa. We expect attributes with high correlations to be embedded closely.

To capture the correlations of attributes, AE borrows the idea from Skip-gram [16], a very popular model that learns word embeddings by predicting the context of a word given the word itself. Similarly, given an attribute, AE wants to predict its correlated attributes. In order to leverage the range type information, AE minimizes the following objective function:

$$\begin{aligned} \mathcal {O}_{AE}=-\sum _{(a,c)\in H}w_{a,c}\cdot \log p(c|a), \end{aligned}$$
(2)

where H denotes the set of positive (ac) pairs, i.e., c is actually a correlated attribute of a, and the term p(c|a) denotes the probability. To prevent all the vectors from having the same value, we adopt the negative sampling approach [17] to efficiently parameterize Eq. (2), and \(\log p(c|a)\) is replaced with the term as follows:

$$\begin{aligned} \log \sigma (\mathbf {a}\cdot \mathbf {c})+\sum _{(a,c')\in H_a'}\log \sigma (\mathbf {-a}\cdot \mathbf {c}'), \end{aligned}$$
(3)

where \(\sigma (x)=\frac{1}{1+e^{-x}}\). \(H_a'\) is the set of negative pairs for attribute a generated according to a log-uniform base distribution, assuming that they are all incorrect.

We set \(w_{a,c}=1\) if a and c have different range types, otherwise \(w_{a,c}=2\) to increase their probability of tending to be similar. In this paper, we distinguish four kinds of abstract range types, i.e., IntegerDoubleDatetime and String (as default). Note that it is easy to extend to more types.

Entity Similarity Calculation. Given attribute embeddings, we take the representation of an entity to be the normalized average of its attribute vectors, i.e., \(\mathbf {e}={[\sum _{a\in A_e}\mathbf {a}]}_1\), where \(A_e\) is the set of attributes of e and \([\mathbf {.}]_1\) denotes the normalized vector. We have two matrices of vector representations for entities in two KBs, \(\mathbf {E}_{AE}^{(1)}\in \mathbb {R}^{n_e^{(1)}\times d}\) for \(KB_1\) and \(\mathbf {E}_{AE}^{(2)}\in \mathbb {R}^{n_e^{(2)}\times d}\) for \(KB_2\), where each row is an entity vector, and \(n_e^{(1)},\) \(n_e^{(2)}\) are the numbers of entities in \(KB_1,KB_2\), respectively.

We use the cosine distance to measure the similarities between entities. For two entities \(e,e'\), we have \( \mathrm{sim}(e,e')={\cos } (\mathbf {e},\mathbf {e'})=\frac{\mathbf {e}\cdot \mathbf {e'}}{||\mathbf {e}||||\mathbf {e'}||}=\mathbf {e}\cdot \mathbf {e'}\), as the length of any embedding vector is enforced to 1. The cross-KB similarity matrix \(\mathbf {S}^{(1,2)}\in \mathbb {R}^{n_e^{(1)}\times n_e^{(2)}}\) between \(KB_1\) and \(KB_2\), as well as the inner similarity matrices \(\mathbf {S}^{(1)}\in \mathbb {R}^{n_e^{(1)}\times n_e^{(1)}}\) for \(KB_1\) and \(\mathbf {S}^{(2)}\in \mathbb {R}^{n_e^{(2)}\times n_e^{(2)}}\) for \(KB_2\), are defined as follows:

$$\begin{aligned} \mathbf {S}^{(1,2)}=\mathbf {E}_{AE}^{(1)}{\mathbf {E}_{AE}^{(2)\top }},\quad \mathbf {S}^{(1)}=\mathbf {E}_{AE}^{(1)}{\mathbf {E}_{AE}^{(1)\top }},\quad \mathbf {S}^{(2)}=\mathbf {E}_{AE}^{(2)}{\mathbf {E}_{AE}^{(2)\top }}. \end{aligned}$$
(4)

A similarity matrix \(\mathbf {S}\) holds the cosine similarities among entities and \(\mathbf {S}_{i,j}\) is the similarity between the i-th entity in one KB and the j-th entity in the same or the other KB. We discard lower values of \(\mathbf {S}\) because a low similarity of two entities indicates that they are likely to be different. So, we set the entry \(\mathbf {S}_{i,j}=0\) if \(\mathbf {S}_{i,j}<\tau \), where \(\tau \) is a threshold and can be set based on the average similarity of seed entity pairs. In this paper, we fix \(\tau =0.95\) for inner similarity matrices and 0.9 for cross-KB similarity matrix, to achieve high accuracy.

3.4 Joint Attribute-Preserving Embedding

We want similar entities across KBs to be clustered to refine their vector representations. Inspired by [25], we use the matrices of pairwise similarities between entities as supervised information and minimize the following objective function:

$$\begin{aligned} \mathcal {O}_{S}= & {} {\Vert \mathbf {E}_{SE}^{(1)}-\mathbf {S}^{(1,2)}\mathbf {E}_{SE}^{(2)}\Vert }_F^2 \nonumber \\&+\,\beta ({\Vert \mathbf {E}_{SE}^{(1)}-\mathbf {S}^{(1)}\mathbf {E}_{SE}^{(1)}\Vert }_F^2+{\Vert \mathbf {E}_{SE}^{(2)}-\mathbf {S}^{(2)}\mathbf {E}_{SE}^{(2)}\Vert }_F^2), \end{aligned}$$
(5)

where \(\beta \) is a hyper-parameter that balances similarities between KBs and their inner similarities. \(\mathbf {E}_{SE} \in \mathbb {R}^{n_e \times d}\) denotes the matrix of entity vectors for one KB in SE with each row an entity vector. \(\mathbf {S}^{(1,2)}\mathbf {E}_{SE}^{(2)}\) calculates latent vectors of entities in \(KB_1\) by accumulating vectors of entities in \(KB_2\) based on their similarities. By minimizing \({\Vert \mathbf {E}_{SE}^{(1)}-\mathbf {S}^{(1,2)}\mathbf {E}_{SE}^{(2)}\Vert }_F^2\), we expect similar entities across KBs to be embedded closely. The two inner similarity matrices work in the same way.

To preserve both the structure and attribute information of two KBs, we jointly minimize the following objective function:

$$\begin{aligned} \mathcal {O}_{joint}=\mathcal {O}_{SE}+\delta \mathcal {O}_{S}, \end{aligned}$$
(6)

where \(\delta \) is a hyper-parameter weighting \(\mathcal {O}_{S}\).

3.5 Discussions

We discuss and analyze our joint attribute-preserving embedding model in the following aspects:

Objective Function for Structure Embedding. SE is translation-based embedding model but its objective function (see Eq. (1)) does not follow the margin-based ranking loss function below, which is used by many previous KB embedding models [1]:

$$\begin{aligned} \mathcal {O}\,=\,\sum _{tr\in T}\sum _{tr'\in T'_{tr}} \max [\gamma +f(tr)-f(tr'),0]. \end{aligned}$$
(7)

Equation (7) aims at distinguishing positive and negative triples, and expects that their scores can be separated by a large margin. However, for the cross-lingual entity alignment task, in addition to the large margin between their scores, we also want to assign lower scores to positive triples and higher scores to negative triples. Therefore, we choose Eq. (1) instead of Eq. (7).

In contrast, JE [11] uses the margin-based ranking loss from TransE [1], while MTransE [5] does not have this as it does not use negative triples. However, as explained in Sect. 3.2, we argue that negative triples are effective in distinguishing the relations between entities. Our experimental results reported in Sect. 4.4 also demonstrate the effectiveness of negative triples.

Training. We initialize parameters such as vectors of entities, relations and attributes randomly based on a truncated normal distribution, and then optimize Eqs. (2) and (6) with a gradient descent optimization algorithm called AdaGrad [6]. Instead of directly optimizing \(\mathcal {O}_{joint}\), our training process involves two optimizers to minimize \(\mathcal {O}_{SE}\) and \(\delta \mathcal {O}_{S}\) independently. At each epoch, the two optimizers are executed alternately. When minimizing \(\mathcal {O}_{SE}\), f(tr) and \( -\alpha f(tr') \) can also be optimized alternately.

The length of any embedding vector is enforced to 1 for the following reasons: (i) this constraint prevents the training process from trivially minimizing the objective function by increasing the embedding norms and shaping the embeddings, (ii) it limits the randomness of entity and relationship distribution in the training process, and (iii) it fixes the mismatch between the inner product in Eq. (3) and the cosine similarity to measure embeddings [24].

Our model is also scalable in training. The structure embedding belongs to the translation-based embedding models, which have already been proved to be capable of learning embeddings at large scale [1]. We use sparse representations for matrices in Eq. (5) for saving memory. Additionally, the memory cost to compute Eq. (4) can be reduced using a divide-and-conquer strategy.

Parameter Complexity. The parameter complexity of our joint model is \(O\big (d(n_e+n_r+n_a)\big )\), where \(n_e,n_r,n_a\) are the numbers of entities, relationships and attributes, respectively. d is the dimension of the embeddings. Considering that \( n_r,n_a\ll n_e \) in practice and the seed alignment share vectors in training, the complexity of the model is roughly linear to the number of total entities.

Searching Latent Aligned Entities. Because the length of each vector always equals 1, the cosine distance between entities of the two KBs can be calculated as \(\mathbf {D} = \mathbf {E}_{SE}^{(1)} {\mathbf {E}_{SE}^{(2)\top }}\). Thus, the nearest entities can be obtained by simply sorting each row of \( \mathbf {D} \) in descending order. For each source entity, we expect the rank of its truly-aligned target entity to be the first few.

4 Evaluation

In this section, we report our experiments and results on real-world cross-lingual datasets. We developed our approach, called JAPE, using TensorFlowFootnote 2—a very popular open-source software library for numerical computation. Our experiments were conducted on a personal workstation with an Intel Xeon E3 3.3 GHz CPU and 128 GB memory. The datasets, source code and experimental results are accessible at this websiteFootnote 3.

4.1 Datasets

We selected DBpedia (2016-04) to build three cross-lingual datasets. DBpedia is a large-scale multi-lingual KB including inter-language links (ILLs) from entities of English version to those in other languages. In our experiments, we extracted 15 thousand ILLs with popular entities from English to Chinese, Japanese and French respectively, and considered them as our reference alignment (i.e., gold standards). Our strategy to extract datasets is that we randomly selected an ILL pair s.t. the involved entities have at least 4 relationship triples and then extracted relationship and attribute infobox triples for selected entities. The statistics of the three datasets are listed in Table 1, which indicate that the number of involved entities in each language is much larger than 15 thousand, and attribute triples contribute to a significant portion of the datasets.

Table 1. Statistics of the datasets

4.2 Comparative Approaches

As aforementioned, JE [11] and MTransE [5] are two representative embedding-based methods for entity alignment. In our experiments, we used our best effort to implement the two models as they do not release any source code or software currently. We conducted them on the above datasets as comparative approaches. Specifically, MTransE has five variants in its alignment model, where the fourth performs best according to the experiments of its authors. Thus, we chose this variant to represent MTransE. We followed the implementation details reported in [5, 11] and complemented other unreported details with careful consideration. For example, we added a strong orthogonality constraint for the linear transformation matrix in MTransE to ensure the invertibility, because we found it leads to better results. For JAPE, we tuned various parameter values and set \(d=75,\alpha =0.1,\beta =0.05,\delta =0.05\) for the best performance. The learning rates of SE and AE were empirically set to 0.01 and 0.1, respectively.

4.3 Evaluation Metrics

Following the conventions [1, 5, 11], we used Hits@k and Mean to assess the performance of the three approaches. Hits@k measures the proportion of correctly aligned entities ranked in the top k, while Mean calculates the mean of these ranks. A higher Hits@k and a lower Mean indicate better performance. It is a phenomenon worth noting that the optimal Hits@k and Mean usually do not come at the same epoch in all the three approaches. For fair comparison, we did not fix the number of epochs but used early stopping to avoid overtraining. The training process is stopped as long as the change ratio of Mean is less than 0.0005. Besides, the training of AE on each dataset takes 100 epochs.

4.4 Experimental Results

Results on DBP15K. We used a certain proportion of the gold standards as seed alignment while left the remaining as testing data, i.e., the latent aligned entities to discover. We tested the proportion from 10% to 50% with step 10%, and Table 2 lists the results using 30% of the gold standards. The variation of Hits@k with different proportions will be shown shortly. For relationships and attributes, we simply extracted the property pairs with exactly the same labels, which only account for a small portion of the seed alignment.

Table 2 indicates that JAPE largely outperformed JE and MTransE, since it captures both structure and attribute information of KBs. For JE, it employs TransE as its basic model, which is not suitable to be directly applied to entity alignment as discussed in Sect. 3.5. Besides, JE does not give a mandatory constraint on the length of vectors. Instead, it only minimizes \(\Vert \mathbf {v}\Vert _2^2-1\) to restrain vector length and brings adverse effect. For MTransE, it models the structures of KBs in different vector spaces, and information loss happens when learning the translation between vector spaces.

Table 2. Result comparison and ablation study

Additionally, we divided JAPE into three variants for ablation study, and the results are shown in Table 2 as well. We found that involving negative triples in structure embedding reduces the random distribution of entities, and involving attribute embedding as constraint further refines the distribution of entities. The two improvements demonstrate that systematic distribution of entities makes for the cross-lingual entity alignment task.

It is worth noting that the alignment direction (e.g. \(\mathrm ZH\rightarrow EN\) vs. \(\mathrm EN\rightarrow ZH\)) also causes performance difference. As shown in Table 1, the relationship triples in a non-English KB are much sparser than those in an English KB, so that the approaches based on the relationship triples cannot learn good representations to model the structures of non-English KBs, as restraints for entities are relatively insufficient. When performing alignment from an English KB to a non-English KB, we search for the nearest non-English entity as the aligned one to an English entity, the sparsity of the non-English KB leads to the disorganized distribution of its entities, which brings negative effects on the task. However, it is comforting to see that the performance difference becomes narrower when involving attribute embedding, because the attribute triples provide additional information to embed entities, especially for sparse KBs.

Figure 3 provides the visualization of sample results for entity alignment and attribute correlations. We projected the embeddings of aligned entity pairs and involved attribute embeddings to two dimensions using PCA. The left part indicates that universities, countries, cities and cellphones were divided widely while aligned entities from Chinese to English were laid closely, which met our expectation of JAPE. The right part shows our attribute embedding clustered three groups of monolingual attributes (about cellphones, cities and universities) and one group of cross-lingual ones (about countries).

Fig. 3.
figure 3

Visualization of results on DBP15K\(_\text {ZH-EN}\)

Sensitivity to Proportion of Seed Alignment. Figure 4 illustrates the change of Hits@k with varied proportion of seed alignment. In accordance with our expectation, the results on all the datasets become better with the increase of the proportion, because more seed alignment can provide more information to overlay the two KBs. It can be seen that, when using half of the gold standards as seed alignment, JAPE performed encouragingly, e.g. Hits@1 and Hits@10 on DBP15K\(_\text {ZH-EN}\) are 53.27% and 82.91%, respectively. Moreover, even with a very small proportion of seed alignment like \(10\%\), JAPE still achieved promising results, e.g. Hits@10 on DBP15K\(_\text {ZH-EN}\) reaches 55.04% and on DBP15K\(_\text {JA-EN}\) reaches 44.69%. Therefore, it is feasible to deploy JAPE to various entity alignment tasks, even with limited seed alignment.

Fig. 4.
figure 4

Hits@k w.r.t. proportion of seed alignment

Combination with Machine Translation. Since machine translation is often used in cross-lingual ontology matching [9, 21], we designed a machine translation based approach that employs Google Translate to translate the labels of entities in one KB and computes similarities between the translations and the labels of entities in the other KB. For similarity measurement, we chose Levenshtein distance because of its popularity in ontology matching [3].

We chose DBP15K\(_\text {ZH-EN}\) and DBP15K\(_\text {JA-EN}\), which have big barriers in linguistics. As depicted in Table 3, machine translation achieves satisfying results, especially for Hits@1, and we think that it is due to the high accuracy of Google Translate. However, the gap between machine translation and JAPE becomes smaller for Hits@10 and Hits@50. The reason is as follows. When Google misunderstands the meaning of labels (e.g. polysemy), the top-ranked entities are all very likely to be wrong. On the contrary, JAPE relies on the structure information of KBs, so the correct entities often appear slightly behind. Besides, we found that translating from Chinese (or Japanese) to English is more accurate than the reverse direction.

To further investigate the possibility of combination, for each latent aligned entities, we considered the lower rank of the two results as the combined rank. It is surprising to find that the combined results are significantly better, which reveals the mutual complementarity between JAPE and machine translation. We believe that, when aligning entities between cross-lingual KBs where the quality of machine translation is difficult to guarantee, or many entities lack meaningful labels, JAPE can be a practical alternative.

Table 3. Combination of machine translation and JAPE

Results at Larger Scale. To test the scalability of JAPE, we built three larger datasets by choosing 100 thousand ILLs between English and Chinese, Japanese and French in the same way as DBP15K. The threshold of relationship triples to select ILLs was set to 2. Each dataset contains several hundred thousand entities and several million triples. We set \(d=100, \beta =0.1\) and keep other parameters the same as DBP15K. For JE, the training takes 2000 epochs as reported in its paper. The results on DBP100K are listed in Table 4. Due to lack of space, only Hits@10 is reported. We found that similar results and conclusions stand for DBP100K compared with DBP15K, which indicate the scalability and stability of JAPE.

Table 4. Hits@10 comparison on DBP100K

Furthermore, the performance of all the methods decreases to some extent on DBP100K. We think that the reasons are twofold: (i) DBP100K contains quite a few “sparse” entities involved in a very limited number of triples, which affect embedding the structure information of KBs; and (ii) as the number of latent aligned entities in DBP100K are several times larger than DBP15K, the TransE-based models suffer from the increased occurrence of multi-mapping relations as explained in [22]. Nevertheless, JAPE still outperformed JE and MTransE.

5 Conclusion and Future Work

In this paper, we introduced a joint attribute-preserving embedding model for cross-lingual entity alignment. We proposed structure embedding and attribute embedding to represent the relationship structures and attribute correlations of KBs and learn approximate embeddings for latent aligned entities. Our experiments on real-world datasets demonstrated that our approach achieved superior results than two state-of-the-art embedding approaches and could be complemented with conventional methods based on machine translation.

In future work, we look forward to improving our approach in several aspects. First, the structure embedding suffered from multi-mapping relations, thus we plan to extend it with cross-lingual hyperplane projection. Second, our attribute embedding discarded attribute values due to their diversity and cross-linguality, which we want to use cross-lingual word embedding techniques to incorporate. Third, we would like to evaluate our approach on more heterogeneous KBs developed by different parties, such as between DBpedia and Wikidata.