Keywords

1 Introduction

Learning to predict relationship between the entities plays a vital role in recommendation system, knowledge base population and question answering system, etc. In the big data era, we aim to know the relation between two homogeneous or heterogeneous entities through the latent knowledge behind the data. Take an example in business mining field, we attempt to predict what the relationship between the two companies is, which can help an enterprise to search his potential customers or providers. The relation prediction (RP) task is different from relation extraction (RE) task, the entity pairs in RP task does not appear in the same semantic sentence and sometimes they may consist of many common attributes. For instance, there exists many common multiple attributes information in a pair of companies such as company name, company address, company profile, business scope, etc. Hence, a demanding requirement for promoting RP is to develop a novel model that can support the entity relation for those multi-attribute data.

On RP task, most traditional methods are kernel-based method [1,2,3,4] that they need to do complicated feature engineering. Moreover, those methods limit to capture latent semantic information and they cannot extract some new effective features from relation examples easily. In recent years, a variety of neural network models [5,6,7,8,9,10,11,12,13,14,15] have been widely applied to relation prediction and achieved remarkable success. Those models are mainly based on an architecture of distributed representation, which learns a scalar-output to represent entity relation. All these methods can be divided into three categories: the CNN-based method [10, 11], the RNN-based method [12, 13] and the Transformer-based method [14, 15]. Although above methods can capture latent semantic information due to the deep learning way, some disadvantages remain: 1) they heavily rely on the quality of entity semantic representation. Using one scalar-output to represent entity relation is limited because attribute information is abundant and diverse. 2) Above methods fail to retain the precise spatial relationships between high-level parts. The structural relationships such as the homogeneous information in all attributes are valuable. To address above problems, the capsule network methods have been proposed [16,17,18], which encapsulate multiple attributes into groups of neurons and replaces the scalar-output feature detectors with vector-output capsules to preserve additional information such as position and correlation [19]. Recently, capsule networks have achieved competitive results in classification tasks [18, 20, 21], relation extraction [22], especially for those structural information learning tasks [19, 23]. In RP task, the entity relationship is usually correlate with their multiple attributes information. The capsule networks could represent those multiple attributes of entities as individual capsules, preserving the structural, correlation, relationship information between multiple attributes to drive entity relation prediction.

In this paper, we propose a novel Attribute-driven Capsule Network (ACNet) for entity relation prediction task, which retains the structural and correlation information by using capsule networks. Our ACNet makes attribute capsules generate relation capsules by developing a self-attention routing method, which assign weights to different attribute capsules and improve relation prediction performance. Furthermore, we adopt k-max pooling method to improve robust representation and training efficiency. The major contributions of this paper are briefly summarized as follows.

  1. 1.

    We propose a novel entity relation prediction approach based on capsule networks with self-attention routing method for those multi-attribute entity datasets. To the best of our knowledge, this is the first work that capsule network has been empirically apply on multi-attribute entity relation prediction.

  2. 2.

    We devise self-attention routing method between the attribute capsule layer and the relation capsule layer to improve the ability to capture relational semantic information. We conduct extensive experiments on a new real-world scenario CompanyRelationCollection (CRC) dataset that we constructed from web and one public BlurbGenreCollection (BGC) dataset. The results demonstrate our ACNet consistently outperforms the state-of-the-art baselines.

The rest of the paper is organized as follows: Sect. 2 describes the proposed our model; Sect. 3 evaluates the approach; Sect. 4 concludes the paper.

Fig. 1.
figure 1

The whole framework of ACNet method. It consists of four layers: 1) Attribute Embedding Layer converts each of common attributes into attribute-key and attribute-value vectors; 2) Attribute Capsule extracts feature from attribute-value vector and generates attribute capsules; 3) Relation Capsule uses self-attention routing method to aggregate attribute capsules and attribute-key vectors into a set of relation feature capsules; 4) Class Capsule produces classification capsules to represent each relation category.

2 Model

2.1 Definitions

Definition 1 (Relation prediction)

Let D=\(\{e^1, e^2,...\,e^n\}\) represent a set of entities, where \(|D|=n\). The ordered pair \({<}e^s, e^o{>}\) in D denotes two entities has a relevant link represented by \(r{<}e^s, e^o{>}\), which is called relation. The learning task is to improve the target predictive function for relation \(f_r{<}e^s, e^o{>}\).

Definition 2 (Attribute information)

Let \(e^s=\{k_1:v^{s}_{1},...\,,k_N:v^{s}_{N}\}\) represent multi-attribute entity information, where \(|e^s|=N\) means that there exists N attributes for each entity. And multi-attribute in the form of key-value (k : v) pairs to indicate their information. We utilize \(E^k\)(attribute-key) and \(E^v\)(attribute-value) to represent the key and value information of one attribute, respectively.

2.2 Attribute Embedding Layer

The whole framework of our method is shown in Fig. 1. The purpose of the attribute embedding layer is to turn each common attribute into two separate embedding vectors. Let \(M^{E} \in \mathbb {R}^{d_w\times |V|}\) represent attribute information embedding matrix, where \(d_w\) is the dimension of word vectors and |V| is the vocabulary size. Here, the embedding for information of j-th attribute can be separated into two vectors, using \(E^{k_j} \in \mathbb {R}^{1\times d_w}\) to represent attribute-key embedding vector and \(E^{v_j}=\{x_1,...\,,x_i,...\,,x_L\} \in \mathbb {R}^{L\times d_w}\) to represent attribute-value embedding vector, where \(x_i\) is the i-th word in the attribute-value sentence and L is the sentence length. For a relation pair \({<}e^s, e^o{>}\), due to the relationship is strongly related to their common attributes information, we concatenate their attribute-value vectors for each attribute and they can be formulated as \({<}e^s, e^o{>}=\{E^{k_1}:E^{v_{1}^{s}}+E^{v_{1}^{o}},...\,,E^{k_N}:E^{v_{N}^{s}}+E^{v_{N}^{o}}\}=\{E^{k_1}:E^{v_1},...\,,E^{k_N}:E^{v_N}\}\).

2.3 Attribute Capsule Layer

This layer aims to use multiple convolution operations to extract n-gram features from the attribute-value embedding, which contain local semantic information about the attribute-key in a fixed window. In general, let \(x_{i:i+j}\) refers to the concatenation of words \(x_i,x_{i+1},...\,,x_{i+j}\). Multiple convolution operations involves a kernel group \(F \in \mathbb {R}^{d_p\times (d_w\times h)}\), which is applied to a window with h words to generate multiple h-gram features. \(d_w\times h\) is the size of one convolutional kernel, h is the n-gram size and \(d_p\) is the dimension of one attribute capsule. For a j-th attribute, a feature \(c_i\) is produced from a window of words \(x_{i:i+h-1}\) by

$$\begin{aligned} c_i=F\odot E^{v_j}_{i:i+h-1}+b \end{aligned}$$
(1)

where \(\odot \) denotes the component-wise multiplication and \( b \in \mathbb {R}\) is a bias term. Thus, we can get a set of attribute capsules \(c \in \mathbb {R}^{d_p\times (L-h+1)}\), which encapsulates n-gram features and extracts from the whole textual information of an attribute-value.

In fact, different kernel groups F can capture different categories of semantic meaning. We repeat the above procedure B times with different kernel groups, and get multiple channels of features to represent B categories of semantic meaning. The final output of this layer is arranged as \(C \in \mathbb {R}^{B\times d_p\times (L-h+1)}\), where \(C=[c_1,c_2,...\,c_B]\).

2.4 Relation Capsule Layer

Self-Attention Routing Approach. For a common attribute, different n-gram features are extracted from the attribute-value and they have different effects on the attribute-key. What’s more, n-gram features contain different semantic information of entity relation. To alleviate aforementioned problem, and following self-attention mechanism [24, 25], we propose a novel attention routing approach to compute the weight for the n-gram features of h-size window in \(E^{v_j}\).

First, we apply a fusing convolution operation to the embedding \(E^{v_j}\) with a kernel \(V_s \in \mathbb {R}{^{d_w\times h}}\), and getting a set of feature vectors \(F_s \in \mathbb {R}^{1\times (L-h+1)}\). Second, we let \(E^{k_j}\) as our query input, a simple linear projection is used to construct the query \(Q =E^{k_j}W^q\), and the key \(K=F_sW^k\). Then, we use the query Q to perform a scaled dot-product attention over K. The returned scalars can be put through a softmax function to produce a set of weights.

$$\begin{aligned} a_i=softmax \left( \frac{E^{k_j}W^q(F_s{W^k})^T}{\sqrt{d_w}}\right) \end{aligned}$$
(2)

where \(a_i \in \mathbb {R}{^{1\times (L-h+1)}}\), \(W^q \in \mathbb {R}{^{d_w\times u}}\) and \(W^k \in \mathbb {R}{^{(L-h+1)\times u}}\) represent weighted parameter matrices. The generate attention weight \(a_i \in [0,1]\) contains information with respect to attribute-value. It controls how much information in current n-gram feature can be transmitted to the next layer. If \(a_i\) is zero, the feature capsule would be totally blocked. Since produced B different channels of attribute capsules in above layer, we repeat the above computational process B times to get the whole attention routing weights \(A \in \mathbb {R}{^{B\times 1\times (L-h+1)}}\), where \(A=[a_1,a_2,...\,,a_B]\). Finally, the attribute capsules are routed using these weights:

$$\begin{aligned} S=C\odot A \end{aligned}$$
(3)

where \(S \in \mathbb {R}{^{B\times d_p\times (L-h+1)}}\) is the attribute-customized feature capsules, and \(\odot \) denotes element-wise multiplication.

Relation Capsule Generation. The above generated S is transformed from attribute capsules. Though encoding key-related information, there are still many unrelated capsules in S. Moreover, the large number of capsules in S may prevent the next layer from learning robust representations. If using a method similar to the max pooling, it will lead to lose some capsules about the structural or position information. Hence, we adopt a compromise method, k-max method in S to aggregate all attribute capsules in the third channel horizontally.

$$\begin{aligned} P=max(k,[S_1,S_2,..,S_{L-h+1}]) \end{aligned}$$
(4)

where \(P \in \mathbb {R}{^{B\times k\times d_p}}\), k is a constant that meaning to preserve top k capsules and max is a descending sort operation. Through Eq. 4, the network can filter out many unimportant capsules and obtain more relation capsules.

From Eq. 1 to Eq. 4, a attribute can generate a set of capsules P and we loop N attributes in the same way to get a set of all attributes capsules u. Then we combine all attributes capsules together to produce relation capsules:

$$\begin{aligned} u=[P_1,P_2,...,P_N] \end{aligned}$$
(5)

where \(u \in \mathbb {R}{^{(B\times K\times N)\times d_p}}\). Then, we use the non-linear “squash” function [16], using the length of each relation feature capsule \(u_i\) to represent the probability that \(u_i\)’s feature meaning is present in the current input.

$$\begin{aligned} u_i \leftarrow \frac{{\left\| u_i \right\| }^2}{1+{\left\| u_i \right\| }^2}\frac{u_i}{\left\| u_i \right\| } \end{aligned}$$
(6)

2.5 Class Capsules Layer

In the capsule network, it uses class capsules to represent entity relation categories that means the number of relation capsules to be consistent with categories of entity relations. Let the number of relation categories be m, and there are m relation capsules learned in this layer. Each relation capsule is used for calculating the classification probability of each relation in entity prediction task. Hence each class capsule should have its own routing weights to adaptively aggregate relation capsules from the previous layer.

A capsule’s prediction vector \(\tilde{u}_{j|i}\) is generated by multiplying the output \(u_i\) by a weight matrix \(W_{ij}\), where \(W_{ij}\) \(\in \mathbb {R}{^{d_r\times d_p}}\) is a weight matrix, \(d_r\) and \(d_p\) are the dimensions of class capsule j and relation capsule i, \(u_i\) is the vector representation of relation capsule i.

$$\begin{aligned} \hat{u}_{j|i}=W_{ij}u_i \end{aligned}$$
(7)

Then all prediction vectors generated by relation capsules are summed up with weights \(c_{ij}\) to obtain the vector representation \(s_j\) of class capsule j:

$$\begin{aligned} \mathbf {s_j}= \begin{matrix} \sum _i{c_{ij}}\tilde{u}_{j|i} \end{matrix} \end{aligned}$$
(8)

where \(c_{ij}\) is a coupling coefficient that determine the contribution of each relation capsule’s output to a class capsule are calculated using a dynamic routing heuristic [16] and defined by a “routing softmax”:

$$\begin{aligned} c_{ij}=\frac{exp(b_{ij})}{\sum _k exp(b_{ik}) } \end{aligned}$$
(9)

where each \(b_{ij}\) is the log prior probability that a relation capsule i should pass to a class capsule j. It is computed using a dynamic routing approach which is written in Algorithm 1 [16].

After that, we apply the non-linear “squash” function again to \(\mathbf {s_j}\) in Eq. 6 to get a final representation \(\mathbf {v_j}\) for class capsule j.

$$\begin{aligned} \mathbf {v_j}=squash(\mathbf {s_j}) \end{aligned}$$
(10)
figure a

2.6 Margin Loss

We use the length of a class vector to represent the probability of the relationship between entities. The capsule length of the active relation should be larger than others. We adopt a separate margin loss \(L_j\) for each class capsule j in our task:

$$\begin{aligned} L_j=T_j max(0,m^+ -{\left\| \mathbf {v_j} \right\| })^2 +\lambda (1-T_j)max(0,{\left\| \mathbf {v_j} \right\| }-m^-)^2 \end{aligned}$$
(11)

where \(T_j=1\) if the entity relation is present in class capsule j. We set \(m^+\)=0.9, and \(m^-=0.1\), \(\lambda =0.5\) following in the research [16]. The total loss is simply the sum of the losses of all class capsules, \(L_T=\sum _{j=1}^J L_j\).

3 Experiments

3.1 Datasets

We evaluate our model on two datasetsFootnote 1: CompanyRelationCollection (CRC) refers to company entities and BlurbGenreCollection (BGC) refers to book entities. The CRC is a Chinese company entity dataset that collected from the Internet. It consists of 58,013 company entities and there are 6 common attributes for each company entity, which includes company_name, company_address, company_type, industry_category, business_scope and abstract. For a pair of company entities, there exists three major links content, which includes customer (C), provider (P), rival (R). Figure 2 gives an example of a relationship in CRC dataset, which demonstrates \({Entity}\_O\) is a customer of \({Entity}\_S\). In our task, we retain only one relationship between two company entities. The BGC is an English public dataset [23]. It includes 91,892 book entities and each entity has 3 common attributes. We define three kinds of link according to book categories, which includes similar(S), presumably-similar(P), dissimilar(D). Table 1 lists some important quantitative characteristics of both datasets.

Fig. 2.
figure 2

An example of one relationship in CRC dataset.

Table 1. Quantitative characteristics of both datasets

3.2 Experimental Setting

Due to different text lengths of attribute-values, we fix text length of all attribute-values equals to 200. We train words embedding [26] in two dataset respectively.

For all the experiments below, we set some parameters: \(d_w=200\) for word embedding dimensionality, \(d_p=16\) and \(d_r\) = 24 for the dimension of a relation capsule and a class capsule, respectively. We tune some parameters of our models by grid searching on the validation dataset and grid search is utilized to select optimal learning rate \(\lambda \) for Adam optimizer, we final set our \(\lambda \) equal to 0.001 and filters number B equals to 64. The significant parameter k with 12 different values in our experiment and detailed discussion is presented in Sect. 3.5. Table 2 shows that hyper-parameters of the optimum model were selected by the evaluation results on the validation dataset. For other parameters, we use empirical settings because they make few influence on performance of our model. Three widespread used evaluation metrics are applied in the experiments, which includes the precision, recall and F1. We select F1 as our final main evaluation indicator.

3.3 Baselines

To demonstrate the superiority of ACNet on relation prediction task between multi-attribute entities, we compare it with following six baselines: CNN proposes a convolutional deep neural network model for entity relation mining. PCNN puts forward a piecewise CNN model for distant supervision for relation mining. BLSTM proposes a bidirectional LSTM model for relation mining. ATT-BLSTM is a bidirectional LSTM model with attention mechanism. BERT is a pre-training bidirectional transformer model for relation mining. All above methods are based on the idea of classification and they are suitable for our relation prediction tasks. Basic-Caps is the original capsule network model without self-attention routing mechanism.

Table 2. List of hyper-parameters
Table 3. The results of Comparison of different methods. Best scores are in bold.

3.4 Main Results

We conduct experiments on the seven methods mentioned above. The comparison results of all models are shown in Table 3. It is clear that our attribute-driven capsule network achieves the highest F1 scores, outperforming the other deep network-based methods, simultaneously in terms of recall. Our ACNnet has the smallest fluctuation between precision and recall on two datasets. Both situations demonstrate that our method is superior to other models on relation prediction between multi-attribute entities and RP task learning is more robust.

The results of ACNet significantly outperform the Basic-Caps and others on the F1 indicator show that the self-attention routing approach is effective, which indicates that our approach could capture more relational semantic information and enhance the ability to represent class capsules by self-attention mechanism. Moreover, capsule networks have shown to identify and combine relational information with common attributes more accurately than the baselines.

BERT performs the best result on precision indicator among all baselines except our model, which demonstrates that the powerful pre-training model can achieve richful representation information. We also can observe that the self-attention routing approach beneficial to improve relation prediction between multi-attribute entity. Among CNN-based methods, PCNN achieves higher performance than CNN since it uses dynamic max pooling method and preserves some level of information about attributes. However, BLSTM is better than ATT-BLSTM beyond our expectations. ATT-BLSTM performs the worst among all others, as maybe its based attention model can not extract more effective attribute features.

3.5 Parameter Analysis

Stability of Training. In order to investigate the difference between our model and other models in training process, we conduct a statistical analysis on training loss where we set epochs equal to 10 and select the lowest training loss at each epoch for comparison. The results are shown in Fig. 3. We can observe that the capsule networks convergence much faster than other methods, getting a stable training process and achieve less fluctuation. The reason behind this situation should be that capsule network discard pooling and retain all the feature information.

Fig. 3.
figure 3

The result of training loss from all models on two datasets.

Fig. 4.
figure 4

Training results under different k values

k-max Pooling. To explore whether the capsule networks discard many features by the pooling way will have an impact on the prediction results, we use k-max pooling method in ACNnet to test it. We set 12 different k values and \(k=0\) means without using pooling. The results are shown in Fig. 4. The test results show that k-max pooling method does not decrease the precision of entity relationship prediction. Conversely, using a smaller k can greatly reduce training time and increase training efficiency. This also illustrates that it is effective to retain the main attribute capsules and filter out many unimportant capsules.

4 Conclusion

We present attribute-driven capsule network model for relation prediction between multi-attribute entities in this work. In order to capture relational information from common attributes and improve relation prediction, we develop self-attention routing method based on capsule networks to generate relational capsules and link with class capsules. The experimental results demonstrate the effectiveness of our model on RP task. Our future work includes: (i) to enrich the representation of the capsule network with more attributes, (ii) to consider entity relation prediction with different attributes, (iii) to apply some methods of the recommended domain on entity relation prediction.