Keywords

1 Introduction

It has been shown that knowledge bases (KBs) such as DBpedia [1], Freebase [2], and WordNet [3] are effective sources of knowledge for applications such as hypernym detection [4], machine translation [5], and question answering [6,7,8]. Among these applications, knowledge-based question answering (KBQA), the process of requesting knowledge over a KB using natural language questions, is the most direct approach to access KB knowledge. As the common representation of knowledge graph beliefs is a discrete relational triple showing one relation between two entities, such as LocatedIn (NewOrleans, Louisiana), where the two entities are two nodes and their relation is the edge between them in the knowledge graph, the procedure of answering a question can be transformed into a traversal that starts from the question entity and searches for the appropriate path to reach the answer entity. Therefore, KBQA can be divided into two major tasks: entity linking and relation extraction. The former discovers the topic entity in the question and then locates this entity in the KB; the latter attempts to discover all the relations in the path which connects the question topic entity and the answer entity. These two tasks are illustrated in Fig. 1. Given the question “What is the name of Justin Bieber brother?”, the entity linking process identifies “Justin Bieber” as the topic entity and begins to search for candidate relations from it; the relation extraction process identifies “sibling_s” and “sibling” as the relations in the form of a relation path between the topic entity and the answer entity. In this paper, we focus on relation extraction.

Fig. 1.
figure 1

Two KBQA tasks

Most previous work [6, 9] treats the relation extraction task of KBQA as a ranking problem. In contrast, we attempt to solve this task in a more straightforward fashion by regarding it as a multi-class classification task. The challenge then comes from the model optimization with a relatively large search space. Fortunately, some candidate relations can be filtered out in advance by traversing the whole KB to remove unlikely relations and reduce the search space. In this paper, we propose a masking mechanism to prevent our model from selecting these impossible relations. In this paper, we develop and discuss three models: (1) a convolutional neural network-based model (CNN), (2) a convolutional neural network-based model with a masking layer (CNN + masking) and (3) a hybrid convolutional neural network/recurrent neural network model (CNN + RNN). We evaluate the performance of these models on the WebQuestions Semantic Parses Dataset (WebQSP) [10].

In addition to providing effective models, we hope to offer a strong baseline for companies and government agencies who seek to build their own knowledge-based question-answering systems efficiently. Therefore, we describe the construction of the KB for the catering industry for which we apply the proposed models to demonstrate adaptation from the general domain to a specific domain.

The major contributions of this paper are: (1) We propose an effective relation extraction model to extract relations in both general KBs and domain-specific KBs. (2) We demonstrate strong performance on the catering KB and describe details when adapting this general model to a specific domain.

2 Related Work

Determining the relation between two entities is critical for natural language understanding. With the text evidence and the given two mentioned entities, the aim of text-based relation extraction is to predict the relation indicated by the text evidence. Previous methods include labor-intensive feature engineering with SVM [11] and clustering words before relation classification [12] for more sophisticated management of semantics. With the advance of deep learning techniques, relation extraction models have evolved from machine learning models based on word embeddings [13, 14] to deep neural networks such as CNNs or LSTMs [15,16,17] and even more complicated models [18, 19]. One of the assumptions in the text-based relation extraction task is that a fixed set of candidate relations is given, with a relatively small size. However, in the KBQA relation extraction task, thousands of relations are included in the KB and all are to be considered for each question. As a result, approaches for text-based relation extraction cannot be directly applied to KBQA relation extraction. Instead, KBQA relation extraction usually takes the question and a candidate relation as the input for its decision process in a pairwise fashion. That is, each time, a question and a candidate relation are given to compute a score, which eliminates the need to consider all relations at the same time.

Traditional relation extraction in KBQA started with the naive Bayes method considering rich linguistic features [7] and learning-to-rank mechanism with hierarchical relations [20]. These were followed by neural network models [6, 21, 22] with the attention mechanism [18, 23] and the residual network [9] in recent years. Though these models are able to handle all relations in a KB sequentially, they must repeat the same process for each candidate relation many times for the final decision. Moreover, when processing one relation, other relations are neglected. This is clearly a drawback for a selection process. To address this problem, we propose a model that combines the advantages of both text-based and KBQA relation extraction, and that considers the large KB relation set using only one forward pass for each question.

3 Method

We first implemented a CNN-based multi-class classifier, the output layer of which is the probability of all the relations in the training data. After we analyzing the errors in this first model, we observed that the major errors were domain errors and semantic meaning errors.

Each relation in Freebase is composed of the following three fragments (tokens): Domain.Type.Property. A domain error indicates the wrong domain field in the predicted relation. For example, the predicted relation is “film.performance.actor” for the question “Who plays ken barlow in coronation street?” when the correct relation is “tv.regular_tv_appearance.actor”, where the domain field “tv” is mis-identified as “film”.

From our observations, domain errors arise from considering all of the relations in KB, a relative large relation set, regardless of the topic entity. In the above example, “coronation street” is obviously a TV show and will not connect to a relation whose domain is “film”. Therefore, we attempt to reduce domain errors by ignoring those highly unlikely relations in advance: we propose a masking mechanism to filter out relations which are not connected to the topic entity within two hops, i.e., relations in the relation path whose length is not greater than two.

Semantic errors, in turn, indicate semantic mismatches between the question and the relation. For example, for the question “Where did they find Jenni Rivera’s body?” our model predicts “people.place_lived.location” while the correct answer is “people.deceased_person.place_of_death”. In this example, the model did not learn that “find the body” is related more to “death” than to “live”. This may due to the design in which the proposed model considers each relation independently as one class but ignores the semantic meaning of its name. In order to solve this problem, we propose the third model, the hybrid CNN/RNN model (CNN + RNN).

For this hybrid model, we collect the relation paths connected to the topic entity whose lengths are equal or less than two (within two hops) and view these relation paths as the candidate answers. Then we perform a binary classification on all the candidate paths.

3.1 CNN

The first proposed model is a CNN-based multi-class classifier. To fully utilize the deep neural network and to achieve better language understanding, we use several different features as the input of this model. In addition, we use two channels to offer features from the raw question and its dependency-parsed question. The architecture of the proposed CNN model is illustrated in Fig. 2.

Fig. 2.
figure 2

Architecture of proposed CNN model

With the first channel we consider the following types of features:

  • Lexical:

    We use public pretrained GloVe word embeddings to turn words into fixed-length vectors. Given the \(D \times d\) word embedding matrix W, the i-th row indicates the embedding of the i-th word, yielding a d-dimensional vector \(x_{glove}\).

  • WordNet:

    To gain more information on the semantic type, we use WordNet to generate another set of word embeddings as features. Each word, together with its hypernyms, forms a sequence in their hierarchical order: this is termed the hypernym path. Ten hypernym paths are generated by a random walk for each word in WordNet. All the generated paths are then treated as sentences for GloVe to train the embedding for each word in WordNet, yielding a d-dimensional vector \(x_{wordnet}\).

  • POS:

    We randomly initialize an embedding matrix for the POS tag vocabulary. The weights in the matrix are updated during the training process. For each POS the embedding matrix yields a d-dimensional vector \(x_{pos}\).

  • Distance:

    For each word, we compute two distances to the current question: its distance to the question word (e.g., Who, What, How) and its distance to the topic entity. For example, in the question “What is the name of Justin Bieber’s brother?”, the question entity distance, i.e., the distance from the word “name” to “What”, is 3 and the topic entity distance, i.e., that from “name” to “Justin Bieber”, is 2. After the two distances are computed, the randomly initialized embedding matrix turns them into the d-dimensional vectors \(x_{ques}\) and \(x_{topic}\), respectively. The weights of the embedding matrix are also updated in the training process.

Once these features are extracted, we concatenate and feed them into the first channel of the proposed CNN model. These concatenations are shown as follows:

$$\begin{aligned} X&= X_{glove} \oplus X_{wordnet} \oplus X_{pos} \oplus X_{ques} \oplus X_{topic} \\ X_{glove}&= x^0_{glove} \oplus x^1_{glove} \oplus x^2_{glove} \oplus \dots \oplus x^{n-1}_{glove} \oplus x^n_{glove} \\ X_{wordnet}&= x^0_{wordnet} \oplus x^1_{wordnet} \oplus \dots \oplus x^{n-1}_{wordnet} \oplus x^n_{wordnet} \\ X_{pos}&= x^0_{pos} \oplus x^1_{pos} \oplus x^2_{pos} \oplus \dots \oplus x^{n-1}_{pos} \oplus x^n_{pos} \\ X_{ques}&= x^0_{ques} \oplus x^1_{ques} \oplus x^2_{ques} \oplus \dots \oplus x^{n-1}_{ques} \oplus x^n_{ques} \\ X_{topic}&= x^0_{topic} \oplus x^1_{topic} \oplus x^2_{topic} \oplus \dots \oplus x^{n-1}_{topic} \oplus x^n_{topic} \end{aligned}$$

where \(\oplus \) is the concatenation operator, X is the input of channel one, \(x^i_k\) denotes the i-th vector of feature k, and n is the number of words in the question.

For the second channel, we use the Stanford CoreNLP [24] dependency parser to generate the question’s dependency parse tree. From the parse tree, we extract the shortest path from the topic entity to the question word, and then use words on the shortest path to generate the following types of features:

  • Lexical:

    The pretrained word embedding matrix for channel one is used here as well. For each word in the shortest path, from the matrix we extract a d-dimensional vector \(s_{glove}\).

  • POS:

    For POS tags, we use the embedding matrix from channel one to transform the POS vocabulary. Each POS tag in the shortest path yields a d-dimensional vector \(s_{pos}\).

  • Distance:

    As in channel one, we extract the question entity distance and the topic entity distance. The distance between the question entity and the topic entity in the original question is computed for each separate word in the shortest path. Again, the distance embedding matrix from channel one is used to turn the distance into d-dimensional vectors \(s_{ques}\) and \(s_{topic}\).

  • Dependency tag:

    Words in the dependency parsed tree are connected by dependency tags, which indicate their mutual relationship. We randomly initialize a dependency tag embedding matrix for later training, and turn each dependency tag appearing in the shortest path into a d-dimensional vector \(s_{dep}\).

  • Reversed dependency tag:

    In the end, we reverse the dependency tag feature above, indicating a traversal of the dependency parse tree from the topic entity to the question entity. The same dependency embedding matrix is used to generate d-dimensional vectors \(s_{rdep}\).

Given these features, we concatenate them and feed them into channel two of our models.

$$\begin{aligned} S&= S_{glove} \oplus S_{pos} \oplus S_{ques} \oplus S_{topic} \oplus S_{dep} \oplus S_{rdep} \\ S_{glove}&= s^0_{glove} \oplus s^1_{glove} \oplus s^2_{glove} \oplus \dots \oplus s^{m-1}_{glove} \oplus s^m_{glove} \\ S_{pos}&= s^0_{pos} \oplus s^1_{pos} \oplus s^2_{pos} \oplus \dots \oplus s^{m-1}_{pos} \oplus s^m_{pos} \\ S_{ques}&= s^0_{ques} \oplus s^1_{ques} \oplus s^2_{ques} \oplus \dots \oplus s^{m-1}_{ques} \oplus s^m_{ques} \\ S_{topic}&= s^0_{topic} \oplus s^1_{topic} \oplus s^2_{topic} \oplus \dots \oplus s^{m-1}_{topic} \oplus s^m_{topic} \\ S_{dep}&= s^0_{dep} \oplus s^1_{dep} \oplus s^2_{dep} \oplus \dots \oplus s^{m-1}_{dep} \\ S_{rdep}&= s^0_{rdep} \oplus s^1_{rdep} \oplus s^2_{rdep} \oplus \dots \oplus s^{m-1}_{rdep} \end{aligned}$$

where S is the input of channel two of our model, \(s^i_k\) denotes the i-th vector of feature k, and m is the number of words in the dependency shortest path.

After providing our model with the feature vectors, for each channel, we use three filters of size 1, 2, and 3 to capture features using different window sizes. Assume a sequence of feature vectors fed into channel k is represented as

$$\begin{aligned} V^k = v^k_1 \oplus v^k_2 \oplus \dots \oplus v^k_{n-1} \oplus v^k_{n}, \end{aligned}$$

and \(v^k_{i:i+j}\) refers to the concatenation of \(v^k_i, v^k_{i+1}, \ldots ,v^k_{i+j}\). The filter \(w \in R^{hd}\) in our model generates new d-dimensional features from each window of size h, where h is either 1, 2, or 3. For instance, feature \(c_i\) is generated from a window of words \(v_{i:i+h-1}\):

$$\begin{aligned} c_i = f(\mathbf w \cdot v_{i:i+h-1}), \end{aligned}$$
(1)

where f is a linear activation function. This filter is applied to each possible window in the sentence {\(v_{1:h}, v_{2:h+1},\ldots , v_{n-h+1:n}\)} to produce a feature map \(\mathbf c \):

$$\begin{aligned} \mathbf c = [c_1, c_2,\dots ,c_{n-h+1}]. \end{aligned}$$
(2)

Then two max pooling layers are applied on these feature maps to obtain the max value of each feature map. The first max pooling layer is of length 3, and the second is designed to produce the single-length feature map. After this convolution process, 6 feature maps are generated from 2 channels and 3 filters. We then concatenate these feature maps and pass them through a dense layer and then a softmax layer to calculate the class probabilities. The model is optimized by minimizing the categorical cross-entropy loss.

3.2 CNN + Masking

Further observations indicate that some candidate relations can be filtered out in advance: in the WebQSP dataset, there are 5,210 different relations in the training data, while on average there are only 141.6 relations connected to a topic entity. This substantial difference indicates that searches performed on unconnected relations of the topic entity by the proposed CNN model are inefficient.

To take into account only connected relations, we propose a masking mechanism. The masking layer is added before the output layer in the proposed CNN model to drop disconnected relations from the final prediction. Given a question and its topic entity, we retrieve these possible relations from the KB. In this work, we enumerate the possible relations \(R^{'}\) by traversing the KB from the topic entity and recording relations within two hops. With the total relation set R, we define the masking layer M as

$$\begin{aligned} M = \{1 {\text { if }} r \in R^{'} \text { else } 0, \forall r \in R\}. \end{aligned}$$
(3)

As illustrated in Fig. 3, the output vector of the probabilities of relation classes is then multiplied element-wise with the masking layer to yield only the probabilities of the connected relations.

Fig. 3.
figure 3

Proposed CNN + masking model

3.3 CNN + RNN

Since the relations are organized hierarchically from domain, type, to property, we propose a third model, which not only encodes information from a question, but also considers information from candidate relations using a recurrent neural network (RNN). In this model, we treat the relation extraction task as a ranking problem. As illustrated in Fig. 4, we use the model to encode the question. For candidate relations, we first segment them into tokens according to their hierarchy, and then apply a gate recurrent unit (GRU) [25] to encode these tokens sequentially.

Fig. 4.
figure 4

Proposed CNN + RNN model

Consider \(t_i\) as the i-th relation token: we start by passing it through a randomly initialized embedding matrix to generate its embedding representation \(y_i\). Then the whole y sequence is processed using the GRU layer, which is formulated as

(4)
$$\begin{aligned} z_t&= \sigma (W_z x_t + U_z h_{t-1}) \end{aligned}$$
(5)
(6)
$$\begin{aligned} r_t&= \sigma (W_r y_t + U_r h_{t-1}), \end{aligned}$$
(7)

where W and U are trainable parameters, \(\sigma \) is an arbitrary non-linear activation function, and \(x_t\) denotes the t-th relation token.

The hidden vector of the question and the candidate relation are then concatenated and passed onto a relu [26] activated feed-forward neural network (FFNN), which outputs a single-digit score indicating the fitness of the question and candidate relation. As with previous work that treats relation extraction as a ranking test, hinge loss is applied to optimize the model, which can be formulated as

$$ loss = max(0, -(score^+ - score^-) + margin), $$

where margin is an arbitrary value and \(score^+\) and \(score^-\) stand for the output scores of the correct and incorrect relations, respectively.

As this model considers relations semantically, it has the potential to relate the relation to the question more correctly. Moreover, with hinge loss, we can train the model not only with positive samples but also with negative ones.

4 Experiment

4.1 The WebQSP Dataset

For experiments, we use WebQSP [10], a public QA dataset which is an annotated version of WebQuestion [27], another public QA dataset which contains question entity and answer entity pairs from the Freebase KB. Questions in WebQuestion were generated based on suggestions from the Google Search Suggestions API and were labeled with semantic parses by experts who are familiar with Freebase. In the WebQSP dataset, in addition to the question and its linked entity with the corresponding MID in Freebase, there is the annotated inferential chain which we refer as the relations (relation path) in this paper. WebQSP specifies 3,098 questions for training and 1,639 questions for testing. From these 3,098 training questions we further set aside 305 for validation.

4.2 Results and Discussion

We evaluated the three proposed models in terms of accuracy. As shown in Table 1, compared to previous works, the proposed models show comparable performance with a smaller search space. The state-of-the-art HR-BiLSTM [9] achieves 83% accuracy by incorporating both relation tokens and relation words.

Table 1. Accuracy of proposed models

Comparing the three proposed models, it is surprising that the CNN + masking model outperforms the others. The architecture of the CNN + RNN model is similar to the other related work. Moreover, the CNN + RNN model, the BiCNN model, and the BiLSTM + relation_names model all use the same features. Therefore, we expected it to yield comparable performance. Instead, this model yielded the worst performance in our experiment; the BiCNN model and the BiLSTM + relation_names model both far outperformed this model. This suggests that the recurrent model cannot extract information on relations effectively in the CNN + RNN model.

However, we found that masking benefits the proposed model. To the best of our knowledge, this is the first attempt to treat the relation extraction task as a classification problem. The main challenge in making it a classification problem is the large search space for selecting possible relations. Results show that simply adding a masking layer can solve this problem efficiently.

4.3 Error Analysis

We further analyzed the errors found in the results from the CNN + masking model. The first major type of error observed was ambiguous relations. For example, consider the question “What are the major languages spoken in Greece?”: whereas the correct relation for this question is “location.country.official_language”, our model predicts the “location.country.languages_spoken” relation. About 16% of the errors are of this type.

The second major error type is due to the model’s understanding the question not as a whole but in pieces. For example, the question “What state is Mount St. Helens in?” is classified as “geography.mountain.mountain_type” relation, instead of the gold relation “location.location.contained_by”. The model understands the concept “mount”, but wrongly recognizes that the question is asking the type of mountain instead of its location. Another example is the question “What town did Justin Bieber grow up in?” being classified to the relation “people.person.place_of_birth”, while the correct relation should be “people.person.places_lived people.place_lived.location”. To avoid this kind of error, a more sophisticated language encoding model may help such as the deeper CNN model [28, 29], the attention mechanism [30, 31], or the residual network [32].

Table 2. Relations of CaterKB
Fig. 5.
figure 5

CaterKB sample

5 KBQA on a Catering Knowledge Base

5.1 Constructing CaterKB from iPeen

CaterKB is a Freebase-style knowledge base generated from iPeenFootnote 1, the biggest Taiwanese restaurant ranking website. Each restaurant has its own webpage in iPeen from which most of the information can be collected. Each relation representation is a RDF triple in the form of <Subject> <Relation> <Object>. A total of 11 relations are defined in CaterKB, as shown in Table 2. We collected a total of 2,371,397 triples from 147,868 restaurants. Figure 5 shows a sample partial CaterKB knowledge graph.

5.2 Generating Questions for Catering

As people usually search for information they need on the Internet using keywords rather than complete questions, especially with catering information, we failed to collect enough questions with the Google Search Suggestion API. Therefore, we recruited 15 native Chinese speakers to generate 200 questions about restaurants and foods. As it is challenging to generate high-quality questions for all types of relations, six types of restaurant relations were selected for experiments: restaurant type, recommended dish, customer comment, opening time, price, and location.

5.3 Experimental Settings

We applied the CNN + masking layer model to the generated 200 questions. As for data preparing, we performed ten-fold cross validation to make sure low variance between each setting. The result reported is the average of the testing accuracy of the ten folds. The learning rate was set to 0.00005, the batch size 4, and the hidden layer size 128. We adopted the Stanford CoreNLP parser [24] for word segmentation, POS tagging and dependency parsing. To utilize the parser, we use OpenCC toolkitFootnote 2 to translate sentences from traditional Chinese to simplified Chinese. After parsing, we translate the parsed result from simplified Chinese to traditional Chinese, and we use Skip-Gram [33] to train on the Chinese Gigaword Second Edition (CG2) [34] as the pre-trained word embedding.

5.4 Results and Discussion

Results are shown in Table 3. The accuracy is unchanged in three different dropout rates and only achieves 75%. As we investigate the results and errors, we find that the relation “ (restaurant type)” is not easily identified by the model; questions of this type bring in much noise. These questions include various terms such as “brunch”, “Japanese”, “secret place”, “historic”, and “Taipei”, which are too diverse to learn. Hence in terms of KB construction principles, “restaurant type” may not be a good relation.

To evaluate the impact of the error-prone “restaurant type” relation, we excluded questions of this type, keeping 167 questions with other five types of relation, and conducted the experiment again with 147 questions for training, 10 for validation, and 10 for testing. The performance improved considerably: the best performance was now 89%, from which we conclude that (1) carefully designing the relation types in a domain-specific knowledge base may considerably improve performance (2) the 89% accuracy indicates that the proposed CNN + masking model can be used in real-world domain-specific KBQA applications.

Overall, the best performance – an accuracy of 89% – is observed when the dropout rate is set to a relatively high value of 0.4.

Table 3. CNN + masking accuracy on CaterKB questions

5.5 Error Analysis

From the error results of the CNN + masking model with a 0.4 dropout rate, we noticed that the questions for the “location” relation which contain the Chinese word “ (open/locate/drive)” tended to be classified as questions for the “opening time” relation, such as “ (Where is DingTaiFeng located?)”. This error is due to polysemy in the questions, in this case, (open/locate/drive). As we did not perform word sense disambiguation before relation extraction, the most commonly used sense “ (open)” is always adopted by the model. Other questions using this word include “ (Is DingTaiFeng open everyday?)” and “ (Is DingTaiFeng open at nine o’clock?)”, where 18 of 19 appearances of the word are of the sense “open”. This shows that word sense disambiguation is relatively important for domain-specific KBQA.

Questions for “recommended dish” relation such as “ (How to make the best order in DingTaiFeng?)” are also challenging for the proposed model. These questions involve the question word “ (how)”, which is also a challenging question type in the conventional question answering task. The results showed that the model confused these questions with those for the “customer comment” relation, as the latter are common connected with the question word “ (how)”. Usually when one word is found in questions for different relations, its context can aid disambiguation. However, as how-type questions are expressed using a wide variety of words, the context of the question word “how” may not help in relation extraction.

Finally, we note that the vocabulary of the specific domain is small; hence some words are common in questions. For example, in catering questions words such as “ (dish)”, “ (price)”, “ (open)”, “ (comment)”, and “ (where)” are often seen, and are strong features for relations “recommended dish”, “price”, “opening time”, “customer comment”, and “location” respectively. We believe this is a worthy direction for improving domain-specific relation extraction.

6 Conclusion and Future Work

Relation extraction plays an important role in KBQA. However, the greatest challenge comes from the large search space of relations. In this paper, we propose three models to extract relations for KBQA, as well as a masking mechanism to reduce the search space. Results show that it is comparable to the state of the art in the general domain and yields superior performance for domain-specific KBs.

In the future, we will investigate automatic question collection for relations in domain-specific KB and useful features for domain adaptation. We believe the proposed model can serve as a simple but strong tool for real-world applications.