Keywords

1 Introduction

Relation Classification is the process of recognizing the semantic relations between pairs of nominals. It is a crucial component in natural language processing and could be defined as follows: given a sentence S with the annotated pairs of nominals e1 and e2, we aim to identify the relations between e1 and e2. For example: “The [singer]e1, who performed three of the nominated songs, also caused a [commotion]e2 on the red carpet.” Our goal is to find out the relation of marked entities singer and commotion, which is obviously recognized as Cause-Effect (e1, e2) relation in this demonstration.

Traditional relation classifiers generally focused on features representation or kernel-based approaches which rely on full-fledged NLP tools, such as POS tagging, dependency parsing and semantic analysis [13, 14]. Although these approaches are able to exploit the symbolic structures in sentences, they still suffer from the weakness of using handcrafted features. In recently years, deep learning models which extract features automatically, have achieved big improvements on this task. Commonly used models include convolutional neural network (CNN), recurrent neural network (RNN) and other complex hybrid networks [7, 8]. In the most recent past, some researchers combined features representation with neural network models to utilize more characteristics, such as the shortest dependency path [2].

Although deep neural network architectures have achieved state-of-the-art performance, to train an optimized model relies on a large amount of labeled data, otherwise it will lead to overfitting. Due to the high cost of manually tagging samples, in many specific tasks, labeled data is scarce and may not fully sustain the training of a deep supervised learning model. For example, in relation classification task, the standard dataset just contains 10,717 annotated sentences. To prevent overfitting, strategies such as dropout [16] and adding random noise [17, 18] have been proposed, but the effectiveness is limited.

In order to address this problem, we innovatively adopt the adversarial training framework for classifying the inter-relations between nominals. We generate adversarial examples [11, 12] for labeled data by making small perturbations on word embeddings of the input, which significantly increase the loss incurred by our model. Then, we regularize our classifier using adversarial training technique, i.e. training the model to correctly classify both unmodified examples and perturbed ones. This strategy not only improves the robustness to adversarial examples, but also promotes generalization performance for original examples. In this work, we construct a bidirectional LSTM model as a relation classifier. Beyond the basic model, we use a word-level attention mechanism [6] on the input sentence to capture its most important semantic information. This framework is an end-to-end one without using extra knowledge and NLP systems.

In experiments, we run our model and ten typical comparative methods on the SemEval-2010 Task 8 dataset [13]. Our model achieved an F1-score of 88.7% and outperformed other methods in the literature, which demonstrates the effectiveness of adversarial training.

2 Related Work

Traditional methods for relation classification are mainly based on features representation or kernel-based approaches which rely on a mature NLP tools, such as POS tagging, dependency parsing and semantic analysis. [21] propose a shortest path dependency kernel for relation classification, the main idea of which is that the relation strongly relies on the dependency path between two given entities. Besides considering the structural information, [20] introduce semantic information into kernel methods. In these approaches, the use of features extracted by NLP tools results in cascaded error. On the other hand, handcrafted features of data have bad reusability for other tasks.

In order to extract features automatically, recent researches focus on utilizing deep learning models for this task and have achieved big improvements. [9] proposed convolutional neural networks (CNNs), which uses word embedding and position as input. [5, 7] observed that recurrent neural networks (RNNs) with long-short term memory (LSTMs) could improve addressing this problem. Recently, [6] proposed CNNs with two levels of attention for this task in order to better discern patterns in heterogeneous contexts, which achieved the best effect. What is more, some researchers combined features representation with neural network in order to utilize more linguistic information. The typical operations are neural architecture which leverages the shortest dependency path-based CNNs [2], and the SDP-LSTM model [5]. Existing studies revealed that, deep and rich neural network architectures are more capable of information integration and abstraction, while the annotated data maybe not sufficient for the further promotion of performance.

Adversarial Training was originally introduced by image classification [12]. Then it is adapted to text classification and extended to some semi-supervised tasks by [10]. Predecessors’ work demonstrated that the learned input with adversarial training have improved in quality, which solved overfitting problem to some extent. Having a similar intuition, [18] added random noise to the input and hidden layer during training, however the effectiveness of randomly adding mechanism is limited. As another strategy for prevent overfitting, dropout [16] is a regularization method widely used for many tasks. We especially conducted an experiment to make a comparison among adversarial training and these methods.

3 Our Model

Given a sentence s with a pair of entities e 1 and e 2 annotated, the task of relation classification is to identify the semantic relation between and e 1 and e 2 in accordance with a set of predefined relation types (all types will be displayed in Sect. 4). Figure 1 shows the overall architectures of our adversarial neural relation classification (ANRC).

Fig. 1.
figure 1

Overall architecture for adversarial neural relation classification

The input of architecture is encoded using vector representations including word embedding, context and positional embedding. What’s more, word-level attention could be used to capture the relevance of words with respect to the target entities. In order to enhance the robustness of model, adversarial examples are leveraged in input embeddings. After that, bidirectional recurrent neural network is used to capture information in different levels of abstraction, and the last layer is a softmax classifier to optimize classification results.

3.1 Input Representation with Word-Level Attention

Given a sentence s, each word w i is converted into a real-valued vector \( r^{{w_{i} }} \). The position embedding of w i is mapped to a vector of dimension dwpe, tagged as WPE (word position embeddings) proposed by [9]. Consequently, the word embedding and the word position embedding of each word w 1 are concatenated to form the input, \( emb_{x} \, = \,\left\{ {\left[ {r^{{w_{1} }} ,\,wpe^{{w_{1} }} } \right],\,\left[ {r^{w2} ,\,wpe^{w2} } \right],\, \ldots ,\,\left[ {r^{{w_{N} }} ,\,wpe^{{w_{N} }} } \right]} \right\} \). Afterwards, the convolutional operation is applied to each window of size k of successive windows in \( emb_{x} \, = \,\left\{ {r^{{w_{1} }} ,\,r^{{w_{2} }} ,\, \ldots ,\,r^{{w_{N} }} ,} \right\} \), ultimately, we define vector z n as the concatenation of a sequence of k word embedding, centralized in the n-th word:

$$ Z_{n} \, = \,\left( {r^{{w_{{{\text{n}} - (k - 1)/2}} }} , \cdots ,r^{{w_{{{\text{n + }}(k - 1)/2}} }} } \right)^{T} $$
(1)

Word-Level Attention.

Attention mechanism makes the neural network look back to the key parts of the source text when it is trying to predict the next token of a sequence. Attentive neural networks have been applied successfully in sequence-to-sequence learning tasks. In order to fully capture the relationships and interest of specific words with the target nominals, we design a model to automatically learn this relevance for relation classification like [6].

Contextual Relevance Matrices.

Take notice of the example in Fig. 2, we can easily observe that the non-entity word “cause” is of great significance to determine the relation of entity pair. For the sake of characterizing the contextual correlations and connections between entity mention e j and non-entity word w i , we leverage two diagonal attention matrix Aj with value \( A_{i,i}^{j} \, = \,f(e_{j} ,\,w_{i} ) \), which is computed as the inner product between embeddings of the entity e j and word w i respectively. Based on the diagonal attention matrixes, the relativeness of the i-th word with respect to j-th entity (\( j\, \in \,\left\{ {1,\,2} \right\} \)) could be calculated as Eq. (1):

$$ \alpha_{i}^{j} = \frac{{\exp \left( {A_{i,i}^{j} } \right)}}{{\sum\nolimits_{{i^{{\prime }} = 1}}^{n} {\exp \left( {A_{{i^{{\prime }} ,i^{{\prime }} }}^{j} } \right)} }} $$
(2)
Fig. 2.
figure 2

Word-level attention on input

Input Attention Composition.

Next, we combine the two relevance factors \( \alpha_{i}^{1} \) and \( \alpha_{i}^{2} \) with compositional word embedding z n above in for recognizing the relation via a simple average algorithm as:

$$ r_{i} = z_{i} \cdot \frac{{\alpha_{i}^{1} + \alpha_{i}^{2} }}{2} $$
(3)

Finally, we’ve got the final output of word-level attention mechanism, a matrix R = [r 1 , r 2 , …, r n ] where n is the sentence length, regarded as input vectors feed into neural network we construct.

3.2 Bi-LSTM Network for Classification

Bi-LSTM Network.

As a text classification model, we use a LSTM-based neural network model which is used in the state-of-the-art works [1, 7] and the experimental results show its effectiveness for this problem. Beyond the basic model, we adopt in our method a variant introduced by [15]. The LSTM-based recurrent neural network consists of four components: an input gate, a forget gate, an output gate, and a memory cell .

Fig. 3.
figure 3

The model of Bi-LSTMs and perturbed embeddings

We employ the bidirectional recurrent neural network in this part so as to better capture the textual information from both ends of the sentences in view of the fact that the standard RNN is a biased model, where the later inputs are more dominant than the earlier inputs.

Softmax Layer.

The softmax layer is a commonly used classifier, which can be regarded as a generalization of multivariate classifier from binary Logistic Regression (LR) one. For this part, we use it to predict the label y from a discrete set of classes Y for a sentence. We denote s as the input sentence and \( \theta \) as the parameters of a classifier. The output of Bi-LSTM h is the input of the classifier (Eq. (4)). Simply taking the summation over the log probabilities of all those labels yields the final loss function as Eq. (5).

$$ p\left( {y|s ;\,\theta } \right)\, = \,\text{softmax} (W_{y} \, * \,h\, + \,b_{y} ) $$
(4)
$$ L(s;\,\theta )\, = \, - \sum\limits_{i = 1}^{\left| Y \right|} {\log P\left( {y_{i} |s;\,\theta } \right)} $$
(5)

3.3 Adversarial Training

Adversarial examples are generated by making small perturbations to the input, which is designed to significantly increase the loss incurred by a machine learning model. And adversarial training is a way of regularizing supervised learning algorithms to improves robustness to small, approximately word case perturbations. It’s a process of training a model to correctly classify unmodified examples and adversarial examples.

As shown in Fig. 3, we apply the adversarial perturbation to word embeddings, rather than directly to the input, which is similar to [10]. We denote the concatenation of a sequence of word embedding vectors [z(1), z(2), …, z(T)] as s′. Then we define the adversarial perturbation e adv on s’ as Eq. (6). Here e is a perturbation on the input and \( \hat{\theta } \) denotes a fixed copy of the current value of θ.

Fig. 4.
figure 4

Training progress of ANRC and ANRC minus AT across iterations

$$ e_{adv} = arg\mathop {\hbox{min} }\limits_{\left\| e \right\| \le \epsilon } - L\left( {s^{{\prime }} + e;\,\widehat{\theta }} \right) $$
(6)

When applied to a classifier, adversarial training adds e adv to the cost as Eq. (7) instead of Eq. (5), where N in Eq. (7) denotes the number of labeled examples. The adversarial training is carried out to minimize the negative log-likelihood plus L adv with stochastic gradient descent.

$$ L_{adv} \left( {s^{{\prime }} ;\,\theta } \right) = - \frac{1}{N}\sum\limits_{n = 1}^{N} {\log p\left( {y_{n} |s_{n}^{{\prime }} + e_{adv,n} ;\,\theta } \right)} $$
(7)

At each step of training, we identify the worst perturbations e adv against the current model \( p\left( {y|s^{{\prime }} ;\widehat{\theta }} \right) \), and train the model to be robust to such perturbations through minimizing Eq. (7) with respect to θ. However, Eq. (6) is computationally intractable for neural nets. Inspired by [11], we approximate this value by linearizing \( L\left( {s^{{\prime }} ;\theta } \right) \) around s as Eq. (8).

$$ e_{adv} = \frac{\epsilon g}{\left\| g \right\|},\,where \, g\, = \,\nabla_{s} L\left( {s^{{\prime }} ;\hat{\theta }} \right) $$
(8)

4 Experiments and Results

4.1 Datasets

Our experiments are conducted on SemEval-2010 Task 8 dataset, which is widely used for relation classification [13]. The dataset contains 10,717 annotated examples, including 8,000 sentences for training and 2,717 for testing. The relationships between nominals in the corpus are classified into 10 categories, which are list as below. We adopt the official evaluation metric to evaluate our systems, which is based on macro-averaged F1-score for the nine actual relations (Table 1).

Table 1. 9 relationships and examples in our dataset

4.2 Comparative Methods

To evaluate the effectiveness of our model, we compare its performance with notable traditional machine learning approaches and deep learning models including CNN, RNN and other neural network architectures. The comparative methods are introduced in the following.

  • Traditional machine learning algorithms: As a traditional handcrafted-feature based classification, [19] fed extracted features from many external corpora to an SVM classifier and achieved 82.2% F1 score.

  • RNN based models: MV-RNN is a recursive neural network build on the constituency tree and achieved a comparable performance with SVM [22]. SDP-LSTM is a type of gated recurrent neural network, and it is the first attempt to use LSTM in this task and it raised the F1-score to 83.7% [5].

  • CNN based models: [9] construct a CNN on the word sequence and integrated word position embedding, make a breakthrough on the task. CR-CNN extended the basic CNN by replacing the common softmax cost function with a ranking-based cost function [3], and achieved an F1-score of 84.1%. Using a simple negative sampling method, depLCNN + NS introduced additional samples from other corpora like the NYT dataset. And this strategy effectively improved the performance to 85.6% F1-score [4]. Att-Pooling-CNN appended multi-level attention to the basic CNN model, and have achieved the state-of-the-art F1-score in relation classification task [6].

  • RNN combined with CNN: DepNN is a convolutional neural network with a recursive neural network designed to model the subtrees, and achieve an F1-score of 83.6% [2].

4.3 Experimental Setup

We utilize the word embeddings with 200 dimensions released by StanfordFootnote 1. For model parameters, we set the dimension of the entity position feature vector as 20. We use Adam optimizer with batch size 64, an initial learning rate of 0.001 and a 0.99 learning rate exponential decay factor at each training step. The word window size on the convolutional layer is fixed to 3. We also leverage dropout method to training the neural network with 0.5 dropout ratio. For adversarial training, we empirically choose “ϵ” = 0.02. We trained for 50,000 steps for each method in contrast experiments.

We run all experiments using TensorFlow on two Tesla V100 GPUs. Our model took about 8 min per epoch on average.

4.4 Results Analysis

Comparation with Other Models.

Table 2 presents the best effect achieved by our adversarial-training based model (ANRC) and comparative methods. We observe that our model achieves an F1-score of 88.7%, outperforming the state-of-the-art models.

Table 2. Results of our model and comparative methods

From the results in Table 2, we can also find that, in the end-to-end frameworks the CNN architectures have achieved better performance than RNN ones. Besides, the employment of negative sampling in depLCNN+NS promote the F1-score to more than 85%. And the attention mechanism introduced in the Att-Pooling-CNN model significantly improved the effectiveness of relation classification. Although we use a Bi-LSTM as the basic classification model, there is still some improvement in the performance, which proved the effectiveness of adversarial training framework.

Robustness of Adversarial Training.

In order to test the robustness of our model, we delete half of the training data, and evaluate the models’ precision on training data and test data respectively. All using the Bi-LSTM model with attention as the relation classifier, we adopt three different strategies to prevent overfitting: adversarial training plus dropout, adding random noise plus dropout, and just using dropout. Comparative results are shown in Table 3. Although the Adversarial Training+Dropout method has a little precision loss on training data, it achieves an acceptable precision on test data which prominently outperforms other strategies. It demonstrates that training with adversarial perturbations well alleviated the overfitting in the case of scarce training data. Meanwhile, our model has stronger robustness to small, approximately word case perturbations.

Table 3. F1-score in the case of halving training data

Convergence of Adversarial Training.

We compare the convergence behavior of our method using adversarial training to that of the baseline Bi-LSTM model with attention. We plot the performance of each iteration of these two models in Fig. 4. From this figure, we find that training with adversarial examples converges more slowly while the final F1 score is higher. It enlightens us that, we could pre-trained the model without adversarial training to faster the process.

5 Conclusion and the Future Work

In this paper, we proposed an adversarial training framework for relation classification, named ANRC, to improve the performance and robustness of relation classification. Experimental results demonstrate that, training with adversarial perturbations outperformed the method with random perturbations and dropout in term of reducing overfitting. And, our model using a Bi-LSTM relation classifier with word-level attention outperforms previous models. In the future work, we will construct various relation classifier models and apply the adversarial training framework on other tasks.