1 Introduction

The Chinese implicit discourse relation recognition has drawn more and more attention, because it is crucial for Chinese discourse understanding. Recently, the Chinese Discourse Treebank (CDTB) was released [1]. Although Chinese Discourse corpora shares the similar annotation framework with Penn Discourse Treebank (PDTB) for English, the statistical differences are obvious and significant. First, the connectives in Chinese occur much less frequently than those in English [2]. Second, the relation distribution in Chinese is more unbalanced than that in English. Third, the relation annotation for Chinese implicit case is more semantic due to the language essential characteristic [3]. These evidences indicate that implicit discourse relation recognition task for Chinese would be different from English.

Unfortunately, there is existing few work on Chinese discourse relation problem [4, 7], thus our work is mainly inspired by the studies of English. Conventional approaches on identifying English discourse relation rely on handcrafted features extracted from two arguments, including word-pairs [8], VerbNet classes [10], brown clustering [24], production rules [15] and dependency rules [9]. These features indeed capture the correlation with discourse relation to some extent and achieve considerable performance in explicit cases. However, implicit discourse relation recognition is much harder, due to the absence of connectivesFootnote 1. Moreover, these hand-crafted features usually suffer from data sparsity problem [19] and are weak to capture the deep semantic feature of discourse [22].

To tackle this problem, deep learning methods are introduced to this area. It can learn dense real-valued vector representations of the arguments, which can capture the semantics in some extent, and alleviate the data sparsity problem simultaneously. Recently, a variety of neural network architectures have been explored on this task, such as convolution neural network [32], recursive network [22], feed-forward network [26], recurrent network [25], attentional network [23] and hybrid feature model [5, 6]. These studies show that deep learning technology can achieve comparable or even better performance than the conventional approach with complicated hand-crafted features.

More recently, there are growing interest in memory augmented neural architecture. The advantage of extra memory is to capture and preserve useful information for task, the core of this idea is to keep those information in independent memory slot, and trigger and retrieval the related memory slot to support the inference. This design has proven effective in many works, including neural turing machine [17], memory network [28], dynamic memory networks [21], matching networks [29], etc.

Therefore, in this paper, we propose a memory augmented attention model (MAAM) to handle Chinese implicit discourse relation recognition task. It can represent arguments with an attention-based neural network, and then retrieval the external memory for relation inference support information, after that it combines the representation and memory support information to complete the classification.

More specifically, the procedure of our model can be divided into five steps: (1) Our model use a general encoder module to transform the input arguments from word sequence into dense vectors. (2) An attention module is proposed to score the importance of each word based on the given contexts and the weighted sum of the words is used as the argument representation. (3) An external memory is employed to produce an output based on this arguments representation. (4) The memory gate combines the memory output together with the attention representation to generate a refined representation of the arguments. (5) Finally, we stack a feed-forward network as the classification layer to predict the discourse relation. Extensive experiments and analysis show that our proposed method achieves the new state-of-the-art results on Chinese Discourse Treebank (CDTB).

2 Memory Augmented Attention Model

In this section, we first give an overview of the modules that build up memory augmented attention model (MAAM). We then introduce each module in detail and give intuitions about its formulation. A high-level illustration of the MAAM is shown in Fig. 1.

Fig. 1.
figure 1

The basic framework of our model, including (1) General Encoder Module, (2) Content-based Attention Module, (3) External Memory Module, (4) Memory Gate and (5) Classification Module.

As shown in Fig. 1, our framework consists of five modules: (1) general encoder module; (2) content-based attention module; (3) external memory module; (4) memory gate; (5) classification module.

The General Encoder Module encodes the word sequence of the two arguments into distributed vector representations. It is implemented by using the bidirectional recurrent neural network.

The Attention Module is proposed to capture the importance (attention) of each word in two arguments. We score the weight of each word in the argument based on its inner context and generates a weighted sum as the argument representation.

The External Memory Module consists of a fixed number of memory slots. The external memory computes the match score between the representation of arguments and yields a probability distribution. Then memory generates a weighted sum as memory output.

The Memory Gate is a learn-able controller component and it computes the convex combination of the original argument representation and the memory output to generate a refined representation.

The Classification Module stacks on the refined representation of the arguments and outputs the final discourse relation. We implement this module with a two-layer feed-forward network which can capture the interaction between two arguments implicitly.

2.1 General Encoder Module

In implicit discourse relation recognition, the input is the word sequence of two arguments Arg1 and Arg2. We choose recurrent neural network [16] to encode the arguments. Word embeddings are given as input to the recurrent network. At each time step t, the network updates its hidden state \(h_{t}= RNN(x_{t},h_{t-1})\), where \(x_{t}\) is the embedding vector of the t-th word of the input argument. In our model, we use a gated recurrent unit (GRU) to replace the normal RNN unit [12]. GRU is a variant of RNN, which works much better than the original one and suffers less from the vanishing gradient problem by introducing the gate structure like Long Short Term Memory (LSTM) [18]. Assume each time step t has an input \(x_{t}\) and a hidden state \(h_{t}\). The formula of GRU shows as follows:

$$\begin{aligned} z_{t}&= \sigma (W_{z}x_{t}+U_{z}h_{t-1}+b_{z}) \end{aligned}$$
(1)
$$\begin{aligned} r_{t}&= \sigma (W_{r}x_{t}+U_{r}h_{t-1}+b_{r}) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{h_{t}}&= tanh(Wx_{t}+r_{t}\circ Uh_{t-1}+b_{h}) \end{aligned}$$
(3)
$$\begin{aligned} h_{t}&= z_{t}\circ h_{t-1}+(1-z_{t})\circ \tilde{h_{t}} \end{aligned}$$
(4)

In brief, the simple version of GRU is \(h_{t} = GRU(x_{t};h_{t-1})\). RNN and its variant as described above read an input sequence x in order, starting from the first word to the last one. However, we expect the representation of each word to summarize not only the preceding words, but also the following words. Thus, we propose to use a bidirectional RNN [27]. A Bi-RNN consists of a forward and a backward RNN. The forward RNN reads the input sequence from left to right, while the backward RNN reads the sequence in the reverse order.

$$\begin{aligned} \overrightarrow{h_{t}} = \overrightarrow{GRU}(x_{t},\overrightarrow{h_{t-1}}) \end{aligned}$$
(5)
$$\begin{aligned} \overleftarrow{h_{t}} = \overleftarrow{GRU}(x_{t},\overleftarrow{h_{t-1}}) \end{aligned}$$
(6)

We obtain representation for each word by concatenating two hidden state sequences generated by the forward and backward RNNs.

$$\begin{aligned} h_{t} = [\overrightarrow{h_{t}};\overleftarrow{h_{t}}] \end{aligned}$$
(7)

In this way, the representation \(h_{t}\) of each word contains the summary of both the preceding words and the following words.

2.2 Attention Module

After obtaining the representation of the arguments by treating each word equally in general encoder module, we now apply the content-based attention module to score the importance of each word in the arguments. We evaluate the weight of each word only based on the its inner context. The motivation behind it is that since the connective is absent in implicit samples, we can utilize the context of the arguments to generate an appropriate representation. Obviously, the contribution of each word in the context is not same and it is natural to capture the correlation between the context dependent word feature and the discourse relation using attention mechanism. In our case, we use a multilayer perception to implement the attention module:

$$\begin{aligned} e_{t}=u_{a}^{T}tanh(W_{a}h_{t}+b_{a}) \end{aligned}$$
(8)

Notice that \(h_{t}\) is generated by the general encoder module. The weight of each word \(h_{t}\) is computed using softmax:

$$\begin{aligned} a_{t}=\frac{exp(e_{t})}{\sum _{j=1}^{T}exp(e_{j})} \end{aligned}$$
(9)

For instance, we consider the vector \(v_{Arg1}\) the weighted sum of the representations of Arg1:

$$\begin{aligned} v_{Arg1}=\sum _{j=1}^{T}a_{t}h_{t} \end{aligned}$$
(10)

We generate the vector of Arg2 in the same way. Then we directly concatenate two vectors as the representation of arguments:

$$\begin{aligned} v_{Args} = [v_{Arg1};v_{Arg2}] \end{aligned}$$
(11)

2.3 External Memory Module

As long as we have the semantic representation of arguments, we can use it to interact with our augmented memory. Our external memory consists of the memory slots, which are activated by the particular pattern of the arguments and generate corresponding output as response. This memory output will be used in following step to refine the original argument representation. Concretely, we first compute the similarity score between \(v_{Args}\) and each memory slot \(m_{i}\) and produce a normalized weight \(w_{i}\) using similarity measure \(K[\cdot ,\cdot ]\). Also, in order to improve the focus, a sharpen factor \(\beta \) is needed.

$$\begin{aligned} w_{i}\leftarrow \frac{exp(\beta K[v_{Args},m_{i}])}{\sum _{j}exp(\beta K[v_{Args},m_{j}]))} \end{aligned}$$
(12)

In our case, we use the cosine similarity as our metric.

$$\begin{aligned} K[u,v] = \frac{u\cdot v}{\left\| u \right\| \cdot \left\| v \right\| } \end{aligned}$$
(13)

Then, we generate the output from memory according to the weights.

$$\begin{aligned} m = \sum _{i} w_{i}m_{i} \end{aligned}$$
(14)

The memory design is mainly inspired by Neural Turing Machine [17]. The memory will capture the common pattern of discourse relation distribution during training. For example, when an input relation sample accesses the external memory, the memory will response with an output vector which contains the information mostly related to the similar samples it has seen before. Intuitively, samples with similar representations usually belong to the same discourse relation. In summary, the memory actually implicitly holds the discourse relation clustering information for the following classification. The external memory component is randomly initialized and optimized during training.

2.4 Memory Gate

Once we can access the output information m from memory, we can use it to generate the refined representation \(\widetilde{v}\) along with the original representation of arguments \(v_{Args}\). We propose an interpolation strategy to combine these two vectors together and employ a sigmoid function called memory gate to control the final output.

$$\begin{aligned} \alpha = \sigma (W_{g}[v_{Args};m] + b_{g}) \end{aligned}$$
(15)

where \(\sigma \) is a sigmoid function. We then compute a convex combination of the memory output and the original argument representation:

$$\begin{aligned} \widetilde{v} = \alpha \cdot v_{Args} + (1-\alpha ) \cdot m \end{aligned}$$
(16)

The memory gate is a learn-able neural layer. The idea behind it is that although memory can return the clustering structure information which is potentially useful. Also, we build a gate mechanism to control the output of memory and mix them with the original argument representations.

2.5 Classification Module

Given the refined representation vector \(\widetilde{v}\) of the arguments, we implement the classification module using a two-layer feed-forward network which is followed by a standard softmax layer.

$$\begin{aligned} \tilde{y} =softmax(tanh(W_{c}\widetilde{v}+b_{c})) \end{aligned}$$
(17)

where \(\tilde{y}\) is our output predicted label. During training, we optimize the network parameters by maximizing the cross-entropy loss function between the true and predicted labels.

3 Experiments

3.1 Corpora

We evaluate our model on Chinese Discourse Treebank (CDTB) [1,2,3, 25], which has been published as standard corpora in CoNLL shared task 2016. In our work, we experiment on the ten relations in this corpus following the setup of suggestions given by the shared task. We directly adopt the standard training set, development set, test set and blind test set. We also use the word embeddings provided by the CoNLL 2016.

3.2 Training Details

To train our model, the objective function is defined as the cross-entropy loss between the outputs of the softmax layer and the ground-truth class labels. We use adadelta algorithm to optimize the whole neural networks. To avoid over-fitting, dropout operation is applied on the layer before softmax.

3.3 Experimental Results

To exhibit the effectiveness of our model, our experiment results consists of three parts: baselines, MAAM variants and MAAMs.

Baselines: We collect two baselines for our experiments, the one is “Conjunction” and another is “Focused RNN” which achieved the best result in CoNLL 2016 shared task.

We implement the first “Conjunction” system which directly annotates every test sample as “Conjunction”. The reason behind is that due to the unbalanced problem of corpora (see Table 1), this baseline system is very strong according to the CoNLL report by Xue et al. [25] and many participated systems cannot beat this baseline.

The “Focused RNN” is proposed by Weiss and Bajec [31], which is implemented with a focused recurrent neural network which can selective react to different context. Its result is directly selected from the report of CoNLL 2016.

MAAM variants: Since there are few published results on CDTB, it is necessary to show variants of our model. These variants are helpful to understand the contribution of each module, since the variants we proposed is only sightly different from our final model. The detail of each MAAM variants is shown below.

MAAM+0memslot+no Encoder: It use no encoder module at all. In this variant, it directly uses word embedding sequence to encode arguments, and applies the same attention layer on them. This model explores the effectiveness of embedding features missing context dependent information.

MAAM+0memslot+GRU Encoder: This system only uses single GRU as the encoder of input module, it is used to understand the effectiveness of bidirectional encoder.

MAAM+0memslot+Mean(no Attention): Instead of using attention mechanism, this system directly represent argument as mean of all hidden states in Bi-GRU, treating each word in argument equally.

We can see from Table 2 that the proposed MAAM module is better than all the variants. It is obvious that both of the context and the attention are beneficial for distributed argument representation in discourse relation.

Table 1. The Experiment results on CoNLL 2016 Shared Task

MAAMs: Now, we compare our memory augmented attention model (MAAM) with other approaches in the closed track. Our memory models (containing different numbers of slots [1, 20, 50, 100, 150] can outperform the two baselines, and the one with 20 slots achieves the best result, which is the new state-of-the-art on CTDB. Specifically, we observe an interesting phenomenon in our memory models. Along with the number of memory slots grow, the performance is improved first (from 0 to 20 slots) but is gradually decreased (from 20 to 50, 100 and 150). We speculate that the under-fitting problem (no adequate training samples) is the main reason. When comparing to MAAM+0memslot, we can see that all the settings of memory model can obtain better results, demonstrating the effectiveness of proposed external memory component.

3.4 Discussion and Analysis

The experimental results demonstrate the superiority of our memory augmented attention model. In this section, we discuss the behavior of the external memory and the attention module in the network.

Fig. 2.
figure 2

Memory activation for different relation samples. Horizontal coordinate reflects the activation of 10 memory slots. Vertical coordinate reflects different discourse implicit relation samples. (Conjunction-Conj; Expansion-Exp; EntRel-EntR) Each row in figure represent the different activation of different memory slot for each input discourse relation sample. The deeper color indicate higher score.

Memory Analysis: The results show that the external memory component is significantly helpful for the performance. In order to understand how our memory component works, we show a memory component which contains 10 memory slots in Fig. 2. As we mentioned above, the memory slot will be triggered when the relevant input arguments retrial the memory component. The memory will compute scores for each memory slot based on input arguments, we call these scores as activation. We now feed 13 arguments belong to different discourse relation into memory component. The activations of each 10 memory slots triggered by different relation samples are shown in Fig. 2, the deeper color means this slot achieve higher activation, each row in Fig. 2 exhibits the different activation of memory slot for every input relation arguments. As we can see that, arguments belong to the same relation always trigger the same slots (location) in memory component. For instance, the “EntRel” samples always focus on the 2-nd slot (in horizontal) and the “Conjunction” samples trigger the 8-th slot.

Fig. 3.
figure 3

t-SNE for Chinese discourse relation distribution. Notice that clustering for each relation in figure. The “Expansion” is in blue. Conjunction-0; Expansion-1; EntRel-2; AltLex-3; Causation-4; Contrast-5; Purpose-6; Conditional-7; Temporal-8; Progression-9. As we can see, the “Conjunction” relation plays as a background for the rest of relations.

Representation Analysis: In order to understand the discourse relation distribution (representation) in our model, we show the t-SNE visualization of Chinese implicit discourse relation samples in Fig. 3 (using feature space from classification module). As we can see, the “Conjunction” relation samples mostly play as a background for any other relation. This may be caused by the definition of “conjunction”.Footnote 2 Meanwhile, other relation samples are hard to distinguish from “Conjunction” samples. This situation also indicates that the Chinese implicit relation recognition is a difficult task.

Attention Analysis: Our attention module scores each word relying on the inner content. It captures the correlation between content and discourse relation, different from independent word embedding information which can not access the surrounding context. In Fig. 4, the “Causation” relation example extracted from corpora shows our model pays more attention on the content words than the function words. We annotated the alignment relation between the Chinese relation sample and its English translation. The attention module focuses on the “international;steady;expansion” in Arg1 and “for China’s export;provides;international environment” in Arg2, which can be roughly considered as a simple summarization of two arguments. This example demonstrates the effect of the proposed attention module. The result of attention makes us to wonder if we should give different score to word when we deal with different relation.

Fig. 4.
figure 4

Attention for Causation samples. The attention module focus on the “international, steady, expansion” in Arg1 and “for China’s export,provides,international environment” in Arg2.

Discussion: Another issue we observed is the ambiguity and data imbalance of Chinese implicit discourse relation. Comparing to English, Chinese contains more less explicit connectives, this is the main reason for Chinese implicit reason recognition problem. Therefore, many relation samples is hard distinguish from “Conjunction”, unless it is pretty obvious for annotator. Our approach is actually based on a assumption that every relation has a prototype sample, thus we hope our memory component can capture each discourse relation prototype and identify it from unseen sample. However, we didn’t observed positive result to support our assumption.

4 Related Work

Implicit discourse relation recognition has been a hot topic in recent years. However, most of the approaches focus on English. There are mainly two directions related to our study: (1) English implicit discourse relation recognition using neural networks, and (2) memory augmented networks.

Conventional implicit relation recognition approaches rely on kinds of hand-crafted features [8, 11, 24], these surface features usually suffer from sparsity problem. Then, neural network based approaches are proposed. In order to alleviate feature sparsity problem, Ji and Eisenstein [19] first transform surface features of arguments into low dimension distributed representations to boost the performance. A discourse document usually covers different scale unit from word, sentence to paragraph. To model this kind of structures, Li [22] and Ji [20] both introduced the recursive network to represent arguments to facilitate the discourse parsing.Considering the discourse relation recognition as text classification problem, Liu et al. [23] propose a convolution neural network (CNN) to detect the sequence feature in arguments to predict relation. Rutherford et al. [25] conduct experiments to explore the effectiveness of feedforward neural network and recurrent neural network. Liu and Li [23] use attention mechanism to refine the representation of arguments by reweighing the importance of different parts of argument. Braud and Denis [13, 14] utilize the word representation to improve implicit discourse relation classification. Their method investigates the correlation between word embedding and discourse relation.

The memory model is inspired by recently proposed memory augmented network. The Neural Turing Machine (NTM) [17] builds an external memory component to preserve kinds of subsequence pattern explicitly, and makes NTM more effective to learn from training samples. Another type of memory augmented network is memory network [28], which is different from NTM and works more like a cache for particular data. The memory network saves the sentences in memory to support multiple step question & answer inference. More recently, the matching network is proposed by Vinyals et al. [29], its memory component caches the common pattern of representation and corresponding label of training samples. It predicts label by matching input sample with memory caches then generate weighted sum label (with matching distribution) as final output. Since the memory network can capture particular pattern of samples and be optimized during training, we extend it in our framework to maintain crucial information for Chinese implicit relation recognition. The experimental results verify the efficacy of the proposed memory network and the memory augmented model achieves the best performance on CDTB.

5 Conclusion

In this paper, we have proposed a memory augmented attention model for Chinese implicit Discourse relation recognition. The attention network is employed to learn the semantic representation of the two arguments Arg1 and Arg2. The memory network is introduced to capture the underlying clustering structure of samples. The extensive experiments show that our proposed method achieves the new state-of-the-art results on CDTB.