1 Introduction

One long-term goal in artificial intelligence field is to build an intelligent human-machine dialogue system, which is capable of understanding human’s language and giving smooth and correct responses. Especially with the success of new speech- based human-computer interfaces, there is a great need for effective dialogue agents, such as digital personal assistants, which can handle everyday tasks such as booking flights. SLU and DM are two essential parts in building a spoken dialogue system [1].

A typical dialogue system is designed to execute the following components: (i) automatic speech recognition converts a spoken query into transcription; (ii) spoken language understanding (SLU) component analyzes the transcription to extract semantic representations; (iii) dialogue manager (DM) interprets the semantic information and decides the best system action, according to which the system response is further generated either as a natural language output or a result page.

SLU aims to obtain semantic representations in user utterances. In SLU, there are two main tasks: slot filling and intent detection. Slot filling aims to assign a semantic concept to each word in an utterance. Intent detection aims to identify the intent that users express. DM is responsible for controlling the dialogue flow, tracking the dialogue states and deciding what actions the system should take to handle the interaction between users and system. For DM, we focus on system action prediction (SAP) in this work.

Traditional approaches train SLU model and SAP model separately, which may have restrictions on knowledge sharing. In order to take full advantage of all supervised signals and utilize the information from both tasks, some joint models have been explored [2, 3]. However, the traditional way of joint learning is just combining the loss functions of slot filling and intent detection, which brings the limitation that the information is hard to be used and transmitted effectively. We consider that intent labels, slot tags and actions are correlated, and intent information is helpful for slot filling and SAP. We use intent information as condition to integrate with semantic representations for slot filling and SAP.

In this paper, we propose a conditional joint model that can be used to perform SLU and SAP. In our model, we obtain the semantic representations by a shared Bi-LSTM layer. Meanwhile, intent information is provided to predict slot tags, and is used as a condition to predict system action. Moreover, knowledge between the three supervised signals can be shared implicitly by joint learning. We evaluate our model on the popular benchmark DSTC4 dataset. The results show that our model has a great performance and outperforms other popular methods significantly.

The rest of our paper is structured as follows: Sect. 2 discusses related work, Sect. 3 gives a detailed description of our model, Sect. 4 presents experiments results and analysis, and Sect. 5 summarizes this work and the future direction.

2 Related Work

In this section, we introduce some previous work on SLU and SAP.

Firstly, SLU consists of two tasks: slot filling and intent detection. Traditionally, slot filling can be viewed as a sequence labeling task and intent detection can be viewed as an utterance classification task. These two tasks are usually processed by different models separately.

Machine learning methods such as hidden Markov models (HMM) [4] or conditional random fields (CRF) [5] have been widely employed for slot filling. These methods need complicated feature engineering. However, models with neural network architectures show advantages in feature generation and have a good performance on slot filling task. RNNs are applied in [6,7,8]. [9] utilized LSTM to generate context-aware distributions for capturing temporal dependencies. [10] enhanced the LSTM-based slot filling to model label dependencies.

Several classifiers such as SVM [11], Adaboost [12] and maximum entropy [13] have been employed for intent detection. With the development of deep learning, deep belief networks (DBNs) have been applied [14]. [15] proposed an RNN architecture to improve intent detection.

In recent years, joint learning of slot filling and intent detection are explored for utilizing some shared knowledge. [16] utilized CNN based triangular CRF to extract features for joint learning slot filling and intent detection. [17, 18] adapted RNNS for joint learning of SF and ID. [19] presented a contextual method to exploit possible correlations among intent detection and slot filling. [20] utilized explicit alignment information in the attention-based encoder-decoder neural network models.

For SAP, [21] explored a partially observable Markov decision process (POMDP) to control the system actions. Furthermore, RNN based dialog state tracking models for monitoring the dialogue progress was proved [22]. [2] provided conjoint representations among the utterances, slot- value pairs and knowledge graph representations to overcome current obstacles of deploying dialogue systems. [23] implemented an online learning framework to jointly train actions and the reward model with a Gaussian process model. [24] employed a value iteration method of reinforcement learning framework. [25] described a novel framework using genetic algorithm to optimize system actions. [3] proposed an end-to-end deep recurrent neural network with limited contextual dialogue history to train SLU and SAP jointly.

3 Model

The structure of our conditional joint model is shown in Fig. 1. It consists of SLU model and SAP model. Firstly, SLU model takes user utterances as inputs and obtains the context-aware distribution of each word by a shared Bi-LSTM layer. Then it performs slot filling and intent detection through task-specific output layers. Using the hidden outputs from SLU model, a sentence-level distribution for each utterance is produced in SAP model, and the distribution is combined with intent information for predicting system actions.

Fig. 1.
figure 1

Conditional joint model

3.1 SLU Model

As is shown in Fig. 2, SLU model consists of embedding layer, shared Bi-LSTM layer and task-specific output layers for slot filling and intent detection.

Fig. 2.
figure 2

SLU model

Embedding Layer. Given a sequence of words \(w_1, w_2, ..., w_T\) as inputs, we map them into a vector space to produce embeddings \(x=\{x_1, x_2, ..., x_T\}\), where \(x_t\) means the word embedding of the t-th word.

Shared Bi-LSTM Layer. We employ Bi-LSTM network to obtain the context-aware distribution of each word. Since we have \(x_t\) as the t-th word embedding, we calculate the forward and backward hidden states \(\overrightarrow{h_t}\) and \(\overleftarrow{h_t}\) respectively by the following equations,

$$\begin{aligned} \begin{aligned}&i_t = \sigma (W_{it} x_t + W_{hi}h_{t-1}+ b_i) \\&f_t = \sigma (W_{ft}x_t + W_{hf} h_{t-1} + b_f) \\&o_t = \sigma (W_{ot}x_t + W_{ho}h_{t-1} + b_o) \\&\hat{c}_t = tanh(W_{ct}x_t + W_{hc}h_{t-1} +b_c) \\&c_t = f_t \odot c_{t-1} + i_t \odot \hat{c}_t \\&h = o \odot tanh(c_t) \\ \end{aligned} \end{aligned}$$
(1)

where \(\sigma \) is the sigmoid function, \(\odot \) is element-wise multiplication, and ifo and c represent input gate, output gate, forget gate and cell state respectively. W and b are trainable parameters. \(c_{t-1}\) means previous cell state; \(h_{t-1}\) means previous hidden state. Finally, we can obtain the final state \(h_t\) by concatenating the forward and backward hidden states. In this way, the context information is integrated from two directions as:

$$\begin{aligned} h_t = [\overrightarrow{h_t}, \overleftarrow{h_t}] \end{aligned}$$
(2)

Intent Detection Layer. We stack another LSTM layer \(LSTM_{int}\) on top of the shared Bi-LSTM layer for intent detection,

$$\begin{aligned} h_{t}^{int} = LSTM_{int}(h_{t-1}^{int}, h_t) \end{aligned}$$
(3)

where \(h_t\) is the hidden state at time step t. We take the last hidden state \(h_T^{int}\) for intent detection. Sometimes, there are more than one intents in a user utterance. In such a situation, we use a sigmoid function to calculate the probability over all intent labels,

$$\begin{aligned} p^{int} = sigmoid(W_T^{int} h_T^{int}) \end{aligned}$$
(4)

where \(W_T^{int}\) is a weight matrix.

Similar with the intent detection layer, we obtain a threshold. The system action label is predicted if its probability is no less than the threshold,

$$\begin{aligned} y_n^{int} = {\left\{ \begin{array}{ll} 1, &{} p_n^{int} \ge threshold \\ 0, &{} otherwise \end{array}\right. } \end{aligned}$$
(5)

\(n \in [1, N]\) is the index of intent labels.

Conditional Slot Filling Layer. Since we have already obtained the last hidden state \(h_T^{int}\) from the layer \(LSTM_{int}\), we use it as an intent vector \(v^{int}\). The probability \(p_t^s\) is calculated as an attention weight to evaluate the contribution of the intent vector \(v^{int}\) to each hidden state \(h_t\) from the shared Bi-LSTM layer.

$$\begin{aligned} p_t^s = softmax(h_t \odot v^{int}) \end{aligned}$$
(6)

Then, we add the hidden state and weighted intent vector together for predicting slot labels.

$$\begin{aligned} h_t^s = h_t + v^{int} * p_t^s \end{aligned}$$
(7)

Finally, we choose the maximum of the probability as the predicted slot label,

$$\begin{aligned} y_t^s = argmax(softmax(W_t^s h_t^s + b^s)) \end{aligned}$$
(8)

where \(W_t^s\) is a weight matrix.

3.2 Conditional Joint Model

To predict system actions, we joint SLU and SAP model together to make full use of information from each other. In multi-turn dialogues, history utterances play an important role in system actions. We recombine the utterances in a window size \(u = \{u^1, u^2, ..., u^k\}\), where \(u^k\) is the k-th utterance in the window. Then we put them into SLU model for slot filling and intent detection. For each utterance, we can obtain the hidden outputs \(h_t^k (t=1, ...T, k=1, ...K)\) and the intent vector \(v_k^{int}\) from SLU model, then we take \(h_t^k\) as inputs to a LSTM layer \(LSTM_{joint}\) and use the last hidden state \(H_T^k\) to produce a sentence-level distribution.

$$\begin{aligned} H_t^k = LSTM_{joint} (H_{t-1}^k, h_t^k) \end{aligned}$$
(9)

We concatenate the sentence-level distribution \(H_k\) with the intent vector \(v_k^{int}\) to utilize intent information.

$$\begin{aligned} l^k = [H_T^k, v_k^{int}] \end{aligned}$$
(10)

Then the concatenated vector \(I^k\) is used as the input to the top Bi-LSTM layer for computing system action \(h_k^{act}\),

$$\begin{aligned} {\left\{ \begin{array}{ll} &{} \overrightarrow{h_k^{act}} = LSTM_{act}^{fw} (\overrightarrow{h_{k-1}^{act}}, I^k) \\ &{} \overleftarrow{h_k^{act}} = LSTM_{act}^{fw} (\overleftarrow{h_{k-1}^{act}}, I^k) \\ &{} h_k^{act} = [\overrightarrow{h_k^{act}}, \overleftarrow{h_k^{act}}] \end{array}\right. } \end{aligned}$$
(11)

where \(LSTM_{act}^{fw}\) and \(LSTM_{act}^{bw}\) stand for forward and backward LSTM network for SAP respectively.

The last hidden state \(h_K^{act}\) is used for predicting system actions. System can make more than one actions for an user utterance. Therefore, we use a sigmoid function to calculate the probability overall system action labels,

$$\begin{aligned} p^{act} = sigmoid(W_K^{act} h_K^{act}) \end{aligned}$$
(12)

where \(W_K^{act} \) is a weight matrix.

Similar with the intent detection layer, we obtain a threshold. The system action label is predicted if its probability is no less than the threshold,

$$\begin{aligned} y_m^{act} = {\left\{ \begin{array}{ll} 1, &{} p_m^{act} \ge threshod \\ 0, &{} otherwise \end{array}\right. } \end{aligned}$$
(13)

where \(m \in [1, M]\) is the index of system action labels.

The loss function for SAP is defined as:

$$\begin{aligned} \mathcal {L}_{act} = - \sum _{m=1}^M a_m^{act} log y_m^{act} \end{aligned}$$
(14)

where \(a_m^{act}\) means the ground truth label of system action.

In this joint model, the losses for slot filling and intent detection are defined as,

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{int} = - \sum _{n=1}^{N} g_n^{int} log y_n^{int} \\&\mathcal {L}_{slot} = - \sum _{t=1}^T s_t^s log y_t^s \end{aligned} \end{aligned}$$
(15)

where N is the number of intent labels. For joint learning of SLU model and SAP model, we add the three losses together. The total loss is as follows.

$$\begin{aligned} \mathcal {L}_{total} = \sum _D (\mathcal {L} + \mathcal {L}_{int} + \mathcal {L}_{slot}) \end{aligned}$$
(16)

where \(\mathcal {D}\) means the number of sequences in the total dataset. Via joint learning with the united loss function, the shared hidden states can combine two tasks jointly. Furthermore, the correlations of the two tasks can be learned and promote each other.

4 Experiment

In this section, we conduct experiment on benchmark DSTC4 and give the experiment result and analysis.

4.1 Corpus

DSTC4 corpus contains several multi-turn dialogues collected from Skype calls between tour guides and tourists. It involves touristic information for Singapore in five aspects: accommodation, attraction, food, shopping, and transportation. In this paper, we use the DSTC4 corpus setting following [3]. The training set contains 5648 utterances, the validation set contains 1939 utterances, and test set contains 3178 utterances. The number of slot labels, intent labels and system action labels are 87, 68, 66 respectively. The statistic of DSTC4 is shown in Table 1.

Table 1. DSTC4 corpus setup in this work

4.2 Training Details

For comparison purpose, we used the same training configurations as work [3]. The model is trained on all of the training data with its learning rate initialized to be 0.01. In order to enhance the model, we set the maximum norm for gradient clipping to 5 and dropout rate to 0.5

Table 2. Results for slot filling and intent detection

4.3 Metrics

Following the work [3], the performance of slot filling, intent detection and system action prediction are measured by token- level micro-average F1-score and frame-level accuracy(calculated only when the whole frame is correct).

4.4 Experiment Results and Analysis

We compare our model with the results from [3]. Table 2 shows results for slot filling and intent detection. There are previous works for SLU on DSTC4:

  • CRF+SVMs: CRF for slot filling and LinearSVM for intent detection are training separately.

  • BiLSTMs: A shared Bi-LSTM layer is provided for joint learning of slot filling and intent detection.

  • JointModel: A SAP model stacks on top of a history of SLU models simply.

From Table 2, we can see that our model gains much increase in slot filling task. In term of F1 score, our model outperforms previous best result (BiLSTMs) by 3.17%. In term of frame-level accuracy, our model achieves 1.1% improvement compared with previous best result (CRF+SVMs). In intent detection task, our model also shows good performance. It outperforms previous best result (CRF+SVMs) by 0.06% in term of token-level F1 score, and achieves the same score in term of frame-level accuracy. For both slot filling and intent detection, our model outperforms previous best result (JointModel) by 0.63% in term of the frame-level accuracy.

From all the results above, we can conclude that with conditioned intent information, our joint model performs well in SLU. This can be explained that intent labels can provide more effective information for predicting slot tags. Table 3 gives the results for SAP. The models in the table are introduced as follows.

  • SVMs: LinearSVM with features of one-hot vectors of aggregated slots and intents.

  • BiLSTMs: A Bi-LSTM layer which takes the predicted slot label and intent label from NLU model as input for system action prediction.

  • OraSAP (SVMs): LinearSVM with human annotated slot tags and user intents.

  • OraSAP (biLSTM): A Bi-LSTM layer whose inputs are the same as Oracle-SAP.

Our conditional joint model outperforms all other models in token-level F1 score, especially in the recall value. Compared with the best result (SVMs), our model obtains 2.54% improvement in F1 score and 10.05% improvement in the recall value especially. Through combining intent information for SAP, the model can identify the most accurate action labels, which brings the recall value with obvious increase.

Table 3. Results for system action prediction

We found that most user utterances in the dataset have more than one action labels (the maximum is 20). It is difficult to predict all the actions correctly. To purse high F1 score, we make a trade-off between token-level F1 score and frame-level. We found that most user utterances in the dataset have more than one action labels (the maximum is 20). It is difficult to predict all the actions correctly. To purse high F1 score, we make a trade-off between token-level F1 score and frame-level accuracy. Therefore, it is reasonable that our model ranks a little lower in term of frame-level accuracy.

Above all, our conditional joint model has a great performance on both SLU and SAP. It can be interpreted that slot tags, intent labels and actions share knowledge with each other, and they promote each other via joint learning.

5 Conclusion

In this paper, we proposed a conditional joint model that can be used to perform spoken language understanding and dialogue management. Our model is capable of achieving knowledge sharing between slot tags, intents and system actions by utilizing intent information. Experiments on dataset DSTC4 demonstrate that our model has an excellent performance and outperforms other popular methods significantly. In future work, we intend to explore how to integrate information from the three different tasks explicitly for an enhanced joint model. Besides, we plan to extend our work to spoken language generation task for a more complete spoken dialogue system.