1 Introduction

Community Question Answering (cQA) sites such as Yahoo! AnswersFootnote 1, StackoverflowFootnote 2, QuoraFootnote 3, WikiAnswersFootnote 4, and Google EjabatFootnote 5 give people the ability to post their various questions and get them answered by other users. Interestingly, users can directly obtain short and precise answers rather than a list of potentially relevant documents. Community sites are exponentially growing over time, building up very huge archives of previous questions and their answers. However, multiple questions with the same meaning can make information seekers spend more time searching for the best answer to their question. Therefore, retrieving similar questions could greatly improve the QA system and benefit the community. Detecting similar previous questions that best match a new user’s query is a crucial and challenging task in cQA, known as Question Retrieval (QR). Using the existing answers to similar previous questions could dodge the lag time incurred by waiting for new answers, thus enhancing user satisfaction. Owing to its importance, the question retrieval task has received wide attention over the last decade [14, 17, 18]. One critical challenge for this task is the word mismatch between the new posted questions and the existing ones in the archives as similar questions can be formulated using different, but related words. For instance, the questions How can we relieve stress naturally? and What are some home remedies to help reduce feelings of anxiety? have nearly the same meaning but include different words and then may be regarded as dissimilar. This constitutes a barrier to traditional Information Retrieval (IR) and Natural Language Processing (NLP) models since users can phrase the same query using different wording. Furthermore, community questions are mostly short, have different lengths, and usually have sparse representations with little word overlap. Although numerous attempts have been made to tackle this problem, most existing methods rely on the bag of-words (BOWs) representations which are constrained by their specificities that put aside the word order and ignore semantic and syntactic relationships. Recent advances in question retrieval have been achieved using Neural Networks (NNs) [5, 6, 8, 12] which provide powerful tools for modeling language, processing sequential data and predict the text similarity.

In this paper, we propose an approach based on NNs to detect the semantic similarity between the questions. The deep learning approach is based on a Siamese architecture with LSTM networks, augmented with an attention mechanism. We tested different similarity measures to compare the final hidden states of the LSTM layers.

2 Related Work

The question retrieval task has been intensively studied over the past decade. Early works were based on the vector space model referred to as VSM to calculate the cosine similarity between a query and archived questions [2]. However, the major limitation of VSM is that it favors short questions, while cQA services can handle a wide variety of questions not limited to factoïd questions. Language Models (LM)s [3] have been also used to model queries as sequences of terms instead of sets of terms. LMs estimate the relative likelihood for each possible successor term taking into account relative positions of terms. Nevertheless, such models might not be effective when there are only few common words between the questions. Further methods exploited the available category information of questions such as in [2]. Wang et al. [15] used a parser to build syntactic trees of questions, and rank them based on the similarity between their syntactic trees. Nonetheless, such an approach requires large training data and existing parsers are still not well-trained to parse informally written questions. Recent works focused on the representation learning for questions, relying on the Word Embedding model for learning distributed representations of words in a low-dimensional vector space. Along with the popularization of word embeddings and its capacity to produce distributed representations of words, advanced NN architectures such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and LSTM have proven effectiveness in extracting higher-level features from constituting word embeddings. For instance, Dos Santos et al. [5] employed CNN and bag-of-words (BOW) representations of the questions to calculate the similarity scores. Within the same context, Mohtarami et al. [8] developed a bag-of-vectors approach and used CNN and attention-based LSTMs to capture the semantic similarity between the community questions and rank them accordingly. LSTM model was also used in [12], where the weights learned by the attention mechanism were exploited for selecting important segments and enhancing syntactic tree-kernel models. More recently, the question retrieval task was modeled as a binary classification problem in [6] using a combination of LSTM and a contrastive loss function to effectively memorize the long term dependencies. In our work, we use a Siamese adaptation of LSTM [9] for pairs of variable-length sentences named Siamese LSTM. It is worth noting that work on cQA has been mostly carried out for other languages than Arabic mainly due to a lack of resources. Recent works in Arabic mainly rely on word embeddings and parse trees to analyze the context and syntactic structure of the questions [1, 7, 8, 13].

3 Description of the Proposed ASLSTM Approach

In order to improve the QR task, we propose an attentive Siamese LSTM approach for question retrieval, referred to as ASLSTM to detect the semantically similar questions in cQA. The approach is composed of three main modules namely, question preprocessing, word embedding learning and attentive Siamese LSTM. The basic principle underlying the ASLSTM approach is to map every question word token into a fix-sized vector. The word vectors of the questions are therefore fed to the Siamese LSTM with the aim of representing them in the final hidden states encoding semantic meaning of the questions. An attention mechanism is integrated in the Siamese architecture to determine which words should give more attention on than other words over the question. Community questions are then ranked by means of the Manhattan similarity function based on the vector representation of each question. A previous posted question is considered to be semantically equivalent to a queried question if their corresponding LSTM representations lie close to each other according to the Manhattan similarity measure. The historical question with the highest Manhattan score will be returned as the most similar question to the new posted one. The components of ASLSTM and the dataset used are described below.

3.1 Dataset

We used the dataset released in [19] for the QR evaluation. The questions of the community collection were harvested from all categories in the Yahoo! Answers platform, and were randomly splitted into the test and search sets while maintaining their distributions in all categories. The community questions in the collection are in various structures, different lengths and belonging to diverse categories e.g., Health, Sports, Computers and Internet, Diet and Fitness, Pets, Travel, Business and Finance, Entertainment and Music etc. Table 1 gives some statistics on the experimental data set.

Table 1. Description of the data set

For our experiments in Arabic, we translated the same English collection using Google Translation with a careful manual verification, as there is no large Arabic dataset available for the question retrieval task. Note that the Arabic collection includes exactly the same number of questions as the English set.

3.2 Question Preprocessing

Pre-processing is important to make the question collections cleaner and easier to process. The question preprocessing module aims to filter the community questions and extract the useful terms in order to represent them in a formal way. It comprises text cleaning, tokenization, stopwords removal and stemming. Punctuation marks, non letters, diacritics, and special characters are removed. English letters are lowercased while dates are normalized to the token date and numerical digits are normalized to the token num. For the Arabic question collection, in addition to the aforementioned tasks, orthographic normalization was applied, including Tachkil removal, Tatweel removal, and letter normalization.

3.3 Word Embedding Learning

Word embeddings are low-dimensional vector representations of words, learned by harnessing large amounts of text corpora using shallow neural networks. In the word embedding learning module, we map every word into a fix-sized vector using Word2Vec pretrained on an external corpus. For English word embedding training, we resorted to the publicly available word2vec vectorsFootnote 6, with dimensionality of 300, that were trained on 100 billion words from Google News.

For the experiments in Arabic, we used the Yahoo!Webscope datasetFootnote 7, translated into Arabic including 1,256,173 questions with 2,512,034 distinct words. The Continuous Bag-of-Words (CBOW) model was used, as it has proven through experiments to be more efficient and outperform Skip gram on our dataset [10]. The training parameters of the CBOW model on the Arabic collection were set after several tests as follows:

  • Size=300: feature vector dimension. We tested different values in the range [50, 500] but did not get significant difference in terms of precision.

  • Sample=1e-4: down sampling ratio for the redundant words in the corpus.

  • Negative samples=25: number of noise words

  • min-count=1: we set the minimum number of words to 1 to make sure we do not throw away anything.

  • Context window=5: fixed window size.

3.4 Attentive Siamese LSTM

3.5 Siamese LSTM

The overall aim of Siamese LSTM is to compare a pair of sentences to decide whether or not they are semantically equivalent. Siamese LSTM uses the Siamese network [9] architecture which is known to have identical sub-networks LSTMleft and LSTMright that are passed vector representations of two sentences and return a hidden state encoding semantic meaning of the sentences. These hidden states are then compared using a similarity metric to return a similarity score.

In our work, Siamese LSTM was adapted to the context of question retrieval, that is to say, the sentence pairs become pairs of questions. LSTM learns a mapping from the space of variable length sequences \(d_{in}\) and encode the input sequences into a fixed dimension hidden state representation \(d_{rep}\). More concretely, each question is represented as a word vector sequence and fed into the LSTM, which updates, at each sequence-index, its hidden state. The final state of LSTM for each question is a vector of d dimensions, which holds the inherent context of the question. Unlike vanilla RNN language models which predict next words, the given network rather compares pairs of sequences. A major feature of the Siamese architecture is the shared weights across the sub-networks, which reduce not only the number of parameters but also the tendency of overfitting.

To measure the similarity between the two question vectors, we tested several similarity measures and finally adapted the Manhattan one with which we acquired the best outcome as will be seen later in the next section.

The Manhattan similarity between the last hidden states of a sequence pairs \(h^{(left)}\) and \(h^{(right)}\) is computed as follows:

$$\begin{aligned} y=exp(-\parallel h^{(left)}-h^{(right)}\parallel _{1}) \end{aligned}$$
(1)

For Siamese LSTM training, we employed the publicly available Quora Question Pairs datasetFootnote 8. The given collection encompasses 400,000 samples of question duplicate pairs, where each sample has a pair of questions along with ground truth about their corresponding similarity (1: similar, 0: dissimilar). During LSTM training, we applied the Adadelta method for weights optimization to automatically decrease the learning rate. Gradient clipping was also used with a threshold value of 1.25 to avoid the exploding gradient problem [11]. The LSTM layers’ size was set to 50 and the embedding layer’s size to 300. We employed the back propagation and small batches of size equals 64, to reduce the cross-entropy loss and we resorted to the Mean Square Error (MSE) as a common regression loss function for prediction. We trained the model for several epochs to observe how the results varied with the epochs. We found out that the accuracy changed with the variation of the number of epochs but stabilized after epoch 25. The given parameters were set based on several empirical tests; each parameter was tuned separately on a development set to pick out the best one. Note that we used the same LSTM configuration for both languages.

3.6 Attention Mechanism

Attention mechanism with neural networks have recently achieved tremendous success in several NLP tasks [4, 12]. We assume that every word in a question contributes to the meaning of the whole question but the words do not have equal influential information. Thus, we should assign a probability to every word to determine how influential it is to the entire question.

The general architecture of the Siamese LSTM model augmented with an attention layer is illustrated in Fig. 1, where the different constituent layers are shown from the input (question words) to the output (similarity score). Siamese LSTM model employs only the last hidden states of a sequence pair e.g., \(h_{5}^{(a)}\) and \(h_{4}^{(b)}\), which may ignore some information. To remedy this problem, in the attention layer, we used all hidden states \(H=\{h_{1},h_{2},...,h_{L}\}\), where \(h_{i}\) is the hidden state of the LSTM at time step i summarizing all the information of the question up to \(x_{i}\) and L denotes the length of the question. Note that \(\alpha ^{(a)}\) and \(\alpha ^{(b)}\) denote the weights of \(LSTM_{a}\) and \(LSTM_{b}\), respectively. Basically, the attention mechanism measures the importance of a word through a context vector. It computes a weight \(\alpha _{i}\) for each word annotation \(h_{i}\) according to its importance. The final question representation r is the weighted sum of all the word annotations using the attention weights, computed by Eq. 4.

Fig. 1.
figure 1

An illustration of attentive Siamese LSTM model

In the attention layer, a context vector \(u_{h}\) is introduced, which is randomly initialized and can be viewed as a fixed query, that allows to identify the informative words.

$$\begin{aligned} e_{i}=\tanh (W_{h}h_{i}+b_{h}), e_{i}\in \left[ -1,1\right] \end{aligned}$$
(2)
$$\begin{aligned} \alpha _{i}=\frac{\exp (e_{i}^{T}u_{h})}{\sum _{i=1}^{T}\exp (e_{t}^{T}u_{h})},\sum _{i=1}^{T}\alpha _{i}=1 \end{aligned}$$
(3)
$$\begin{aligned} r=\sum _{i=1}^{T}\alpha _{i}h_{i}, r\in R^{2L} \end{aligned}$$
(4)

where \(W_{h}\), \(b_{h}\), and \(u_{h}\) are the learnable parameters, \(W_{h}\) is a weight matrix and \(b_{h}\) is a bias vector used to project each context vector into a common dimensional space and L is the size of each LSTM.

4 Experimental Evaluation

4.1 Evaluation Metrics

For the automatic evaluation, we used the following metrics: Mean Average Precision (MAP), Precision@n (P@n) and Recall as they are the most used ones for assessing the performance of the QR task. MAP assumes that the user is interested in finding many relevant questions for each query and then rewards methods that not only return relevant questions early, but also get good ranking of the results. Precision@n gives an idea about the classifier’s ability of not labeling a positive sample as a negative one. It returns the proportion of the top-n retrieved questions that are equivalent. Recall is the measure by which we check how well the model is in finding all the positive samples of the dataset. It returns the proportion of relevant similar questions that have been retrieved over the total number of relevant questions. We also used accuracy, which returns the proportion of correctly classified questions as relevant or irrelevant.

4.2 Results and Discussion

We compare ASLSTM against our previous approach called WEKOS as well as the competitive state-of-the-art question retrieval methods tested in [19] on the same datasets. The methods being compared are briefly described below:

  • WEKOS [10]: A word embedding based method which uses the cosine distance to measure the similarity between the weighted continuous valued vectors of the clustered questions.

  • TLM [16]: A translation based language model which uses a query likelihood approach for the question and the answer parts, and integrates word-to-word translation probabilities learned through various information sources.

  • ETLM [14]: An entity based translation language model, which is an extension of TLM where the word translation was replaced with entity translation to integrate semantic information within the entities.

  • PBTM [20]: A phrase based translation model which uses machine translation probabilities assuming that QR should be performed at the phrase level.

  • WKM [22]: A world knowledge based model which integrates the knowledge of Wikipedia into the questions by deriving the concept relationships that allow to identify related topics between the questions.

  • M-NET [21]: A word embedding based model, which integrates the category information of the questions to get a category based word embedding.

  • ParaKCM [19]: A key concept paraphrasing based approach which explores the translations of pivot languages and expands queries with paraphrases.

Table 2 gives a comparison of the performance of ASLSTM against the aforementioned models on the English Yahoo! Answers dataset.

As illustrated in Table 2, ASLSTM outperforms in English all the compared methods on all criteria by successfully returning a significant number of similar questions among the retrieved ones. This good performance indicates that the use of Siamese LSTM along with the attention mechanism is effective in the QR task. Word embeddings allow to obtain an efficient input representation for LSTM, capturing syntactic and semantic information in a word level.

Table 2. Question retrieval performance comparison of different models in English.

Interestingly, our approach does not require an extensive feature generation owing to the use of a pre-trained model. The results show that ASLSTM performs better than translation and knowledge based methods, which provides evidence that the question representations made by the Siamese LSTM sub-networks can learn the semantic relatedness between pairs of questions and then are more adequate for representing questions in the question similarity task. The Siamese network was trained using backpropagation-through-time under the MSE loss function which compels the LSTM sub-networks to detect textual semantic difference during training. A key virtue of LSTM is that it can accept variable length sequences and map them into fixed length vector representations which can overcome the length and structure’s problems in cQA.

Another significant finding is the effectiveness of the attention mechanism which was able to improve the performance of the approach. We assume that the attention mechanism managed to boost the similarity learning process by assigning a weight to each element of the question. The weights will then allow to compute which element in the sequence the neural network should more attend.

WEKOS averages the weighted embeddings, which is one of the most simple and widely used techniques to derive sequence embedding but it leads to losing the word order, while in ASLSTM, the LSTMs update their state to get the main context meaning of the text sequence in the order of words. The goal of the Siamese architecture is to learn a function which can map a question to an appropriate fixed length vector which is favor for similarity measurement. Interestingly, it offers vector representation for a very short text fragment that should grasp most of the semantic information in that fragment.

In order to properly assess the Siamese LSTM model performance on the similarity prediction problem, we plot training data vs validation data accuracy using the Matplotlib library.

Fig. 2.
figure 2

Epochs vs accuracy of Siamese LSTM on the English and Arabic dataset

From the plots of accuracy given in Figs. 2a and 2b, we observe that we get about 82% and 81% accuracy rate on the validation data for English and Arabic respectively. The model has comparable consistent accuracy on both train and validation sets. Both training and validation accuracy continue to increase without a sudden decrease of the validation accuracy, indicating a good fit. Therefore, we can admit that, whilst the performance on the training set is slightly better than that of the validation set in term of accuracy, the model converged to a stable value without any typical overfitting signs.

It is worth mentioning that the accuracy used in the epochs-accuracy plots, is the binary accuracy calculated by Keras, and it implies that the threshold is set at 0.5 so, everything above 0.5 will be considered as correct.

Our results are fairly stable across different similarity functions, namely cosine and Euclidean distances. We found that the Manhattan distance outperformed them on both the English and Arabic datasets as depicted in Tables 3a and 3b which demonstrates that it is the most relevant measure for the case of high dimensional text data.

Table 3. Comparison between similarity measures

Furthermore, we remarked that ASLSTM could find the context mapping between certain expressions mostly used in the same context such as bug and error message or also need help and suggestions. ASLSTM was also able to retrieve similar questions containing certain common misspelled terms like recieve instead of receive, but it failed to capture other less common spelling mistakes like relyable or realible instead of reliable. Such cases show that our approach can address some lexical disagreement problems. Moreover, there are few cases where ASLSTM fails to detect semantic equivalence, including queries having only one similar question and most words of this latter do not appear in a similar context with those of the query.

Table 4. Question retrieval performance of ASLSTM in Arabic

Table 4 shows that ASLSTM outperforms in Arabic the best compared system which proves that it can also perform well with complex languages.

Nevertheless, a major limitation of the proposed approach is that it ignores the morphological structure of Arabic words. Harnessing the word internal structure might help to capture semantically similar words. Therefore, endowing word embeddings with grammatical information such as, the person, gender, number and tense could help to obtain more meaningful embeddings that detect morphological and semantic similarity. In terms of recall, ASLSTM reaches 0.4136 for Arabic which implies that the number of omitted similar questions is not too big. Interestingly, unlike traditional RNNs, Siamese LSTM is able effectively handle the long questions and learn long range dependencies thanks to its use of memory cell units that can store information across long input sequences.

4.3 Conclusion

In this paper, we presented an Attention-based Siamese LSTM approach, aiming at solving the question retrieval problem, which is of great importance in real-world cQA. For this purpose, we suggested using Siamese LSTM to capture the semantic similarity between the community questions. An attention mechanism was integrated to let the model give different attention to different words while modeling questions. Interestingly, we showed that Siamese LSTM is capable of modeling complex structures and covering the context information of question pairs. Experiments on large scale Yahoo! Answers datasets showed that the proposed approach can successfully improve the question retrieval task in English and Arabic and outperform some competitive methods evaluated on the same dataset. In the future, we plan to integrate morphological features into the embedding layer to improve the question representations.