Keywords

1 Introduction

In recent years, visual question answering (VQA) has been widely studied by researchers in both computer vision and natural language processing communities [2, 8, 11, 27, 31, 34]. Most existing works perform VQA by utilizing attention mechanism and combining features from two modalities for predicting answers.

Fig. 1.
figure 1

VQA-E provides insightful information that can explain, elaborate or enhance predicted answers compared with the traditional VQA task. Q = Question, A = Answer, E = Explanation. (Left) From the answer, there is no way to trace the corresponding visual content to tell the name of the hotel. The explanation clearly points out where to look for the answer. (Middle) The explanation provides a real answer to the aspect asked. (Right) The word “anything” in the question refers to a vague concept without specific indication. The answer is enhanced by the “madonna shirt” in the explanation.

Although promising performance has been reported, there is still a huge gap for humans to truly understand the model decisions without any explanation for them. A popular way to explain the predicted answers is to visualize attention maps to indicate ‘where to look’. The attended regions are pointed to trace the predicted answer back to the image content. However, the visual justification through attention visualization is implicit and it cannot entirely reveal what the model captures from the attended regions for answering the questions. There could be many cases where the model attends to right regions but predicts wrong answers. What’s worse, the visual justification is not accessible to visually impaired people who are the potential users of the VQA techniques. Therefore, in this paper we intend to explore textual explanations to compensate for these weaknesses of visual attention in VQA.

Another crucial advantage of textual explanation is that it elaborates and enhances the predicted answer with more relevant information. As shown in Fig. 1, a textual explanation can be a clue to justify the answer, or a complementary delineation that elaborates on the context of the question and answer, or a detailed specification about abstract concepts mentioned in the QA to enhance the short answer. Such textual explanations are important for effective communication since they provide feedbacks that enable the questioners to extend the conversation. Unfortunately, although textual explanations are desired for both model interpretation and effective communication in natural contexts, little progress has been made in this direction, partly because almost all the public datasets, such as VQA [2, 8], COCO-QA [22], and Visual7W [34], do not provide explanations for the annotated answers.

In this work, we aim to address the above limitations of existing VQA systems by introducing a new task called VQA-E (VQA with Explanations). In VQA-E, the models are required to provide a textual explanation for the predicted answer. We conduct our research in two steps. First, to foster research in this area, we construct a new dataset with textual explanations for the answers. The VQA-E dataset is automatically derived from the popular VQA v2 dataset [8] by synthesizing an explanation for each image-question-answer triple. The VQA v2 dataset is one of the largest VQA datasets with over 650k question-answer pairs, and more importantly, each image in the dataset is coupled with five descriptions from MSCOCO captions [4]. Although these captions were written without considering the questions, they do include some QA-related information and thus exploiting these captions could be a good initial point for obtaining explanations free of cost. We further explore several simple but effective techniques to synthesize an explanation from the caption and the associated question-answer pair. To relieve concern about the quality of the synthesized explanations, we conduct a comprehensive user study to evaluate a randomly-selected subset of the explanations. The user study results show that the explanation quality is good for most question-answer pairs while being a little inadequate for the questions asking for a subjective response or requiring common sense (pragmatic knowledge). Overall, we believe the newly created dataset is good enough to serve as a benchmark for the proposed VQA-E task.

To show the advantages of learning with textual explanations, we also propose a novel VQA-E model, which addresses both the answer prediction and the explanation generation in a multi-task learning architecture. Our dataset enables us to train and evaluate the VQA-E model, which goes beyond a short answer by producing a textual explanation to justify and elaborate on it. Through extensive experiments, we find that the additional supervisions from explanations can help the model better localize the important image regions and lead to an improvement in the accuracy of answer prediction. Our VQA-E model outperforms the state-of-the-art methods in the VQA v2 dataset.

2 Related Work

Attention in Visual Question Answering. Attention mechanism is firstly used in machine translation [3] and then is brought into the vision-to-language tasks [9, 10, 15, 18, 19, 28,29,30,31,32,33]. The visual attention in the vision-to-language tasks is used to address the problem of “where to look” [25]. In VQA, the question is used as a query to search for the relevant regions in the image. [31] proposes a stacked attention model which queries the image for multiple times to infer the answer progressively. Beyond the visual attention, Lu et al. [18] exploit a hierarchical question-image co-attention strategy to attend to both related regions in the image and crucial words in the question. [19] proposes the dual attention network, which refines the visual and textual attention via multiple reasoning steps. Attention mechanism can find the question-related regions in the image, which can account for the answer to some extent. [6] has studied how well the visual attention is aligned with the human gaze. The results show that when answering a question, current attention-based models do not seem to be “looking” at the same regions of the image as humans do. Although attention is a good visual explanation for the answer, it is not accessible for visually impaired people and is somehow limited in real-world applications.

Model with Explanations. Recently, a number of works [14, 17, 20] have been done for explaining the decisions from deep learning models, which are typically black boxes due to the end-to-end training procedure. [14] proposes a novel explanation model for bird classification. However, their class relevance metrics are not applicable to VQA since there is no pre-defined semantic category for the questions and answers. Therefore, we build a reference dataset to directly train and evaluate models for VQA with explanations. The most similar work to ours is Multimodal Explanations [20] that proposes a multimodal explanation dataset for VQA, which is human-annotated and of high quality. In contrast, our dataset focuses on textual explanations and is built free of cost and over six times bigger (269,786 v.s. 41,817) than theirs.

3 VQA-E Dataset

We now introduce our VQA-E dataset. We begin by describing the process of synthesizing explanations from image descriptions for question-answer pairs, followed by dataset analysis and a user study to assess the quality of our dataset.

Fig. 2.
figure 2

An example of the pipeline to fuse the question (Q), the answer (A) and the relevant caption (C) into an explanation (E). Each question-answer pair is converted into a statement (S). The statement and the most relevant caption are both parsed into constituency trees. These two trees are then aligned by the common node. The subtree including the common node in the statement is merged into the caption tree to obtain the explanation.

3.1 Explanation Synthesis

Approach. The first step is to find the caption most relevant to the question and answer. Given an image caption \(\mathcal {C}\), a question \(\mathcal {Q}\) and an answer \(\mathcal {A}\), we tokenize and encode them into GloVe word embeddings [21]: \(W_c = \{\varvec{w}_1, ..., \varvec{w}_{T_c}\}, W_q = \{\varvec{w}_1, ..., \varvec{w}_{T_q}\}, W_a = \{\varvec{w}_1, ..., \varvec{w}_{T_a}\}\), where \(T_c, T_q, T_a\) are the number of words in the caption, question, and answer, respectively. We compute the similarity between the caption and question-answer pair as follows:

$$\begin{aligned}&s(\varvec{w}_i, \varvec{w}_j) = \frac{1}{2} (1+\frac{\varvec{w}_i^T \varvec{w}_j}{||\varvec{w}_i||\cdot ||\varvec{w}_j||}) \end{aligned}$$
(1a)
$$\begin{aligned}&S(\mathcal {Q},\mathcal {C}) = \frac{1}{T_q} \sum _{\varvec{w}_i\in W_q}\max _{\varvec{w}_j\in W_c} s(\varvec{w}_i, \varvec{w}_j) \end{aligned}$$
(1b)
$$\begin{aligned}&S(\mathcal {A},\mathcal {C}) = \frac{1}{T_a} \sum _{\varvec{w}_i\in W_a}\max _{\varvec{w}_j\in W_c} s(\varvec{w}_i, \varvec{w}_j) \end{aligned}$$
(1c)
$$\begin{aligned}&S(<\mathcal {Q}, \mathcal {A}>,\mathcal {C}) = \frac{1}{2}(S(\mathcal {Q},\mathcal {C}) + S(\mathcal {A},\mathcal {C})) \end{aligned}$$
(1d)

For each question-answer pair, we find the most relevant caption, coupled with a similarity score. We have tried other more complex techniques like using Term Frequency and Inverse Document Frequency to adjust the weights of different words, but we find this simple mean-max formula in Eq. (1) works better.

Fig. 3.
figure 3

Top: similarity score distribution. Bottom: illustration of VQA-E examples at different similarity levels.

To generate a good explanation, we intend to fuse the information from both the question-answer pair and the most relevant caption. Firstly the question and answer are merged into a declarative statement. We achieve this by designing simple merging rules based on the question types and the answer types. Similar rule-based methods have been explored in NLP to generate questions from declarative statements [13] (i.e., opposite direction). We then fuse this QA statement with the caption via aligning and merging their constituency parse trees. We further refine the combined sentence by a grammar check and correction tool to obtain the final explanation, and compute its similarity to the question-answer pair with Eq. 1. An example of our pipeline is shown in Fig. 2.

Similarity Distribution. Due to the large size and diversity of questions, and the limited sources of captions for each image, it is not guaranteed that a good explanation could be generated for each Q&A. The explanations with low similarity scores are removed from the dataset to reduce noise. We present some examples in Fig. 3. It shows a gradual improvement in explanation quality when the similarity scores increase. With some empirical investigation, we select a similarity threshold of 0.6 to filter out those noisy explanations. We also plot the similarity score histogram in Fig. 3. Interestingly, we observe a clear trough at 0.6 that makes the explanations well separated by this threshold.

Fig. 4.
figure 4

Distribution of synthesized explanations by different question types.

Table 1. Statistics for our VQA-E dataset.

3.2 Dataset Analysis

In this section, we analyze our VQA-E dataset, particularly the automatically synthesized explanations. Out of 658,111 existing question-answer pairs in original VQA v2 dataset, our approach generates relevant explanations with high similarity scores for 269,786 QA pairs (41%). More statistics about the dataset are given in Table 1.

We plot the distribution of the number of synthesized explanations for each question type in Fig. 4. While looking into different question types, the percentage of relevant explanations varies from type to type.

Abstract Questions v.s. Specific Questions. It is observed that the percentage of relevant explanations is generally higher for ‘is/are’ and ‘what’ questions than ‘how’, ‘why’ and ‘do’ questions. This is because ‘is/are’ and ‘what’ questions tend to be related to specific visual contents which are more likely being described by image captions. In addition, a more specific question type could further help in the explanation generation. For example, for ‘what sport is’ and for ‘what room is’ questions, our approach successfully generates explanations for 90% and 87% question and answer pairs, respectively. The rates of having good explanations for these types of questions are much higher than the general ‘what’ questions (40%).

Fig. 5.
figure 5

Subjective examples: our method cannot handle the questions involving emotional feeling (left), commonsense knowledge (middle) or behavioral reasoning (right).

Subjective Questions: Do You/Can You/Do/Could? The existing VQA datasets involve some questions that require subjective feeling, logical thinking or behavioral reasoning. These questions often fall in the question types starting with ‘do you’, ‘can you’, ‘do’, ‘could’, and etc. For these questions, there may be underlying clues from the image contents but the evidence is usually opaque and indirect and thus it is hard to synthesize a good explanation. We illustrate examples of such questions in Fig. 5 and the generated explanations are generally inadequate to provide relevant details regarding the questions and answers.

Due to the inadequacy in handling the above mentioned cases, we only achieve small percentages of good explanations for these question types. The percentages of ‘do you’, ‘can you’, ‘do’ and ‘could’ questions are 4%, 5%, 13% and 6% respectively which are far below the average 41%.

3.3 Dataset Assessment – User Study

It is not easy to use quantitative metrics to evaluate whether the synthesized explanations can provide valid, relevant and complementary information to the answers of the visual questions. Therefore, we conduct a user study to assess our VQA-E dataset from human perspective. Particularly, we measure the explanation quality from four aspects: fluent, correct, relevant, complementary.

Fluent measures the fluency of the explanation. A fluent explanation should be correct in grammar and idiomatic in wording. The correct metric indicates whether the explanation is correct according to the image content. The relevant metric assesses the relevance of an explanation to the question and answer pair. If an explanation is relevant, users should be able to infer the answer from the explanation. This metric is important to measure whether the proposed word embedding similarity can effectively select and filter the explanations. Through the user study, we evaluate the relevance of explanations from human understanding to verify whether the synthesized explanations are closely tied to their corresponding QA pairs. Last but not least, we evaluate whether an explanation is complementary to the answer. It is essential that the explanation can provide complementary details to the abbreviate answers so that visual accordance between the answer and the image could be enhanced.

Table 2. User assessment results for the synthesized explanation, the most similar caption, the random caption, and the generated explanation. To avoid bias, they are evaluated jointly and in each sample, their order is shuffled and unknown to users. They are assessed by the human evaluators in 1–5 grades: 1-very poor, 2-poor, 3-barely acceptable, 4-good, 5-very good. Here we show the average scores of 2,000 questions.

Evaluation Results Summary. We show the human evaluation results in Table 2. Since the synthesized explanations are derived from existing human annotated captions, their average fluency and correctness scores are both close to 5. More importantly, their relevance and complementariness scores are both above 4, which indicates that the overall quality of the explanations is good from human perspective. These two metrics differentiate a general caption of an image and our specific explanation dedicated for a visual question-answer pair.

4 Multi-task VQA-E Model

Based on the well-constructed VQA-E dataset, in this section, we introduce the proposed multi-task VQA-E model. Figure 6 gives an overview of our model. Given an image \(\mathcal {I}\) and a question \(\mathcal {Q}\), our model can simultaneously predict an answer \(\mathcal {A}\) and generate a textual explanation \(\mathcal {E}\).

Fig. 6.
figure 6

An overview of the multi-task VQA-E network. Firstly, an image is represented by a pre-trained CNN, while the question is encoded via a single-layer GRU. Then the image features and question features are input to the Attention module to obtain image features for question-guided regions. Finally, the question features and attended image features are used to simultaneously predict an answer and generate an explanation.

4.1 Image Features

We adopt a pre-trained convolutional neural network (CNN) to extract a high-level representation \(\phi \) of the input image \(\mathcal {I}\):

$$\begin{aligned} \phi = \text {CNN}(\mathcal {I}) = \{\varvec{v}_1, ..., \varvec{v}_P\} \end{aligned}$$
(2)

where \(\varvec{v}_i\) is the feature vector of the \(i^{th}\) image patch and P is the total number of patches. We experiment with three types of image features:

  • Global. We extract the outputs of the final pooling layer (‘pool5’) of the ResNet-152 [12] as global features of the image. For these image features, \(P=1\), and visual attention is not applicable.

  • Grid. We extract the outputs of the final convolutional layer (‘res5c’) of ResNet-152 as the feature map of the image, which corresponds to a uniform grid of equally-sized image patches. In this case, \(P=7\times 7=49\).

  • Bottom-up. [1] proposes a new type of image features based on object detection techniques. They utilize Faster R-CNN to propose salient regions, each with an associated feature vector from the ResNet-101. The bottom-up image features provide a more natural basis at the object level for attention to be considered. We choose \(P=36\) in this case.

4.2 Question Embedding

The question \(\mathcal {Q}\) is tokenized and encoded into word embeddings \(W_q = \{\varvec{w}_1, ..., \varvec{w}_{T_q}\}\). Then the word embeddings are fed into a gated recurrent unit [5]: \( \varvec{q} = \text {GRU}(W_q). \) We use the final state of the GRU as the representation of the question.

4.3 Visual Attention

We use the classical question-guided soft attention mechanism similar to most modern VQA models. For each patch in the image, the feature vector \(\varvec{v}_i\) and the question embedding \(\varvec{q}\) are firstly projected by non-linear layers to the same dimension. Next we use the Hadamard product (i.e., element-wise multiplication) to combine the projected representations and input to a linear layer to obtain a scalar attention weight associated with that image patch. The attention weights \(\pmb {\tau }\) are normalized over all patches with softmax function. Finally, the image features from all patches are weighted by the normalized attention weights and summed into a single vector \(\varvec{v}\) as the representation of the attended image. The formulas are as follow and we omit the bias terms for simplicity:

$$\begin{aligned} \begin{aligned}&{\tau }_i = \varvec{w}^T~(\text {Relu}(W_v\varvec{v}_i) \odot \text {Relu}(W_q\varvec{q})) \\&\pmb {\alpha } = \text {softmax}(\pmb {\tau }) \\&\varvec{v} = \sum _{i=1}^P \alpha _i \varvec{v}_i \end{aligned} \end{aligned}$$
(3)

Note that we adopt a simple one-glimpse, one-way attention, as opposed to complex schemes proposed by recent works [16, 18, 31].

Next, the representations of the question \(\varvec{q}\) and the image \(\varvec{v}\) are projected to the same dimension by non-linear layers and then fused by a Hadamard product:

$$\begin{aligned} \varvec{h} = \text {Relu}(W_{qh}\varvec{q}) \odot \text {Relu}(W_{vh} \varvec{v}) \end{aligned}$$
(4)

where \(\varvec{h}\) is a joint representation of the question and the image, and then fed to the subsequent modules for answer prediction and explanation generation.

4.4 Answer Prediction

We formulate the answer prediction task as a multi-label regression problem, instead of a single-label classification problem in many other works. A set of candidate answers is pre-determined from all the correct answers in the training set that appear more than 8 times. This leads to \(N=3129\) candidate answers. Each question in the dataset has \(K=10\) human-annotated answers, which are sometimes not same, especially when the question is ambiguous or subjective and has multiple correct or synonymous answers. To fully exploit the disagreement between annotators, we adopt soft accuracies as the regression targets. The accuracy for each answer is computed as:

$$\begin{aligned} \begin{aligned} \text {Accuracy}(a)&= \frac{1}{K} \sum _{k=1}^K \min (\frac{\sum _{1\le j \le K,j \ne k}\mathbbm {1}(a=a_j)}{3}, 1)\\ \end{aligned} \end{aligned}$$
(5)

Such soft target provides more information for training and is also in line with the evaluation metric.

The joint representation \(\varvec{h}\) is input into a non-linear layer and then through a linear mapping to predict a score for each answer candidate:

$$\begin{aligned} \hat{s} = \text {sigmoid}~(W_o~\text {Relu}~(W_f~\varvec{h})) \end{aligned}$$
(6)

The sigmoid function squeezes the scores into (0, 1) as the probability of the answer candidate. Our loss function is similar to the binary cross-entropy loss while using soft targets:

$$\begin{aligned} L_{\text {vqa}} = -\sum _{i=1}^M\sum _{j=1}^N s_{ij}\log \hat{s}_{ij} + (1-s_{ij})\log (1-\hat{s}_{ij}) \end{aligned}$$
(7)

where M are the number of training samples and \(\mathbf s \) is the soft targets computed in Eq. 5. This final step can be seen as a regression layer that predicts the correctness of each answer candidate.

4.5 Explanation Generation

To generate an explanation, we adopt an LSTM-based language model that takes the joint representation \(\varvec{h}\) as input. Given the ground-truth explanation \(\mathcal {E}=\{w_1, w_2, ..., w_{T_e}\}\), the loss function is:

$$\begin{aligned} \begin{aligned} L_{\text {vqe}}&= -\log (p(\mathcal {E}|\varvec{h}))\\&= -\sum _{t=0}^{T_e} \log (p(w_t|\varvec{h},w_1,...,w_{t-1}) ) \end{aligned} \end{aligned}$$
(8)

The final loss of multi-task learning is the sum of the VQA and VQE loss:

$$\begin{aligned} L = L_{\text {vqa}} + L_{\text {vqe}} \end{aligned}$$
(9)

5 Experiments and Results

5.1 Experiment Setup

Model Setting. We use 300 dimension word embeddings, initialized with pre-trained GloVe vectors [21]. For the question embedding, we use a single-layer GRU with 1024 hidden units. For explanation generation, we use a single-layer forward LSTM with 1024 hidden units. The question embedding and the explanation generation share the word embedding matrix to reduce the number of parameters. We use Adam solver with a fixed learning rate 0.01 and the batch size is 512. We use weight normalization [24] to accelerate the training. Dropout and early stop (15 epochs) are used to reduce overfitting.

Model Variants. We experiment with the following model variants:

  • Q-E: generating explanation from question only.

  • I-E: generating explanation from image only.

  • QI-E: generating explanation from question and image and only training the branch of explanation generation.

  • QI-A: predicting answer from question and image and only training the branch of answer prediction.

  • QI-AE: predicting answer and generating explanations, training both branches.

  • QI-AE(relevant): predicting answer and generating explanation and training both branches. The explanation used in this variant is the relevant caption obtained in the process of explanation synthesis in Sect. 3.1.

  • QI-AE(random): predicting answer and generating explanation and training both branches. The explanation is randomly selected from the ground-truth captions for the same image except the relevant caption.

5.2 Evaluation of Explanation Generation

In this section, we evaluate the task of explanation generation. Table 3 shows the performance of all model variants on the validation split of the VQA-E dataset. First, the I-E model outperforms Q-E. This implies that it is easier to generate an explanation from only the image than from only the question, and this image bias is contrary to the well-known language bias in the VQA where it is easier to predict an answer from only the question than from only the image. Second, the QI-E models outperform both the I-E and Q-E by a large margin, which means that both the question and the image are critical for generating good explanations. Attention mechanism is helpful for the performance and bottom-up image features are consistently better than grid image features. Finally, the QI-AE using bottom-up image features improves the performance further and achieves the best performance across all evaluation metrics. This shows that the supervision on the answer side is helpful for the explanation generation task, thus proving the effectiveness of our multi-task learning scheme.

Table 3. Performance of explanation generation task on the validation split of the proposed VQA-E dataset, where B-N, M, R, and C are short for BLEU-N, METEOR, ROUGE-L, and CIDEr-D. All scores are reported in percentage (%).

5.3 Evaluation of Answer Prediction

In this section, we evaluate the task of answer prediction, as shown in Table 4. Overall, the QI-AE models consistently outperform QI-A models across all question types. This indicates that forcing the model to explain can help it predict a more accurate answer. We argue that the supervision on explanation in QI-AE models can alleviate the headache of language bias in the QI-A models, because in order to generate a good explanation, the model has to fully exploit the image content, learn to attend to important regions, and explicitly interpret the attended regions in the context of questions. In contrast, during the training of QI-A models without explanations, when an answer can be guessed from the question itself, the model can easily get the loss down to zero by understanding the question only regardless of the image content. In this case, the training sample is not fully exploited to help the model learn how to attend to the important regions. Another observation from Table 4 can further support our argument. The additional supervision on explanation produces a much bigger improvement on the attention-based models (Grid and Bottom-up) than the models without attention (Global).

Table 4. Performance of the answer prediction task on the validation split of VQA v2 dataset. Accuracies in percentage (%) are reported.
Table 5. Performance comparison with the state-of-the-art VQA methods on the test-standard split of VQA v2 dataset. is an ensemble of 30 models and it will not participate in ranking. Accuracies in percentage (%) are reported.

QI-AE(random)-Bottom-up produces a much lower accuracy than QI-AE-Bottom-up, even lower than QI-A-Bottom-up. This implies that low-quality or irrelevant explanations might confuse the model, thus leading to a big drop in the performance. It also relieves the concern that the improvement is brought by learning to describe the image, rather than explaining the answer. This further substantiates the effectiveness of the additional supervision on explanation.

Table 5 presents the performance of our method and the state-of-the-art approaches on the test-standard split of VQA v2 dataset. Our method outperforms the state-of-the-art methods over the answer types ‘Yes/No’ and ‘Other’ as well as in the overall accuracy, while producing a slightly lower accuracy over the answer type ‘Number’ than BUTD [1, 26].

Fig. 7.
figure 7

Qualitative comparison between the QI-A and QI-AE models (both using bottom-up image features). We visualize the attention by rendering a red box over the region that has the biggest attention weight.

5.4 Qualitative Analysis

In this section, we show qualitative examples to demonstrate the strength of our multi-task VQA-E model, as shown in Fig. 7. Overall, the QI-AE model can generate relevant and complementary explanations for the predicted answers. For example, in the (a) of Fig. 7, the QI-AE model not only predicts the correct answer ‘Yes’, but also provides more details in the ‘kitchen’, i.e., ‘fridge’, ‘sink’, and ‘cabinets’. Besides, the QI-AE model can better localize the important regions than the QI-A model. As shown in the (b) of Fig. 7, the QI-AE model gives the biggest attention weight on the person’s hand and thus predicts the right answer ‘Feeding giraffe’, while the QI-A model focuses more on the giraffe, leading to a wrong answer ‘Standing’. In the (c), both QI-AE and QI-E models attend to the right region, but these two models predict the opposite answers. This interesting contrast implies that the QI-AE model, which has to fully exploit the image content to generate an explanation, can better understand the attended region than the QI-A model that only needs to predict a short answer.

6 Conclusions and Future Work

In this work, we have constructed a new dataset and proposed a task of VQA-E to promote research on justifying answers for visual questions. Explanations in our dataset are of high quality for those visually-specific questions, while being inadequate for subjective ones whose evidences are indirect. For subjective questions, we will need extra knowledge bases to find good explanations for them.

We have also proposed a novel multi-task learning architecture for the VQA-E task. The additional supervision from explanations not only enables our model to generate reasons to justify predicted answers, but also brings a big improvement in the accuracy of answer prediction. Our VQA-E model is able to better localize and understand the important regions in images than the original VQA model. In the future, we will adopt more advanced approaches to train our model, like the reinforcement learning in image captioning [23].