Keywords

1 Introduction

Image caption task aims at automatically generating a descriptive sentence to describe the content of an image with an English sentence [1]. With the explosive increase in digital images and the rapid development in deep learning, teaching machines to understand images as humans is drawing great interests. At the outset, computer vision task aims at classifying the category of a single image (image classification). Hereafter, researchers try to locate the position of objects in more complicated scenes (object detection). After that, researchers further want to distinguish the category of per-pixel (semantic segmentation). Along with this fruitful development route, researchers owe it to comprehending the semantic information of the picture better and better. Meanwhile, another understanding of the images’ semantic information is to describe an image’s content with a human-like sentence (image caption). This idea is closer to human’s habit when there is a scene in front of their eyes. While caption task seems obvious for human beings, it is much more difficult for machine since it requires the ‘translation’ model to capture several semantic information from a certain image. Such as scenes, objects, attributes, relative position and so on. Another challenge of caption task is to generate descriptive sentence meeting the grammar rules.

Recently, Neural network methods [2, 3] dominate the literature in image captioning. The encoder-decoder architecture in Neural Machine Translation [4] inspire these methods very much. In contrast to original Neural Machine Translation model, image caption model replace the recurrent neural network (RNNs) with convolutional neural network (CNNs) as encoder. CNNs encode the input image into a feature vector, which represents the semantic information of the image. Then a sequence modeling approach (e.g., Long Short-Term Memory (LSTM) [5]) decodes the semantic feature vector into a sequence of words. Such architecture applies to the vast majority of image caption model.

The method to combine CNNs and RNNs together directly will result that the information of the input image decreases by iterations. In this situation, researchers start to utilize image guidance [3], attributes [6] or region attention [7] as the extra input into LSTM decoder for better performance. The original intention comes from visual attention, which has been known in Psychology and Neuroscience for a long time. Attention mechanism highly relies on the quality of the input image. If there are too much redundant information in the image, it will be hard for attention mechanism to capture the principal information. As shown in Fig. 1, the proportion of principal objects (humans and surfboards) is very low. CNNs-encoder usually reduce the dimension of feature vector a lot, which will make it harder for attention mechanism to capture the information for subject, object and other noun composition. In this condition, if we insist on applying attention mechanism to the whole image like [7], caption model may not know what to describe.

Fig. 1.
figure 1

This is an example picture in MS-COCO dataset. The caption ground truth is “Several surfers riding a small wave into the beach”. The proportion of principal object (humans and surfboards) is well low. There are too much redundant information, such as sky, which will make it harder for attention mechanism to align the principal object with the noun composition in the descriptive sentence.

In Natural Language Processing, scientists take the noun composition in a sentence as the focus, which people care more about. In image caption task, the noun composition corresponds to the principal object in an image. To help image caption model to capture the principal object more accurate, we propose to get help from object detection task. Object detection task has been studied for a long time. CNNs framework is widely used and rapidly developed in object detection task, such as R-CNN [11], Fast-RCNN [12], Faster-RCNN [13]. These models are able to capture principal objects in the image very well. So we propose to make use of the feature of object detection methods to encode the image and generate guidance for the language generate model. We call it as accurate guidance. This advance also means to combine the higher level of semantic information in computer vision task with the semantic meaning in human-reading sentence.

We implement our model based on a single state-of-the-art object detection network Faster-RCNN [13], for accuracy and speed. Simultaneously, our model can be trained end-to-end, which will make the object detection module to adjust itself to suit for the image caption task. We take the Google NIC [7] as the baseline and compare our methods with popular attention models on the commonly used MS-COCO dataset [9] with publicly available splits of training, validation and testing sets. We evaluate methods on standard metrics. Our proposed methods outperform all of them and achieve state-of-the-art across different evaluation metrics.

The main contributions of our paper are as follows. First, we propose accurate guidance mechanism to help the caption model capture the principal object more precisely and infer their relationships from global information simultaneously. Second, the proposed method utilize a single object detection network as the multi-level feature extracter and demonstrates a less complicated way to achieve end-to-end training of attention-based captioning model, whereas state-of-the-art methods [3, 6, 19] involve LSTM hidden states or image attributes for attention, which compromises the possibility of end-to-end optimization.

2 Related Work

Recent successes of deep neural networks in machine translation catalyze the adoption of neural networks [8] in solving image caption problems. Early works of neural networks-based image caption include the multimodal RNN [10] and LSTM [5]. In these methods, neural networks are used to both image-text embedding and sentence generating.

Attention mechanism has recently attracted considerable interest in LSTM-based image captioning [3, 6]. Xu et al. [7] proposed to integrate visual attention through the hidden state of LSTM model. You et al. [6] propose to fusion visual attributes extracted from images with the input or output of LSTM. These methods achieve state-of-the-art performance but they highly rely on the quality of the pre-specified visual attributes. Our method also use attention mechanism. Different from the predecessors, we consider the object detection-dependent attention to generate high quality guidance rather than search at the whole noisy image. It is an adaptive method to obtain high quality features.

Reinforcement Learning has recently been introduced into image caption task [20] and achieved state-of-the-art performance due to optimize the evaluation metrics directly. These methods are generally applicable training approach not the improvement for the caption model. Thus, we don’t compare with them but believe that our model will gain much higher performance with Reinforcement Learning.

[19] first proposes to utilize object detection method in image caption task. However, it utilize Fast-RCNN to detect and VGG net [15] to locate. The caption model is very redundant. While generating guidance, it keep the region of its bounding box unchanged and set remaining regions to mean value of the training set for each object in image. This process will bring much interference to the caption model. Our method solves these puzzles by taking the single object detection network as the multi-level feature extracter. In this way, our method is a clean architecture for the ease of end-to-end learning.

3 Methods

Our accurate guidance model includes a multi-level feature extraction module (MFEM) and a principal object guiding LSTM (po-gLSTM). Figure 2 shows the structure of our model. We first describe how to use object detection network as MFEM to simultaneously extract the features of the whole image (\( fea_{w} \)) and principle objects (\( fea_{o} \)) in Sect. 3.1. Then, we introduce our po-gLSTM which will take advantage of the multi-level feature to guide the LSTM to describe the image more precise in Sect. 3.2.

Fig. 2.
figure 2

The structure of our accurate guidance model

3.1 Multi-level Feature Extraction Module

Figure 3 shows the framework of multi-level feature extraction module. The MFEM consists of two parts: (1) \( fea_{o} \) extraction network (above the red dotted line); (2) \( fea_{w} \) extraction network (below the red dotted line). It is a variant of Faster-RCNN [13]. In order to capture the principle objects better, for an input image \( I \), we suppose to utilize object detection network to find the potential objects and extract \( fea_{o} \), which denoted as \( fea_{o} = \left\{ {obj_{1} , \ldots ,obj_{N} } \right\} \) and formulated as formula (1). N is the number of potential objects. RPN (Region Proposal Network) splits the principle object parts from the whole image. \( CNN_{\theta 2} \) is to further extract the features after RPN.

Fig. 3.
figure 3

The structure of the MFEM (Color figure online)

$$ fea_{0} = CNN_{\theta 2} \left\{ {RPN\left[ {CNN_{\theta 1} \left( I \right)} \right]} \right\} $$
(1)

Simultaneously, we also need \( fea_{w} \) so that the po-gLSTM can get the information of scenes and infer the relationship between objects. In this situation, the original output of Faster-RCNN framework does not meet the conditions. Thus, we try to fix it’s framework so that it can extract \( fea_{w} \) at the same time. As shown in the part below the red dotted line in Fig. 3, we get a copy of the feature after \( CNN_{\theta 1} \) and take it into \( CNN_{\theta 2} \) directly. Then we get an imitation classification network followed with \( fea_{w} \), formulated as formula (2).

$$ fea_{w} = CNN_{\theta 2} \left[ {CNN_{\theta 1} \left( I \right)} \right] $$
(2)

Notice that the \( CNN_{\theta 2} \) with dotted border (below the red dotted line) is the same with the \( CNN_{\theta 2} \) with solid border (above the red dotted line). We do not increase the model parameters but obtain \( fea_{w} \) successfully. Faster-RCNN argues the size of input image should be larger than 600 pixel × 600 pixel. For reducing the model parameters, we replace its’ fully connected layer with the Global Average Pooling layer to embedding \( fea_{w} \) and resize it to fit the size of the principle object guiding LSTM’s input, formulated as follows:

$$ x_{0} = Pool_{ave} \left( {fea_{w} } \right) $$
(3)

\( x_{0} \) is utilized to initialize decoder in Sect. 3.2. Here, we have already gotten the multi-level feature of the input image. The multi-level feature carries the multi-level semantic information. As later experiments will demonstrate, multi-level feature extraction module will help the model to focus more on the principle objects and achieve better performance.

3.2 Principal Object Guiding LSTM (po-gLSTM)

As shown in Fig. 4, the function of po-gLSTM is to decode the multi-level semantic information of the image and generate corresponding descriptive sentence. In this section, we will first introduce the condition attention module to obtain the principle object information for the current word. Then we will introduce how to make use of the principle object information to guide the LSTM to generate sentence. Both of above, we treat them as a whole and call it as po-gLSTM.

Fig. 4.
figure 4

(a) CAM is the condition Attention Module, which is to generate guidance information \( \left( {gui_{t} } \right) \) by principle object features \( \left( {fea_{0} } \right) \) and the information of hidden layer at previous step \( \left( {h_{t - 1} } \right) \). (b) This sketch map shows how to utilize \( x_{0} \) and \( gui_{t} \) to generate descriptive sentence. Both of (a) and (b) make up the po-gLSTM.

Condition Attention Module

With the multi-level feature extraction module, \( fea_{o} \) and \( fea_{w} \) of an input image will be extracted easily. Each word in caption is represented by a one-hot vector and the captioning sentence is a sequence of input vectors \( \left( {x_{1} , \ldots ,x_{T} } \right) \). Same as previous methods, we utilize \( fea_{w} \) to initialize the decoder (LSTM), the decoder then computes a sequence of hidden states \( \left( {h_{1} , \ldots ,h_{t} } \right) \) and a sequence of outputs \( \left( {y_{1} , \ldots ,y_{t} } \right) \). The primer decoder only accesses \( fea_{w} \) (encoded as \( x_{0} \)) once at the beginning of the learning process, which will loss most of the information of image I by iterations, and output incorrect words or stop too early. To avoid this, we proposed to utilize condition attention module (CAM) [6] to stress the role of principle objects and supply necessary information lost by iterations. CAM is formulated as followed:

$$ a_{t}^{i} = Wtanh\left( {W_{ao} obj_{i} + W_{ah} h_{t - 1} } \right) i = 1, \ldots ,N $$
(4)
$$ \alpha_{t} = softmax\left( {a_{t} } \right) $$
(5)
$$ gui_{t} = \sum\nolimits_{i = 1}^{N} {\alpha_{t}^{i} obj_{i} } $$
(6)

\( W,W_{ao} ,W_{ah} \) are learnable parameters. \( N \) is the number of principle object in an image. \( a_{t}^{i} \) is the relevance of \( obj_{i} \) and \( h_{t - 1} \). The elements of \( \alpha_{t} \) is utilized to combine the guiding information (principle objects). \( gui_{t} \) is the guidance at iteration t.

With attention mechanism, model will know “where to see” while generating every word. We also make a visualization of attention mechanism to prove it in later experiment.

Guiding LSTM

The generated sentence by the LSTM model may lose track of the original image content since it only accesses the image content once at the beginning of the learning process, and forgets the image even after a short time. To make use of \( gui_{t} \) mentioned above and supplement the forgotten information if necessary, we propose to utilize an extension of the LSTM model, named the guiding LSTM (gLSTM) [3], which extracts semantic information from the input image and feeds it into the LSTM model every time step as extra information. Its’ gate and memory cell can be formulated as follows:

$$ i_{t}^{\prime } = \sigma \left( {W_{i}^{\prime } \left[ {h_{t - 1}^{\prime } ,x_{t}^{\prime } ,gui_{t} } \right]} \right) $$
(7)
$$ f_{t}^{\prime } = \sigma \left( {W_{f}^{\prime } \left[ {h_{t - 1}^{\prime } ,x_{t}^{\prime } ,gui_{t} } \right]} \right) $$
(8)
$$ o_{t}^{\prime } = \sigma \left( {W_{o}^{\prime } \left[ {h_{t - 1}^{\prime } ,x_{t}^{\prime } ,gui_{t} } \right]} \right) $$
(9)
$$ \widetilde{{C_{t}^{\prime } }} = tanh\left( {W_{c}^{\prime } \left[ {h_{t - 1}^{\prime } ,x_{t}^{\prime } ,gui_{t} } \right]} \right) $$
(10)
$$ C_{t}^{\prime } = f_{t}^{\prime } C_{t - 1}^{\prime } + i_{t}^{ '} \widetilde{{C_{t}^{\prime } }} $$
(11)
$$ h_{t}^{\prime } = o_{t}^{\prime } *tanh\left( {C_{t}^{\prime } } \right) $$
(12)
$$ x_{t + 1}^{\prime } = W_{emb}^{\prime } \left( {\log softmax\left( {W_{h}^{\prime } h_{t}^{\prime } } \right)} \right) $$
(13)

Where \( W_{s}^{\prime } \) denote learnable weighs, \( * \) represent element-wise multiplication, \( \sigma \left( \cdot \right) \) is the sigmoid function, \( tanh\left( \cdot \right) \) is the hyperbolic tangent function, \( x_{t}^{\prime } \) stands for input at t-th iteration, \( i_{t}^{\prime } \) for the input gate, \( f_{t}^{\prime } \) for the forget gate, \( o_{t}^{\prime } \) for the output gate, \( C_{t}^{\prime } \) for state of the memory cell, \( h_{t}^{\prime } \) for the hidden state.

\( o_{t}^{\prime } \) decides what to forget in \( C_{t}^{\prime } \). Its’ decision is up to \( h_{t - 1}^{\prime } \) and \( x_{t}^{\prime } \). In original LSTM, when \( o_{t}^{\prime } \) decides that forgetting some information is helpful for \( x_{t + 1}^{\prime } \), it will be impossible for \( x_{{t^{\prime } }}^{\prime } (t^{\prime } > t + 1) \) to utilize the forgotten information. The longer the descriptive sentence, the worse the condition like this is.

gLSTM is able to supplement the forgotten information if necessary. Condition attention module will also help to pick the most helpful principle object for \( x_{t + 1}^{\prime } \). And we call our gLSTM with principle object condition attention module as op-gLSTM. Somebody may doubt weather emphasizing the principle object so much is helpful. Our experiment will verify that the model can infer the relationship better with stronger principle object information and it will cause no trouble for extracting the scene from \( fea_{w} \).

One benefit of op-gLSTM is that it allows the language model to learn semantic attention automatically through the back-propagation of the training loss. While [19] only utilize objects and locations, other semantic information, such as scenes and motion relationship, is discarded.

4 Experiments

4.1 Dataset and Experiment Setup

Dataset

We use MS-COCO dataset [9] in our experiments. The dataset contains 123287 images respectively and each is annotated with 5 sentences using Amazon Mechanical Turk. There are 80 classes included in the dataset. We use 113287 images for training, 5000 images for validation and 5000 images for testing.

Experiment Setup

The inputting image is resized to 600 pixel × 600 pixel. The training process contains three stages: (1) pre-train the object detection network (Faster RCNN) on MS-COCO dataset. (2) combine the multi-level feature extract module (a variant of the pre-trained Faster RCNN) with our po-gLSTM and train the po-gLSTM to equip it with the ability to decode. (3) train the integral model end-to-end to help our multi-level feature extract module and po-gLSTM fusion better. Four standard evaluation metrics, e.g. BLUE, METEOR, ROUGE_L, and CIDER, are used evaluate the property of the generated sentence.

4.2 Comparison Between Different CNNs Encoders

Encoder is used to extract the semantic feature of the input image. The property of the extracted feature is decisive to our caption model. To explore which encoder is more proper, we use three different CNNs in our experiments, including 50-layer and 101-layer ResNets [14] and 16-layer VGGNet [15]. Table 1 shows the experimental result.

Table 1. Results of different CNNs encoders. All values are reported as percentage (%).

The experimental results show that deeper CNNs achieves higher scores on all metrics. This indicates that deeper CNNs can capture better semantic features, which contain more and better information for descriptive sentence generation. The guidance of deeper CNNs is much more accurate.

4.3 Comparison to the State-of-the-Art

Several related models have been proposed in Arxiv preprints since the original submission of this work. We also include these in Table 2 for comparison.

Table 2. Results of different caption models. All values are reported as percentage (%).

Table 2 shows the comparison results. Our models, both VGG16-based and RESNET101-based, outperform other models at the same scale in most metrics by a large margin, ranging from 1% to 5%. Models with attention mechanism, such as ATT [6], Det+Loc [19] achieve better score than models without attention mechanism, such as NIC [7] and LRCN [16]. Det+Loc [19] also utilize the object detection network whose scores are better than the models with classification network. Notice that, our VGG16-based model gets comparable performance with FC-2 K [20] (Resnet-101 based). Meanwhile, our RESNET101-based model outperforms FC-2 K in all metrics. it’s up to 5.1% in CIDER. Det+Loc is an object detection-based model, which utilize beam search (beam size 4) while testing. Without beam search, our VGG16-based model outperforms it in Blue_1 and CIDER and slightly inferior to it in other metrics. Det+Loc. introduce too much redundant information, which results in that its’ poorer performance.

The results of comparison are strong evidence that (1) the object detection task does have the ability to help with image caption model and our multi-level feature extract module is better suitable for caption task. (2) Our end-to-end model can help the two modules merge to get better performance in caption task.

4.4 Comparison Between Different Beam Search Size

In this section, we introduce Beam Search (BS) to replace Maximum Probability Sampling Mechanism. BS is a heuristic algorithm, which will consider more situations to generate better sentence while testing. The larger the beam size is, the more situation will be considered. We take gLSTM as comparison and Table 3 shows the experimental results.

Table 3. Results of different Beam Size. All values are reported as percentage (%).

From Table 3, we can see that the performance of a model varies in different beam size. Simultaneously, our model always outperforms gLSTM and it surpass Det+Loc. at beam size = 4. This is another evidence that our accurate guided model is better than other methods.

4.5 Qualitative Results

Figure 5 shows qualitative captioning results. To emphasize the effectiveness of our accurate guidance model and for fair comparison, we compare our VGG16-based model with the baseline model (NIC).

Fig. 5.
figure 5

Qualitative results: NIC is the baseline model; Ours means our VGG16 based model; GT is the ground truth.

The example images include similar colors and rare actions. Our proposed model can better capture objects in the target image, such as “a slice of pizza” in image (a) and “a little girl” in image (b). Our po-gLSTM can better capture the scenes and relationships between objects, such as “on a pile of rocks” in image (b), “in the air” in image (d) and “holding” in image (c), “jumping” in image (d). Assuredly, our model may fail in some cases, such as “bear” in image (c). It is mainly due to there is no class named as “hamburger” while training the object detection network and the hamburger is covered with a white wrapping paper, which is hard for object detection task. If the performance of object detection task gets better, our proposed model can achieve better performance simultaneously. The qualitative result shows that object detection network does do much help to capture the principle objects. Our model does not loss the information of scenes and relationships between objects but it can even do better.

4.6 Visualization of Condition Attention Mechanism

In this section, we visualize the focus of CAM. The brighter part refers to higher attention. Taking the first row as example, our proposed model focus exactly on the bus in the image while generating the word-“bus”. When generating “parked”, the CAM focus more on where the car and ground contact. This indicates that our po-gLSTM does have the ability to focus on the effective objects all the time (Fig. 6).

Fig. 6.
figure 6

The visualization of condition attention mechanism on feature maps.

5 Conclusion

In this work, we propose the framework of accurate guidance for image caption. It combines a variety of object detection network (MFEM) and gLSTM with the help of attention mechanism (po-LSTM). We show in our experiments that the proposed methods significantly improve the baseline method and outperform the current state-of-the-art on MS-COCO dataset, which supports our argument of explicit consideration of getting help from object detection task.