1 Introduction

Image captioning, a task of generating short descriptions for given images, has received increasing attention in recent years. Latest works on this task [1,2,3,4] mostly adopt the encoder-decoder paradigm, where a recurrent neural network (RNN) or one of its variants, e.g. GRU [5] and LSTM [6], is used for generating the captions. Specifically, the RNN maintains a series of latent states. At each step, it takes the visual features together with the preceding word as input, updates the latent state, then estimates the conditional probability of the next word. Here, the latent states serve as pivots that connect between the visual and the linguistic domains.

Following the standard practice in language models [5, 7], existing captioning models usually formulate the latent states as vectors and the connections between them as fully-connected transforms. Whereas this is a natural choice for purely linguistic tasks, it becomes a question when the visual domain comes into play, e.g. in the task of image captioning.

Along with the rise of deep learning, convolutional neural networks (CNN) have become the dominant models for many computer vision tasks [8, 9]. Convolution has a distinctive property, namely spatial locality, i.e. each output element corresponds to a local region in the input. This property allows the spatial structures to be maintained by the feature maps across layers. The significance of spatial locality for vision tasks have been repeatedly demonstrated in previous work [8, 10,11,12,13].

Image captioning is a task that needs to bridge both the linguistic and the visual domains. Thus for this task, it is important to capture and preserve properties of the visual content in the latent states. This motivates us to explore an alternative formulation for image captioning, namely representing the latent states with 2D maps and connecting them via convolutions. As opposed to the standard formulation, this variant is capable of preserving spatial locality, and therefore it may strengthen the role of visual structures in the process of caption generation.

We compared both formulations, namely the standard one with vector states and the alternative one that uses 2D states, which we refer to as RNN-2DS. Our study shows: (1) The spatial structures significantly impact the captioning process. Editing the latent states, e.g. suppressing certain regions in the states, can lead to substantially different captions. (2) Preserving the spatial structures in the latent states is beneficial for captioning. On two public datasets, MSCOCO [14] and Flickr30k [15], RNN-2DS achieves notable performance gain consistently across different settings. In particular, a simple RNN-2DS without gating functions already outperforms more sophisticated networks with vector states, e.g. LSTM. Using 2D states in combination with more advanced cells, e.g. GRU, can further boost the performance. (3) Using 2D states makes the captioning process amenable to visual interpretation. Specifically, we take advantage of the spatial locality and develop a simple yet effective way to identify the connections between latent states and visual regions. This enables us to visualize the dynamics of the states as a caption is being generated, as well as the connections between the visual domain and the linguistic domain.

In summary, our contributions mainly lie in three aspects. First, we rethink the form of latent states in image captioning models, for which existing work simply follows the standard practice and adopts the vectorized representations. To our best knowledge, this is the first study that systematically explores two dimensional states in the context of image captioning. Second, our study challenges the prevalent practice, which reveals the significance of spatial locality in image captioning and suggests that the formulation with 2D states and convolution is more effective. Third, leveraging the spatial locality of the alternative formulation, we develop a simple method that can visualize the dynamics of the latent states in the decoding process.

2 Related Work

Image Captioning. Image captioning has been an active research topic in computer vision. Early techniques mainly rely on detection results. Kulkarni et al. [16] proposed to first detect visual concepts including objects and visual relationships [17], and then generate captions by filling sentence templates. Farhadi et al. [18] proposed to generate captions for a given image by retrieving from training captions based on detected concepts.

In recent years, the methods based on neural networks are gaining ground. Particularly, the encoder-decoder paradigm [1], which uses a CNN [19] to encode visual features and then uses an LSTM net [6] to decode them into a caption, was shown to outperform classical techniques and has been widely adopted. Along with this direction, many variants have been proposed [2, 20,21,22], where Xu et al. [2] proposed to use a dynamic attention map to guide the decoding process. And Yao et al. [22] additionally incorporate visual attributes detected from the images, obtaining further improvement. While achieving significant progress, all these methods rely on vectors to encode visual features and to represent latent states.

Multi-dimensional RNN. Existing works that aim at extending RNN to more dimensions roughly fall into three categories:

  1. (1)

    RNNs are applied on multi-dimensional grids, e.g. the 2D grid of pixels, via recurrent connections along different dimensions [23, 24]. Such extensions have been used in image generation [25] and CAPTCHA recognition [26].

  2. (2)

    Latent states of RNN cells are stacked across multiple steps to form feature maps. This formulation is usually used to capture temporal statistics, e.g. those in language processing [27, 28] and audio processing [29]. For both categories above, the latent states are still represented by 1D vectors. Hence, they are essentially different from this work.

  3. (3)

    Latent states themselves are represented as multi-dimensional arrays. The RNN-2DS studied in this paper belongs to the third category, where latent states are represented as 2D feature maps. The idea of extending RNN with 2D states has been explored in various vision problems, such as rainfall prediction [30], super-resolution [11], instance segmentation [12], and action recognition [13]. It is worth noting that all these works focused on tackling visual tasks, where both the inputs and the outputs are in 2D forms. To our best knowledge, this is the first work that studies recurrent networks with 2D states in image captioning. A key contribution of this work is that it reveals the significance of 2D states in connecting the visual and the linguistic domains.

Interpretation. There are studies to analyze recurrent networks. Karpathy et al. [31] try to interpret the latent states of conventional LSTM models for natural language understanding. Similar studies have been conducted by Ding et al. [32] for neural machine translation. However, these studies focused on linguistic analysis, while our study tries to identify the connections between linguistic and visual domains by leveraging the spatial locality of the 2D states.

Our visualization method on 2D latent states also differs from the attention module [2] fundamentally, in both theory and implementation. (1) Attention is a mechanism specifically designed to guide the focus of a model, while the 2D states are a form of representation. (2) Attention is usually implemented as a sub-network. In our work, the 2D states by themselves do not introduce any attention mechanism. The visualization method is mainly for the purpose of interpretation, which helps us better understand the internal dynamics of the decoding process. To our best knowledge, this is accomplished for the first time for image captioning.

3 Formulations

To begin with, we review the encoder-decoder framework [1] which represents latent states as 1D vectors. Subsequently, we reformulate the latent states as multi-channel 2D feature maps for this framework. These formulations are the basis for our comparative study.

3.1 Encoder-Decoder for Image Captioning

The encoder-decoder framework generates a caption for a given image in two stages, namely encoding and decoding. Specifically, given an image I, it first encodes the image into a feature vector \(\mathbf {v}\), with a Convolutional Neural Network (CNN), such as VGGNet [19] or ResNet [8]. The feature vector \(\mathbf {v}\) is then fed to a Recurrent Neural Network (RNN) and decoded into a sequence of words \((w_1, \ldots , w_T)\). For decoding, the RNN implements a recurrent process driven by latent states, which generates the caption through multiple steps, each yielding a word. Specifically, it maintains a set of latent states, represented by a vector \(\mathbf {h}_t\) that would be updated along the way. The computational procedure can be expressed by the formulas below:

$$\begin{aligned}&\mathbf {h}_0 = \mathbf {0}, \quad \mathbf {h}_t = g(\mathbf {h}_{t - 1}, \mathbf {x}_t, \mathbf {I}), \end{aligned}$$
(1)
$$\begin{aligned}&\mathbf {p}_{t|1:t-1} = \text {Softmax}(\mathbf {W}_p \mathbf {h}_t), \end{aligned}$$
(2)
$$\begin{aligned}&w_t \sim \mathbf {p}_{t|1:t-1}. \end{aligned}$$
(3)

The procedure can be explained as follows. First, the latent state \(\mathbf {h}_0\) is initialized to be zeros. At the t-th step, \(\mathbf {h}_t\) is updated by an RNN cell g, which takes three inputs: the previous state \(\mathbf {h}_{t-1}\), the word produced at the preceding step (represented by an embedded vector \(\mathbf {x}_t\)), and the visual feature \(\mathbf {v}\). Here, the cell function g can take a simple form:

$$\begin{aligned} g(\mathbf {h}, \mathbf {x}, \mathbf {v}) = \tanh \left( \mathbf {W}_h \mathbf {h}+ \mathbf {W}_x \mathbf {x}+ \mathbf {W}_v \mathbf {v}\right) . \end{aligned}$$
(4)

More sophisticated cells, such as GRU [5] and LSTM [6], are also increasingly adopted in practice. To produce the word \(w_t\), the latent state \(\mathbf {h}_t\) will be transformed into a probability vector \(\mathbf {p}_{t|1:t-1}\) via a fully-connected linear transform \(\mathbf {W}_p \mathbf {h}_t\) followed by a softmax function. Here, \(\mathbf {p}_{t|1:t-1}\) can be considered as the probabilities of \(w_t\) conditioned on previous states.

Despite the differences in their architectures, all existing RNN-based captioning models represent latent states as vectors without explicitly preserving the spatial structures. In what follows, we will discuss the alternative choice that represents latent states as 2D multi-channel feature maps.

3.2 From 1D to 2D

From a technical standpoint, a natural way to maintain spatial structures in latent states is to formulate them as 2D maps and employ convolutions for state transitions, which we refer to as RNN-2DS.

Fig. 1.
figure 1

The overall structure of the encoder-decoder framework with RNN-2DS. Given an image I, a CNN first turns it into a multi-channel feature map \(\mathbf {V}\) that preserves high-level spatial structures. \(\mathbf {V}\) will then be fed to an RNN-2DS, where the latent state \(\mathbf {H}_t\) is also represented by multi-channel maps and the state transition is via convolution. At each step, the 2D states are transformed into a 1D vectors and then decoded into conditional probabilities of words.

Specifically, as shown in Fig. 1, the visual feature \(\mathbf {V}\), the latent state \(\mathbf {H}_t\), and the word embedding \(\mathbf {X}_t\) are all represented as 3D tensors of size \(C \times H \times W\). Such a tensor can be considered as a multi-channel map, which comprises C channels, each of size \(H \times W\). Unlike the normal setting where the visual feature is derived from the activation of a fully-connected layer, \(\mathbf {V}\) here is derived from the activation of a convolutional layer that preserves spatial structures. And \(\mathbf {X}_t\) is the 2D word embedding for \(w_{t - 1}\), of size \(C \times H \times W\). To reduce the number of parameters, we use a lookup table of smaller size \(C_x \times H_x \times W_x\) to fetch the raw word embedding, which will be enlarged to \(C \times H \times W\) by two convolutional layersFootnote 1. With these representations, state updating can then be formulated using convolutions. For example, Eq. (4) can be converted into the following form:

$$\begin{aligned} \mathbf {H}_t = \text {relu} \left( \mathbf {K}_h \circledast \mathbf {H}_{t-1} + \mathbf {K}_x \circledast \mathbf {X}_t + \mathbf {K}_v \circledast \mathbf {V}\right) . \end{aligned}$$
(5)

Here, \(\circledast \) denotes the convolution operator, and \(\mathbf {K}_h\), \(\mathbf {K}_x\), and \(\mathbf {K}_v\) are convolution kernels of size \(C \times C \times H_k \times W_k\). It is worth stressing that the modification presented above is very flexible and can easily incorporate more sophisticated cells. For example, the original updating formulas of GRU are

(6)

where \(\sigma \) is the sigmoid function, and is the element-wise multiplication operator. In a similar way, we can convert them to the 2D form as

(7)

Given the latent states \(\mathbf {H}_t\), the word \(w_t\) can be generated as follows. First, we compress \(\mathbf {H}_t\) (of size \(C \times H \times W\)) into a C-dimensional vector \(\mathbf {h}_t\) by mean pooling across spatial dimensions. Then, we transform \(\mathbf {h}_t\) into a probability vector \(\mathbf {p}_{t|1:t-1}\) and draw \(w_t\) therefrom, following Eqs. (2) and (3). Note that the pooling operation could be replaced with more sophisticated modules, such as an attention module, to summarize the information from all locations for word prediction. We choose the pooling operation as it adds zero extra parameters, which makes the comparison between 1D and 2D states fair.

Since this reformulation is generic, besides the encoder-decoder framework, it can be readily extended to other captioning models that adopt RNNs as the language module, e.g. Att2in [3] and Review Net [33].

4 Qualitative Studies on 2D States

Thanks to the preserved spatial locality, the use of 2D states makes the framework amenable to some qualitative analysis. Taking advantage of this, we present three studies in this section: (1) We manipulate the 2D states and investigate how it impacts the generated captions. The results of this study would corroborate the statement that 2D states help to preserve spatial structures. (2) Leveraging the spatial locality, we identify the associations between the activations of latent states and certain subregions of the input image. Based on the dynamic associations between state activations and the corresponding subregions, we can visually reveal the internal dynamics of the decoding process. (3) Through latent states we also interpret the connections between the visual and the linguistic domains.

Fig. 2.
figure 2

This figure lists several images with generated captions relying on various parts of RNN-2DS’s states. The accessible part is marked with color in each case. (Color figure online)

4.1 State Manipulation

We study how the spatial structures of the 2D latent states influence the resultant captions by controlling the accessible parts of the latent states.

As discussed in Sect. 3.2, the prediction at t-th step is based on \(\mathbf {h}_t\), which is pooled from \(\mathbf {H}_t\) across H and W. In other words, \(\mathbf {h}_t\) summarizes the information from the entire area of \(\mathbf {H}_t\). In this experiment, we replace the original region (1, 1, HW) with a subregion between the corners \((x_1, y_1)\) and \((x_2, y_2)\) to get a modified summarizing vector \(\mathbf {h}_t^\prime \) as

$$\begin{aligned} \mathbf {h}_t^\prime = \frac{1}{(y_2 - y_1 + 1) (x_2 - x_1 + 1)}\sum _{i = y_1}^{y_2} \sum _{j = x_1}^{x_2} \mathbf {H}_t |_{(i, j)}. \end{aligned}$$
(8)

Here, \(\mathbf {h}_t^\prime \) only captures a subregion of the image, on which the probabilities for the word \(w_t\) is computed. We expect that this caption only partially reflects the visual semantics.

Figure 2 shows several images together with the captions generated using different subregions of the 2D states. Take the bottom-left image in Fig. 2 for an instance, when using only the upper half of the latent states, the decoder generates a caption focusing on the cat, which indeed appears in the upper half of the image. Similarly, using only the lower half of the latent states results in a caption that talks about the book located in the lower half of the image. In other words, depending on a specific subregion of the latent states, a decoder with 2D states tends to generate a caption that conveys the visual content of the corresponding area in the input image. This observation suggests that the 2D latent states do preserve the spatial structures of the input image.

Manipulating latent states differs essentially from the passive data-driven attention module [2] commonly adopted in captioning models. It is a controllable operation, and does not require a specific module to achieve such functionality. With this operation, we can extend a captioning model with 2D states to allow active management of the focus, which, for example, can be used to generate multiple complementary sentences for an image. While the attention module can be considered as an automatic manipulation on latent states, the combination of 2D states and the attention mechanism worths exploring in the future work.

4.2 Revealing Decoding Dynamics

This study intends to analyze internal dynamics of the decoding process, i.e. how the latent states evolve in a series of decoding steps. We believe that it can help us better understand how a caption is generated based on the visual content. The spatial locality of the 2D states allows us to study this in an efficient and effective way.

We use activated regions to align the activations of the latent states at different decoding steps with the subregions in the input image. Specifically, we treat the channels of 2D states as the basic units in our study, which are 2D maps of activation values. Given a state channel c at the t-th decoding step, we resize it to the size of the input image I via bicubic interpolation. The pixel locations in I whose corresponding interpolated activations are above a certain thresholdFootnote 2 are considered to be activated. The collection of all such pixel locations is referred to as the activated region for the state channel c at the t-th decoding step, as shown in Fig. 3.

Fig. 3.
figure 3

This figure shows our procedure of finding the activated region of a latent channel at the t-th step.

Fig. 4.
figure 4

This figure shows the changes of several channels, in terms of the activated regions, during the decoding processes. On the last two cases, changes of two channels in the same decoding process are shown and compared. (Best viewed in high resolution)

With activated regions computed respectively at different decoding steps for one state channel, we may visually reveal the internal dynamics of the decoding process at that channel. Figure 4 shows several images and their generated captions, along with the activated regions of some channels following the decoding processes. These channels are selected as they are associated with nouns in the generated captions, which we will introduce in the next section. Via this study we found that (1) The activated regions of channels often capture salient visual entities in the image, and also reflect the surrounding context occasionally. (2) During a decoding process, different channels have different dynamics. For a channel associated with a noun, the activated regions of its associated channel become significant as the decoding process approaches the point where the noun is produced, and the channel becomes deactivated afterwards.

The revealed dynamics can help us better understand the decoding process, which also point out some directions for future study. For instance, in Fig. 4, the visual semantics are distributed to different channels, and the decoder moves its focus from one channel to another. The mechanism that triggers such movements remains needed to be explored.

4.3 Connecting Visual and Linguistic Domains

Here we investigate how the visual domain is connected to the linguistic domain. As the latent states serve as pivots that connect both domains, we try to use the activations of the latent states to identify the detailed connections.

First, we find the associations between the latent states and the words. Similar to Sect. 4.2, we use state channels as the basic units here, so that we can use the activated regions which connect the latent states to the input image. In Sect. 4.2, we have observed that a channel associated with a certain word is likely to remain active until the word is produced, and its activation level will drop significantly afterwards thus preventing that word from being generated again. Hence, one way to judge whether a channel is associated with a word is to estimate the difference in its level of activations before and after the word is generated. The channel that yields maximum difference can be considered as the one associated with the wordFootnote 3.

Fig. 5.
figure 5

Sample words and their associated channels in RNN-2DS-(512, 7, 7). For each word, 5 activated regions of its associated channel on images that contain this word in the generated captions are shown. The activated regions are chosen at the steps where the words are produced. (Best viewed in high resolution)

Words and Associated Channels. For each word in the vocabulary, we could find its associated channel as described above, and study the corresponding activated regions, as shown in Fig. 5. We found that (1) Only nouns have strong associations with the state channels, which is consistent with the fact that spatial locality is highly-related with the visual entities described as nouns. (2) Some channels have multiple associated nouns. For example, Channel-066 is associated with “cat”, “dog”, and “cow”. This is not surprising – since there are more nouns in the vocabulary than the number of channels, some nouns have to share channels. Here, it is worth noting that the nouns that share a channel tend to be visually relevant. This shows that the latent channels can capture meaningful visual structures. (3) Not all channels have associated words. Some channels may capture abstract notions instead of visual elements. The study of such channels is an interesting direction in the future.

Match of Words and Associated Channels. On top of the activated regions, we could also estimate the match between a word and its associated channel. Specifically, noticing the activated regions visually look like the attention maps in [34], we borrow the measurement of attention correctness from [34], to estimate the match. Attention correctness computes the similarity between a human-annotated segmentation mask of a word, and the activated region of its associated channel, at the step the word is produced. The computation is done by summing up the normalized activations within that mask. On MSCOCO [14], we evaluated the attention correctness on 80 nouns that have human-annotated masks. As a result, the averaged attention correctness is 0.316. For reference, following the same setting except for replacing the activated regions with the attention maps, AdaptiveAttention [4], a state-of-the-art captioning model, got a result of 0.213.

Fig. 6.
figure 6

This figure lists some images with generated captions before and after some word-associated channel being deactivated. The word that associates with the deactivated channel is marked in . (Color figure online)

Deactivation of Word-Associated Channels. We also verify the match of the found associations between the state channels and the words alternatively via an ablation study, where we compare the generated captions with and without the involvement of a certain channel. Specifically, on images that contain the target word w in the generated captions, we re-run the decoding process, in which we deactivate the associated channel of w by clipping its value to zero at all steps, then compare the generated captions with previous ones. As shown in Fig. 6, deactivating a word-associated channel leads to the miss of the corresponding words in the generated captions, even though the input still contains the visual semantics for those words. This ablation study corroborates the validity of our found associations.

5 Comparison on Captioning Performance

In this section, we compare the encoder-decoder framework with 1D states and 2D states. Specifically, we run our studies on MSCOCO [14] and Flickr30k [15], where we at first introduce the settings, followed by the results.

5.1 Settings

MSCOCO [14] contains 122, 585 images. We follow the splits in [35], using 112, 585 images for training, 5, 000 for validation, and the remaining 5, 000 for testing. Flickr30K [15] contains 31, 783 images in total, and we follow splits in [35], which has 1, 000 images respectively for validation and testing, and the rest for training. In both datasets, each image comes with 5 ground-truth captions. To obtain a vocabulary, we turn words to lowercase and remove those with non-alphabet characters. Then we replace words that appear less than 6 times with a special token UNK, resulting in a vocabulary of size 9, 487 for MSCOCO, and 7, 000 for Flickr30k. Following the common convention [35], we truncated all ground-truth captions to have at most 18 words.

All captioning methods in our experiments are based on the encoder-decoder paradigm [1]. We use ResNet-152 [8] pretrained on ImageNet [9] as the encoder in all methods. In particular, we take the output of the layer res5c as the visual feature \(\mathbf {V}\). We use the combination of the cell type and the state shape to refer to each type of the decoder. e.g. LSTM-1DS-(L) refers to a standard LSTM-based decoder with latent states of size L, and GRU-2DS-(C, H, W) refers to an RNN-2DS decoder with GRU cells as in Eq. (7), whose latent states are of size \(C \times H \times W\). Moreover, all RNN-2DS models adopt a raw word-embedding of size \(4 \times 15 \times 15\), except when a different size is explicitly specified. The convolution kernels \(\mathbf {K}_h\), \(\mathbf {K}_x\), and \(\mathbf {K}_v\) share the same size \(C \times C \times 3 \times 3\).

The focus of this paper is the representations of latent states. To ensure fair comparison, no additional modules including the attention module [2] are added to the methods. Moreover, no other training strategies are utilized, such as the scheduled sampling [36], except for the maximum likelihood objective, where we use the ADAM optimizer [37]. During training, we first fix the CNN encoder and optimize the decoder with learning rate 0.0004 in the first 20 epochs, and then jointly optimize both the encoder and decoder, until the performance on the validation set saturates.

For evaluation, we report the results using metrics including BLEU-4 (B4) [38], METEOR (MT) [39], ROUGE (RG) [40], CIDER (CD) [41], and SPICE (SP) [42].

5.2 Comparative Results

First, we compared RNN-2DS with LSTM-1DS. The former has 2D states with the simplest type of cells while the latter has 1D states with sophisticated LSTM cells. As the capacity of a model is closely related to the number of parameters, to ensure a fair comparison, each config of RNN-2DS is compared to an LSTM-1DS config with a similar number of parameters. In this way, the comparative results will signify the differences in the inherent expressive power of both formulations.

Fig. 7.
figure 7

The results, in terms of different metrics, obtained using RNN-2DS () and LSTM-1DS () on the MSCOCO offline test set with similar parameter sizes. Specifically, RNN-2DS of sizes 10.57M, 13.48M and 21.95M have compared to LSTM-1DS of sizes 10.65M, 13.52M and 22.14M. (Color figure online)

The resulting curves in terms of different metrics are shown in Fig. 7, in which we can see that RNN-2DS outperforms LSTM-1DS consistently, across different parameter sizes and under different metrics. These results show that RNN-2DS, with the states that preserve spatial locality, can capture both visual and linguistic information more efficiently.

Table 1. The results obtained using different decoders on the offline and online test sets of MSCOCO, and the test set of Flickr30k, where METEOR (MT) [39] is omitted due to space limitation, and no SPICE (SP) [42] is reported by the online test set of MSCOCO.

We also compared different types of decoders with similar numbers of parameters, namely RNN-1DS, GRU-1DS, LSTM-1DS, RNN-2DS, GRU-2DS, and LSTM-2DS. Table 1 shows the results of these decoders on both datasets, from which we observe: (1) RNN-2DS outperforms RNN-1DS, GRU-1DS, and LSTM-1DS, indicating that embedding latent states in 2D forms is more effective. (2) GRU-2DS, which is also based on the proposed formulation but adds several gate functions, surpasses other decoders and yields the best result. This suggests that the techniques developed for conventional RNNs including gate functions and attention modules [2] are very likely to benefit RNNs with 2D states as well.

Fig. 8.
figure 8

This figure shows some qualitative samples of captions generated by different decoders, where words in indicate they are inconsistent with the image. (Color figure online)

Table 2. The results obtained on the MSCOCO offline test set using RNN-2DS with different choices on pooling functions, activation functions, word-embeddings, kernels and latent states. Except for the first row, each row only lists the choice that is different from the first row. “-” means the same.

Figure 8 includes some qualitative samples, in which we can see the captions generated by LSTM-1DS rely heavily on the language priors, which sometimes contain the phrases that are not consistent with the visual content but appear frequently in training captions. On the contrary, the sentences from RNN-2DS and GRU-2DS are more relevant to the visual content.

5.3 Ablation Study

Table 2 compares the performances obtained with different design choices in RNN-2DS, including pooling methods, activation functions, and sizes of word embeddings, kernels and latent states The results show that mean pooling outperforms max pooling by a significant margin, indicating that information from all locations is significant. The table also shows the best combination of modeling choices for RNN-2DS: mean pooling, ReLU, the word embeddings of size \(4\times 15\times 15\), the kernel of size \(3\times 3\), and the latent states of size \(256\times 7\times 7\).

6 Conclusions and Future Work

In this paper, we studied the impact of embedding latent states as 2D multi-channel feature maps in the context of image captioning. Compared to the standard practice that embeds latent states as 1D vectors, 2D states consistently achieve higher captioning performances across different settings. Such representations also preserve the spatial locality of the latent states, which helps reveal the internal dynamics of the decoding process, and interpret the connections between visual and linguistic domains. We plan to combine the decoder having 2D states with other modules commonly used in captioning community, including the attention module [2], for further exploration.