Keywords

1 Introduction

Recently, various video-language tasks have drawn a lot of interests in computer vision research [1,2,3], including video captioning [4,5,6,7,8,9], video question answering (QA) [10, 11], and video retrieval for a natural language query [8, 12, 13]. To solve such challenging tasks, it is important to learn a hidden join representation between word and frame sequences, for correctly measuring their semantic similarity. Video classification [14,15,16,17,18] can be a candidate solution, but tagging only a few labels to a video may be insufficient to fully relate multiple latent events in the video to a language description. Thanks to recent advance of deep representation learning, many methods for multimodal semantic embedding (e.g. [19,20,21]) have been proposed. However, most of existing methods embed each of visual and language information into a single vector, which is often insufficient especially for a video and a natural sentence. With single vectors for the two sequence modalities, it is hard to directly compare multiple relations between subsets of sequence data (i.e. matchings between subevents in a video and short phrases in a sentence), for which hierarchical matching is more adequate. There have been some attempts to learn representation of hierarchical structure of natural sentences and visual scenes (e.g. [22, 23] using recursive neural networks), but they require groundtruth parse trees or segmentation labels (Fig. 1).

Fig. 1.
figure 1

The intuition of the Joint Sequence Fusion (JSFusion) model. Given a pair of a video clip and a language query, Joint Semantic Tensor (in purple) encodes a pairwise joint embedding between the two sequence data, and Convolutional Hierarchical Decoder (in blue) discovers hierarchical matching relations from JST. Our model is easily adaptable to many video QA and retrieval tasks. (Color figure online)

In this paper, we propose an approach that can measure semantic similarity between any pairs of multimodal sequence data, by learning bottom-up recursive matches via attention mechanisms. We apply our method to tackle several video question answering and retrieval tasks. Our approach, named as Joint Sequence Fusion (JSFusion) model, consists of two key components. First, the Joint Semantic Tensor (JST) performs dense Hadamard products between frames and words and encodes all pairwise embeddings between the two sequence data into a 3D tensor. JST further takes advantage of learned attentions to refine the 3D matching tensor. Second, the Convolutional Hierarchical Decoder (CHD) discovers local alignments on the tensor, by using a series of attention-based decoding modules, consisting of convolutional layers and gates. These two attention mechanisms promote well-matched representation patterns and prune out misaligned ones in a bottom-up manner. Finally, CHD obtains hierarchical composible representations of the two modalities, and computes a semantic matching score of the sequence pair.

We evaluate the performance of our JSFusion model on multiple video question answering and retrieval tasks on LSMDC [1] and MSR-VTT [2] datasets. First, we participate in three challenges of LSMDC: multiple-choice test, movie retrieval, and fill-in-the-blank, which require the model to correctly measure a semantic matching score between a descriptive sentence and a video clip, or to predict the most suitable word for a blank in a sentence for a query video. Our JSFusion model achieves the best accuracies reported so far with significant margins for the lsmdc tasks. Second, we newly create multiple-choice and movie retrieval annotations for the MSR-VTT dataset, on which our approach also outperforms many state-of-the-art methods in diverse video topics (e.g.  TV shows, web videos, and cartoons).

We summarize the contributions of this work as follows.

  1. 1.

    We propose the Joint Sequence Fusion (JSFusion) model, consisting of two key components: JST and CHD. To the best of our knowledge, it is a first attempt to leverage recursively learnable attention modules for measuring semantic matching scores between multimodal sequence data. Specifically, we propose two different attention models, including soft attention in JST and Conv-layers and Conv-gates in CHD.

  2. 2.

    To validate the applicability of our JSFusion model, especially on video question answering and retrieval, we participate in three tasks of LSMDC [1], and achieve the best performance reported so far. We newly create video retrieval and QA benchmarks based on MSR-VTT [2] dataset, on which our JSFusion outperforms many state-of-the-art VQA models. Our source code and benchmark annotations are publicly available in our project page.

2 Related Work

Our work can be uniquely positioned in the context of two recent research directions: video retrieval and video question answering.

Video Retrieval with Natural Language Sentences. Visual information retrieval with natural language queries has long been tackled via joint visual-language embedding models [12, 19, 24,25,26,27,28]. In the video domain, it is more difficult to learn latent relations between a sequence of frames and a sequence of descriptive words, given that a video is not simply a multiple of images. Recently, there has been much progress in this line of research. Several deep video-language embedding methods [8, 12, 13] has been developed by extending image-language embeddings [20, 21]. Other recent successful methods benefit from incorporating concept words as semantic priors [9, 29], or relying on strong representation of videos like RNN-FV [30]. Another dominant approach may be leveraging RNNs or their variants like LSTM to encode the whole multimodal sequences (e.g.  [9, 12, 29, 30]).

Compared to these existing methods, our model first finds dense pairwise embeddings between the two sequences, and then composes higher-level similarity matches from fine-grained ones in a bottom-up manner, leveraging hierarchical attention mechanisms. This idea improves our model’s robustness especially for local subset matching (e.g. at the activity-phrase level), which places our work in a unique position with respect to previous works.

Video Question Answering. VQA is a relatively new problem at the intersection of computer vision and natural language research [31,32,33]. Video-based VQA is often recognized as a more difficult challenge than image-based one, because video VQA models must learn spatio-temporal reasoning to answer problems, which requires large-scale annotated data. Fortunately, large-scale video QA datasets have been recently emerged from the community using crowdsourcing on various sources of data (e.g. movies for MovieQA [10] and animated GIFs for TGIF-QA [11]). Rohrbach et al.  [1] extend the LSMDC movie description dataset to the VQA domain, introducing several new tasks such as multiple-choice [12] and fill-in-the-blank [34].

The multiple-choice problem is, given a video query and five descriptive sentences, to choose a single best answer in the candidates. To tackle this problem, ranking losses on deep representation [9, 11, 12] or nearest neighbor search on the joint space [30] are exploited. Torabi et al. [12] use the temporal attention on the joint representation between the query videos and answer choice sentences. Yu et al.  [9] use LSTMs to sequentially feed the query and the answer embedding conditioned on detected concept words. The fill-in-the-blank task is, given a video and a sentence with a single blank, to select a suitable word for the blank. To encode the sentential query sentence on the video context, MergingLSTMs [35] and LR/RL LSTMs [36] are proposed. Yu et al. [9, 29] attempt to detect semantic concept words from videos and integrate them with Bidirectional LSTM that encodes the language query. However, most previous approaches tend to focus too much on the sentence information and easily ignore visual cues. On the other hand, our model focuses on learning multi-level semantic similarity between videos and sentences, and consequently achieves the best results reported so far in these two QA tasks, as will be presented in section 4.

3 The Joint Sequence Fusion Model

We first explain the preprocessing steps for describing videos and sentences in Sect. 3.1, and then discuss the two key components of our JSFusion model in Sects. 3.23.4, respectively. We present the training procedure of our model in Sect. 3.5, and its applications to three video-language tasks in Sect. 3.6.

Fig. 2.
figure 2

The architecture of Joint Sequence Fusion (JSFusion) model. indicate the information flows for multimodal similarity matching tasks, while for the fill-in-the-blank task. (a) JST composes pairwise joint representation of language and video sequences into a 3D tensor, using a soft-attention mechanism. (b) CHD learns hierarchical relation patterns between the sequences, using a series of convolutional decoding module which shares parameters for each stage. \(\odot \) is Hadamard product, \(\oplus \) is addition, and \(\otimes \) is multiplication between representation and attentions described in Eqs. (2)–(4). We omit some fully-connected layers for visualization purpose. (Color figure online)

3.1 Preprocessing

Sentence Representation. We encode each sentence in a word level. We first define a vocabulary dictionary \(\mathcal V\) by collecting the words that occur more than three times in the dataset. (e.g. the dictionary size is \(|\mathcal V| = 16,824\) for LSMDC). We ignore the words that are not in the dictionary. Next we use the pretrained glove.42B.300d [37] to obtain the word embedding matrix \(\mathbf E \in \mathcal R^{d \times |\mathcal V|}\) where \(d=300\) is the word embedding dimension. We denote the description of each sentence by \(\{\mathbf w_m \}_{m=1}^{M}\) where M is the number of words in the sentence. We limit the maximum number of words per sentence to be \(M_{max}=40\). If a sentence is too long, we discard the remaining excess words, because only 0.07\(\%\) of training sentences excess this limit, and no performance gain is observed for larger \(M_{max}\). Throughout this paper, we use m for denoting the word index.

Video Representation. We sample a video at five fps, to reduce the frame redundancy while minimizing information loss. We employ CNNs to encode both visual and audial information in videos. For visual description, we extract the feature map of each frame from the pool5 layer (\(\mathbb R^{2,048}\)) of ResNet-152 [38] pretrained on ImageNet. For audial information, we extract the feature map using the VGGish [39] followed by PCA for dimensionality reduction (\(\mathbb R^{128}\)). We then concatenate both features as the video descriptor \(\{\mathbf v_n \}_{n=1}^N \in \mathbb R^{2,156\times N}\) where N is the number of frames in the video. We limit the maximum number of frames to be \(N_{max}=40\). If a video is too long, we select \(N_{max}\) equidistant frames. We observe no performance gain for larger \(N_{max}\). We use n for denoting the video frame index.

3.2 The Joint Semantic Tensor

The Joint Semantic Tensor (JST) first composes pairwise representations between two multimodal sequences into a 3D tensor. Next, JST applies a self-gating mechanism to the 3D tensor to refine it as an attention map that discovers fine-grained matching between all pairwise embeddings of the two sequences while pruning out unmatched joint representations

Sequence Encoders. Give a pair of multimodal sequences, we first represent them using encoders. We use bidirectional LSTM networks (BLSTM) encoder [40, 41] for word sequence and CNN encoder for video frames. It is often advantageous to consider both future and past contexts to represent each element in a sequence, which motivates the use of BLSTM encoders. \(\{ \mathbf h^f_t \}_{t=1}^T\) and \(\{ \mathbf h^b_t \}_{t=1}^T\) are the forward and backward hidden states of the BLSTM, respectively:

$$\begin{aligned} \mathbf h^f_t = \text{ LSTM } (\mathbf x_t , \mathbf h^f_{t-1} ), \mathbf h^b_t = \text{ LSTM } (\mathbf x_t , \mathbf h^b_{t+1} ), \end{aligned}$$
(1)

where we set \(\mathbf h^b_t, \mathbf h^f_t \in \mathbb R^{512}\), with initializing them as zeros: \(\mathbf {h}^b_{T+1} = \mathbf {h}^f_0 = \mathbf 0\). Finally, we obtain the representation of each modality at each step by concatenating the forward/backward hidden states and the input features: \(\mathbf x_{w,t} = [\mathbf h^f_{w,t}, \mathbf h^b_{w,t}, \mathbf w_t]\) for words. For visual domain, we use 1-d CNN encoder representation for \(v_t\), \(h^{cnn} \in \mathbb R^{2,048}\) instead, \(\mathbf x_{v,t} = [\mathbf h^{cnn}_{v,t}, \mathbf v_t]\).

Table 1. The detailed setting of layers in the JSFusion model. No padding is used for each layer. Dk means a fully-connected dense layer, and Convk and ConGk indicate convolutional and convolutional-gating layer, respectively.

Attention-Based Joint Embedding. We then feed the output of the sequence encoder into fully-connected (dense) layer [D1] for each modality separately, which results in \( D1^v ( \mathbf x_v ), D1^w ( \mathbf x_w ) \in \mathbb R^{d_{D1}}\), where \(d_{D1}\) is a hidden dimension of [D1]. We summarize the details of all the layers in our JSFusion model in Table 1. Throughout the paper, we denote fully-connected layers as Dk and convolutional layers as Convk.

Next, we compute attention weights \(\varvec{\alpha }\) and representation \(\varvec{\gamma }\), from which we obtain the JST as a joint embedding between every pair of sequential features:

$$\begin{aligned} \mathbf j_{nm} = \alpha _{nm} \varvec{\gamma }_{nm}, \ \text{ where }\, \alpha _{nm}&= \sigma (\mathbf w^T D2( \mathbf t_{nm} )) , \; \varvec{\gamma }_{nm} = D4 ( D3 (\mathbf t_{nm}) ), \end{aligned}$$
(2)
$$\begin{aligned} \mathbf t_{nm}&= D1^v ( \mathbf x_{v,n} ) \odot {D1}^w ( \mathbf x_{w,m}). \end{aligned}$$
(3)

\(\odot \) is a hadamard product, \(\sigma \) is a sigmoid function, and \(\mathbf w \in \mathbb R^{d_{D2}}\) is a learnable parameter. Since the output of the sequence encoder represents each frame conditioned on the neighboring video (or each word conditioned on the whole sentence), the attention \(\alpha \) is expected to figure out which pairs should be more weighted for the joint embedding among all possible pairs. For example of Fig. 2, expectedly, \(\alpha _{3,6}(v_3, w_6) > \alpha _{8,6} (v_8, w_6)\), if \(w_6\) is truck, and the third video frame contains the truck while the eighth frame does not.

From Eqs. (2)–(3), we obtain JST in a form of 3D tensor: \(\mathbf J = [\mathbf j_{n,m}]^{n=1:N_{max}}_{m=1:M_{max}} \) and \(\mathbf J \in \mathbb R^{N_{max} \times M_{max} \times d_{D4}}\).

Fig. 3.
figure 3

Attention examples for (a) Joint Semantic Tensor (JST) and (b) Convolutional Hierarchical Decoder (CHD). Higher values are shown in darker. (a) JST assigns high weights on positively aligned joint semantics in the two sequence data. Attentions are highlighted darker where words coincide well with frames. (b) Each layer in CHD assigns high weights to where structure patterns are well matched between the two sequence data. For a wrong pair of sequences, a series of Conv-gating (ConvG2) prune out misaligned patterns with low weights.

3.3 The Convolutional Hierarchical Decoder

The Convolutional Hierarchical Decoder (CHD) computes a compatibility score for a pair of multimodal sequences by exploiting the compositionality in the joint vector space of JST. We pass the JST tensor through a series of a convolutional (Conv) layer and a Conv-gating block, whose learnable kernels progressively find matched embeddings from those of each previous layer. That is, starting from the JST tensor, the CHD recursively activates the weights of positively aligned pairs than negatively aligned ones.

Specifically, we apply three sets of Conv layer and Conv-gating to the JST:

$$\begin{aligned} \mathbf J^{(k)} = Convk(\mathbf J^{(k-1)})\cdot \sigma (ConvGk(\mathbf J^{(k-1)})) \end{aligned}$$
(4)

for \(k=1,2,3\). We initialize \(\mathbf J^{(0)} = \mathbf J\) from the JST, and [Convk] is the k-th Conv layer for joint representation, [ConvGk] is the k-th Conv-gating layer for matching filters, whose details are summarized in Table 1.

We apply mean pooling to \(\mathbf J^{(3)}\) to obtain a single video-sentence vector representation \(\mathbf J_{out}\) (e.g.  \(\mathbb R^{17 \times 17 \times 256} \rightarrow \mathbb R^{1 \times 1 \times 256} \)). Finally, we compute similarity matching score by feeding \(\mathbf J_{out}\) into four dense layers [D5, D6, D7, D8]:

$$\begin{aligned} \text{ score } =&\mathbf W_{D8}(D7(D6(D5(\mathbf J_{out})))) + \mathbf b_{D8} \nonumber \\ \text{ where }&Dk (\mathbf x) = \text{ tanh } (\mathbf W_{Dk} \mathbf x + \mathbf b_{Dk} ), \ \ k= 5,6,7. \end{aligned}$$
(5)

We use the tanh activation for all dense layers except [D8].

3.4 An Illustrative Example of How the JSFusion Model Works

Figure 3 illustrates with actual examples how the attentions of JST and CHD work.

Figure 3(a) visualizes the learned attention weights \(\alpha _{nm}\) in Eq. (2) of all pairs between frames in a video and words in a positive and a negative sentence. The attentions are highlighted with higher values (shown in darker) when the words coincide better with the content in the frames, dominantly in a positive pair.

Figure 3(b) shows the output \(\mathbf J^{(k)}\) of each Conv layer and Conv-gating block in Eq. (4) for the same example. During training, each Conv layer learns to compose joint embedding from the ones in the lower layer, while the Conv-gating layer learns frequent matching patterns in the training pairs of videos and sentences. At test time, when it comes to compute a similarity score, the Conv-gating layers prune out misaligned patterns; if the pair is negative where there is no common aligned structure in the two sequences, as shown in the right of Fig. 3(b), most elements of \(\mathbf J^{(k)}\) have very low values. As a result, the CHD can selectively filter lower-layer information that needs to be propagated to the final-layer representation, and the final layer of CHD assigns a high score only if the jointly aligned patterns are significant between the sequence data.

The motivation behind the JSFusion model is that long sequence data like videos and sentences are too complicated to compare them in a single vector space, although most previous approaches depend on single LSTM embedding such as neural visual semantic embedding [19] and previous LSMDC winners [9, 30]. Instead, in our approach, JST first composes a dense pairwise 3D tensor representation between multimodal sequence data, from which CHD then exploits convolutional gated layers to learn multi-stage similarity matching. Therefore, our JST model can be more robust for detecting partial matching between short phrases and subhots.

3.5 Training

We train our JSFusion model using the ranking loss. Each training batch consists of L video-sentence pairs, including a single positive pair and \(L-1\) randomly sampled negative pairs. We use batch shuffling in every training epoch. Finally, we train the model using a max-margin structured loss objective as follows:

$$\begin{aligned} \mathcal {L} =&\sum _k \sum _{l=1}^{L} \max (0, S_{k,l} - S_{k,l^*} + \varDelta ) + \lambda ||\mathbf \theta ||^2 \end{aligned}$$
(6)

where \(l^*\) is the answer pair among L candidates, \(\lambda \) is a hyperparameter and \(\theta \) denotes weight parameters. This objective encourages a positive video-sentence pair to have a higher score than a misaligned negative pair by a margin \(\varDelta \). We use \(\lambda = 0.0005, \varDelta = 10\) in our experiments. We train all of our models using the Adam optimizer [42], with an initial learning rate in the range of \(10^{-4}\). For regularization, we apply batch normalization [43] to every dense layer.

3.6 Implementation of Video-Language Models

We below discuss how the JSFusion model is implemented for three video-language tasks, video retrieval, multiple-choice test, and fill-in-the-blank. We apply the same JSFusion model to both video retrieval and multiple-choice test with slightly different hyperparameter settings. For the fill-in-the-blank, we make a minor modification in our model to predict a word for a blank in the middle of the sentence.

For Retrieval. The retrieval model takes a query sentence and ranks 1,000 test videos according to the relevance between the query and videos. For training, we set \(L=10\) as the size of each training batch. At test, for each query sentence k, we compute scores \(\{S_{k,l}\}_{l}\) for all videos l in the test set. From the score matrix, we can rank the videos for the query. As will be presented in Sects. 4.3 and 4.4, our method successfully finds hierarchical matching patterns between complex natural language query and video frames with sounds.

For Multiple-Choice Test. The multiple-choice model takes a video and five choice sentences among which only one is the correct answer. Since our model can calculate the compatibility score between the query video and each sentence choice, we use the same model as the retrieval task. We simply select the choice with the highest score as an answer. For training, we set \(L=10\) so that each training batch contains 10 pairs of videos and sentences, which include only a single correct sentence, four wrong choices, and 5 randomly selected sentences from other training data.

For Fill-in-the-Blank. The fill-in-the-blank model takes a video and a sentence with one blank, and predict a correct word for the blank. Since this task requires more difficult inference (i.e. selecting a word out of vocabulary \(\mathcal V\), instead of computing a similarity score), we make two modifications as follows. First, we use deeper dimensions for layers: \(d_{D1}= d_{D5} = d_{D6} = d_{D7} =1,024\), \(d_{D2} = d_{D3} = d_{D4} = 2,048\), \(\; d_{D8} = |\mathcal V|\), \(\; d_{Conv1\_1} = d_{Conv2\_1} = d_{Conv3\_1} = 1,024\), instead of the numbers in Table 1.

Second, we add a skip-connection part to our model, which is illustrated as the green paths of Fig. 2. Letting b as the blank position in the query sentence, we use the BLSTM output from the blank word token BLANK as a sentential context of the blank position: \(\mathbf t_{b} = \text{ D1 }^w ( \mathbf w_{b})\). We make a summation between the output of [D7] \(\in \mathbb R^{1,024}\) and the sentential context \(\mathbf t_{b} \in \mathbb R^{1,024}\), and then feed it into [D8] to predict a word.

For training, we set the batch size as \(L=32\). We use the different objective, the cross-entropy loss, because this task is classification rather than ranking:

$$\begin{aligned} \mathcal {L} = - \log p(\mathbf y) + \lambda ||\mathbf \theta ||^2 \end{aligned}$$
(7)

where \(\theta \) denotes weight parameters and \(\lambda = 0.0005\). We use dropout with a rate of 0.2.

4 Experiments

We report the experimental results of JSFusion models for the three tasks of LSMDC [1] and two tasks of MSR-VTT [2].

Table 2. Performance comparison for the movie retrieval task using Recall@k (R@k, higher is better) and Median Rank (MedR, lower is better). We report the results on the two datasets of LSMDC [1] (L) and MSR-VTT [2] (M).
Table 3. Left: Performance comparison for the multiple-choice test using the accuracy in percentage. We report the results on the two datasets of LSMDC (L) and MSR-VTT (M). Right: Accuracy comparison (in percentage) for the movie fill-in-the-blank task.

4.1 LSMDC Dataset and Tasks

The LSMDC 2017 consists of four video-language tasks for movie understanding and captioning, among which we focus on the three tasks in our experiments: movie retrieval, multiple-choice test, and fill-in-the-blank. The challenge provides a subset of the LSMDC dataset, which contains a parallel corpus of 118,114 sentences and 118,081 video clips of about 4–5 s long sampled from 202 movies. We strictly follow the evaluation protocols of the challenge. We defer more details of the dataset and challenge rules to [1] and the homepageFootnote 1.

Multiple-Choice Test. Given a video query and five candidate captions, the goal is to find the correct one for the query out of five possible choices. The correct answer is the groundtruth (GT) caption and four other distractors are randomly chosen from other captions that have different activity-phrase labels from the correct answer. The evaluation metric is the percentage of correctly answered test questions out of 10,053 public-test data.

Movie Retrieval. The test set consists of 1,000 video/activity phrase pairs sampled from the LSMDC17 public-test data. Then, the objective is, given a short query activity-phrase (e.g.  answering phone), to find its corresponding video out of 1,000 test videos. The evaluation metrics include Recall@1, Recall@5, Recall@10, and Median Rank (MedR). The Recall@k means the percentage of GT videos in the first k retrieved videos, and the MedR indicates the median rank of GT videos. The challenge winner is determined by the metric of Recall@10.

Movie Fill-in-the-Blank. This track is related to visual question answering. The task is, given a video clip and a sentence with a blank in it, to predict a single correct word for the blank. The test set includes 30,000 examples from 10,000 clips (i.e. about 3 blanks per sentence). The evaluation metric is the prediction accuracy (i.e. the percentage of predicted words that match with GTs).

4.2 MSR-VTT-(RET/MC) Dataset and Tasks

The MSR-VTT [2] is a large-scale video description dataset. It collects 118 videos per query of 257 popular queries, and filters manually to 7,180 videos. From the videos, it selects 10 K video clips with 41.2 h and 200 K clip-sentence pairs.

Based on the MSR-VTT dataset, we newly create two video-text matching tasks: (i) multiple-choice test and (ii) video retrieval. The task objectives for these tasks are identical to those of corresponding tasks in the LSMDC benchmark. To collect annotations for the two tasks, we exactly follow the protocols that are used in the LSMDC dataset, as described in [12].

Multiple-Choice Test: We generate 2,990 questions in total for the multiple-choice test, using all the test video clips of MSR-VTT. For each test video, we use the associated GT caption for the correct answer, while randomly sampled descriptions from other test data for four negative choices.

Video Retrieval: For retrieval, we first sample 1,000 pairs of video clips and description queries from the test set of MSR-VTT We use 1,000 as the size of the test set, following the LSMDC benchmark. As a result, the retrieval task is to find out the video that corresponds to the query caption out of 1000 candidates.

4.3 Quantitative Results

Tables 2 and 3 summarize the results of our experiments for the three video-language tasks. For LSMDC experiments, we report the results in the published papers and the official leaderboard of LSMDC 2017Footnote 2. For MSR-VTT experiments, we run some participants of LSMDC, including SNUVL, EITanque, VSE-LSTM, ST-VQA-Sp.Tp and CT-SAN, using the source codes provided by the original authors. We implement the other baselines by ourselves, only except Miech et al. that require an additional person tracker, which is unavailable to use. Other variants of our method will be discussed in details below in the ablation study.

Tables 2 and 3 clearly show that our JSFusion achieves the best performance with significant margins from all the baselines over the three tasks on both datasets. That is, the two components of our approach, JST and CHD, indeed helps measure better the semantic similarity between multimodal sequences than a wide range of state-of-the-art models, such as a multimodal embedding method (VSE-LSTM), a spatio-temporal attention-based QA model (ST-VQA-Sp.Tp), and a language model based QA inference (Text-only BLSTM). Encouragingly, the JSFusion single model outperforms even the ensemble method of runner-up (LR/RL LSTMs) in the fill-in-the-blank task.

Among baselines, multimodal low-rank bilinear attention network (MLB) [45] is competitive. The main differences of our model from (MLB) are two-fold. First, JSFusion embeds both a video and a sentence to feature sequences, whereas (MLB) represents the sentence as a single feature. Second, JSFusion uses the self-gating to generate fine-grained matching between all pairwise embeddings of the two sequences, while (MLB) uses the attention to find a position in the visual feature space that best fits for the sentence vector. Moreover, JSFusion consistently shows better performance than (MLB) in all experiments.

Ablation Study. We conduct ablation experiments on different variants of our JSFusion model and present the results in Tables 2 and 3. As one naive variant of our model, we test a simple LSTM baseline (LSTM-fusion) that only carries out the Hadamard product on a pair of final states of video and language LSTM encoders. That is, (LSTM-fusion) is our JSFusion model that has neither JST nor CHD, which are the two main contributions of our model. We train (LSTM-fusion) in the same way as done for the JSFusion model in Sect. 3.5. As easily expected, the performance of (LSTM-fusion) is significantly worse than our JSFusion in all the tasks.

To further validate the contribution of each component, we remove or replace key components of our model with simpler ones. To understand the effectiveness of BLSTM encoding, we test two baselines: (JSTfc) that replaces BLSTM with fully-connected layers and (JSTlstm) that replaces BLSTM with LSTM. (JSTmax) and (JSTmean) denote our variants that use max pooling and mean pooling, instead of the Convk convolutional layers in CHD. That is, they use fixed max/mean pooling operations instead of convolutions with learnable kernels. These comparisons reveal that the proposed CHD is critical to improve the performance of JSFusion nontrivially on all the tasks on both datasets. We also compare our model with (JSFusion-noattention) that discards Conv-gating operations of CHD. (JSFusion-noattention) shows nontrivial performance drops as MC (acc): \(4.1\%p, 4.2\%p\), RET (R@10): \(5.7\%p, 3.7 \%p\) for LSMDC and MSR-VTT, respectively. Finally, we test our model with using no audio information denoted by (JSFusion-noaudio), which is also much better than other baselines but only slightly worse than our original model.

Fig. 4.
figure 4

Qualitative examples of the three video-language tasks: movie retrieval on LSMDC (a)–(b) and MSR-VTT-RET (c)–(d), multiple-choice on LSMDC (e)–(f) and MSR-VTT-MC (g)–(h), and (i)–(j) fill-in-the-blank on LSMDC. The left column shows correct examples, while the right column shows near-miss examples. In (b),(d), we show our retrieval ranks of the GT clips (in the red box). (Color figure online)

4.4 Qualitative Results

Figure 4 illustrates qualitative results of our JSFusion algorithm with correct (left) and near-miss (right) examples for each task. In each set, we show natural language query and sampled frames of a video. We present both groundtruth (GT), our prediction (Ours).

Movie Retrieval. Figure 4(a) is an example that our model can understand human behaviors like gaze. Figure 4(b) shows the model’s failure to distinguish a small motion (e.g. facial expression), and simply retrieve the videos containing the face of a woman. Figure 4(c) shows that our model successfully catches the features of horses in both web videos and 3D animation, and correctly select the highest ranking video by focusing on the word stall. In Fig. 4(d), although the model can retrieve relevant videos of cooking with bowl, it fails to find out the answer video that contains the query description of baking mix.

Movie Multiple-Choice Test. Figure 4(e) delivers an evidence that our model uses the whole sentence for computing matching scores, because the model successfully chooses instead of that shares the same phrases (e.g.  shakes his head). Figure 4(f) is an example of focusing on a wrong video subsequence, where our model chooses the word club by looking at a subsequence with crowded people, but the answer is related to another subsequence with grandmother. Figure 4(g) is an example that the model learns words in a phrase. Choice can be very tempting, since it contains the word kids, tv and show. But our model successfully choose the right answer by identifying that kids tv show and kids in tv show mean differently. Figure 4(h) shows that our model fails to distinguish the details.

Movie Fill-in-the-Blank. In Fig. 4(i), the model successfully finds the answer by using both structural information of a sentence and a video (e.g.  door is a likely word after shuts the). Figure 4(j) is an example that the model focuses too much on the word picture that follows the blank, instead of visual information, and thus choose a wrong answer framed picture rather than flash picture.

5 Conclusion

We proposed the Joint Sequence Fusion (JSFusion) model for measuring hierarchical semantic similarity between two multimodal sequence data. The two key components of the model, Joint Semantic Tensor (JST) and Convolutional Hierarchical Decoder (CHD), are easily adaptable in many video-and-language tasks, including multimodal matching or video question answering. We demonstrated that our method significantly improved the performance of video understanding through natural language description. Our method achieved the best performance in challenge tracks of LSMDC, and outperformed many state-of-the-art models for VQA and retrieval tasks on the MSR-VTT dataset.

Moving forward, we plan to expand the applicability of JSFusion; since our model is usable to any multimodal sequence data, we can explore other retrieval tasks of different modalities, such as videos-to-voices or text-to-human motions.