Keywords

1 Introduction

Actions in the real world are structured, as are objects. We design an architecture that exploits some semantics of this structure through a target hierarchy. In particular, we show that works extract better features by learning targets at different levels of granularity, yielding better performance at all levels. We use Something-Something [3] dataset for our experiments. It comprises over 220,000 videos and a hierarchy of targets, namely, 50 coarse-grained action groups, 174 fine-grained action categories, and video captions with information about the actions, the objects, and their properties. Our contribution is two-fold:

  1. 1.

    We present a hierarchical architecture that makes use of different levels of task granularity by jointly learning fine-grained and coarse-grained targets, as well as captions.

  2. 2.

    We show that such model perform better than networks trained solely on one level of target labels.

2 Related Work

The literature on action recognition is vast (e.g. [1, 6, 7]), as is the growing literature on video captioning [2, 5, 8], this paper focuses on architecture to learn a common encoding for both recognition and captioning, leveraging different levels of granularity in video targets, which is not studied extensively to date.

3 Approach

Our deep neural network leverages information at different levels of granularity. It simultaneously performs coarse-grained classification of action groups, fine-grained classification of action categories, and video captioning. It is trained end-to-end with a joint loss on all aforementioned targets.

The overall architecture is depicted in Fig. 2. Video frames are fed into the video encoder, resulting in encoding vector V, which is then passed to processing modules (classifiers or captioners) for targets with different granularity. Let i denote the \(i^{th}\) level of target granularity, with higher values of i representing more fine-grained targets. Let \(T_i\) be the set of targets at level i. In order to capture the structure between actions, the concatenation of probability distribution over \(T_i\) and V is passed to \((i+1)^{th}\) processing module. The model then tries to estimate the probability over \(T_{i+1}\). In our model, \(\forall c \in T_{i+1} : P(c|V) = \Sigma _{g \in T_{i}}{ P(c|g,V) P(g|V)}\). In other words, at level \(i+1\), given p(g|V), instead of estimating the posterior p(c|V) directly, we implicitly estimate P(c|gV), integrating out g. Decomposing the posterior in this way is expected to simplify learning, as it provides some hints from coarse- to fine-grained classification. The total loss being optimized, is sum of losses of different levels: \(loss = \Sigma _{i} {w_i * loss_i}\).

As mentioned before, Something-Something has 3 levels of targets. In the first step (yellow box in Fig. 2), the model infers the action group, e.g. “Putting”, vs. “Taking”. In the second step (orange box in Fig. 2), the output distribution over coarse-grained groups (grey boxes in Fig. 2) is used along with video features to infer the fine-grained action category, e.g. if the model has figured out the group “Putting” in the previous step, here it needs to distinguish between “Putting on top”, “Putting behind”, and “Putting next to”. We move one step further with more details, where the model needs to generate a caption (red box in Fig. 2), and describe what is happening in the video. At this step, the output distribution over fine-grained classes is passed to the caption decoder. For an accurate caption, the model needs to recognize and explain objects and their properties as well as the action. The rest of the section explains each component of our architecture in more details.

Video Encoder. We use a minimal video encoder with 6 layers of 3D-CNN [4] with \(3 \times 3 \times 3\) filters, and 4 spatial max pooling layers (see Fig. 1). For temporal aggregation we use a bidirectional LSTM. We then average the outputs over different time steps, yielding the final video encoding V.

Video Classification. For classification at level i, we pass V to a fully connected layer of size \(|V|\times |T_i|\) followed by a softmax layer, \(T_i\) being the set of targets at level i. Note that the dimensions of this layer varies for different levels i. We train classifiers using the usual cross-entropy loss over classes at the intended target level.

Video Captioning. We use a two-layer LSTM for decoding captions. At each time-step, we use a softmax over the vocabulary words, conditioned on previously generated words. The captioning loss is the negative log-probability of the word sequence.

Fig. 1.
figure 1

Our video encoder architecture includes 6 layers of 3DCNN, and 4 layers of spatial max-pooling followed by an LSTM layer. T denotes the number of frames across time.

Fig. 2.
figure 2

Our hierarchical architecture for targets with different levels of granularity, along with the loss functions on different levels. (Color figure online)

4 Experiments

As a baseline, we train the network for each specific target level separately. To show the effectiveness of training on the target hierarchy, we first train the group classifier, then activate the category classifier, and finally add the captioner, see Table 1. The results reveal that by adding each level of detail, there is a boost in model performance, on all the target levels it’s been trained on. Captioning accuracy measures the fraction of captions that the model got correct, word by word. Note that this is a very harsh metric, and the model does not get any score if it misses even one word of the correct caption. We are not investigating other metrics, as this experiment is not about that, and we only need to involve the captioning in the loss.

Below you can find examples of videos, accompanied by their their ground truth caption and the caption generated by our model (Fig. 3).

Fig. 3.
figure 3

Ground truth captions and model outputs for video examples of Something-Something dataset.

Training Settings. We use frame rate of 12 fps, and randomly pick 48 consecutive frames. Videos with less than 48 frames are padded by replicating first and last frames. We re-size frames to \(128\times 128\), and then use random cropping of size \(96\times 96\). For validation and testing, we use center cropping with same size. We optimize all models using Adam, with an initial learning rate of 0.001. Each step is finished after 20 epochs of training over the full dataset.

Table 1. Comparing models trained with different weight vectors for joint loss. W shows weights on different level’s losses, with the rightmost weight denoting the weight for captioning loss.

5 Conclusions

We present a new architecture that leverages information from hierarchy of targets, at different levels of granularity. To train the architecture we use a joint loss on the different target levels. This makes it possible to simultaneously classify the input video in different levels and produce a caption. Our experiments show that the architecture achieves a performance boost over models that aim to solve the tasks individually.