Keywords

1 Introduction

Human activity recognition has a long list of potential applications, ranging from autonomous driving and robotics to security surveillance [7, 8, 12]. Knowledge transfer from external sources is crucial for using such models in practice, as they are especially sensitive to the amount of training data due to the 3-D convolution kernels leading to a higher number of parameters [2].

Fig. 1.
figure 1

Zero-shot action recognition paradigm: instances of the new target classes are recognized without any training data by linking visual features learned from the known source categories with a language-based representation of the action labels. Our work demonstrates, that the zero-shot premise of disjoint source and target categories may be violated when using external datasets for training.

Intersection of vision and language allows us to generalize to new actions without any visual training data through Zero-Shot Learning (ZSL) [15]. ZSL connects a visual model trained on a dataset of known (source) classes to the unknown (target) classes through the high-level semantic descriptions of an action, e.g., the action label. The description is often represented by a word embedding model (e.g. word2vec [4, 5]), previously trained on web data. Such ZSL methods would first compute the word vector by mapping a visual representation of a new instance to the common semantic space and then assign it to one of the previously unseen categories by selecting the category with the closest semantic representation Fig. (1).

ZSL for action recognition gained popularity over the past few years, usually dividing the dataset into seen categories for training and unseen categories for evaluation [9, 11, 14, 16, 17]. Recent emergence of large-scale action recognition datasets has led to an increasing interest in the field of domain adaptation, where the model trained on a high amount of external data is classifying instances from a smaller, potentially application-specific dataset [18]. At the first glance, one would assume, that classifying data from a foreign source would be a harder problem because of the existing domain shift. However, recent works using data from origins other then the evaluation dataset for training of the visual recognition model, report extraordinary results in zero-shot action recognition, doubling the performance of the previous models focused on the inner-dataset split [18]. A single dataset would not contain the same activity twice. Action labels of an external dataset, on the other hand, possibly intersect with the test categories, violating the ZSL premise of assigning action classes not seen during training and turning the problem into supervised classification. We argue, that in order to draw the line between zero-shot and standard supervised recognition across different datasets, it is crucial to take a closer look at the similarity of action categories of source and target data and create a standardized evaluation procedure which eliminates the influence of such overlapping activities.

Contributions and Summary. This work aims to highlight the fact that cross-dataset evaluation of zero-shot action recognition is greatly influenced by the presence of overlapping activity classes in the source and target datasets. We quantitatively analyze the similarities of labels used for training (source) and testing (target) in the inner-dataset and cross-dataset setup and demonstrate, that external labels tend to have categories excessively similar to the unseen target classes, therefore violating the ZSL assumption of disjoint source and target categories. We propose a novel procedure that enables the use of external data for zero-shot action recognition settings in a fair way, by using the maximum internal semantic similarity within the target dataset to restrict the external classes. We evaluate our method on the HMDB-51 dataset, and show how using external data improves the performance of ZSL approaches, even in our more fair evaluation setting. Finally, we propose a novel hybrid ZSL regime, where the model is allowed to use all the internal labels and additional large-scale external data, consistently improving the zero-shot action recognition accuracy.

2 Fair Transfer of External Categories

Problem Definition. We define the zero-shot learning task as follows. Let \(A = \{a_k\}_{k=1}^K\) be a set of K previously seen source categories. Given the set of previously unseen target categories \(T = \{t_m\}_{m=1}^M\) and a new data sample X, our goal is to predict the correct action category \(t \in T\) without having any training data (i.e. labeled samples) for this class. Since the core idea of ZSL is to recognize previously unseen visual categories, source labels and target labels are set to be strictly disjoint. This is known as the zero-shot premise and is formalized as: \(A \cap T = \emptyset \).

2.1 Evaluation Protocols for ZSL

Intra-dataset Protocol. A common way to evaluate zero-shot learning approaches is to divide a dataset into seen and unseen categories. That is, while a subset of unseen categories is held out during training, both the source and target labels belong to the same dataset: \(A = A_{intra}\). In this setting, source and target categories do not overlap, since well designed datasets contain no category duplicates.

Cross-dataset Protocol. The main goal of zero-shot learning, however, is to apply knowledge from available data to tasks from a different domain where labeled data is difficult to obtain. This setting is evaluated by training and evaluating on a different datasets: \(A = A_{cross}\). In that case, however, the zero-shot premise is not given by default. In the most extreme case, if \(T \subset A\), no semantic transfer is needed.

Intra- and Cross- dataset protocol. Recently, several approaches in other computer vision areas have been presented that investigate ways of increasing the performance by mixing the available domain-specific datasets with large amounts of training data from external sources [10]. We transfer this paradigm to the zero-shot action recognition and formalize this hybrid evaluation regime as: \(A = A_\text {intra}\cup A_\text {cross}\). Similarly to the previous setting, the zero-shot premise is not ensured.

2.2 Proposed Protocol to Incorporate External Datasets

In the intra-class protocol, compliance of the zero-shot premise is given for granted, and generally well accepted by researchers [9, 14, 16]. However, when external datasets are involved, one has to ensure that the terms of ZSL are still met and the source and target categories are disjoint. For example, Zhu et al. [18] excludes classes from the training dataset whose category label overlaps with a tested label. This procedure would remove the action brushing hair, present in both ActivityNet [1] and Kinetics [2], since the label brush hair is present in the target classes from the HMDB-51 [3] dataset.

However, it is not trivial to determine if a source class should be excluded and eliminating direct category matches may not be enough. External datasets often contain slightly diverging variants or specializations of the target actions (e.g., drinking beer and drink), leading to a much closer relation of source and target actions compared to the inner dataset protocol, even if the direct matches are excluded. We argue, that taking into account the similarity of source and target labels is a key element for evaluation of zero-shot action recognition when external sources datasets are used.

We propose a standardized procedure to decide whether an external class should be used or discarded when training the visual model. Our corrective method is based on the fact that zero-shot learning is well-defined for the intra-class protocol, i.e. thus all source categories of the intra-dataset split can always be used to train our model. We will remove a source category if its label is semantically too similar to any of the target categories by leveraging the maximum similarity observed inside the same dataset as a rejection threshold for categories of foreign origin. Formally, an external category \(a_k \in A\) is allowed if and only if following condition is satisfied:

$$\begin{aligned} \forall t_m \in T, \text { } s(\omega (a_k),\omega (t_m)) \leqslant s_{th}. \end{aligned}$$
(1)

The similarity threshold \(s_{th}\) corresponds to the maximum pairwise similarity between the source and target labels in the intra-class setting:

$$\begin{aligned} s_{th} = \underset{a_k \in A_\text {intra}, t_m \in T}{{\text {max}}} s(\omega (a_k),\omega (t_m)). \end{aligned}$$
(2)
Fig. 2.
figure 2

Histogram of semantic similarities between all target labels and the most similar source label.

Fig. 3.
figure 3

Proportion of source labels allowed depending on the semantic similarity threshold \(s_{th}\).

3 Experiments

Experimental Setup. To evaluate our idea, we adapt an off-the shelf ZSL approach Convex Combination of Semantic Embeddings (ConSE) [6]. While ConSE has been used for zero-shot action recognition before [17], where in the underlying visual model was based on dense trajectory features [13] encoded as Fisher Vector, we employ a model based on CNNs.

We denote the model for mapping an action label to the word vector representation as \(\omega (\cdot )\) and the cosine similarity of the two word vectors as \(s(\omega (a_i),\omega (a_j))\). In the next step, a word vector embedding for X is synthesized by taking a linear combination of the predicted probabilities and the semantic representation of source classes: \(w^*(X) = \sum _{k=1}^{K} p(a_k|X)\omega (a_k)\). X will be classified to the target category whose semantic representation is most similar to the synthesized word embedding:

$$\begin{aligned} t^*_X = \underset{t_m \in T}{{\text {argmax}}}\, s(\omega (t_m),w^*(X)). \end{aligned}$$

As our visual recognition model, we use I3D [2], which is the current state-of-the-art method for action recognition. The model is trained using SGD with momentum of 0.9, and an initial learning rate of 0.005 for 100 epochs. To compute the word vectors embeddings of the action categories, we use the publicly available word2vec model trained on 100 billion words from Google News articles, which maps the input into a 300 dimensional semantic space [5].

We use HMDB-51 [3] as our target dataset, and we follow the zero-shot learning setup of Wang et al.  [14]: we generate 10 random splits with 26 seen and 25 unseen categories each. As a foreign data source we use the Kinetics dataset [2], which covers 400 activity categories.

Intra- and Cross-dataset Class Similarity. First, we re-assure our assumption that labels of seen actions tend to be significantly closer to the unseen categories if they originate from an external dataset. Figure 2 shows the distribution of the maximum pairwise source-target similarity for each source label. We observe that actions from external dataset are far closer, often even identical, to the target classes dataset in comparison to the same dataset case. We explain this distribution by the nature of datasets design, as a single dataset does not contain duplicates or activities that are too close to each other.

Fig. 4.
figure 4

Influence of source-target label similarity on ZSL performance. X-Axis denotes the semantic similarity threshold \(s_{th}\) above which source categories are excluded. Having similar classes in the seen and unseen sets strongly affects accuracy, an effect that is more pronounced when using external datasets.

Effect of the Similar Activities on the Classification Accuracy. Our next area of investigation is the influence of such analogue activities and external data on the classification results. We report the average and standard deviation of the recognition accuracy over the splits for different similarity thresholds \(s_{th}\) for restricting the target categories (Fig. 4 and Table 1). Extending the model trained on the native data (intra-dataset) with external datasets (intra- and cross-dataset regimes) increases the accuracy by almost 15%, with 10% improvement observed when an external source is used alone (cross-dataset regime). Excluding direct matches (\(s_{th}\) of 0.95) leads to a performance decline of 4% for cross-dataset scenario, although only around 1% of external action categories are excluded (Fig. 3). In other words, only 1% of external action labels (which are extremely similar to the target) account for almost half of the cross-dataset performance boost.

The accuracy saturates at a similarity threshold of around 0.8 in the inner-dataset regime, as no duplicate activities are present (Fig. 3). Our evaluation procedure leverages this maximum inner-dataset similarity to effectively eliminate synonyms from external sources, while not influencing the inner-dataset performance. In our framework, the majority of the external dataset is kept 384.7 of 400. However, the influence of analogue activities is clearly tamed, leading to a performance drop from 34.77% to 25.67% for the inner- and cross-dataset protocol. Still, using external data is very beneficial for the recognition results and using both internal and external data sources consistently outperforms single-source models. A clear standardized protocol for defining allowed external source classes without violating the ZSL rules, is a crucial step towards a more adequate model evaluation.

Table 1. ZSL on HMDB-51 for different evaluation regimes with and without our corrective approach. Naively using external sources may not honor the ZSL premise.
Fig. 5.
figure 5

Effect of eliminating unfamiliar concepts: source categories with similarity to the target labels below 0.4 hinder the performance

Context of Previous Work. In this work, our goal is to highlight the ambiguities which arise when external datasets come into play in zero-shot action recognition and we do not aim at state-of the art performance. The vast majority of evaluated methods has used the inner-dataset split, e.g. a similar ConSE model employed by [17] which reaches 15.0%, while our model with underlying deep showes an improvement of 19.92%. The state-of-the-art approach using inner-dataset evaluation achieves 22.6% [9], while the recent work of Zhu et al.  [18] reports highly impressive results of 51.8% employing an external data source (ActivityNet). We want to note, that our model also consistently outperforms state-of-the-art which uses inner-dataset split only. However, we find that systematic elimination of synonyms is crucial for a fair comparison, as we do not know, which actions were allowed in the setting of [18] and we show, that few analogue actions might lead to a clear performance boost.

Eliminating too Unfamiliar Concepts for Better Domain Adaptation. As a side-observation, we have found that using an additional lower bound on the similarity of the external and target categories leads to a performance increase of around 2% for every evaluation setting (Fig. 5). In other words, unfamiliar concepts act as a distractor for the purposes of ZSL.

4 Conclusions

Current machine learning methods based on CNNs benefit immensely from having a high amount of data. Hence, it is sensible to integrate external datasets within the context of zero-shot learning to improve its performance. However, blindly using external datasets may break the zero-shot learning premise, i.e. that source and target categories should not overlap. In this work, we have proposed an objective metric that defines which source categories may constitute a synonym of a target category. By pruning these categories from the source set, we honor the zero-shot learning premise. We evaluate this approach in the context of action recognition, and show that adding external data still helps considerably to improve the accuracy of zero-shot learning, even after removing all the similar categories from the source datasets.