1 Introduction

Recent advancements in deep convolutional neural networks (CNN) have been successful in various vision tasks such as image recognition [7, 17, 23] and object detection [13, 43, 44]. A notable bottleneck for deep learning, when applied to multimodal videos, is the lack of massive, clean, and task-specific annotations, as collecting annotations for videos is much more time-consuming and expensive. Furthermore, restrictions such as privacy or runtime may limit the access to only a subset of the video modalities during test time.

The scarcity of training data and modalities is encountered in many real-world applications including self-driving cars, surveillance, and health care. A representative example is activity understanding on health care data that contain Personally Identifiable Information (PII) [16, 34]. On the one hand, the number of labeled videos is usually limited because either important events such as falls [40, 63] are extremely rare or the annotation process requires a high level of medical expertise. On the other hand, RGB violates individual privacy and optical flow requires non-real-time computations, both of which are known to be important for activity understanding but are often unavailable at test time. Therefore, detection can only be performed on real-time and privacy-preserving modalities such as depth or thermal videos.

Fig. 1.
figure 1

Our problem statement. In the source domain, we have abundant data from multiple modalities. In the target domain, we have limited data and a subset of the modalities during training, and only one modality during testing. The curved connectors between modalities represent our proposed graph distillation.

Inspired by these problems, we study action detection in the setting of limited training data and partially observed modalities. To do so, we make use of a large action classification dataset that contains various heterogeneous modalities as the source domain to assist the training of the action detection model in the target domain, as illustrated in Fig. 1. Following the standard assumption in transfer learning [59], we assume that the source and target domain are similar to each other. We define a modality as a privileged modality if (1) it is available in the source domain but not in the target domain; (2) it is available during training but not during testing.

We identify two technical challenges in this problem. First of all, due to modality discrepancy in types and quantities, traditional domain adaption or transfer learning methods [12, 41] cannot be directly applied. Recent work on knowledge and cross-modal distillation [18, 26, 33, 48] provides a promising way of transferring knowledge between two models. Given two models, we can specify the distillation as the direction from the strong model to the weak model. With some adaptations, these methods can be used to distill knowledge between modalities. However, these adapted methods fail to address the second challenge: how to leverage the privileged modalities effectively. More specifically, given multiple privileged modalities, the distillation directions and weights are difficult to be pre-specified. Instead, the model should learn to dynamically adjust the distillation based on different actions or examples. For instance, some actions are easier to detect by optical flow whereas others are easier by skeleton features, and therefore the model should adjust its training accordingly. However, this dynamic distillation paradigm has not yet been explored by existing methods.

To this end, we propose the novel graph distillation method to learn a dynamic distillation across multiple modalities for action detection in multimodal videos. The graph distillation is designed as a layer attachable to the original model and is end-to-end learnable with the rest of the network. The graph can dynamically learn the example-specific distillation to better utilize the complementary information in multimodal data. As illustrated in Fig. 1, by effectively leveraging the privileged modalities from both the source domain and the training stage of the target domain, graph distillation significantly improves the test-time performance on a single modality. Note that graph distillation can be applied to both single-domain (from training to testing) and cross-domain (from one task to another) tasks. For our cross-domain experiment (from action classification to detection), we utilized the most basic transfer learning approach, i.e. pre-train and fine-tune, as this is orthogonal to our contributions. We can potentially achieve even better results with advanced transfer learning and domain adaptation techniques and we leave it for future study.

We validate our method on two public multimodal video benchmarks: PKU-MMD [28] and NTU RGB+D [45]. The datasets represent one of the largest public multimodal video benchmarks for action detection and classification. The experimental results show that our method outperforms the state-of-the-art approaches. Notably, it improves the state-of-the-art by 9.0% on PKU-MMD [28] (at 0.5 tIoU threshold) and by 6.6% on NTU RGB+D [45]. The remarkable improvement on the two benchmarks is a convincing validation of our method.

To summarize, our contribution is threefold. (1) We study a realistic and challenging condition for multimodal action detection with limited training data and modalities. To the best of our knowledge, we are first to effectively transfer multimodal privileged information across domains for action detection and classification. (2) We propose the novel graph distillation layer that can dynamically learn to distill knowledge across multiple privileged modalities and can be attached to existing models and learned in an end-to-end manner. (3) Our method outperforms the state-of-the-art by a large margin on two popular benchmarks, including action classification task on the challenging NTU RGB+D [45] and action detection task on PKU-MMD [28].

2 Related Work

Multimodal Action Classification and Detection. The field of action classification [3, 49, 51] and action detection [2, 11, 14, 64] in RGB videos has been studied by the computer vision community for decades. The success in RGB videos has given rise to a series of studies on action recognition in multimodal videos [10, 20, 22, 25, 50, 54]. Specifically, with the availability of depth sensors and joint tracking algorithms, extensive research has been done on action classification and detection in RGB-D videos [39, 46, 47, 60] as well as skeleton sequences [24, 30,31,32, 45, 62]. Different from previous work, our model focuses on leveraging privileged modalities on a source dataset with abundant training examples. We show that it benefits action detection when the target training dataset is small in size, and when only one modality is available at test time.

Video Understanding Under Limited Data. Our work is largely motivated by real-world situations where data and modalities are limited. For example, surveillance systems for fall detection [40, 63] often face the challenge that annotated videos of fall incidents are hard to obtain, and more importantly, yhr recording of RGB videos is prohibited due to privacy concerns. Existing approaches to tackling this challenge include using transfer learning [36, 41] and leveraging noisy data from web queries [5, 27, 58]. Specifically to our problem, it is common to transfer models trained on action classification to action detection.

The transfer learning methods are proved to be effective. However, it requires the source and target domains to have the same modalities. In reality, the source domain often contains richer modalities. For instance, suppose the depth video is the only available modality in the target domain, it remains nontrivial to transfer the other modalities (e.g. RGB, optical flow) even though they are readily available in the source domain and could make the model more accurate. Our method provides a practical approach to leveraging the rich multimodal information in the source domain, benefiting the target domain of limited modalities.

Learning Using Privileged Information. Vapnik and Vashist [52] introduced a Student-Teacher analogy: in real-world human learning, the role of a teacher is crucial to the student’s learning process since the teacher can provide explanations, comments, comparisons, metaphors, etc. They proposed a new learning paradigm called Learning Using Privileged Information (LUPI), where at training time, additional information about the training example is provided to the learning model. At test time, the privileged information is not available, and the student operates without the supervision of the teacher [52].

Several work employed privileged information (PI) on SVM classifiers [52, 55]. Ding et al. [8] handled missing modality transfer learning using latent low-rank constraint. Recently, the use of privileged information has been combined with deep learning in various settings such as PI reconstruction [48, 56], information bottleneck [38], and Multi-Instance Multi-Label (MIML) learning [57]. The idea more related to our work is the combination of distillation and privileged information, which will be discussed next.

Knowledge Distillation. Hinton et al. [18] introduced the idea of knowledge distillation, where knowledge from a large model is distilled to a small model, improving the performance of the small model at test time. This is done by adding a loss function that matches the outputs of the small network to the high-temperature soft outputs of the large network [18]. Lopez-Paz et al. [33] later proposed a generalized distillation that combined distillation and privileged information. This approach was adopted by [15, 19] in cross-modality knowledge transfer. Our graph distillation method is different from prior work [18, 26, 33, 48] in that the privileged information contains multiple modalities and that the distillation directions and weights are dynamically learned rather than being predefined by human experts.

3 Method

Our goal is to assist the training in the target domain with limited labeled data and modalities by leveraging the source domain dataset with abundant examples and multiple modalities. We address the problem by distilling the knowledge from the privileged modalities. Formally, we model action classification and detection as an L-way classification problem, where a “background class” is added in action detection.

Let \(\mathcal {D}_{t} = \{(x_i, y_i)\}_{i=1}^{|\mathcal {D}_{t}|}\) denote the training set in the target domain, where \(x_i\in \mathbb {R}^d\) is the input and \(y_i\in \mathbb {R}\) is an integer denoting the class label. Since training data in the target domain is limited, we are interested in transferring knowledge from a source dataset \(\mathcal {D}_{s} = \{(x_i, \mathcal {S}_i, y_i)\}_{i=1}^{|\mathcal {D}_{s}|}\), where \(|\mathcal {D}_{s}| \gg |\mathcal {D}_{t}|\), and the source and target data may have different classes. The new element \(\mathcal {S}_i = \{x_i^{(1)},...,x_i^{(|\mathcal {S}|)}\}\) is a set of privileged information about the i-th sample, where the superscript indexes the modality in \(\mathcal {S}_i\). As an example, \(x_i\) could be the depth image of the i-th frame in a video and \(x_i^{(1)},x_i^{(2)},x_i^{(3)} \in \mathcal {S}_i\) might be RGB, optical flow and skeleton features about the same frame, respectively. For action classification, we employ the standard softmax cross entropy loss:

$$\begin{aligned} \ell _c(f(x_i), y_i) = -\sum _{j=1}^L \mathbbm {1}(y_i =j) \log \sigma (f(x_i)), \end{aligned}$$
(1)

where \(\mathbbm {1}\) is the indicator function and \(\sigma \) is the softmax function. The class prediction function \(f:\mathbb {R}^d \rightarrow [1,L]\) computes the probability for each action class.

In the rest of this section, Sect. 3.1 discusses the overall objective of privileged knowledge distillation. Section 3.2 details the proposed graph distillation over multiple modalities.

3.1 Knowledge Distillation with Privileged Modalities

To leverage the privileged information in the source domain data, we follow the standard transfer learning paradigm. We first train a model with graph distillation using all modalities in the source domain, and then transfer only the visual encoders (detailed in Sect. 4.1) of the target domain modalities. Finally, the visual encoder is finetuned with the rest of the target model on the target task. The visual feature encoding step is shared between the tasks in the source and target data and is therefore intuitive to use the same visual encoder architecture (as shown in Fig. 2) for both tasks.

To train a graph distillation model on the source data, we minimize:

$$\begin{aligned} \min \frac{1}{|\mathcal {D}_{s}|} \sum _{(x_i, y_i) \in \mathcal {D}_{s}} \ell _c(f(x_i),y_i) + \ell _m(x_i, \mathcal {S}_i). \end{aligned}$$
(2)

The loss consists of two parts: the first term is the standard classification loss in Eq. (1) and the latter is the imitation loss [18]. The imitation loss is often defined as the cross-entropy loss on the soft logits [18]. In existing literatures, the imitation loss is computed using a pre-specified distillation direction. For example, Hinton et al. [18] computed the soft logits by \(\sigma (f_{\mathcal {S}}(x_i)/T)\), where T is the temperature, and \(f_{\mathcal {S}}\) is the class prediction function of the cumbersome model. Gupta et al. [15] employed the “soft logits” obtained from different layers of the labeled modality. In both cases, the distillation is pre-specified, i.e., from a cumbersome model to a small model in [18] or from a labeled modality to an unlabeled modality in [15]. In our problem, the privileged information comes from multiple heterogeneous modalities and it is difficult to pre-specify the distillation directions and weights. To this end, our the imitation loss in Eq. (2) is derived from a dynamic distillation graph.

Fig. 2.
figure 2

An overview of our network architectures. (a) Action classification with graph distillation (attached as a layer) in the source domain. The visual encoders for each modality are trained. (b) Action detection with graph distillation in the target domain at training time. In our setting, the target training modalities is a subset of the source modalities (one or more). Note that the visual encoder trained in the source is transferred and finetuned in the target. (c) Action detection in the target domain at test time, with a single modality.

3.2 Graph Distillation

First, consider a special case of graph distillation where only two modalities are involved. We employ an imitation loss that combines the logits and feature representation. For notation convenience, we denote \(x_i\) as \(x_i^{(0)}\) and fold it into \(\mathcal {S}_i = \{x_i^{(0)}, \cdots , x_i^{(|\mathcal {S}|)}\}\). Given two modalities \(a,b \in [0, |\mathcal {S}|]\) \((a \ne b)\), we use the network architectures discussed in Sect. 4 to obtain the logits and the output of the last convolution layer as the visual feature representation.

The proposed imitation loss between two modalities consists of the loss on the logits \(l_{logits}\) and the representation \(l_{rep}\). The cosine distance is used on both logits and representations as we found the angle of the prediction to be more indicative and better than KL divergence or L1 distance for our problem.

The imitation loss \(\ell _m\) from modality b to a is computed by the weighted sum of the logits loss and the representation loss. We encapsulate the loss between two modalities into a message \(m_{a \leftarrow b}\) passing from b to a, calculated from:

$$\begin{aligned} m_{a \leftarrow b}(x_i) = \ell _m(x_i^{(a)}, x_i^{(b)}) = \lambda _1 l_{logits}+\lambda _2 l_{rep}, \end{aligned}$$
(3)

where \(\lambda _1\) and \(\lambda _2\) are hyperparameters. Note that the message is directional, and \(m_{a \leftarrow b}(x_i) \ne m_{b \leftarrow a}(x_i)\).

For multiple modalities, we introduce a directed graph of \(|\mathcal {S}|\) vertices, named distillation graph, where each vertex \(v_k\) represents a modality and an edge \(e_{k \leftarrow j} \ge 0\) is a real number indicating the strength of the connection from \(v_j\) to \(v_k\). For a fixed graph, the total imitation loss for the modality k is:

$$\begin{aligned} \ell _m(x_i^{(k)}, \mathcal {S}_i) = \sum _{v_j \in \mathcal {N}(v_k)} e_{k \leftarrow j} \cdot m_{k \leftarrow j}(x_i), \end{aligned}$$
(4)

where \(\mathcal {N}(v_k)\) is the set of vertices pointing to \(v_k\).

To exploit the dynamic interactions between modalities, we propose to learn the distillation graph along with the original network in an end-to-end manner. Denote the graph by an adjacency matrix \(\mathbf {G}\) where \(\mathbf {G}_{jk} = e_{k \leftarrow j}\). Let \(\phi _k^l\) be the logits and \(\phi _k^{l-1}\) be the representation for modality k, where l indicates the number of layers in the network. Given an example \(x_i\), the graph is learned by:

$$\begin{aligned} z_i^{(k)}(x_i)&= W_{11} \phi _k^{l-1}(x_i^{(k)}) + W_{12} \phi _k^{l}(x_i^{(k)}), \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {G}_{jk}(x_i)&= e_{k \leftarrow j} = W_{21} [z_i^{(j)}(x_i) \Vert z_i^{(k)}(x_i)] \end{aligned}$$
(6)

where \(W_{11}\), \(W_{12}\) and \(W_{21}\) are parameters to learn and \(\cdot \Vert \cdot \) indicates the vector concatenation. \(W_{21}\) maps a pair of inputs to an entry in \(\mathbf {G}\). The entire graph is learned by repetitively applying Eq. (6) over all pairs of modalities in \(\mathcal {S}\).

As a distillation graph is expected to be sparse, we normalize \(\mathbf {G}\) such that the nonzero weights are dispersed over a small number of vertices. Let \(\mathbf {G}_{j:} \in \mathbb {R}^{1 \times |\mathcal {S}|}\) be the vector of its j-th row. The graph is normalized:

$$\begin{aligned} \mathbf {G}_{j:}(x_i) = \sigma (\alpha [\mathbf {G}_{j1}(x_i), ..., \mathbf {G}_{j|\mathcal {S}|}(x_i)]), \end{aligned}$$
(7)

where \(\alpha \) is used to scale the input to the softmax operator.

The message passing on distillation graph can be conveniently implemented by attaching a new layer to the original network. As shown in Fig. 2(a), each vertex represents a modality and the messages are propagated on the graph layer. In the forward pass, we learn a \(\mathbf {G} \in \mathbb {R}^{|\mathcal {S}| \times |\mathcal {S}|}\) by Eqs. (6) and (7) and compute the message matrix \(\mathbf {M} \in \mathbb {R}^{|\mathcal {S}| \times |\mathcal {S}|}\) by Eq. (3) such that \(\mathbf {M}_{jk}(x_i)=m_{k \leftarrow j}(x_i)\). The imitation loss to all modalities is calculated by:

$$\begin{aligned} \ell _m = (\mathbf {G}(x_i) \odot \mathbf {M}(x_i))^T \mathbf {1}, \end{aligned}$$
(8)

where \(\mathbf {1} \in \mathbb {R}^{|\mathcal {S}| \times 1}\) is a column vector of ones; \(\odot \) is the element-wise product between two matrices; \(\mathbf {\ell _m} \in \mathbb {R}^{|\mathcal {S}| \times 1}\) contains imitation loss for every modality in \(\mathcal {S}\). In the backward propagation, the imitation loss \(\ell _m\) is incorporated in Eq. (2) to compute the gradient of the total training loss. This graph distillation layer is end-to-end trained with the rest of the network. As shown, the distillation graph is an important and essential structure which not only provides a base for learning dynamic message passing through modalities but also models the distillation as a few matrix operations which can be conveniently implemented as a new layer in the network.

For a modality, its performance on the cross-validation set often turns out to be a reasonable estimator to its contribution in distillation. Therefore, we add a constant bias term \(\mathbf {c}\) in Eq. (7), where \(\mathbf {c} \in \mathbb {R}^{|\mathcal {S}| \times 1}\) and \(c_j\) is set w.r.t. the cross-validation performance of the modality j and \(\sum _{k=1}^{|\mathcal {S}|} c_k = 1\). Therefore, Eq. (8) can be rewritten as:

$$\begin{aligned} \ell _m&= ((\mathbf {G}(x_i)+ \mathbf {1} \mathbf {c}^T)\odot \mathbf {M}(x_i))^T \mathbf {1} \end{aligned}$$
(9)
$$\begin{aligned}&= (\mathbf {G}(x_i)\odot \mathbf {M}(x_i))^T\mathbf {1}+(\mathbf {G}_{prior}\odot \mathbf {M}(x_i))^T\mathbf {1} \end{aligned}$$
(10)

where \(\mathbf {G}_{prior} = \mathbf {1} \mathbf {c}^T\) is a constant matrix. Interestingly, by adding a bias term in Eq. (7), we decompose the distillation graph into two graphs: a learned example-specific graph \(\mathbf {G}\) and a prior modality-specific graph \(\mathbf {G}_{prior}\) that is independent to specific examples. The messages are propagated on both graphs and the sum of the message is used to compute the total imitation loss. There exists a physical interpretation of the learning process. Our model learns a graph based on the likelihood of observed examples to exploit complementary information in \(\mathcal {S}\). Meanwhile, it imposes a prior to encouraging accurate modalities to provide more contribution. By adding a constant bias, we use a more computationally efficient approach than actually performing message passing on two graphs.

So far, we have only discussed the distillation on the source domain. In practice, our method may also be applied to the target domain on which privileged modality is available. In this case, we apply the same method to minimize Eq. (2) on the target training data. As illustrated in Fig. 2(b), a graph distillation layer is added during the training of the target model. At the test time, as shown in Fig. 2(c), only a single modality is used.

4 Action Classification and Detection Models

In this section, we discuss our network architectures as well as the training and testing procedures for action classification and detection. The objective of action classification is to classify a trimmed video into one of the predefined categories. The objective of action detection is to predict the start time, the end time, and the class of an action in an untrimmed video.

4.1 Network Architecture

For action classification, we encode a short clip of video into a feature vector using the visual encoder. For action detection, we first encode all clips in a window of video (a window consists of multiple clips) into initial feature vectors using the visual encoder, then feed these initial feature vectors into a sequence encoder to generate the final feature vectors. For either task, each feature vector is fed into a task-specific linear layer and a softmax layer to get the probability distribution across classes for each clip. Note that a background class is added for action detection. Our action classification and detection models are inspired by [49] and [37], respectively. We design two types of visual encoders depending on the input modalities.

Visual Encoder for Images. Let \(X=\{x_t\}_{t=1}^{T_c}\) denote a video clip of image modalities (e.g. RGB, depth, flow), where \(x_t\in \mathbb {R}^{H\times W\times C}\), \(T_c\) is the number of frames in a clip, and \(H\times W\times C\) is the image dimension. Similar to the temporal stream in [49], we stack the frames into a \(H\times W\times (T_c\cdot C)\) tensor and encode the video clip with a modified ResNet-18 [17] with \(T_c\cdot C\) input channels and without the last fully-connected layer. Note that we do not use the Convolutional 3D (C3D) network [3, 51] because it is hard to train with limited amount of data [3].

Visual Encoder for Vectors. Let \(X=\{x_t\}_{t=1}^{T_c}\) denote a video clip of vector modalities (e.g. skeleton), where \(x_t\in \mathbb {R}^{D}\) and D is the vector dimension. Similar to [24], we encode the input with a 3-layer GRU network [6] with \(T_c\) timesteps. The encoded feature is computed as the average of the outputs of the highest layer across time. The hidden size of the GRU is chosen to be the same as the output dimension of the visual encoder for images.

Sequence Encoder. Let \(X = \{x_t\}_{t=1}^{T_c\cdot T_w}\) denote a window of video with \(T_w\) clips, where each clip contains \(T_c\) frames. The visual encoder first encodes each clip individually into a single feature vector. These \(T_w\) feature vectors are then passed into the sequence encoder, which is a 1-layer GRU network, to obtain the class distributions of these \(T_w\) clips. Note that the sequence encoder is only used in action detection.

4.2 Training and Testing

Our proposed graph distillation can be applied to both action detection and classification. For action detection, we show that our method can optionally pre-train the action detection model on action classification tasks, and graph distillation can be applied in both pre-training and training stages. Both models are trained to minimize the loss in Eq. (2) on per-clip classification, and the imitation loss is calculated based on the representations and the logits.

Action Classification. Figure 2(a) shows how graph distillation is applied in training. During training, we randomly sample a video clip of \(T_c\) frames from the video, and the network outputs a single class distribution. During testing, we uniformly sample multiple clips spanning the entire video and average the outputs to obtain the final class distribution.

Action Detection. Figure 2(a) and (b) show how graph distillation is applied in training and testing, respectively. As discussed earlier, graph distillation can be applied to both the source domain and the target domain. During training, we randomly sample a window of \(T_w\) clips from the video, where each clip is of length \(T_c\) and is sampled with step size \(s_c\). As the data is imbalanced, we set a class-specific weight based on its inverse frequency in the training set. During testing, we uniformly sample multiple windows spanning the entire video with step size \(s_w\), where each window is sampled in the same way as training. The outputs of the model are the class distributions on all clips in all windows (potentially with overlaps depending on \(s_w\)). These outputs are then post-processed using the method in [37] to generate the detection results, where the activity threshold \(\gamma \) is introduced as a hyperparameter.

5 Experiments

In this section, we evaluate our method on two large-scale multimodal video benchmarks. The results show that our method outperforms representative baseline methods and achieves the state-of-the-art performance on both benchmarks.

5.1 Datasets and Setups

We evaluate our method on two large-scale multimodal video benchmarks: NTU RGB+D [45] (classification) and PKU-MMD [28] (detection). These datasets are selected for the following reasons. (1) They are (one of the) largest RGB-D video benchmarks in each category. (2) The privileged information transfer is reasonable because the domains of the two datasets are similar. (3) They contain abundant modalities, which are required for graph distillation.

We use NTU RGB+D as our dataset in the source domain, and PKU-MMD in the target domain. In our experiments, unless stated otherwise, we apply graph distillation whenever applicable. Specifically, the visual encoders of all modalities are jointly trained on NTU RGB+D by graph distillation. On PKU-MMD, after initializing the visual encoder with the pre-trained weights obtained from NTU RGB+D, we also learn all available modalities by graph distillation on the target domain. By default, only a single modality is used at test time.

NTU RGB+D [45]. It contains 56,880 videos from 60 action classes. Each video has exactly one action class and comes with four modalities: RGB, depth, 3D joints, and infrared. The training and testing sets have 40,320 and 16,560 videos, respectively. All results are reported with cross-subject evaluation.

PKU-MMD [28]. It contains 1,076 long videos from 51 action classes. Each video contains approximately 20 action instances of various lengths and consists of four modalities: RGB, depth, 3D joints, and infrared. All results are evaluated based on the Average Precision (mAP) at different temporal Intersection over Union (tIoU) thresholds between the predicted and the ground truth intervals.

Modalities. We use a total of six modalities in our experiments: RGB, depth (D), optical flow (F), and three skeleton features (S) named Joint-Joint Distances (JJD), Joint-Joint Vector (JJV), and Joint-Line Distances (JLD) [9, 24], respectively. The RGB and depth videos are provided in the datasets. The optical flow is calculated on the RGB videos using the dual TV-L1 method [61]. The three spatial skeleton features are extracted from 3D joints using the method in [9, 24]. Note that we select a subset of the ten skeleton features in [9, 24] to ensure the simplicity and reproducibility of our method, and our approach can potentially perform better with the complete set of features.

Baselines. In addition to comparing with the state-of-the-art, we implement three representative baselines that could be used to leverage multimodal privileged information: multi-task learning [4], knowledge distillation [18], and cross-modal distillation [15]. For the multi-task model, we predict the raw pixels of the other modalities from the representation of a single modality, and use the \(L_2\) distance as the multi-task loss. For the distillation methods, the imitation loss is calculated as the high-temperature cross-entropy loss on the soft logits [18], and \(L_2\) loss on both representations and soft logits in cross-modal distillation [15]. These distillation methods originally only support two modalities, and therefore we average the pairwise losses to get the final loss.

Table 1. Comparison with state-of-the-art on NTU RGB+D. Our models are trained on all modalities and tested on the single modality specified in the table. The available modalities are RGB, depth (D), optical flow (F), and skeleton (S).
Table 2. Comparison of action detection methods on PKU-MMD with state-of-the-art models. Our models are trained with graph distillation using all privileged modalities and tested on the modalities specified in the table. “Transfer” refers to pre-training on NTU RGB+D on action classification. The available modalities are RGB, depth (D), optical flow (F), and skeleton (S).

Implementation Details. For action classification, we train the visual encoder from scratch for 200 epochs using SGD with momentum with learning rate \(10^{-2}\) and decay to \(10^{-1}\) at epoch 125 and 175. \(\lambda _1\) and \(\lambda _2\) are set to 10, 5 respectively in Eq. (3). At test time we sample 5 clips for inference. For action detection, the visual and sequence encoder are trained for 400 epochs. The visual encoder is trained using SGD with momentum with learning rate \(10^{-3}\), and the sequence encoder is trained with the Adam optimizer [21] with learning rate \(10^{-3}\). The activity threshold \(\gamma \) is set to 0.4. For both tasks, we down-sample the frame rates of the datasets by a factor of 3. The clip length and detection window \(T_c\) and \(T_w\) are both set to 10. For the graph distillation, \(\alpha \) is set to 10 in Eq. (7). The output dimensions of the visual and sequence encoder are both set to 512. Since it is nontrivial to jointly train on multiple modalities from scratch, we employ curriculum learning [1] to train the distillation graph. To do so, we first fix the distillation graph as an identity matrix (uniform graph) in the first 200 epochs. In the second stage, we compute the constant vector \(\mathbf {c}\) in Eq. (9) according to the cross-validation results, and then learn the graph in an end-to-end manner.

Fig. 3.
figure 3

A comparison of the prediction results on PKU-MMD. (a) Both models make correct predictions. (b) The model without distillation in the source makes errors. Our model learns motion and skeleton information from the privileged modalities in the source domain, which helps the prediction for classes such as “hand waving” and “falling”. (c) Both models make reasonable errors.

5.2 Comparison with State-of-the-Art

Action Classification. Table 1 shows the comparison of action classification with state-of-the-art models on NTU RGB+D dataset. Our graph distillation models are trained and tested on the same dataset in the source domain. NTU RGB+D is a very challenging dataset and has been recently studied in numerous studies [24, 29, 32, 35, 46]. Nevertheless, as we see, our model achieves the state-of-the-art results on NTU RGB+D. It yields a 4.5% improvement, over the previous best result, using the depth video and a remarkable 6.6% using the RGB video. After inspecting the results, we found the improvement mainly attributes to the learned graph capturing complementary information across multiple modalities. Figure 4 shows example distillation graphs learned on NTU RGB+D. The results show that our method, without transfer learning, is effective for action classification in the source domain.

Action Detection. Table 2 compares our method on PKU-MMD with previous work. Our model outperforms existing methods across all modalities. The results substantiate that our method can effectively leverage the privileged knowledge from multiple modalities. Figure 3 illustrates detection results on the depth modality with and without the proposed distillation.

5.3 Ablation Studies on Limited Training Data

Section 5.2 has shown that our method achieves the state-of-the-art results on two public benchmarks. However, in practice, the training data are often limited in size. To systematically evaluate our method on limited training data, as proposed in the introduction, we construct mini-NTU RGB+D and mini-PKU-MMD by randomly sub-sampling 5% of the training data from their full datasets and use them for training. For evaluation, we test the model on the full test set.

Table 3. The comparison with (a) baseline methods using Privileged Information (PIs) on mini-NTU RGB+D, (b) distillation graphs on mini-NTU RGB+D and mini-PKU-MMD. Empty graph trains each modality independently. Uniform graph uses a uniform weight in distillation. Prior graph is built according to the cross-validation accuracy of each modality. Learned graph is learned by our method. “D” refers to the depth modality.
Table 4. The mAP comparison on mini-PKU-MMD at different tIoU threshold \(\theta \). The depth modality is chosen for testing. “src”, “trg”, and “PI” stand for source, target, and privileged information, respectively.

Comparison with Baseline Methods. Table 3(a) shows the comparison with the baseline models that uses privileged information (see Sect. 5.1). The fact that our method outperforms the representative baseline methods validates the efficacy of the graph distillation method.

Efficacy of Distillation Graph. Table 3(b) compares the performance of predefined and learned distillation graphs. The proposed learned graph is compared with an empty graph (no distillation), a uniform graph of equal weights, and a prior graph computed using the cross-validation accuracy of each modality. Results show that the learned graph structure with modality-specific prior and example-specific information obtains the best results on both datasets.

Fig. 4.
figure 4

The visualization of graph distillation on NTU RGB+D. The numbers indicate the ranks of the distillation weights, with 1 being the largest and 5 being the smallest. (a) Class “falling”: Our graph assigns more weight to optical flow because optical flow captures the motion information. (b) Class “brushing teeth”: In this case, motion is negligible, and our graph assigns the smallest weight to it. Instead, it assigns the largest weight to skeleton data.

Efficacy of Privileged Information. Table 4 compares our distillation and transfer under different training settings. The input at test time is a single depth modality. By comparing row 2 and 3 in Table 4, we see that when transferring the visual encoder to the target domain, the one pre-trained with privileged information in the source domain performs better than its counterpart. As discussed in Sect. 3.2, graph distillation can also be applied to the target domain. By comparing row 3 and 5 (or row 2 and 4) of Table 4, we see that performance gain is achieved by applying the graph distillation in the target domain. The results show that our graph distillation can capture useful information from multiple modalities in both the source and target domain.

Efficacy of Having More Modalities. The last three rows of Table 4 show that performance gain is achieved by increasing the number of modalities used as the privileged information. Note that the test modality is depth, the first privileged modality is RGB, and the second privileged modality is the skeleton feature JJD. The results also suggest that these modalities provide each other complementary information during the graph distillation.

6 Conclusion

This paper tackles the problem of action classification and detection in multimodal video with limited training data and partially observed modalities. We propose the novel graph distillation method to assist the training of the model by leveraging privileged modalities dynamically. Our model outperforms representative baseline methods and achieves the state-of-the-art for action classification on NTU RGB+D dataset and action detection on the PKU-MMD. A direction for future work is to combine graph distillation with advanced transfer learning and domain adaptation techniques.